Opening Problem Statement
Meet Sarah, a market analyst at a telecommunications firm. Every week, she needs to monitor the latest news updates and technical releases from Colt, a telecom company. Unfortunately, Colt’s news site lacks an RSS feed, forcing Sarah to sift through each post’s web page manually. This tedious process consumes over four hours weekly, with high risk of missing critical updates or key technical keywords relevant to her team’s projects.
She often finds herself overwhelmed, manually copying links, dates, and lengthy articles, then drafting summaries and extracting keywords — all tasks that are repetitive and time-consuming. If she misses even one post, decision-making suffers, potentially costing her company hundreds of dollars in delayed responses to industry changes.
What This Automation Does
This n8n workflow streamlines Sarah’s entire news monitoring process by automating web scraping, AI summarization, and data storage. When run weekly via a scheduler trigger, it achieves the following:
- Extracts links and publication dates from Colt’s news page using precise CSS selectors.
- Filters news posts from only the last 7 days to focus on relevant updates.
- Fetches full content of each filtered news post from individual URLs.
- Uses OpenAI GPT-4 to generate concise news summaries capped at 70 words for quick comprehension.
- Extracts three key technical keywords from each news post’s content for tagging and indexing.
- Stores all collected data including title, date, link, summary, and keywords into a NocoDB SQL database for centralized access and further analysis.
This automation saves Sarah upwards of four hours weekly, eliminates human error in data collection, and keeps her team instantly updated with relevant technical news.
Prerequisites ⚙️
- n8n account – hosting your automation workflows.
- OpenAI account with API access – for GPT-4 driven summaries and keyword extraction.
- NocoDB instance and API token – an SQL-compatible no-code database for storing news data.
- Basic familiarity with CSS selectors – needed to pinpoint exact HTML elements for extraction.
- Optional: Self-hosted n8n for greater control and scalability, which you can learn more about at Hostinger’s n8n hosting guide.
Step-by-Step Guide to Build This News Extraction Workflow ✏️
Step 1: Schedule the Automation Weekly with Schedule Trigger
Navigate to Triggers → Schedule Trigger. Set the workflow to run every Wednesday at 4:32 AM by configuring the interval to weeks and selecting day 3 at hour 4, minute 32. This ensures you collect new news weekly without manual intervention.
Expected: The workflow automatically starts every Wednesday morning.
Common mistake: Forgetting to enable the workflow after setup.
Step 2: Retrieve the News Page HTML via HTTP Request
Go to Nodes → HTTP Request. Use the URL https://www.colt.net/resources/type/news/, and set response format to text. This gets the raw HTML of the news page for analysis.
Expected: Raw HTML text is returned, visible in node output.
Common mistake: Not setting response format correctly may cause parsing errors.
Step 3: Extract News Links with the HTML Node
Insert an HTML Node named “Extract the HTML with the right css class.” Configure operation as “extractHtmlContent”. Use CSS selector div:nth-child(9) > div:nth-child(3) > a:nth-child(2) to extract href attributes of links displayed on the news page.
Expected: You get an array of news post URLs.
Common mistake: Incorrect CSS selectors result in empty or incorrect link lists.
Step 4: Convert the Links Array into Individual Items with ItemLists Node
Add an ItemLists Node named “Create single link items”. Select the field to split out as data created in the previous step to transform the array into individual JSON items for each link.
Expected: Each link becomes a separate item for further processing.
Common mistake: Forgetting to specify the correct source field for splitting.
Step 5: Extract Post Dates with HTML Node
Add another HTML Node named “Extract date,” using CSS selector div:nth-child(9) > div:nth-child(2) > span:nth-child(1) to extract the dates corresponding to each post on the main news page.
Expected: An array of post dates matching extracted links.
Common mistake: Dates not matching links if selectors are off.
Step 6: Convert Dates Array to Individual Items Using ItemLists Node
Add ItemLists Node named “Create single date items.” Set it to split out data as in step 4.
Expected: Dates available as individual items aligned with links.
Common mistake: Mismatch in length of links and dates arrays causing errors.
Step 7: Merge Dates and Links by Position
Use a Merge Node named “Merge date & links” with mode “combine” and combinationMode “mergeByPosition” to pair each link with its date.
Expected: Each news post item now contains both link and date.
Common mistake: Using wrong merge mode causes data misalignment.
Step 8: Filter News Posts from the Last 7 Days with Code Node
Add a Code Node named “Select posts of last 7 days.” Use the following JavaScript code:
const currentDate = new Date();
const sevenDaysAgo = new Date(currentDate.setDate(currentDate.getDate() - 7));
const filteredItems = items.filter(item => {
const postDate = new Date(item.json["Date"]);
return postDate >= sevenDaysAgo;
});
return filteredItems;
This filters out news older than a week, keeping only current posts.
Expected: Only news posts from the past 7 days remain.
Common mistake: Date parsing errors if date formats differ.
Step 9: Fetch Full News Content of Filtered Items via HTTP Request
Insert another HTTP Request Node named “HTTP Request1”. Use the URL from each item’s Link field dynamically (=$json["Link"]). This retrieves the complete HTML content of each individual news article.
Expected: Raw HTML of each news post is pulled for content extraction.
Common mistake: Forgetting to use dynamic expressions for URLs results in errors.
Step 10: Extract Title and Content from Each Post with HTML Node
Add an HTML Node named “Extract individual posts.” Configure with two CSS selectors:
- Title:
h1.fl-heading > span:nth-child(1) - Content:
.fl-node-5c7574ae7d5c6 > div:nth-child(1)
This pulls the headline and main text of each article.
Expected: Structured title and content extracted.
Common mistake: Selectors might need adjustment if site changes.
Step 11: Merge Extracted Content with Corresponding Date and Link
Use a Merge Node named “Merge Content with Date & Link” in “combine” mode by position. This consolidates title, content, date, and links into one item.
Expected: Each item now fully represents one news post with metadata.
Common mistake: Out-of-sync merges due to oddly ordered data.
Step 12: Generate Summary of Each News Post with OpenAI GPT-4 Node
Add the OpenAI Node named “Summary.” Use the GPT-4 preview model. The prompt template is:
=Create a summary in less than 70 words {{ $json["content"] }}This generates concise, readable abstracts for quick review.
Expected: Each post has a short AI-generated summary.
Common mistake: Missing API credentials causes failure.
Step 13: Extract Three Technical Keywords with OpenAI Node
Use another OpenAI Node named “Keywords” with the prompt:
=name the 3 most important technical keywords in {{ $json["content"] }} ? just name them without any explanations or other sentencesThis tags each post with relevant technical terms.
Expected: Three keywords returned per article.
Common mistake: Incorrect prompt format yields poor keywords.
Step 14: Rename and Prepare Summary & Keywords for Merging
Add two Set Nodes called “Rename Summary” and “Rename keywords.” Use expressions:
- Summary:
=$json["message"]["content"]saved assummary - Keywords:
=$json["message"]["content"]saved askeywords
This cleans up the JSON fields to usable keys.
Expected: Clean summary and keyword fields available.
Common mistake: Forgetting to specify output fields correctly.
Step 15: Merge Summaries and Keywords
Use a Merge Node named “Merge” with combination by position to combine the renamed summary and keywords outputs.
Expected: Each news post now has content, summary, and keywords grouped.
Common mistake: Mismatched item counts cause errors.
Step 16: Merge ChatGPT Output with News Metadata
Use “Merge ChatGPT output with Date & Link” node to combine all information from the AI output with the previously collected date and link metadata.
Expected: A final enriched news post item with all details.
Common mistake: Wrong merge mode could misalign data.
Step 17: Save Final News Records into NocoDB Database
Add the NocoDB Node named “NocoDB news database.” Connect your NocoDB API token and select the target table. Map fields:
- News_Source: Static value “Colt”
- Title, Date, Link, Summary, Keywords: Values from JSON fields
Expected: Structured news data saved in your database for further use.
Common mistake: Incorrect field mappings can cause data loss.
Customizations ✏️
- Change News Source URL: Modify the URL in the “Retrieve the web page for further processing” HTTP Request node to another news site by adjusting the CSS selectors accordingly.
- Adjust Date Range: In the “Select posts of last 7 days” Code Node, change the days back from 7 to a preferred time frame falling in sync with your schedule.
- Use Different AI Models: Swap the OpenAI GPT-4 preview model with GPT-3.5 or custom prompt tuning in the “Summary” and “Keywords” nodes for customized outputs.
- Switch Database Target: Replace NocoDB with other SQL integrations available in n8n, such as MySQL or PostgreSQL nodes, to fit your infrastructure.
- Add Email Notifications: After data is stored, add a Gmail node to notify your team of new summaries automatically.
Troubleshooting 🔧
Problem: “No data extracted from HTML Node”
Cause: CSS selectors are incorrect or the webpage structure changed.
Solution: Use browser inspect tool to verify correct CSS selectors and update the HTML nodes accordingly.
Problem: “OpenAI API Key Unauthorized”
Cause: Expired or incorrect OpenAI API credentials.
Solution: Go to n8n credentials, check your OpenAI API key, and refresh if needed.
Problem: “Merge Node Data Misalignment”
Cause: Merged nodes use wrong combination mode or inputs are out-of-sync.
Solution: Ensure all merges are set to “combine” and “mergeByPosition” to correctly align items.
Pre-Production Checklist ✅
- Verify CSS selectors with browser inspector for links, dates, title, and content.
- Confirm OpenAI API credentials are active and permitted for GPT-4 usage.
- Test HTTP Requests separately to confirm pages and posts are reachable.
- Run workflow in debug mode to ensure data flows correctly after each node.
- Backup database before first data insertion to prevent accidental overwrites.
- Schedule a test run close to next scheduled time to verify automation triggers properly.
Deployment Guide
After building and testing this workflow, activate it in your n8n editor by toggling the active switch. The workflow will then automatically run as scheduled, fetching weekly updates.
Monitor execution logs in n8n for errors and review the NocoDB database for stored records. Adjust CSS selectors or date filters as required for website changes or evolving needs.
FAQs
Can I use other AI services instead of OpenAI?
While alternative AI providers exist, this workflow is tailored for OpenAI’s GPT-4. Switching would require adjusting prompt structures and authentication tokens.
Does this workflow consume many OpenAI tokens?
Each summary and keyword extraction call consumes tokens based on content length, so expect moderate usage aligned with your weekly news volume.
Is my news data securely stored?
NocoDB stores your data safely on your infrastructure or cloud provider of choice. Ensure secure API credential management in n8n to keep information protected.
Conclusion
By following this detailed guide, you’ve automated the extraction, summarization, and tagging of the latest news posts from a telecom site lacking an RSS feed. This saves significant manual effort, reduces missed updates, and centralizes data for better decision-making.
Sarah now spends less than an hour weekly monitoring updates instead of hours, increasing productivity and accuracy. Next, consider automating alerts via email or Slack based on keywords or integrating sentiment analysis for deeper insights.