Structured Data Extraction & Sentiment Analysis with n8n and Bright Data

This workflow automates extracting structured data from web content using Bright Data’s web unlocker and Google Gemini AI models. It solves the challenge of manual data mining and sentiment analysis by delivering organized insights and trend detection automatically.
manualTrigger
httpRequest
lmChatGoogleGemini
+6
Learn how to Build this Workflow with AI:
Workflow Identifier: 2016
NODES in Use: Manual Trigger, Sticky Note, Set, HTTP Request, chainLlm, lmChatGoogleGemini, n8-nodes-informationExtractor, Function, Read & Write File

Press CTRL+F5 if the workflow didn't load.

Visit through Desktop for Best experience

Opening Problem Statement

Meet Sarah, a digital marketing analyst tasked with monitoring global news trends and extracting meaningful insights from web data. Every day, she spends hours manually scraping news articles, converting raw markdown content into clear text, and then analyzing sentiment and trending topics. Her time-consuming process frequently leads to missed insights and inconsistent reporting — costing her team valuable time and reducing competitive advantage.

This exact challenge is what the Structured Data Extract, Data Mining with Bright Data & Google Gemini n8n workflow is designed to resolve. Instead of manually copying content, decoding markdown, and sorting through noisy data, this automation streamlines the entire process with AI-powered extraction and sentiment analysis, saving Sarah hours per week and delivering reliable, actionable data.

What This Automation Does

When this workflow runs, here’s what it accomplishes for you:

  • Automatically retrieves web content from any URL using the trusted Bright Data Web Unlocker product, bypassing common scraping restrictions and delivering raw markdown data.
  • Converts markdown content into clean textual data using Google Gemini’s advanced language model through n8n’s LangChain integration.
  • Performs detailed topic extraction with structured data responses identifying key themes, confidence scores, summaries, and relevant keywords.
  • Clusters emerging trends by geographical location and category, offering insightful summaries and keyword mentions for strategic market analysis.
  • Conducts sentiment analysis on the extracted content customized by the user’s AI chain, providing nuanced understanding of underlying emotions or viewpoints.
  • Saves extracted topics and trends as JSON files locally, enabling easy archival or further processing without manual intervention.

The result? A powerful end-to-end data extraction and mining pipeline that turns complex web content into structured, insightful datasets effortlessly.

Prerequisites ⚙️

  • n8n account with access to create workflows 🔌
  • Bright Data account configured with access to the Web Unlocker product 🔐
  • Google PaLM API account to use Google Gemini Chat model for AI processing 🔑
  • Webhook service URL to receive notifications (e.g., webhook.site) 💬
  • Local system or server with write permission to save JSON files 📁

Optional: Self-hosting n8n workflow execution server for uninterrupted operation. You can explore self-hosting options like buldrr with Hostinger to run this automation independently.

Step-by-Step Guide

Step 1: Start Manual Trigger

In your n8n workspace, click + New Workflow and add a Manual Trigger node. This node allows us to start the automation manually for testing. You should see a button labeled “Execute Workflow” once complete. This is your launchpad.

Common mistake: Forgetting to place the trigger node first can break the workflow execution chain.

Step 2: Configure ‘Set URL and Bright Data Zone’ Node

Add the Set node to assign the target website URL and Bright Data proxy zone name. In this workflow, the URL is preset to https://www.bbc.com/news/world and the zone to web_unlocker1.

Navigate to: Add Node → Core Nodes → Set. Enter the following assignments:

  • url: https://www.bbc.com/news/world
  • zone: web_unlocker1

Expected outcome: Workflow now knows which site to scrape and which Bright Data zone to route through.

Common mistake: Not updating the URL to your desired target will result in scraping the default URL content instead.

Step 3: Perform Bright Data Web Request

Add an HTTP Request node to call Bright Data’s API for web scraping. Set method to POST and URL to https://api.brightdata.com/request. Use the HTTP Header Auth credentials configured with your Bright Data API token.

In the body parameters, pass the dynamic values zone and url (with query parameters for unlocking). Set format to raw and data_format to markdown to receive markdown content.

Navigate to: Add Node → Core Nodes → HTTP Request

Expected outcome: Raw markdown content of the target page is fetched.

Common mistake: Using incorrect authentication headers may cause request failures.

Step 4: Convert Markdown to Textual Data with Google Gemini

Chain in the LangChain – Chain LLM node configured to convert markdown to plain text. The node prompts Google Gemini with the instruction: “You are a markdown expert” and asks it to only output textual content, removing all links, scripts, and CSS.

Expected outcome: Clean, structured textual data extracted from complex markdown.

Common mistake: Misconfiguring the prompt can cause unwanted formatting or incomplete extraction.

Step 5: Extract Topics with Structured Response

Use the Information Extractor node to analyze the textual data and produce structured output identifying topics with fields like topic name, confidence score, summary, and keywords.

Expected outcome: You receive concise, machine-readable topic insights from the content.

Common mistake: Incorrect JSON schema in node parameters can cause extraction errors.

Step 6: Analyze Trends by Location and Category

Another Information Extractor node performs clustering of emerging trends by location and category from the same data.

Expected outcome: Clustered trend data with location, category, key trends, confidence scores, and related mentions.

Common mistake: Not using the same input content can result in irrelevant or empty trend data.

Step 7: Use Google Gemini for Sentiment Analysis

Add a Google Gemini Chat Model node to run sentiment analysis on the extracted topics. This node uses the same PaLM API credentials and model version to get mood and opinion insights.

Expected outcome: Results containing sentiment summaries about the web content.

Common mistake: Forgetting to connect the input correctly prevents AI processing.

Step 8: Send Webhook Notifications

Use HTTP Request nodes configured to send the extracted textual data, sentiment summaries, and trends to a webhook URL like webhook.site. This allows monitoring outputs in real-time or integration with other services.

Expected outcome: Real-time notification of extracted data arrives at the webhook endpoint.

Common mistake: Not enabling the “Send Body” option can result in empty payloads.

Step 9: Save Extracted Data to Local Files

Use a Function node to encode the topic and trend JSON data into binary format required for file writing, then a Read & Write File node to save these JSON files onto disk.

Example Function code for encoding topics:

items[0].binary = {
  data: {
    data: Buffer.from(JSON.stringify(items[0].json, null, 2)).toString('base64')
  }
};
return items;

Expected outcome: files d:topics.json and d:trends.json are created with structured information.

Common mistake: Providing invalid paths or insufficient permissions causes write errors.

Customizations ✏️

  • Change Target URL: In the Set URL and Bright Data Zone node, update the url field with your preferred site. This lets you scrape any website supported by Bright Data Web Unlocker.
  • Adjust Bright Data Zone: Modify the zone field to your configured Bright Data proxy zones. Helpful if you manage multiple zones for different regions.
  • Switch AI Model Versions: Change the model used in Google Gemini Chat Model nodes by selecting a different Gemini model version in the parameters, such as “gemini-2.0-advanced” for deeper analysis.
  • Customize Information Extractor Schema: Edit the JSON schema in the Information Extractor nodes to extract additional structured data fields like sentiment scores or entity recognition as per your needs.
  • Setup Alternate Webhook URLs: Update webhook URLs in the HTTP Request nodes to integrate with Slack, Discord, or your preferred notification platform.

Troubleshooting 🔧

Problem: “Authentication failed with Bright Data API”

Cause: Incorrect HTTP Header Auth token or expired API key.

Solution: Double-check your Bright Data credentials in n8n under Credentials → HTTP Header Auth. Reissue API tokens if necessary and update here.

Problem: “Invalid schema in Information Extractor node” leading to failures

Cause: Schema JSON format errors or missing required properties.

Solution: Validate your JSON schema with online tools like JSONLint. Ensure all required fields (topic, score, summary, keywords) exist.

Problem: “Webhook Request Body empty” notification not received

Cause: HTTP Request node’s “Send Body” option not enabled or wrong body parameters.

Solution: In HTTP Request nodes sending notifications, ensure ‘Send Body’ is checked and parameter names match expected keys.

Pre-Production Checklist ✅

  • Verify that the Manual Trigger executes and starts the workflow without errors.
  • Confirm Bright Data credentials and zone names correspond to your account and intended proxy usage.
  • Test HTTP Request node for correct Bright Data API response contains markdown data.
  • Validate LangChain nodes properly convert markdown and extract structured topics and trends.
  • Ensure Google Gemini PaLM API credentials are valid and connected.
  • Perform test workflow runs and confirm webhook notifications are received with expected data.
  • Check file write location permissions to avoid errors when saving JSON output files.

Deployment Guide

Activate your workflow by clicking the Activate button in n8n. Schedule or trigger manually as needed.

Monitor execution via n8n’s built-in logs and webhook traffic to ensure data is flowing correctly.

For ongoing use, consider hosting n8n yourself to maintain control and avoid service interruptions.

FAQs

Can I use another scraping provider instead of Bright Data?

Yes, but you will need to modify the HTTP Request node’s URL and authentication to match the new provider’s API.

Does this workflow consume a lot of Google Gemini API credits?

The workflow uses multiple Google Gemini nodes, so costs depend on your API plan. Consider monitoring usage in Google Cloud Console.

How secure is my data during this automation?

All data transmissions use HTTPS. You control API credentials and webhook endpoints, ensuring privacy as per your service’s standards.

Conclusion

By following this guide, you have built a sophisticated automated pipeline that transforms web markdown content into clear, structured insights enriched with sentiment analysis and trend extraction. This setup saves hours of manual work in web scraping, data cleansing, and analysis.

Not only does this workflow deliver reliable, actionable data in JSON format saved locally, but it also enables instant notifications for monitoring trends and sentiment changes worldwide.

Next steps? Try expanding this workflow to analyze multiple URLs simultaneously or integrate with dashboards like Google Data Studio or Power BI for visual reporting.

With these tools in your arsenal, you’re ready to handle complex web data mining and AI analysis tasks like a pro!

Related Workflows

Automate Viral UGC Video Creation Using n8n + Degaus (Beginner-Friendly Guide)

Learn how to automate viral UGC video creation using n8n, AI prompts, and Degaus. This beginner-friendly guide shows how to import, configure, and run the workflow without technical complexity.
Form Trigger
Google Sheets
Gmail
+37
Free

AI SEO Blog Writer Automation in n8n (Beginner Guide)

A complete beginner guide to building an AI-powered SEO blog writer automation using n8n.
AI Agent
Google Sheets
httpRequest
+5
Free

Automate CrowdStrike Alerts with VirusTotal, Jira & Slack

This workflow automates processing of CrowdStrike detections by enriching threat data via VirusTotal, creating Jira tickets for incident tracking, and notifying teams on Slack for quick response. Save hours daily by transforming complex threat data into actionable alerts effortlessly.
scheduleTrigger
httpRequest
jira
+5
Free

Automate Telegram Invoices to Notion with AI Summaries & Reports

Save hours on financial tracking by automating invoice extraction from Telegram photos to Notion using Google Gemini AI. This workflow extracts data, records transactions, and generates detailed spending reports with charts sent on schedule via Telegram.
lmChatGoogleGemini
telegramTrigger
notion
+9
Free

Automate Email Replies with n8n and AI-Powered Summarization

Save hours managing your inbox with this n8n workflow that uses IMAP email triggers, AI summarization, and vector search to draft concise replies requiring minimal review. Automate business email processing efficiently with AI guidance and Gmail integration.
emailReadImap
vectorStoreQdrant
emailSend
+12
Free

Automate Email Campaigns Using n8n with Gmail & Google Sheets

This n8n workflow automates personalized email outreach campaigns by integrating Gmail and Google Sheets, saving hours of manual follow-up work and reducing errors in email sequences. It ensures timely follow-ups based on previous email interactions, optimizing communication efficiency.
googleSheets
gmail
code
+5
Free