Create AI-Ready Vector Datasets with Bright Data, Gemini & Pinecone

This workflow automates the extraction, formatting, and storage of web data into AI-ready vector datasets using Bright Data, Google Gemini, and Pinecone. It solves the problem of manual data processing by enabling efficient handling of complex data formats for large language models (LLMs).
manualTrigger
lmChatGoogleGemini
vectorStorePinecone
+10
Workflow Identifier: 1243
NODES in Use: Manual Trigger, HTTP Request, Set, Chain LLM, Information Extractor, AI Agent, LM Chat Google Gemini, Output Parser Structured, Text Splitter Recursive Character Text Splitter, Default Data Loader, Embeddings Google Gemini, Vector Store Pinecone, Sticky Note

Press CTRL+F5 if the workflow didn't load.

Learn how to Build this Workflow with AI:

Visit through Desktop for Best experience

Opening Problem Statement

Meet Sam, a data engineer struggling to prepare web data for AI applications. Every day, Sam spends countless hours manually scraping websites, cleaning up messy data, formatting it according to strict schemas, and finally storing it in vector databases for use with large language models (LLMs). This repetitive, error-prone process costs him valuable time and delays critical AI projects.

For Sam, the biggest headache is dealing with unstructured web data — like news articles from sources such as Hacker News — which must be transformed into highly specific, structured datasets. These datasets then need to be embedded and indexed into a vector database like Pinecone to power efficient AI searching and retrieval. Without automation, Sam faces hours of tedious work, inconsistent output quality, and frequent rework.

What This Automation Does

This n8n workflow automates Sam’s entire pipeline of creating AI-ready vector datasets from web data, using a suite of modern tools:

  • Starts with a manual trigger to fetch the latest web content from predefined URLs using Bright Data’s web unlocking API.
  • Formats the raw JSON response into a clearly structured dataset with titles, ranks, sites, points, users, ages, and comment counts using AI models.
  • Extracts and cleans the relevant information from HTML embedded in the response through a Google Gemini-powered AI agent, applying expert-level data extraction and formatting.
  • Segments long text data into manageable chunks using a Recursive Character Text Splitter, preparing it for downstream AI embeddings.
  • Creates high-quality vector embeddings of the cleaned text with Google Gemini’s embeddings model.
  • Inserts these vectors into the Pinecone vector store for lightning-fast semantic search and retrieval by AI applications.
  • Sends real-time webhook notifications with the structured data and AI agent responses for monitoring or further usage.

By automating these steps, Sam saves hours of manual work, reduces errors, and gets consistent, well-structured vector datasets ready to power advanced language models.

Prerequisites ⚙️

  • n8n account – Access to create and run workflows
  • Bright Data API credentials – For web scraping with the Web-Unlocker product
  • Google Gemini (PaLM) API credentials – To use the advanced text embedding and chat models
  • Pinecone API credentials – For managing vector databases
  • Webhook endpoint URL – To receive structured data notifications, e.g., from webhook.site

Step-by-Step Guide

1. Trigger the Workflow Manually

Navigate to your n8n editor and click Execute Workflow to start the process manually. This opens the workflow by triggering the Manual Trigger node called When clicking ‘Test workflow’.

You should see the workflow progress as the data flows from one node to another. This sets the chain in motion.

Common mistake: Forgetting to start manually if the trigger is not set for automatic runs.

2. Configure URL and Webhook Fields

The Set Fields – URL and Webhook URL node sets crucial parameters:

  • url to the website to scrape, e.g., https://news.ycombinator.com?product=unlocker&method=api
  • webhook_url to where structured results will be posted (e.g., https://webhook.site/your-unique-url)

This is where you define what data source to target and where to send notifications.

Common mistake: Not updating the URL to your required target source or erroneous webhook URL.

3. Make a Web Request via Bright Data

The Make a web request node calls Bright Data’s Web-Unlocker API to retrieve raw web data. This node uses a POST request with the zone name and target URL provided earlier.

POST https://api.brightdata.com/request
Headers: Authorization with your API Key
Body: { "zone": "web_unlocker1", "url": "= {{ $json.url }}", "format": "raw" }

You’ll see a large raw JSON response representing the scraped web content.

Common mistake: Missing or incorrect API credentials, or wrong zone names.

4. Format the Raw JSON Response

The Structured JSON Data Formatter node uses an AI language model to convert the complex raw data into a cleaner JSON format describing news items with ranks, titles, points, users, etc.

The prompt defines a JSON schema example to standardize output, improving data consistency.

Common mistake: Improper prompt formatting or schema mismatch causing parsing errors.

5. Extract and Format Web Data Using AI Agent

The Information Extractor with Data Formatter node employs the Google Gemini chat model to analyze the formatted data and extract meaningful content collections.

This step transforms HTML-rich data into structured textual results ready for embedding.

Common mistake: Providing poorly formatted input text or missing credentials.

6. Launch the AI Agent for Advanced Formatting

The AI Agent node takes the extracted content and applies a final layer of AI-driven formatting for crisp, easy-to-consume textual output.

This node uses a defined prompt to carefully structure the response.

Common mistake: Skipping this step reduces data quality and downstream embeddings clarity.

7. Split Text into Chunks for Embedding

The Recursive Character Text Splitter divides long text responses into smaller manageable pieces. This is vital because embedding models have character limits.

Common mistake: Ignoring text splitting can cause embedding failures or truncated vectors.

8. Load Splitted Data for Embedding Preparation

The Default Data Loader node prepares these text chunks as documents for embedding generation.

Common mistake: Failing to map text chunks correctly leads to incomplete data processing.

9. Generate Embeddings with Google Gemini

The Embeddings Google Gemini node sends each text chunk to Google Gemini’s “models/text-embedding-004” to produce vector representations.

Common mistake: Incorrect model naming or missing API keys will prevent embeddings generation.

10. Insert Vector Data into Pinecone

The Pinecone Vector Store node takes the embeddings and inserts them into the “hacker-news” Pinecone index.

This step makes your data instantly searchable in semantic vector form by AI tools.

Common mistake: Misconfigured Pinecone index names or missing credentials.

11. Send Webhook Notifications with Processed Data

The workflow utilizes two HTTP Request nodes (Webhook for structured data and Webhook for structured AI agent response) to send real-time notifications of structured output to predefined webhook URLs.

This is great for monitoring or triggering other workflows.

Common mistake: Incorrect webhook URLs or payload formats.

Customizations ✏️

  • Change Source Website: In the Set Fields – URL and Webhook URL node, update the url field to scrape a different site or API endpoint.
  • Adjust AI Formatting Prompts: Modify the prompt text in the Structured JSON Data Formatter or AI Agent node to tweak how data is structured or summarized.
  • Switch Embeddings Model: In the Embeddings Google Gemini node, switch the modelName parameter to another Google Gemini embedding model if available for better performance.
  • Modify Pinecone Index: Change the index name in Pinecone Vector Store node parameters to target a different namespace.
  • Custom Webhook Actions: Update the webhook URLs or payload structure in the webhook nodes to integrate with other services or dashboards.

Troubleshooting 🔧

  • Problem: “401 Unauthorized” from the Bright Data API.
    Cause: Invalid or expired API key.
    Solution: Go to the Make a web request node credentials, update with a valid API key via HTTP Header Auth.
  • Problem: AI agent returns unexpected or empty data.
    Cause: Incorrect input text or missing Google Gemini API credentials.
    Solution: Verify input data format in the Information Extractor with Data Formatter node and confirm API credentials under Google Gemini nodes.
  • Problem: Pinecone insert fails with “index not found”.
    Cause: Misconfigured Pinecone index name.
    Solution: Confirm the Pinecone index name is correct in the Pinecone Vector Store node.

Pre-Production Checklist ✅

  • Check that all API credential nodes (Bright Data, Google Gemini, Pinecone) are configured and active.
  • Verify target URLs and webhook URLs are correct and accessible.
  • Run manual tests to confirm data flows cleanly through formatting, extraction, and embedding nodes.
  • Ensure text splitting produces manageable chunks without truncation.
  • Test webhook endpoints receive data as expected.

Deployment Guide

Activate this workflow by setting it to Active in your n8n dashboard. Since it starts with a manual trigger, you can run it on-demand or set up a schedule trigger for automatic periodic runs.

Monitor the workflow executions in n8n’s execution log to catch errors or delays.

If you want to self-host n8n for better privacy and control, consider services like Hostinger.

FAQs

Can I use another AI embedding model instead of Google Gemini?
Potentially yes, but you would need to replace the Embeddings Google Gemini node with one supporting your chosen model and adjust prompts accordingly.
Does this workflow consume a lot of API credits?
It depends on your level of usage, but the Bright Data API, Google Gemini calls, and Pinecone insertions each have associated costs. Monitor usage on your provider dashboards.
Is data sent to webhooks secure?
Data sent is only as secure as the webhook endpoint you provide. Use HTTPS endpoints and trusted services to ensure data privacy.

Conclusion

By following this guide, you’ve built an automated pipeline to extract, format, and store web data as AI-optimized vector datasets using Bright Data, Google Gemini, and Pinecone. This saves Sam — or you — hours of manual effort, dramatically reduces errors, and accelerates AI project timelines.

Next, you could expand this workflow to automate regular scraping schedules, add more sophisticated AI summarization, or integrate with other AI tools like chatbot platforms.

Keep experimenting, and enjoy the power of automation for your AI datasets!

Promoted by BULDRR AI

Related Workflows

Automate Viral UGC Video Creation Using n8n + Degaus (Beginner-Friendly Guide)

Learn how to automate viral UGC video creation using n8n, AI prompts, and Degaus. This beginner-friendly guide shows how to import, configure, and run the workflow without technical complexity.
Form Trigger
Google Sheets
Gmail
+37
Free

AI SEO Blog Writer Automation in n8n (Beginner Guide)

A complete beginner guide to building an AI-powered SEO blog writer automation using n8n.
AI Agent
Google Sheets
httpRequest
+5
Free

Automate CrowdStrike Alerts with VirusTotal, Jira & Slack

This workflow automates processing of CrowdStrike detections by enriching threat data via VirusTotal, creating Jira tickets for incident tracking, and notifying teams on Slack for quick response. Save hours daily by transforming complex threat data into actionable alerts effortlessly.
scheduleTrigger
httpRequest
jira
+5
Free

Automate Telegram Invoices to Notion with AI Summaries & Reports

Save hours on financial tracking by automating invoice extraction from Telegram photos to Notion using Google Gemini AI. This workflow extracts data, records transactions, and generates detailed spending reports with charts sent on schedule via Telegram.
lmChatGoogleGemini
telegramTrigger
notion
+9
Free

Automate Email Replies with n8n and AI-Powered Summarization

Save hours managing your inbox with this n8n workflow that uses IMAP email triggers, AI summarization, and vector search to draft concise replies requiring minimal review. Automate business email processing efficiently with AI guidance and Gmail integration.
emailReadImap
vectorStoreQdrant
emailSend
+12
Free

Automate Email Campaigns Using n8n with Gmail & Google Sheets

This n8n workflow automates personalized email outreach campaigns by integrating Gmail and Google Sheets, saving hours of manual follow-up work and reducing errors in email sequences. It ensures timely follow-ups based on previous email interactions, optimizing communication efficiency.
googleSheets
gmail
code
+5
Free