Enrich Company Data with n8n, OpenAI, and ScrapingBee

This workflow automates enriching company profiles from a Google Sheet by scraping website data with ScrapingBee and analyzing it using OpenAI GPT. It extracts business insights and updates your sheet automatically, saving hours of manual research.
googleSheets
lmChatOpenAi
httpRequest
+8
Learn how to Build this Workflow with AI:
Workflow Identifier: 2335
NODES in Use: googleSheets, splitInBatches, set, httpRequest, markdown, lmChatOpenAi, outputParserStructured, agent, toolWorkflow, executeWorkflowTrigger, webhook

Press CTRL+F5 if the workflow didn't load.

Visit through Desktop for Best experience

Opening Problem Statement

Meet Lisa, a business analyst at a B2B marketing firm. Every week, Lisa receives a spreadsheet filled with hundreds of company names and their website URLs. Her job is to enrich this list with critical business details like the company’s core activities, main products, value propositions, and customer profiles. Doing this manually means Lisa wastes countless hours scouring company websites, copying data, and guessing business models. On top of that, mistakes creep in, some insights are missed, and the firm loses out on targeting opportunities and strategic decisions.

Lisa’s challenge is specific: How can she automatically enrich her company list with reliable business insights without spending hours or risking errors? This is exactly the gap the n8n workflow “Enrich Company Data from Google Sheet with OpenAI Agent and Scraper Tool” fills.

What This Automation Does

This workflow runs a seamless process that takes a list of companies from Google Sheets, scrapes their websites, uses an AI agent to analyze the content, and then updates the sheet with structured enriched data. Here’s what happens each time it runs:

  • Retrieves company names and websites from your Google Sheet.
  • Loops through each company to fetch the homepage content via ScrapingBee (a web scraper).
  • Passes the scraped HTML, converted to Markdown, to the OpenAI GPT-4o-mini model.
  • Extracts specific business information: Business Area, Products/Offers, Value Proposition, Business Model, and Ideal Customer Profile (ICP).
  • Handles edge cases such as insufficient data or mismatched website content with clear diagnostics.
  • Automatically updates the original Google Sheet rows with the newly enriched data.

The benefits are concrete: save hours to days per weekly update, reduce errors, and get richer, ready-to-use business insights directly in your spreadsheet.

Prerequisites ⚙️

  • Google Sheets account with your list of companies (must include at least “Company” and “Website” columns).
  • ScrapingBee API key for reliable website scraping.
  • OpenAI API key with GPT-4o-mini model access.
  • n8n account with workflow creation and credentials configured.
  • Basic familiarity with API credentials management in n8n.
  • Optional: Hosting option if self-hosting n8n workflow (such as on Hostinger via buldrr.com).

Step-by-Step Guide

Step 1: Set up Google Sheets with your Company List

Open Google Sheets and prepare a sheet with these columns: Company, Website. Include additional columns like Business Area, Offer, Value Proposition, Business Model, ICP, and Additional Information for enrichment outputs.

Example entry:

Company: ACME Corp
Website: https://www.acmecorp.com

You should see your list ready to be read by the workflow.

Common mistake: Missing or incorrect URLs will cause scraping failures later.

Step 2: Configure the “Get rows from Google Sheet” node

In n8n, add a Google Sheets node and select operation: read rows, connect your Google API credentials, and enter the documentId for your Sheet. Specify the sheet name (e.g., “Sheet1” or “gid=0”).

This node retrieves the full company list to start the process.

Step 3: Add a “Loop Over Items” node

Drag the SplitInBatches node (named “Loop Over Items”) to handle one company at a time. Connect the Google Sheets node’s output here.

This ensures each company’s website is processed independently.

Step 4: Set the URL for Scraping

Add a Set node (named “Set company url”) to assign the company website URL into a field named url. Use an expression like {{$json.website}} to dynamically pull the URL for each iteration.

Step 5: Configure the ScrapingBee HTTP Request node

Add an HTTP Request node (“ScrapingBee : Scrape company’s homepage data”) and select GET method.

Set the URL to https://app.scrapingbee.com/api/v1 and add query parameters: api_key with your ScrapingBee API key and url with the scraped company URL from the previous node: {{$json.url}}.

This will fetch the raw HTML content of the homepage.

Step 6: Convert HTML to Markdown

Add the Markdown node “HTML to Markdown” to convert the HTML data (from ScrapingBee) into Markdown format. Use {{$json.data}} as input to reduce token usage for OpenAI.

Step 7: Configure the OpenAI Chat Model node

Add the “OpenAI Chat Model” (LangChain GPT model) node, select GPT-4o-mini, and connect your OpenAI API credentials.

Step 8: Set up Structured Output Parser

Add the “Structured Output Parser” LangChain node with a manual schema specifying fields: Business Area, Offers or Product, Value Proposition, Business Model, Ideal Customer Profile, and Additional Information with sub-properties.

This parser organizes the AI response into a structured JSON for easy Google Sheets update.

Step 9: Build AI Agent with Subnodes

Create an “AI Agent” node configured to receive scraped data and extract the specified company details using the OpenAI Chat Model, the Scraping Workflow Call, and the Structured Output Parser. The prompt instructs the AI to give English output and handle edge cases.

Step 10: Update the Company Row in Google Sheets

Add another Google Sheets node “Update Company’s Row on Google Sheet” to write the parsed AI output back into the correct row number using the matching column “row_number”.

Map the AI output fields to Google Sheets columns: ICP, Offer, Business area, Business Model, Value proposition, and Additional information.

Customizations ✏️

  • Extend Scraper to Additional Pages: Change the ScrapingBee URL query in the HTTP Request node to also scrape “About Us” or “Pricing” pages by appending paths. This will give richer data but use extra tokens.
  • Enhanced Data Validation: Add error handling to verify if scraped data is sufficient, then skip or flag rows with insufficient info using the “Additional Information” fields.
  • Integrate a CRM: Replace Google Sheets nodes with a database or CRM integration node to directly enrich your client records.
  • Multi-language Support: Update the AI Agent prompt to accommodate multiple languages and perform translations as needed.
  • Trigger from Form Submissions: Add a Webhook or Form trigger that takes real-time leads and enriches their company data live.

Troubleshooting 🔧

  • Problem: “No data returned from ScrapingBee”
    Cause: Incorrect API key or URL parameter missing.
    Solution: Verify your ScrapingBee API key in the HTTP node. Double-check the URL field uses {{$json.url}} and the company website URLs are valid.
  • Problem: “OpenAI API error: max tokens exceeded”
    Cause: Sent HTML too large or complex.
    Solution: Use the “HTML to Markdown” node to reduce token usage; consider limiting scope of scraping to homepage only.
  • Problem: “Mismatch in row update or no update made”
    Cause: Incorrect matching columns or missing row_number.
    Solution: Make sure your Google Sheet has a row_number column populated and the update node uses this column to match rows exactly.

Pre-Production Checklist ✅

  • Confirm Google Sheets document ID and sheet name match your environment.
  • Ensure all API keys (OpenAI, ScrapingBee, Google) are valid and active.
  • Test scraping on a few sample URLs to verify response content is retrieved correctly.
  • Run AI Agent manually on test data to verify correct structured outputs.
  • Backup your Google Sheet or use a copy before running full automation.

Deployment Guide

Activate your workflow in n8n after final configuration of credentials and test runs. Trigger it manually or set a webhook to start the process. Monitor executions through the n8n UI execution logs to confirm success or diagnose failures.

FAQs

  • Can I use other scraping tools? Yes, but ScrapingBee is recommended for its reliability and API features integrated with this workflow.
  • Does this consume many OpenAI tokens? Token usage depends on the website content size; converting HTML to Markdown reduces token consumption.
  • Is my data secure? Yes, with proper API key management and n8n hosting security, your data remains private.
  • Can it handle hundreds of companies? Yes, but consider running the workflow in batches with the SplitInBatches node to avoid rate limits.

Conclusion

By following this guide, you’ve built a sophisticated automation that enriches your company data effortlessly. You save invaluable time from hours to days, improve data accuracy, and gain deeper marketing and sales insights without lifting a finger.

Next steps? Consider automating lead enrichment directly into your CRM, expanding scraping to additional company pages, or integrating notifications to your sales team via Slack or Teams whenever new insights are added.

This workflow transforms tedious research into reliable, automated enrichment — empowering smarter business decisions every day.

Related Workflows

Automate Viral UGC Video Creation Using n8n + Degaus (Beginner-Friendly Guide)

Learn how to automate viral UGC video creation using n8n, AI prompts, and Degaus. This beginner-friendly guide shows how to import, configure, and run the workflow without technical complexity.
Form Trigger
Google Sheets
Gmail
+37
Free

AI SEO Blog Writer Automation in n8n (Beginner Guide)

A complete beginner guide to building an AI-powered SEO blog writer automation using n8n.
AI Agent
Google Sheets
httpRequest
+5
Free

Automate CrowdStrike Alerts with VirusTotal, Jira & Slack

This workflow automates processing of CrowdStrike detections by enriching threat data via VirusTotal, creating Jira tickets for incident tracking, and notifying teams on Slack for quick response. Save hours daily by transforming complex threat data into actionable alerts effortlessly.
scheduleTrigger
httpRequest
jira
+5
Free

Automate Telegram Invoices to Notion with AI Summaries & Reports

Save hours on financial tracking by automating invoice extraction from Telegram photos to Notion using Google Gemini AI. This workflow extracts data, records transactions, and generates detailed spending reports with charts sent on schedule via Telegram.
lmChatGoogleGemini
telegramTrigger
notion
+9
Free

Automate Email Replies with n8n and AI-Powered Summarization

Save hours managing your inbox with this n8n workflow that uses IMAP email triggers, AI summarization, and vector search to draft concise replies requiring minimal review. Automate business email processing efficiently with AI guidance and Gmail integration.
emailReadImap
vectorStoreQdrant
emailSend
+12
Free

Automate Email Campaigns Using n8n with Gmail & Google Sheets

This n8n workflow automates personalized email outreach campaigns by integrating Gmail and Google Sheets, saving hours of manual follow-up work and reducing errors in email sequences. It ensures timely follow-ups based on previous email interactions, optimizing communication efficiency.
googleSheets
gmail
code
+5
Free