Automate Company Profile Extraction with n8n and OpenAI

This workflow automates extracting business value propositions and classifications directly from company websites using n8n and OpenAI, saving hours of manual research and data entry.
manualTrigger
openAi
googleSheets
+6
Workflow Identifier: 1457
NODES in Use: Manual Trigger, Google Sheets, Split In Batches, HTTP Request, HTML Extract, Code, OpenAI, Merge, Wait

Press CTRL+F5 if the workflow didn't load.

Learn how to Build this Workflow with AI:

Visit through Desktop for Best experience

1. Opening Problem Statement

Meet Lucas, a researcher at a B2B marketing agency, tasked with compiling detailed company profiles from a large list of domains. Manually visiting each website to capture their core value proposition, target audience, industry classification, and market type is a tedious, error-prone task. Lucas wastes hours each week copying text, guessing company focus areas, and inputting data into Google Sheets. This delays campaign launches and frustrates client teams awaiting accurate intel.

This workflow directly addresses Lucas’s pain by extracting, summarizing, and classifying company website data automatically. It turns a drudge into a streamlined, repeatable process with minimal manual oversight.

2. What This Automation Does

When you run this n8n workflow, here is what happens:

  • โœจ It reads a list of company domain URLs from a Google Sheets spreadsheet.
  • ๐Ÿ”Œ For each domain, it makes HTTP requests to fetch website HTML content.
  • ๐Ÿ“„ Extracts the full HTML content from each page.
  • ๐Ÿงน Cleans up extracted HTML to readable text focused on the website body.
  • ๐Ÿค– Sends a prompt to OpenAI to generate key company insights: value proposition, industry, target audience, and whether they operate B2B or B2C.
  • ๐Ÿ“ Parses the structured JSON response from OpenAI to extract data fields.
  • ๐Ÿ—‚๏ธ Merges generated data back with the original domain information.
  • ๐Ÿ“Š Updates the original Google Sheet with the new insights in corresponding columns.
  • โฑ๏ธ Waits before processing the next batch to avoid rate limits or overload.

This saves Lucas countless hours of manual web research and data entry, improving accuracy and allowing focus on strategic analysis instead.

3. Prerequisites โš™๏ธ

  • n8n account (cloud or self-hosted) to run workflows.
  • Google Sheets account with OAuth2 credentials connected in n8n, and access to the sheet containing company domains.
  • OpenAI API key configured as credentials inside n8n.
  • Basic familiarity with n8n nodes like HTTP Request and Code, though this post guides step-by-step.

4. Step-by-Step Guide

Step 1: Configure the Manual Trigger

Navigate to the When clicking “Execute Workflow” node. This node allows you to start the workflow manually for testing or batch runs.

Click on the node and verify settings are default (no extra parameters). This triggers the full process when you hit the “Execute Workflow” button.

Expected outcome: Ready to start processing when activated.

Step 2: Read Company Domains from Google Sheets

Locate the Read Google Sheets node. Configure it to point to your spreadsheet URL containing the Domain column with website URLs.

Set Sheet Name to the relevant sheet (usually Sheet1). Connect your Google Sheets OAuth2 credentials.

Expected outcome: Pulls list of domains to process.

Common mistake: Incorrect sheet name or missing OAuth credentials causes failure to fetch data.

Step 3: Split the List Into Batches

Open the Split In Batches node, which chunk processes your domain list to avoid overloading calls.

The node takes input from the previous Google Sheets node and outputs one domain at a time in a batch.

Expected outcome: Domains processed sequentially in manageable chunks.

Step 4: Fetch Website HTML Content

Select the HTTP Request node. Set the URL property to dynamically use the domain from the current batch item: https://www.{{ $node["Split In Batches"].json["Domain"] }}.

Enable follow redirects in the node options to get the final page content.

Expected outcome: HTML content of the homepage returned.

Common mistake: Wrong URL template or domains missing protocol causing failures.

Step 5: Extract HTML Content Using CSS Selector

Open the HTML Extract node configured to extract the html tag’s content using CSS selector html.

This extracts the full HTML body for further processing.

Expected outcome: Full HTML of the page stored in data for cleaning.

Step 6: Clean and Reduce Content with Code Node

Go to the Clean Content node, a JavaScript code node that trims whitespace and removes line breaks and excessive spaces from the extracted HTML content.

It also truncates the content to the first 10,000 characters to keep prompts manageable.

Code snippet:

if ($input.item.json.body){
  $input.item.json.content = $input.item.json.body.replaceAll('/^s+|s+$/g', '').replace('/(rn|n|r)/gm', "").replace(/s+/g, ' ')
  $input.item.json.contentShort = $input.item.json.content.slice(0, 10000)
}
return $input.item

Expected outcome: Clean text ready for OpenAI analysis.

Step 7: Generate Business Insights Using OpenAI Node

In the OpenAI node, use the prompt that feeds the cleaned website content to OpenAI. The prompt instructs the AI to summarize the company’s value proposition in less than 25 words, identify the industry (choosing from a predefined list), guess the target audience, and determine if the business is B2B or B2C.

Make sure you set the max tokens, temperature, and top P settings as desired for consistent outputs.

Expected outcome: AI returns structured JSON with four fields about the company.

Step 8: Parse JSON Response Into Usable Fields

Use the Parse JSON code node to extract the properties value_proposition, industry, target_audience, and market from the raw JSON text returned by OpenAI.

Code snippet:

$input.item.json.value_proposition=JSON.parse($input.item.json.text).value_proposition
$input.item.json.industry=JSON.parse($input.item.json.text).industry
$input.item.json.market=JSON.parse($input.item.json.text).market
$input.item.json.target_audience=JSON.parse($input.item.json.text).target_audience
return $input.item;

Expected outcome: Extracted values are now separate fields within the workflow data.

Step 9: Merge Original and AI Data

The Merge node combines the original domain data and the AI-generated insights into a single item for updating the spreadsheet.

Verify the merge mode is set to merge by position.

Expected outcome: A complete dataset ready to save.

Step 10: Update Google Sheets With New Company Data

Configure the Update Google Sheets node to match rows by the Domain column and populate columns Value Proposition, Industry, Target Audience, and Market with the AI data.

Make sure OAuth credentials are properly connected and spreadsheet access is granted.

Expected outcome: Your sheet updates with fresh business insights per domain.

Step 11: Wait Before Next Batch

The Wait node pauses processing for a configurable amount of time (in seconds) between batches to avoid API rate limits or overloads.

Expected outcome: Smooth, error-free batch processing.

5. Customizations โœ๏ธ

  • Change Industry List in OpenAI Prompt: Modify the industry set inside the OpenAI prompt to better fit your target sectors (e.g., to add “Technology” or “Nonprofit”).
  • Adjust Wait Time: In the Wait node, increase or decrease pause duration between batches for faster or more compliant runs.
  • Expand Extraction CSS Selector: Tweak the HTML Extract node’s CSS selector from html to a more specific container (e.g., body or #main-content) for cleaner text extraction depending on site structure.
  • Increase Content Length: In the Clean Content code node, increase the slice limit from 10,000 characters if you want more content sent to OpenAI for deeper understanding.
  • Batch Size: Configure the Split In Batches node batch size to balance performance and API cost.

6. Troubleshooting ๐Ÿ”ง

Problem: “HTTP Request node fails with 404 or timeout”

Cause: Some domains might be incomplete, lack “https://”, or redirect unexpectedly.

Solution: Ensure your domains include protocol prefixes or modify the HTTP node URL to add it. Test with known working URLs first.

Problem: “OpenAI node returns invalid JSON or no response”

Cause: Prompt formatting issues or API quota exceeded.

Solution: Review the prompt for syntax errors. Check your OpenAI API rate limits and billing. Enable continue on fail to prevent total workflow failure.

Problem: “Google Sheets Update does not reflect changes”

Cause: Incorrect matching column or insufficient permissions.

Solution: Verify the “valueToMatchOn” and “columnToMatchOn” settings exactly match sheet headings. Confirm your OAuth token has write access.

7. Pre-Production Checklist โœ…

  • Verify all Google Sheets credentials are authorized and spreadsheet URLs are correct.
  • Test the HTTP Request node with sample domains to ensure fetch success.
  • Run OpenAI node in test mode with a sample website content to check output format.
  • Confirm merge outputs combined data correctly before updating sheets.
  • Run workflow manually with a small batch before full production.
  • Backup your Google Sheet to prevent accidental data loss.

8. Deployment Guide

Once tested, activate your workflow in n8n by clicking the “Activate” button. Use the manual trigger or schedule runs via additional trigger nodes if desired.

Monitor execution via n8n’s execution logs to catch any failed nodes quickly.

This workflow can be self-hosted using platforms like Hostinger (https://buldrr.com/hostinger) if you prefer full control over API credentials and execution.

9. FAQs

Can I use a different NLP provider instead of OpenAI?

Yes, you can replace the OpenAI node with other NLP or AI services that accept textual prompts and return JSON. Just adjust the prompt format accordingly.

Does this workflow consume many OpenAI API credits?

Each domain processed triggers one OpenAI call. Costs scale linearly with volume, so batching and prompt optimization reduce credits consumed.

Is my company data safe in this workflow?

Yes, n8n and OpenAI use secure connections. Sensitive info is only as safe as your API and Google access are managed.

Can I process hundreds of domains at once?

Yes, but adjust batch sizes and wait times to avoid timeouts and rate limits.

10. Conclusion

By following this guide, you’ve automated extracting valuable company profiles from just domain names using n8n, Google Sheets, and OpenAI. You save hours of manual research weekly, gain consistent insights, and update your CRM or marketing databases faster.

Next, consider automations to analyze social media sentiment for these companies or integrate with email campaign tools to target prospect segments intelligently.

Keep experimenting and refining โ€” automation is about making your work smarter, not harder!

Promoted by BULDRR AI

Related Workflows

Automate Viral UGC Video Creation Using n8n + Degaus (Beginner-Friendly Guide)

Learn how to automate viral UGC video creation using n8n, AI prompts, and Degaus. This beginner-friendly guide shows how to import, configure, and run the workflow without technical complexity.
Form Trigger
Google Sheets
Gmail
+37
Free

AI SEO Blog Writer Automation in n8n (Beginner Guide)

A complete beginner guide to building an AI-powered SEO blog writer automation using n8n.
AI Agent
Google Sheets
httpRequest
+5
Free

Automate CrowdStrike Alerts with VirusTotal, Jira & Slack

This workflow automates processing of CrowdStrike detections by enriching threat data via VirusTotal, creating Jira tickets for incident tracking, and notifying teams on Slack for quick response. Save hours daily by transforming complex threat data into actionable alerts effortlessly.
scheduleTrigger
httpRequest
jira
+5
Free

Automate Telegram Invoices to Notion with AI Summaries & Reports

Save hours on financial tracking by automating invoice extraction from Telegram photos to Notion using Google Gemini AI. This workflow extracts data, records transactions, and generates detailed spending reports with charts sent on schedule via Telegram.
lmChatGoogleGemini
telegramTrigger
notion
+9
Free

Automate Email Replies with n8n and AI-Powered Summarization

Save hours managing your inbox with this n8n workflow that uses IMAP email triggers, AI summarization, and vector search to draft concise replies requiring minimal review. Automate business email processing efficiently with AI guidance and Gmail integration.
emailReadImap
vectorStoreQdrant
emailSend
+12
Free

Automate Email Campaigns Using n8n with Gmail & Google Sheets

This n8n workflow automates personalized email outreach campaigns by integrating Gmail and Google Sheets, saving hours of manual follow-up work and reducing errors in email sequences. It ensures timely follow-ups based on previous email interactions, optimizing communication efficiency.
googleSheets
gmail
code
+5
Free