Opening Problem Statement
Meet Anna, a marketing analyst who spends countless hours each week manually visiting company websites to extract their social media profiles. She scrapes through webpages, clicks links, and copies URLs — a tedious process prone to mistakes and delays. With hundreds of companies to analyze, Anna often finds herself overwhelmed, wasting 10+ hours weekly that could be better used to strategize campaigns rather than gather data.
This is exactly the challenge that this unique n8n automated workflow solves by integrating AI crawling and smart URL extraction to autonomously retrieve social media links with high accuracy and speed.
What This Automation Does
When you run this workflow, here’s what happens step-by-step:
- Fetches company names and websites from a Supabase database table as input.
- Crawls each company website using an OpenAI-powered AI crawler agent configured with tools to retrieve all page text and URLs.
- Extracts social media profile links from the website content and links, normalizing and filtering the URLs.
- Formats the extracted data into a unified JSON structure listing social media platforms and their URLs.
- Saves the collected social media information back into a Supabase database table for further use or analytics.
- Handles URL and HTML content processing with built-in nodes like HTTP Request, HTML extraction, and Markdown conversion to ensure reliable data parsing and cleanup.
Overall, this workflow saves many hours of manual data gathering, improves accuracy, and can scale effortlessly to process hundreds or thousands of company profiles.
Prerequisites ⚙️
- n8n account (cloud or self-hosted) 🔌
- Supabase account with two tables: companies_input and companies_output 🔐
- OpenAI API key configured in n8n for GPT-4o model 🔑
- Basic knowledge of n8n workflow creation and credentials setup
- Optional: Proxy service for better web crawling performance
Step-by-Step Guide
1. Set up Supabase tables for input and output
Create two tables – companies_input with fields for company name and website URL, and companies_output to store extracted social media profiles.
Ensure your API credentials for Supabase are ready to connect in n8n.
Common mistake: Forgetting to map exact field names for company name and website can cause data flow issues.
2. Add a Manual Trigger Node
In n8n editor, click Add Node → Core Nodes → Manual Trigger. This allows you to manually start the workflow.
You should see a manual trigger node on the canvas ready to connect.
3. Configure Supabase “Get All” Node to fetch companies
Click Add Node → Supabase → Get All. Set table to companies_input.
Connect Manual Trigger output to this node.
This fetches all companies from your database to process.
4. Use Set Node to select only company name and website
Add a Set node and configure it to include only name and website fields. Connect output of Supabase “Get All” node here.
This helps focus the workflow on necessary inputs only.
5. Add the LangChain AI agent node named “Crawl website”
This is the core node where OpenAI GPT-4o is configured to act as an AI crawling agent.
Set the prompt to instruct it to crawl given company websites specifically to extract social media profile URLs.
The agent uses two helper tool workflows inside: “Text” to get all page text and “URLs” to get all links on the site.
Ensure “retryOnFail” is enabled to handle transient errors.
6. Implement the Text tool workflow
This workflow fetches the entire HTML content from a URL and converts it to markdown text to be analyzed.
Steps include setting domain, adding protocol if missing, HTTP Request to get HTML, then Markdown node to clean up.
Links thoughtfully forward data to the AI agent.
7. Implement the URLs tool workflow
This workflow scrapes the webpage for anchor tag URLs using an HTML extraction node, then splits out the URLs to individual items.
It removes duplicates, empty hrefs, invalid URLs, and aggregates cleaned URL data.
This data is then fed back to the AI agent’s crawling logic for deeper exploration.
8. Parse the AI agent’s JSON output
Use the LangChain JSON Parser node with a defined JSON schema matching expected social media platform and URL arrays.
The parsed result is assigned to an array field for further downstream processing.
9. Use Set node to assign social media array and merge data
Map the parsed social media data into a new field, then merge it with input company data using a Merge node in combine mode by position.
10. Save data back to Supabase
Add a Supabase Insert node configured to insert or update rows in the companies_output table.
This commits the extracted social media data for later review or automation tasks.
11. Activate and test workflow
Once all nodes are configured with credentials, run a manual test by triggering the workflow.
You should see social media links extracted and saved in your Supabase output table.
Common mistakes: Incorrect API keys, missing URL protocols, or malformed JSON parser schema can cause failures.
Customizations ✏️
- Change extracted data type: Modify the AI prompt in the “Crawl website” node to extract contact emails or phone numbers instead of social profiles.
- Use different databases: Replace Supabase nodes with Airtable, Google Sheets, or MySQL nodes as preferred database sources.
- Add proxy support: In the HTTP Request nodes inside the Text and URLs tool workflows, configure proxy settings for scraping hard-to-access sites.
- Expand to multi-page crawls: Extend the URLs tool workflow to recursively visit linked pages within a domain for richer data extraction.
Troubleshooting 🔧
Problem: “HTTP request fails with 403 or timeouts”
Cause: Target website blocks automated scraping or user-agent is missing.
Solution: Add user-agent headers or proxy settings in HTTP Request nodes under options. Test with browser headers.
Problem: “AI agent returns malformed JSON”
Cause: Prompt or schema mismatch causing parsing errors.
Solution: Verify JSON schema in the JSON Parser node matches the AI output format exactly. Simplify prompts as needed.
Problem: “Supabase insert node throws error on missing fields”
Cause: Input data does not match target table schema.
Solution: Double-check field mappings in the Set and Insert nodes to ensure all required fields exist.
Pre-Production Checklist ✅
- Validate API credentials for OpenAI and Supabase are set up correctly.
- Test HTTP requests inside tool workflows manually with sample URLs.
- Run the AI agent node in isolation using a single input to verify expected output JSON format.
- Review and confirm JSON Parser schema matches anticipated social media JSON structure.
- Create backups of Supabase tables before bulk inserts.
Deployment Guide
After final testing, activate the workflow by toggling it from manual to active trigger state if desired for scheduled runs.
Use n8n’s built-in execution logs and error handling features to monitor success and exceptions.
Schedule periodic runs or trigger per company update as fitting your use case.
FAQs
Can I modify this workflow to extract other data than social media links?
Yes, you can change the AI agent’s prompt and the JSON output schema in the LangChain JSON Parser node to extract different structured information like emails, phone numbers, or company summaries.
Does this workflow consume OpenAI API credits rapidly?
The AI crawling process involves multiple API calls per website, so plan accordingly. Using GPT-4o model gives high-quality responses but also higher credit consumption.
Is my data secure during this automation?
All data transmissions happen through your secured environment with encrypted APIs. Make sure your API keys are private and n8n is hosted securely.
Can this workflow scale to hundreds of companies?
Yes, given proper API limits and database handling, this workflow can be expanded to batch process large volumes of websites with minimal manual intervention.
Conclusion
By building this n8n workflow, you have created an autonomous AI-powered crawler that smartly navigates company websites to extract social media profile links. This automation replaces tedious manual scraping, saving you hours weekly and improving data accuracy for marketing or research purposes.
Next steps to enhance this workflow include adding multi-page recursive crawling, integrating contact info extraction, or connecting results to analytics dashboards. Dive in and make web data extraction effortless with n8n and OpenAI!