1. Opening Problem Statement
Meet Sarah, a marketing analyst at a digital agency tasked with building comprehensive social media contact lists for hundreds of client companies. Every week, she spends countless hours manually visiting company websites, hunting for social media profile links buried in footers or contact pages. This manual process is not only tedious but prone to errors—she often misses critical links or collects outdated URLs. Time wasted on this repetitive task adds up to dozens of hours monthly, cutting into strategic planning and creative work.
Imagine if Sarah could automate this crawl and data extraction process, freeing her time and reliably gathering all social media profiles in a standardized format. That’s exactly what this unique n8n workflow accomplishes by autonomously crawling company websites and extracting all social media profile URLs using advanced AI and custom web scraping workflows.
2. What This Automation Does
This workflow, powered by n8n and LangChain AI agent, performs an automated crawl of given company websites and outputs unified social media profile data. Here’s what happens when the workflow runs:
- Retrieve company list from a Supabase database containing company names and websites.
- Crawl each website autonomously using an AI agent that leverages a text retrieval tool and a URL retrieval tool to gather all page text and extract all links.
- Identify and extract social media URLs from the gathered data with AI interpreting which extracted URLs correspond to social platforms.
- Parse AI output into structured JSON that organizes platforms and corresponding profile URLs.
- Insert structured data back into a Supabase database for further use in marketing or analytics automation.
- Support dynamic link discovery by crawling additional linked pages from the initial company site to ensure deeper profiling.
This automation saves Sarah up to 20 hours weekly by removing manual website navigation, link searching, and data formatting steps, drastically improving accuracy and consistency in gathered social media data.
3. Prerequisites
- n8n account (self-hosted or cloud)
- Supabase account with a table containing company names and websites
- OpenAI API key for GPT-4 or compatible model
- Basic familiarity with n8n workflows and database credentials setup
4. Step-by-Step Guide
Step 1: Set up Supabase and Import Company Data
Log into your Supabase project and create a table named companies_input with columns for name and website. Populate it with company data you wish to crawl.
Expected outcome: You’ll have a clean list of companies ready to be fetched by n8n. Common mistake: Ensure column names match exactly or update the node mappings accordingly.
Step 2: Configure Supabase Credentials in n8n
Navigate to n8n credentials settings and add your Supabase credentials for secure API access.
You should see successful connection when testing credentials. Incorrect credentials will block data retrieval.
Step 3: Initialize Manual Trigger Node
Start your workflow with the Manual Trigger node to test and run the automation on demand.
This allows controlled execution during setup and debugging.
Step 4: Add “Get companies” Supabase Node
Select the Supabase node, set operation to getAll, and connect it to the manual trigger. Configure it to fetch all records from companies_input.
Expected outcome: You get the full list of companies to process. Watch for API limits or empty table errors.
Step 5: Extract and Map Company Name and Website
Use a Set node to keep only name and website fields. This declutters data and prepares for crawling.
Incorrect field names or typos here can cause data loss down the pipeline.
Step 6: Use LangChain AI Agent Node “Crawl website”
This central node runs the autonomous crawler. Parameters include the prompt asking the AI to extract social media links from the website URL.
The AI agent uses two embedded tools:
- Text retrieval tool: Fetches all readable page text by making HTTP requests to the website and converting HTML to Markdown.
- URL retrieval tool: Scrapes all anchor links from the website, cleans duplicates, filters invalid URLs, and sets full URL paths with protocols.
The agent iteratively uses these tools to explore site pages and collect social profile links.
Step 7: Parse AI Output into Structured JSON
Connect the agent’s JSON output to the JSON Parser node. This node uses a strict schema expecting platform names and arrays of URLs to structure the data uniformly.
This allows easy insertion and querying of the results later.
Step 8: Consolidate Data and Prepare for Database Insertion
Use Set and Merge nodes to combine the company name, website, and extracted social media JSON into one object.
This unified data object can then be inserted into your output Supabase table efficiently.
Step 9: Insert Results into Supabase Output Table
Add a Supabase node configured to insert data into companies_output. Map the combined data to the table fields.
Verify data appears correctly in Supabase, ready for downstream use.
Step 10: Run and Monitor Workflow
Trigger the manual node and watch execution logs to ensure each step runs successfully. Adjust settings if errors related to URLs, HTTP requests, or API calls occur.
5. Customizations
- Change extracted data type: Modify the AI prompt in the “Crawl website” node to extract emails or contact details instead of social media URLs.
- Add proxy support: In HTTP Request nodes inside the embedded tool workflows, configure proxies to improve crawling accuracy for restricted websites.
- Expand site crawl depth: Adjust the embedded crawler logic to follow additional layers of links for deeper data capture.
- Use alternate database: Replace Supabase nodes with MySQL/PostgreSQL for data input/output.
- Output format adjustment: Modify the JSON parser schema to include additional metadata fields like profile descriptions.
6. Troubleshooting
Problem: “Supabase node returns empty data or authentication errors.”
Cause: Invalid or expired Supabase credentials.
Solution: Re-check credentials, regenerate API keys, and update in n8n credentials.
Problem: “AI agent fails to extract social media URLs or returns incomplete data.”
Cause: Overly strict prompt or website design blocking crawler.
Solution: Relax AI prompt constraints, confirm site accessibility, or add proxy configuration in HTTP nodes.
Problem: “HTTP Request nodes timeout or fail on certain websites.”
Cause: Sites blocking requests or slow response.
Solution: Add retries, increase timeout settings, or use proxy servers.
7. Pre-Production Checklist
- Verify valid Supabase credentials and table existence (companies_input, companies_output).
- Confirm OpenAI API key is active and has quota left.
- Test URL accessibility via HTTP Request nodes independently.
- Check AI prompt clarity by manual prompt tests in OpenAI playground if needed.
- Perform test runs with a small subset of companies.
8. Deployment Guide
Once tested, activate your workflow by setting the manual trigger to periodic schedule (e.g., weekly). Use n8n’s execution logs to monitor runs and set up notifications on failures.
For higher scale, consider running n8n in self-hosted mode via services like Hostinger for better performance and security.
9. FAQs
Q: Can I extract other data besides social media links?
A: Yes! By editing the AI prompt and parser schema, you can customize the crawler to extract emails, phone numbers, or company summaries.
Q: Does this use many OpenAI API credits?
A: The crawler uses API calls per site crawl, so bulk crawling large datasets will incur costs accordingly.
Q: Is extracted data securely handled?
A: Supabase handles data storage securely; ensure your API keys and n8n instances are well protected.
10. Conclusion
You just built an autonomous AI-powered social media crawler that efficiently extracts social media profile links from company websites and stores them systematically in a database.
This saves you up to 20 hours weekly and enhances data accuracy and business insights for marketing teams.
Next steps? Consider automations to enrich data analytics dashboards, trigger marketing outreach campaigns, or extract other contact data types using similar AI-driven workflows.
Get started with confidence, knowing n8n and LangChain give you powerful flexibility to customize and scale your crawlers with ease.