Autonomous AI Social Media Crawler with n8n and LangChain

Discover how to automatically crawl company websites to extract social media profile links using n8n, LangChain AI agent, and Supabase. This workflow saves hours of manual research and organizes data systematically.
agent
supabase
lmChatOpenAi
+10
Workflow Identifier: 2064
NODES in Use: manualTrigger, supabase, set, agent, lmChatOpenAi, outputParserStructured, httpRequest, html, splitOut, removeDuplicates, aggregate, markdown, merge

Press CTRL+F5 if the workflow didn't load.

Learn how to Build this Workflow with AI:

Visit through Desktop for Best experience

1. Opening Problem Statement

Meet Sarah, a marketing analyst at a digital agency tasked with building comprehensive social media contact lists for hundreds of client companies. Every week, she spends countless hours manually visiting company websites, hunting for social media profile links buried in footers or contact pages. This manual process is not only tedious but prone to errors—she often misses critical links or collects outdated URLs. Time wasted on this repetitive task adds up to dozens of hours monthly, cutting into strategic planning and creative work.

Imagine if Sarah could automate this crawl and data extraction process, freeing her time and reliably gathering all social media profiles in a standardized format. That’s exactly what this unique n8n workflow accomplishes by autonomously crawling company websites and extracting all social media profile URLs using advanced AI and custom web scraping workflows.

2. What This Automation Does

This workflow, powered by n8n and LangChain AI agent, performs an automated crawl of given company websites and outputs unified social media profile data. Here’s what happens when the workflow runs:

  • Retrieve company list from a Supabase database containing company names and websites.
  • Crawl each website autonomously using an AI agent that leverages a text retrieval tool and a URL retrieval tool to gather all page text and extract all links.
  • Identify and extract social media URLs from the gathered data with AI interpreting which extracted URLs correspond to social platforms.
  • Parse AI output into structured JSON that organizes platforms and corresponding profile URLs.
  • Insert structured data back into a Supabase database for further use in marketing or analytics automation.
  • Support dynamic link discovery by crawling additional linked pages from the initial company site to ensure deeper profiling.

This automation saves Sarah up to 20 hours weekly by removing manual website navigation, link searching, and data formatting steps, drastically improving accuracy and consistency in gathered social media data.

3. Prerequisites 

  • n8n account (self-hosted or cloud) 
  • Supabase account with a table containing company names and websites 
  • OpenAI API key for GPT-4 or compatible model 
  • Basic familiarity with n8n workflows and database credentials setup 

4. Step-by-Step Guide

Step 1: Set up Supabase and Import Company Data

Log into your Supabase project and create a table named companies_input with columns for name and website. Populate it with company data you wish to crawl.

Expected outcome: You’ll have a clean list of companies ready to be fetched by n8n. Common mistake: Ensure column names match exactly or update the node mappings accordingly.

Step 2: Configure Supabase Credentials in n8n

Navigate to n8n credentials settings and add your Supabase credentials for secure API access.

You should see successful connection when testing credentials. Incorrect credentials will block data retrieval.

Step 3: Initialize Manual Trigger Node

Start your workflow with the Manual Trigger node to test and run the automation on demand.

This allows controlled execution during setup and debugging.

Step 4: Add “Get companies” Supabase Node

Select the Supabase node, set operation to getAll, and connect it to the manual trigger. Configure it to fetch all records from companies_input.

Expected outcome: You get the full list of companies to process. Watch for API limits or empty table errors.

Step 5: Extract and Map Company Name and Website

Use a Set node to keep only name and website fields. This declutters data and prepares for crawling.

Incorrect field names or typos here can cause data loss down the pipeline.

Step 6: Use LangChain AI Agent Node “Crawl website”

This central node runs the autonomous crawler. Parameters include the prompt asking the AI to extract social media links from the website URL.

The AI agent uses two embedded tools:

  • Text retrieval tool: Fetches all readable page text by making HTTP requests to the website and converting HTML to Markdown.
  • URL retrieval tool: Scrapes all anchor links from the website, cleans duplicates, filters invalid URLs, and sets full URL paths with protocols.

The agent iteratively uses these tools to explore site pages and collect social profile links.

Step 7: Parse AI Output into Structured JSON

Connect the agent’s JSON output to the JSON Parser node. This node uses a strict schema expecting platform names and arrays of URLs to structure the data uniformly.

This allows easy insertion and querying of the results later.

Step 8: Consolidate Data and Prepare for Database Insertion

Use Set and Merge nodes to combine the company name, website, and extracted social media JSON into one object.

This unified data object can then be inserted into your output Supabase table efficiently.

Step 9: Insert Results into Supabase Output Table

Add a Supabase node configured to insert data into companies_output. Map the combined data to the table fields.

Verify data appears correctly in Supabase, ready for downstream use.

Step 10: Run and Monitor Workflow

Trigger the manual node and watch execution logs to ensure each step runs successfully. Adjust settings if errors related to URLs, HTTP requests, or API calls occur.

5. Customizations 

  • Change extracted data type: Modify the AI prompt in the “Crawl website” node to extract emails or contact details instead of social media URLs.
  • Add proxy support: In HTTP Request nodes inside the embedded tool workflows, configure proxies to improve crawling accuracy for restricted websites.
  • Expand site crawl depth: Adjust the embedded crawler logic to follow additional layers of links for deeper data capture.
  • Use alternate database: Replace Supabase nodes with MySQL/PostgreSQL for data input/output.
  • Output format adjustment: Modify the JSON parser schema to include additional metadata fields like profile descriptions.

6. Troubleshooting 

Problem: “Supabase node returns empty data or authentication errors.”
Cause: Invalid or expired Supabase credentials.
Solution: Re-check credentials, regenerate API keys, and update in n8n credentials.

Problem: “AI agent fails to extract social media URLs or returns incomplete data.”
Cause: Overly strict prompt or website design blocking crawler.
Solution: Relax AI prompt constraints, confirm site accessibility, or add proxy configuration in HTTP nodes.

Problem: “HTTP Request nodes timeout or fail on certain websites.”
Cause: Sites blocking requests or slow response.
Solution: Add retries, increase timeout settings, or use proxy servers.

7. Pre-Production Checklist 

  • Verify valid Supabase credentials and table existence (companies_input, companies_output).
  • Confirm OpenAI API key is active and has quota left.
  • Test URL accessibility via HTTP Request nodes independently.
  • Check AI prompt clarity by manual prompt tests in OpenAI playground if needed.
  • Perform test runs with a small subset of companies.

8. Deployment Guide

Once tested, activate your workflow by setting the manual trigger to periodic schedule (e.g., weekly). Use n8n’s execution logs to monitor runs and set up notifications on failures.

For higher scale, consider running n8n in self-hosted mode via services like Hostinger for better performance and security.

9. FAQs

Q: Can I extract other data besides social media links?
A: Yes! By editing the AI prompt and parser schema, you can customize the crawler to extract emails, phone numbers, or company summaries.

Q: Does this use many OpenAI API credits?
A: The crawler uses API calls per site crawl, so bulk crawling large datasets will incur costs accordingly.

Q: Is extracted data securely handled?
A: Supabase handles data storage securely; ensure your API keys and n8n instances are well protected.

10. Conclusion

You just built an autonomous AI-powered social media crawler that efficiently extracts social media profile links from company websites and stores them systematically in a database.

This saves you up to 20 hours weekly and enhances data accuracy and business insights for marketing teams.

Next steps? Consider automations to enrich data analytics dashboards, trigger marketing outreach campaigns, or extract other contact data types using similar AI-driven workflows.

Get started with confidence, knowing n8n and LangChain give you powerful flexibility to customize and scale your crawlers with ease.

Promoted by BULDRR AI

Related Workflows

Automate Viral UGC Video Creation Using n8n + Degaus (Beginner-Friendly Guide)

Learn how to automate viral UGC video creation using n8n, AI prompts, and Degaus. This beginner-friendly guide shows how to import, configure, and run the workflow without technical complexity.
Form Trigger
Google Sheets
Gmail
+37
Free

AI SEO Blog Writer Automation in n8n (Beginner Guide)

A complete beginner guide to building an AI-powered SEO blog writer automation using n8n.
AI Agent
Google Sheets
httpRequest
+5
Free

Automate CrowdStrike Alerts with VirusTotal, Jira & Slack

This workflow automates processing of CrowdStrike detections by enriching threat data via VirusTotal, creating Jira tickets for incident tracking, and notifying teams on Slack for quick response. Save hours daily by transforming complex threat data into actionable alerts effortlessly.
scheduleTrigger
httpRequest
jira
+5
Free

Automate Telegram Invoices to Notion with AI Summaries & Reports

Save hours on financial tracking by automating invoice extraction from Telegram photos to Notion using Google Gemini AI. This workflow extracts data, records transactions, and generates detailed spending reports with charts sent on schedule via Telegram.
lmChatGoogleGemini
telegramTrigger
notion
+9
Free

Automate Email Replies with n8n and AI-Powered Summarization

Save hours managing your inbox with this n8n workflow that uses IMAP email triggers, AI summarization, and vector search to draft concise replies requiring minimal review. Automate business email processing efficiently with AI guidance and Gmail integration.
emailReadImap
vectorStoreQdrant
emailSend
+12
Free

Automate Email Campaigns Using n8n with Gmail & Google Sheets

This n8n workflow automates personalized email outreach campaigns by integrating Gmail and Google Sheets, saving hours of manual follow-up work and reducing errors in email sequences. It ensures timely follow-ups based on previous email interactions, optimizing communication efficiency.
googleSheets
gmail
code
+5
Free