Extract Social Media Links with n8n and OpenAI AI Crawler

This powerful n8n workflow automates the extraction of social media profile links from company websites using AI-driven crawling and URL scraping. It solves the tedious manual task of gathering social media data, saving hours and improving accuracy.
toolWorkflow
lmChatOpenAi
outputParserStructured
+12
Workflow Identifier: 2165
NODES in Use: toolWorkflow, lmChatOpenAi, outputParserStructured, set, manualTrigger, supabase, httpRequest, html, splitOut, removeDuplicates, filter, aggregate, markdown, merge, stickyNote

Press CTRL+F5 if the workflow didn't load.

Learn how to Build this Workflow with AI:

Visit through Desktop for Best experience

Opening Problem Statement

Meet Anna, a marketing analyst who spends countless hours each week manually visiting company websites to extract their social media profiles. She scrapes through webpages, clicks links, and copies URLs — a tedious process prone to mistakes and delays. With hundreds of companies to analyze, Anna often finds herself overwhelmed, wasting 10+ hours weekly that could be better used to strategize campaigns rather than gather data.

This is exactly the challenge that this unique n8n automated workflow solves by integrating AI crawling and smart URL extraction to autonomously retrieve social media links with high accuracy and speed.

What This Automation Does

When you run this workflow, here’s what happens step-by-step:

  • Fetches company names and websites from a Supabase database table as input.
  • Crawls each company website using an OpenAI-powered AI crawler agent configured with tools to retrieve all page text and URLs.
  • Extracts social media profile links from the website content and links, normalizing and filtering the URLs.
  • Formats the extracted data into a unified JSON structure listing social media platforms and their URLs.
  • Saves the collected social media information back into a Supabase database table for further use or analytics.
  • Handles URL and HTML content processing with built-in nodes like HTTP Request, HTML extraction, and Markdown conversion to ensure reliable data parsing and cleanup.

Overall, this workflow saves many hours of manual data gathering, improves accuracy, and can scale effortlessly to process hundreds or thousands of company profiles.

Prerequisites ⚙️

  • n8n account (cloud or self-hosted) 🔌
  • Supabase account with two tables: companies_input and companies_output 🔐
  • OpenAI API key configured in n8n for GPT-4o model 🔑
  • Basic knowledge of n8n workflow creation and credentials setup
  • Optional: Proxy service for better web crawling performance

Step-by-Step Guide

1. Set up Supabase tables for input and output

Create two tables – companies_input with fields for company name and website URL, and companies_output to store extracted social media profiles.

Ensure your API credentials for Supabase are ready to connect in n8n.

Common mistake: Forgetting to map exact field names for company name and website can cause data flow issues.

2. Add a Manual Trigger Node

In n8n editor, click Add Node → Core Nodes → Manual Trigger. This allows you to manually start the workflow.

You should see a manual trigger node on the canvas ready to connect.

3. Configure Supabase “Get All” Node to fetch companies

Click Add Node → Supabase → Get All. Set table to companies_input.

Connect Manual Trigger output to this node.

This fetches all companies from your database to process.

4. Use Set Node to select only company name and website

Add a Set node and configure it to include only name and website fields. Connect output of Supabase “Get All” node here.

This helps focus the workflow on necessary inputs only.

5. Add the LangChain AI agent node named “Crawl website”

This is the core node where OpenAI GPT-4o is configured to act as an AI crawling agent.

Set the prompt to instruct it to crawl given company websites specifically to extract social media profile URLs.

The agent uses two helper tool workflows inside: “Text” to get all page text and “URLs” to get all links on the site.

Ensure “retryOnFail” is enabled to handle transient errors.

6. Implement the Text tool workflow

This workflow fetches the entire HTML content from a URL and converts it to markdown text to be analyzed.

Steps include setting domain, adding protocol if missing, HTTP Request to get HTML, then Markdown node to clean up.

Links thoughtfully forward data to the AI agent.

7. Implement the URLs tool workflow

This workflow scrapes the webpage for anchor tag URLs using an HTML extraction node, then splits out the URLs to individual items.

It removes duplicates, empty hrefs, invalid URLs, and aggregates cleaned URL data.

This data is then fed back to the AI agent’s crawling logic for deeper exploration.

8. Parse the AI agent’s JSON output

Use the LangChain JSON Parser node with a defined JSON schema matching expected social media platform and URL arrays.

The parsed result is assigned to an array field for further downstream processing.

9. Use Set node to assign social media array and merge data

Map the parsed social media data into a new field, then merge it with input company data using a Merge node in combine mode by position.

10. Save data back to Supabase

Add a Supabase Insert node configured to insert or update rows in the companies_output table.

This commits the extracted social media data for later review or automation tasks.

11. Activate and test workflow

Once all nodes are configured with credentials, run a manual test by triggering the workflow.

You should see social media links extracted and saved in your Supabase output table.

Common mistakes: Incorrect API keys, missing URL protocols, or malformed JSON parser schema can cause failures.

Customizations ✏️

  • Change extracted data type: Modify the AI prompt in the “Crawl website” node to extract contact emails or phone numbers instead of social profiles.
  • Use different databases: Replace Supabase nodes with Airtable, Google Sheets, or MySQL nodes as preferred database sources.
  • Add proxy support: In the HTTP Request nodes inside the Text and URLs tool workflows, configure proxy settings for scraping hard-to-access sites.
  • Expand to multi-page crawls: Extend the URLs tool workflow to recursively visit linked pages within a domain for richer data extraction.

Troubleshooting 🔧

Problem: “HTTP request fails with 403 or timeouts”

Cause: Target website blocks automated scraping or user-agent is missing.

Solution: Add user-agent headers or proxy settings in HTTP Request nodes under options. Test with browser headers.

Problem: “AI agent returns malformed JSON”

Cause: Prompt or schema mismatch causing parsing errors.

Solution: Verify JSON schema in the JSON Parser node matches the AI output format exactly. Simplify prompts as needed.

Problem: “Supabase insert node throws error on missing fields”

Cause: Input data does not match target table schema.

Solution: Double-check field mappings in the Set and Insert nodes to ensure all required fields exist.

Pre-Production Checklist ✅

  • Validate API credentials for OpenAI and Supabase are set up correctly.
  • Test HTTP requests inside tool workflows manually with sample URLs.
  • Run the AI agent node in isolation using a single input to verify expected output JSON format.
  • Review and confirm JSON Parser schema matches anticipated social media JSON structure.
  • Create backups of Supabase tables before bulk inserts.

Deployment Guide

After final testing, activate the workflow by toggling it from manual to active trigger state if desired for scheduled runs.

Use n8n’s built-in execution logs and error handling features to monitor success and exceptions.

Schedule periodic runs or trigger per company update as fitting your use case.

FAQs

Can I modify this workflow to extract other data than social media links?

Yes, you can change the AI agent’s prompt and the JSON output schema in the LangChain JSON Parser node to extract different structured information like emails, phone numbers, or company summaries.

Does this workflow consume OpenAI API credits rapidly?

The AI crawling process involves multiple API calls per website, so plan accordingly. Using GPT-4o model gives high-quality responses but also higher credit consumption.

Is my data secure during this automation?

All data transmissions happen through your secured environment with encrypted APIs. Make sure your API keys are private and n8n is hosted securely.

Can this workflow scale to hundreds of companies?

Yes, given proper API limits and database handling, this workflow can be expanded to batch process large volumes of websites with minimal manual intervention.

Conclusion

By building this n8n workflow, you have created an autonomous AI-powered crawler that smartly navigates company websites to extract social media profile links. This automation replaces tedious manual scraping, saving you hours weekly and improving data accuracy for marketing or research purposes.

Next steps to enhance this workflow include adding multi-page recursive crawling, integrating contact info extraction, or connecting results to analytics dashboards. Dive in and make web data extraction effortless with n8n and OpenAI!

Promoted by BULDRR AI

Related Workflows

Automate Viral UGC Video Creation Using n8n + Degaus (Beginner-Friendly Guide)

Learn how to automate viral UGC video creation using n8n, AI prompts, and Degaus. This beginner-friendly guide shows how to import, configure, and run the workflow without technical complexity.
Form Trigger
Google Sheets
Gmail
+37
Free

AI SEO Blog Writer Automation in n8n

A complete beginner guide to building an AI-powered SEO blog writer automation using n8n.
AI Agent
Google Sheets
httpRequest
+5
Free

Automate CrowdStrike Alerts with VirusTotal, Jira & Slack

This workflow automates processing of CrowdStrike detections by enriching threat data via VirusTotal, creating Jira tickets for incident tracking, and notifying teams on Slack for quick response. Save hours daily by transforming complex threat data into actionable alerts effortlessly.
scheduleTrigger
httpRequest
jira
+5
Free

Automate Telegram Invoices to Notion with AI Summaries & Reports

Save hours on financial tracking by automating invoice extraction from Telegram photos to Notion using Google Gemini AI. This workflow extracts data, records transactions, and generates detailed spending reports with charts sent on schedule via Telegram.
lmChatGoogleGemini
telegramTrigger
notion
+9
Free

Automate Email Replies with n8n and AI-Powered Summarization

Save hours managing your inbox with this n8n workflow that uses IMAP email triggers, AI summarization, and vector search to draft concise replies requiring minimal review. Automate business email processing efficiently with AI guidance and Gmail integration.
emailReadImap
vectorStoreQdrant
emailSend
+12
Free

Automate Email Campaigns Using n8n with Gmail & Google Sheets

This n8n workflow automates personalized email outreach campaigns by integrating Gmail and Google Sheets, saving hours of manual follow-up work and reducing errors in email sequences. It ensures timely follow-ups based on previous email interactions, optimizing communication efficiency.
googleSheets
gmail
code
+5
Free