Automate Indeed Company Data Scraping & Summarization with n8n & Google Gemini

Struggling to extract and summarize company data from Indeed efficiently? This unique n8n workflow automates web scraping via Bright Data, summarizes insights with Google Gemini AI, and stores results in Airtable, saving hours of manual research and boosting HR and recruitment efforts.
airtable
lmChatGoogleGemini
agent
+9
Workflow Identifier: 2341
NODES in Use: Manual Trigger, Set, Airtable, SplitInBatches, Wait, If, HTTP Request, Chain LLM, Chain Summarization, Langchain Agent, Sticky Note, Markdown

Press CTRL+F5 if the workflow didn't load.

Learn how to Build this Workflow with AI:

Visit through Desktop for Best experience

1. Opening Problem Statement

Meet Sarah, an HR analyst working in a fast-paced recruitment firm. Every day, Sarah needs to gather extensive company profiles from Indeed to assess potential employers for her candidates. This involves manually opening multiple Indeed pages, copying company information, and summarizing data points, a task that often eats up over 5 hours weekly. Errors creep in due to copy-paste fatigue, and valuable insights get lost in the chaos, causing delays and suboptimal candidate matches. What Sarah needs is a reliable automated system that can extract, summarize, and organize Indeed company data swiftly and accurately.

2. What This Automation Does

This specialized n8n workflow streamlines Sarah’s tedious process by automating Indeed company data scraping and synthesizing critical information through AI summarization. When triggered, it performs the following tasks:

  • Fetches company URLs from an Airtable base where Indeed links are maintained
  • Uses Bright Data’s Web Unlocker to scrape raw company data from Indeed programmatically
  • Converts the scraped markdown content into plain textual data using an AI-powered markdown extractor
  • Summarizes the extracted data via Google Gemini Chat Model’s advanced large language model (LLM)
  • Formats the summary through an expert AI agent specialized in Indeed data to maintain clarity and relevance
  • Pushes formatted data to a configured webhook URL for notifications or further processing

By automating these events, Sarah saves an estimated 4+ hours weekly, avoids manual errors, and gains consistent, structured insights for better HR decision-making.

3. Prerequisites ⚙️

  • n8n account for workflow automation setup 🔌
  • Airtable account with a base named “Indeed” where company URLs are stored 📊
  • Bright Data account with access to the Web Unlocker Zone “web_unlocker1” for scraping access 🔐
  • Google Gemini (PaLM API) credentials for AI-based summarization and chat tasks 🔑
  • Configured HTTP Header Authentication credentials for Bright Data API requests ⏱️
  • A webhook service (e.g., webhook.site) to receive notifications 📧

4. Step-by-Step Guide

Step 1: Start with Manual Trigger

Navigate to the n8n editor. Add the Manual Trigger node called “When clicking ‘Test workflow’”. This allows you to start the workflow manually for testing and debugging.

After adding, click the node and hit “Execute Node” to confirm manual triggering works. You should see the execution flow start from this node.

Common mistake: Forgetting to activate or save the workflow before testing.

Step 2: Set Bright Data Zone

Add a Set node named “Set Bright Data Zone” to assign the string value “web_unlocker1” to variable zone. This configures which Bright Data zone will run the scraping tasks.

In the node parameters, add an assignment with name “zone” and value “web_unlocker1”. This will be referenced later in the HTTP Request node.

You should see this variable available in the workflow data output.

Step 3: Pull Company URLs from Airtable

Add the Airtable node configured with your Airtable Personal Access Token. Select the base “Indeed” and the table listing company URLs (eg. “Table 1”).

Make sure your Airtable table has a field with Indeed company URLs under the field name “Link”.

Execute this node to confirm it pulls your records. You should see JSON outputs matching your Airtable data.

Common mistake: Not setting the Airtable API credentials correctly or using an empty base/table reference.

Step 4: Loop Through Each Company Record

Connect a SplitInBatches node labeled “Loop Over Items” to handle each company URL one at a time. This controls load and limits over请求 to Bright Data.

Verify batching limits; by default, it processes one record per batch.

You should see each company processed sequentially when running the workflow.

Step 5: Add Wait Time Between Requests

Add a Wait node after looping to pause for 10 seconds between each HTTP request. This prevents API rate limits or bans from Bright Data or Indeed.

Configure a 10-second wait to allow a smooth request cadence.

Step 6: Verify Non-empty Links

Use the If node titled “If Link field is not empty” to validate that the scraped record contains a valid URL.

Set the condition to check if the field “Link” is not empty before continuing to the scraping step.

Step 7: Perform Indeed Web Scraping HTTP Request

Add an HTTP Request node named “Perform Indeed Web Request” configured with:

  • Method: POST
  • URL: https://api.brightdata.com/request
  • Body parameters: zone, url (compose URL to Indeed with product=unlocker & method=api), format (raw), and data_format (markdown)
  • Authentication: HTTP Header Auth with Bright Data credentials

This node fetches raw company data via Bright Data’s Web Unlocker from Indeed.

Common mistake: Incorrect URL or body parameters causing request failure.

Step 8: Convert Markdown Data to Text

Use the Chain LLM node “Markdown to Textual Data Extractor” powered by an AI markdown expert prompt. It takes the raw markdown text from scraping and outputs clean textual data.

This NLP step prepares the data for summarization.

Step 9: Summarize Company Data with Google Gemini

Add the Chain Summarization node “Indeed Summarizer” using the Google Gemini Chat Model to condense large textual data into concise summaries.

You will need to link credentials for Google Gemini PaLM API here.

Step 10: Format and Push Summary via AI Agent

Connect the Langchain Agent node titled “Indeed Expert AI Agent” to frame the summarization results specific to Indeed company data and push a structured JSON summary to the final webhook.

This agent implements a customized prompt directing the AI to prepare the output for downstream systems.

Step 11: Send Formatted Data to Webhook

Finally, use an HTTP Request node “Webhook HTTP Request” to POST the formatted summary to a provided webhook URL (eg. webhook.site).

This enables real-time notification or integration with other tools.

5. Customizations ✏️

  • Change scraping zone: In the “Set Bright Data Zone” node, change the zone value to another Bright Data zone if you want to target different regions or unlocker types.
  • Adjust wait duration: Modify the “Wait” node timing from 10 seconds to any preferred interval to manage API rate limits or speed.
  • Target different Airtable base/table: Change the Airtable node settings to pull from any other base or table that stores URLs for different job boards or data sources.
  • Switch AI models: Substitute Google Gemini with other supported AI chat models in Langchain nodes for different summarization tones or languages.
  • Update webhook endpoint: Change the URL in the “Webhook HTTP Request” node to integrate with your CRM, Slack, or custom dashboards.

6. Troubleshooting 🔧

Problem: HTTP Request returns 403 Forbidden from Bright Data API

Cause: Incorrect or expired HTTP header authentication credentials or zone misconfiguration.

Solution: Verify header auth credentials in the node settings. Confirm the “zone” field matches an active Bright Data zone in your account. Re-authenticate if needed.

Problem: Google Gemini API call fails or returns empty summary

Cause: Invalid or missing PaLM API credentials or exceeding rate limits.

Solution: Double-check the Google Gemini credentials linked in Langchain nodes. Monitor API usage limits and renew API keys as necessary.

Problem: Airtable node returns empty data

Cause: Wrong base ID, table ID, or lack of records with valid “Link” fields.

Solution: Confirm Airtable API credentials, base, and table configuration. Make sure the “Link” column is populated.

7. Pre-Production Checklist ✅

  • Verify Airtable connection by successfully pulling company records.
  • Test Bright Data HTTP requests with one sample URL to ensure scraping functionality.
  • Validate Google Gemini summarization returns meaningful content.
  • Ensure the webhook URL correctly receives POST requests.
  • Run the workflow manually and monitor logs for sequential processing.
  • Backup your Airtable data and test on a small record set before full deployment.

8. Deployment Guide

Activate the n8n workflow and schedule a trigger if needed for regular company data updates.

Monitor executions from the n8n dashboard, particularly API responses from Bright Data and Google Gemini for errors or quota issues.

Integrate the webhook receiver with your CRM, HR analytics tool, or notification system to utilize the company summaries effectively.

9. FAQs

  • Can I replace Bright Data with another web scraping service? Yes, but you need to adjust HTTP request parameters accordingly to their API format.
  • Does summarization consume a lot of API credits? It depends on your Google Gemini PaLM quota, but summarized requests reduce overall token usage compared to raw data processing.
  • Is this workflow secure for proprietary company data? The data flows through authenticated API calls; however, always review your API key management and webhook privacy.
  • Can it handle hundreds of company URLs? Yes, but consider batching and wait nodes to respect API limits and avoid bans.

10. Conclusion

By finishing this tutorial, you’ve built an advanced n8n workflow that automates Indeed company data scraping, leverages Bright Data’s unlocker technology, and applies Google Gemini AI for summarization. Sarah’s struggle with hours of manual research is replaced by efficient data extraction and well-organized summaries delivered directly through webhooks.

This automation reliably saves significant time, eliminates errors, and empowers HR professionals and recruiters to make better hires faster.

Next steps? Consider extending the workflow to integrate Slack notifications for immediate alerts or add Google Sheets export for record keeping and analytics!

Promoted by BULDRR AI

Related Workflows

Automate Viral UGC Video Creation Using n8n + Degaus (Beginner-Friendly Guide)

Learn how to automate viral UGC video creation using n8n, AI prompts, and Degaus. This beginner-friendly guide shows how to import, configure, and run the workflow without technical complexity.
Form Trigger
Google Sheets
Gmail
+37
Free

AI SEO Blog Writer Automation in n8n (Beginner Guide)

A complete beginner guide to building an AI-powered SEO blog writer automation using n8n.
AI Agent
Google Sheets
httpRequest
+5
Free

Automate CrowdStrike Alerts with VirusTotal, Jira & Slack

This workflow automates processing of CrowdStrike detections by enriching threat data via VirusTotal, creating Jira tickets for incident tracking, and notifying teams on Slack for quick response. Save hours daily by transforming complex threat data into actionable alerts effortlessly.
scheduleTrigger
httpRequest
jira
+5
Free

Automate Telegram Invoices to Notion with AI Summaries & Reports

Save hours on financial tracking by automating invoice extraction from Telegram photos to Notion using Google Gemini AI. This workflow extracts data, records transactions, and generates detailed spending reports with charts sent on schedule via Telegram.
lmChatGoogleGemini
telegramTrigger
notion
+9
Free

Automate Email Replies with n8n and AI-Powered Summarization

Save hours managing your inbox with this n8n workflow that uses IMAP email triggers, AI summarization, and vector search to draft concise replies requiring minimal review. Automate business email processing efficiently with AI guidance and Gmail integration.
emailReadImap
vectorStoreQdrant
emailSend
+12
Free

Automate Email Campaigns Using n8n with Gmail & Google Sheets

This n8n workflow automates personalized email outreach campaigns by integrating Gmail and Google Sheets, saving hours of manual follow-up work and reducing errors in email sequences. It ensures timely follow-ups based on previous email interactions, optimizing communication efficiency.
googleSheets
gmail
code
+5
Free