Opening Problem Statement
Meet Sarah, an HR analyst at a fast-growing tech startup. Every week, she manually searches and compiles company reviews and insights from Glassdoor to understand what employees think about competitors. This process consumes over 4 hours weekly, often resulting in outdated or incomplete data due to manual copy-pasting errors. Sarah needs a reliable way to automate this repetitive research task to provide up-to-date, concise summaries for strategic HR decisions.
This is where our unique n8n workflow steps in. It combines web scraping with AI summarization to streamline Sarah’s process – specifically targeting Glassdoor company pages for detailed insights.
What This Automation Does
When this workflow runs, it performs several key actions tailored for extracting and summarizing Glassdoor company data:
- Uses Bright Data API to scrape the latest Glassdoor company overview snapshots automatically.
- Checks the scraping job status recursively until data is ready.
- Downloads comprehensive Glassdoor snapshot JSON data once available.
- Breaks down the raw text data optimally using recursive character text splitting to prepare it for AI.
- Uses Google Gemini Flash Thinking experimental model for advanced AI summarization of the scraped content.
- Sends the summarized results to a configured webhook for immediate consumption or integration into other systems.
Sarah and her team can save roughly 4+ hours weekly, eliminate human error, and get consistently formatted, AI-powered insights.
Prerequisites ⚙️
- Bright Data API account with dataset access for Glassdoor scraping 🔐
- Google Gemini (PaLM) API account configured in n8n for AI-powered summarization 🔑
- n8n automation platform access (cloud or self-hosted) ⏱️
- Webhook endpoint URL to receive summary results (e.g., webhook.site)
Step-by-Step Guide
Step 1: Trigger the Workflow Manually
Navigate in n8n to the Manual Trigger node named When clicking ‘Test workflow’. Click Execute Workflow to start the process.
You should see the workflow progress in real-time in n8n.
Common Mistake: Forgetting to activate the workflow after configuring nodes.
Step 2: Trigger Bright Data to Scrape Glassdoor
The HTTP Request to Glassdoor node sends a POST request to Bright Data’s API with the Glassdoor company URL (e.g., Apple’s Glassdoor overview page) to start a new scraping snapshot.
Details:
- Method: POST
- URL:
https://api.brightdata.com/datasets/v3/trigger - Body (JSON): [{“url”: “https://www.glassdoor.co.uk/Overview/Working-at-Apple-EI_IE1138.11,16.htm”}]
- Query: dataset_id for the Glassdoor dataset, include_errors flag
- Headers: Bright Data authentication via header token
This submits a scraping job and returns a snapshot_id needed in next steps.
Common Mistake: Incorrect dataset ID or missing auth header causes 401 Unauthorized errors.
Step 3: Store the Snapshot ID
The Set Snapshot Id node extracts the snapshot_id from the HTTP response and stores it in the workflow for subsequent nodes.
You should see the snapshot ID available as a variable in the next node.
Step 4: Check Snapshot Status Recursively
The Check Snapshot Status node polls Bright Data’s API to check if the scraping is complete.
- Method: GET
- URL:
https://api.brightdata.com/datasets/v3/progress/{{ $json.snapshot_id }} - Authentication: HTTP header token
The If node evaluates the status property. If the status is “ready”, the workflow proceeds; otherwise, a Wait for 30 seconds node pauses before polling again.
Common Mistake: Not handling the wait loop correctly can cause infinite loops or missed data readiness.
Step 5: Download the Snapshot JSON Data
Once ready, the Download the Snapshot Response node fetches the full JSON snapshot from Bright Data.
- GET request to
https://api.brightdata.com/datasets/v3/snapshot/{{ snapshot_id }} - Request includes query parameter format=json
You should receive a complete JSON with raw Glassdoor company data.
Step 6: Split the Downloaded Text for AI
The Recursive Character Text Splitter node breaks the large text response into smaller chunks with 100 characters overlap to prepare for efficient processing in AI summarization.
Step 7: Load Data for AI Summarization
The Default Data Loader node processes these text chunks so the summarization chain can work optimally.
Step 8: Summarize with Google Gemini AI
The Google Gemini Chat Model node uses the experimental Flash Thinking Gemini model version models/gemini-2.0-flash-thinking-exp-01-21 to generate a concise summary of the Glassdoor company overview.
Node parameters have the model name set, leveraging Google PaLM API credentials.
Common Mistake: Missing or invalid Google API credentials lead to failed AI summarization calls.
Step 9: Final Summarization Chain
The Summarization of Glassdoor Response node aggregates the AI outputs into a final concise summary document.
Step 10: Send Summary via Webhook
The Configure Webhook Notification node sends the text summary to an external HTTP endpoint, such as a webhook.site testing URL.
This lets you integrate the summarized insights into Slack, email, or internal dashboards easily.
Customizations ✏️
- Change Company URL
In the HTTP Request to Glassdoor node, replace the URL in the JSON body with your target company’s Glassdoor page.
This lets you query any company data dynamically. - Adjust Text Split Size
Modify Recursive Character Text Splitter to change chunkOverlap or chunk size for longer or shorter summaries. - Use Different Google Gemini Models
In the Google Gemini Chat Model node, select alternative Gemini models if available for adjusting summary detail or style. - Send Summaries to Other Destinations
Update Configure Webhook Notification with your own webhook URL or integrate with Slack, email, or other platforms.
Troubleshooting 🔧
Problem: “401 Unauthorized” error when triggering scraping
Cause: Invalid or missing Bright Data API header token
Solution: Go to the HTTP Request to Glassdoor node → Edit credentials → Re-enter valid Bright Data header authentication token.
Problem: AI summarization fails
Cause: Missing or invalid Google PaLM API credentials
Solution: Check Google Gemini Chat Model node → Verify Google API credentials are correctly set.
Problem: Snapshot status never reaches “ready”
Cause: Bright Data scraping job stuck or delayed
Solution: Check dataset ID in HTTP Request to Glassdoor node, confirm scraping job in Bright Data dashboard, increase wait duration in Wait for 30 seconds node.
Pre-Production Checklist ✅
- Verify Bright Data API credentials and dataset ID are correct.
- Confirm Google PaLM API credentials are active and correct.
- Test manual trigger successfully initiates scraping job.
- Ensure snapshot status polling completes with “ready” state.
- Check final summary is generated and sent to webhook endpoint correctly.
- Backup workflow configuration before major changes.
Deployment Guide
Once fully tested, activate your n8n workflow by clicking Activate in the top right corner.
Schedule this workflow to run on a timer trigger if needed for regular updates, or integrate a webhook trigger from your HR system.
Monitor workflow executions and logs within n8n to ensure scraping and AI summarization run smoothly.
FAQs
Q: Can I use this workflow to scrape other websites?
A: Yes, by adjusting the Bright Data dataset and the URL JSON body, but you’ll need appropriate scraping datasets for your target site.
Q: Does this consume Google Gemini API quota?
A: Yes, each summarization call uses Google PaLM API credits depending on content length.
Q: Is my scraped data and summary secure?
A: Data privacy depends on your Bright Data account setup and webhook security. Use encrypted webhooks or private channels where necessary.
Q: Can this handle multiple companies at once?
A: The workflow can be adapted but currently triggers one scraping job per execution.
Conclusion
By setting up this detailed n8n workflow, you have automated everything from scraping Glassdoor data via Bright Data to generating AI-powered summaries using Google Gemini. You’ve just saved countless hours of manual research and minimized data errors for more informed HR decisions.
Next up, consider extending your workflow to analyze sentiment trends, aggregate multiple company profiles, or push summaries into Slack or email newsletters automatically.
Remember, with n8n’s powerful integrations and AI capabilities, automating HR data workflows like this is both achievable and scalable. Start experimenting and tailor it further to your unique needs!