Opening Problem Statement

Meet Sarah, a digital content manager at a tech startup. She needs to gather comprehensive, up-to-date information from multiple webpages of a frequently updated website about AI tools and agents for her team’s research and newsletters. Doing this manually means opening each page, copying content, saving it, and making sure she doesn’t miss anything important. This tedious process takes her hours every week and is prone to human error, such as missing pages or outdated content. Sarah worries about the lost hours and potential inaccuracies that could impact her team’s decisions.

This is where the “💡🌐 Essential Multipage Website Scraper with Jina.ai” n8n workflow shines by automating website scraping tasks. It directly extracts multipage content using the Jina.ai scraping service without requiring an API key, filters relevant pages by topic, extracts structured content, and automatically archives everything neatly in Google Drive. Sarah gains back precious time, ensures completeness, and improves accuracy.

What This Automation Does

When you trigger this workflow in n8n, it:

Starts with a sitemap URL to get all pages of the website quickly.
Parses the sitemap XML into a usable list of URLs.
Filters web pages based on keywords like “agent” or “tool” ensuring only relevant pages are scraped.
Limits the crawl to the first 20 URLs to prevent excessive scraping and throttles requests with a wait node to avoid server overload.
Uses Jina.ai HTTP requests to scrape each page’s content, extracting title and markdown content via JavaScript code.
Saves scraped pages as markdown files in a Google Drive folder automatically.

This workflow saves hours of manual scraping, reduces human errors, and keeps your content archive organized and up to date.

Prerequisites ⚙️

n8n account (cloud or self-hosted) 🔌
Google Drive account with OAuth credentials configured in n8n 🔑📁
Access to the Jina.ai scrape endpoint (no API key needed)
Website sitemap XML URL for the target site

Optionally, if you want to self-host n8n, you can check out platforms like Hostinger for an easy setup: Hostinger n8n Hosting.

Step-by-Step Guide

1. Start with the Manual Trigger

In n8n editor, click New Workflow. Add a Manual Trigger node named “When clicking ‘Test workflow’”. This lets you run the workflow on demand, perfect for testing or scheduled runs.

You’ll see a button to manually execute the flow. This starts your scraping process.

2. Set Website Sitemap URL

Add a Set node called “Set Website URL”. Enter the sitemap URL of the site you want to scrape. For this example, it’s https://ai.pydantic.dev/sitemap.xml.

This passes the sitemap URL forward as JSON data named sitemap_url.

3. Fetch the Sitemap XML

Add an HTTP Request node named “Get List of Website URLs”. Set the URL field to {{ $json.sitemap_url }} so it dynamically loads the URL from the previous node.

This requests the XML sitemap, which contains the list of all pages on the website.

You should see the raw XML fetched as a response after execution.

4. Convert XML to JSON

Add an XML node called “Convert to JSON”. Connect it to the HTTP Request node. This converts the XML sitemap into JSON format which is easier to work with in n8n.

Check the output it produces to verify URL arrays are extracted.

5. Extract URLs from JSON

Add a SplitOut node named “Create List of Website URLs”. In field to split out, enter urlset.url. This extracts each URL from the sitemap JSON array as a separate item for iteration.

This makes an iterable list so the workflow can scan each page individually.

6. Filter URLs by Relevant Topics or Pages

Add a Filter node called “Filter By Topics or Pages”. Configure the filter conditions to keep URLs that:

Exactly equal https://ai.pydantic.dev/
Contain the word “agent” (case-insensitive)
Contain the word “tool” (case-insensitive)

This filters out URLs unrelated to your topic, focusing on important page categories.

7. Limit URLs to First 20

Use a Limit node named “Limit” to restrict the number of URLs processed to 20, preventing excessive scraping or server overload.

8. Batch Process URLs with SplitInBatches

Add a SplitInBatches node named “Loop Over Items”. This takes the filtered and limited URLs and processes them one by one in batches. This pacing helps avoid hitting request limits or getting blocked.

9. Call Jina.ai Scraper per URL

Add an HTTP Request node named “Jina.ai Web Scraper”. Set the URL field to:
=https://r.jina.ai/{{ $json.loc }}
This dynamically calls the Jina.ai scraping service on each URL.

The response will contain scraped page data in raw text.

10. Extract Title and Markdown Content Using Code Node

Add a Code node named “Extract Title & Markdown Content”. Paste the following JavaScript code to parse the Jina.ai response:

// Get the text output from the previous node
const data = $input.first().json.data;

// Regular expression to capture the title line
const titleRegex = /^Title:s*(.+)$/m;
// Regular expression to capture everything after "Markdown Content:"
const markdownRegex = /Markdown Content:n([sS]+)/;

// Extract the title using the first capture group
const titleMatch = data.match(titleRegex);
const title = titleMatch ? titleMatch[1].trim() : '';

// Extract the markdown content using the first capture group
const markdownMatch = data.match(markdownRegex);
const markdown = markdownMatch ? markdownMatch[1].trim() : '';

// Return a single object with title and markdown as unique values
return { title, markdown };

This extracts structured title and content in markdown ready for saving.

11. Save Content to Google Drive

Add a Google Drive node named “Save Webpage Contents to Google Drive”. Configure it to create a new text file:

Name: Use expressions to combine the page URL and title, e.g. {{ $('Loop Over Items').item.json.loc }} - {{ $json.title }}
Content: Use the markdown extracted from the previous step.
Drive & Folder: Default to “My Drive” and root folder or your preferred folder.

This archives your web scraping output as individual markdown files accessible anytime.

12. Include Wait Between Requests

Add a Wait node named “Wait” that pauses for a short time before looping next batch to control request rate and respect target server limits.

Customizations ✏️

Broaden Filters: In the “Filter By Topics or Pages” node, add or change keywords to match your desired page topics.
Increase URL Limit: In the “Limit” node, expand maxItems to scrape more pages, but be mindful of rate limits.
Change Output Folder: In the “Save Webpage Contents to Google Drive” node, select a specific folder where your markdown files should be saved.
Add Additional Processing: Insert more Code nodes or HTTP requests after scraping for extra content transformation or analytics.

Troubleshooting 🔧

Problem: “404 Not Found” on Get List of Website URLs HTTP

Cause: The sitemap URL may be incorrect or the website does not have a sitemap.xml file at the specified location.

Solution: Verify the sitemap URL https://ai.pydantic.dev/sitemap.xml is correct and accessible via browser.

Problem: Jina.ai returns empty or malformed data

Cause: The Jina.ai endpoint might have issues or the URL passed is invalid.

Solution: Check the URL format passed in “Jina.ai Web Scraper” and test it directly in a browser. Also, monitor Jina.ai service status.

Problem: Google Drive file creation fails

Cause: OAuth credentials might be missing or improperly configured.

Solution: Reauthenticate your Google Drive account in n8n under Credentials and ensure proper permissions for file creation.

Pre-Production Checklist ✅

Make sure your sitemap URL is valid and accessible.
Confirm Google Drive OAuth credentials are set and working.
Test the HTTP Request to Jina.ai for a sample URL and verify output.
Run a limited batch with the Limit node to avoid overload.
Backup any existing data in target Google Drive folder to prevent overwrites.

Deployment Guide

Once tested, activate the workflow in n8n by toggling the Active switch. You can then set this manual trigger to run on a schedule using a Cron node or external scheduler.

Monitor execution logs in n8n to catch any errors quickly. For large scale projects, consider splitting batches further or adding throttling.

FAQs

Can I scrape websites without a sitemap?
While this workflow uses sitemaps for URL discovery, you can modify it to crawl pages by other means but it requires additional URL discovery logic.
Does this consume Jina.ai API credits?
No API key is needed, so it doesn’t consume limited API credits.
Is my data secure on Google Drive?
Your data is stored in your Google Drive account, secured by OAuth and Google’s standard protections.
Can this workflow handle large websites?
The Limit and batching nodes help manage scale, but extremely large websites may need additional optimizations.

Conclusion

By following this tutorial, you have set up a powerful n8n workflow to scrape entire multipage websites focused on AI content automatically using Jina.ai, filter down to relevant pages, extract structured markdown content, and save it neatly in Google Drive.

Sarah-like content managers can save hours weekly, avoid manual errors, and maintain up-to-date archives without coding overhead.

Next, consider automations to transform scraped data into reports, send summaries via email, or update team collaboration docs.

Happy scraping and automating!

Scrape Multipage Websites with Jina.ai and n8n