Web scraper: extract website content from sitemaps to Google Drive

4251 views

2/3/2026

A reliable, no-frills web scraper that extracts content directly from websites using their sitemaps. Perfect for content audits, migrations, and research when you need straightforward HTML extraction without external dependencies.

How It Works This streamlined workflow takes a practical approach to web scraping by leveraging XML sitemaps and direct HTTP requests. Here's how it delivers consistent results:

Direct Sitemap Processing: The workflow starts by fetching your target website's XML sitemap and parsing it to extract all available page URLs. This eliminates guesswork and ensures comprehensive coverage of the site's content structure.

Robust HTTP Scraping: Each page is scraped using direct HTTP requests with realistic browser headers that mimic legitimate web traffic. The scraper includes comprehensive error handling and timeout protection to handle various website configurations gracefully.

Intelligent Content Extraction: The workflow uses sophisticated JavaScript parsing to extract meaningful content from raw HTML. It automatically identifies page titles through multiple methods (title tags, Open Graph metadata, H1 headers) and converts HTML structure into readable text format.

Framework Detection: Built-in detection identifies whether sites use WordPress, Divi themes, or heavy JavaScript frameworks. This helps explain content extraction quality and provides valuable insights about the site's technical architecture.

Rich Metadata Collection: Each scraped page includes detailed metadata like word count, HTML size, response codes, and technical indicators. This data is formatted into comprehensive markdown files with YAML frontmatter for easy analysis and organization.

Respectful Rate Limiting: The workflow includes a 3-second delay between page requests to respect server resources and avoid overwhelming target websites. The processing is sequential and controlled to maintain ethical scraping practices.

Detailed Success Reporting: Every scraped page generates a report showing extraction success, potential issues (like JavaScript dependencies), and technical details about the site's structure and framework.

Setup Steps

Configure Google Drive Integration

Connect your Google Drive account in the "Save to Google Drive" node Replace YOUR_GOOGLE_DRIVE_CREDENTIAL_ID with your actual Google Drive credential ID Create a dedicated folder for your scraped content in Google Drive Copy the folder ID from the Google Drive URL (the long string after /folders/) Replace YOUR_GOOGLE_DRIVE_FOLDER_ID_HERE with your actual folder ID in both the folderId field and cachedResultUrl Update YOUR_FOLDER_NAME_HERE with your folder's actual name

Set Your Target Website

In the "Set Sitemap URL" node, replace https://yourwebsitehere.com/page-sitemap.xml with your target website's sitemap URL Common sitemap locations include /sitemap.xml, /page-sitemap.xml, or /sitemap_index.xml Tip: Not sure where your sitemap is? Use a free online tool like https://seomator.com/sitemap-finder Verify the sitemap URL loads correctly in your browser before running the workflow

Update Workflow IDs (Automatic)

When you import this workflow, n8n will automatically generate new IDs for YOUR_WORKFLOW_ID_HERE, YOUR_VERSION_ID_HERE, YOUR_INSTANCE_ID_HERE, and YOUR_WEBHOOK_ID_HERE No manual changes needed for these placeholders

Adjust Processing Limits (Optional)

The "Limit URLs (Optional)" node is currently disabled for full site scraping Enable this node and set a smaller number (like 5-10) for initial testing For large websites, consider running in batches to manage processing time and storage

Customize Rate Limiting (Optional)

The "Wait Between Pages" node is set to 3 seconds by default Increase the delay for more respectful scraping of busy sites Decrease only if you have permission and the target site can handle faster requests

Test Your Configuration

Enable the "Limit URLs (Optional)" node and set it to 3-5 pages for testing Click "Test workflow" to verify the setup works correctly Check your Google Drive folder to confirm files are being created with proper content Review the generated markdown files to assess content extraction quality

Run Full Extraction

Disable the "Limit URLs (Optional)" node for complete site scraping Execute the workflow and monitor the execution log for any errors Large websites may take considerable time to process completely (plan for several hours for sites with hundreds of pages)

Review Results

Each generated file includes technical metadata to help you assess extraction quality Look for indicators like "Limited Content" warnings for JavaScript-heavy pages Files include word counts and framework detection to help you understand the site's structure

Framework Compatibility: This scraper is specifically designed to work well with WordPress sites, Divi themes, and many JavaScript-heavy frameworks. The intelligent content extraction handles dynamic content effectively and provides detailed feedback about framework detection. While some single-page applications (SPAs) that render entirely through JavaScript may have limited content extraction, most modern websites including those built with popular CMS platforms will work excellently with this scraper. Important Notes: Always ensure you have permission to scrape your target website and respect their robots.txt guidelines. The workflow includes respectful delays and error handling, but monitor your usage to maintain ethical scraping practices.RetryClaude can make mistakes. Please double-check responses.

Web Scraper: Extract Website Content from Sitemaps to Google Drive

This n8n workflow automates the process of extracting content from websites by first identifying URLs from a sitemap, then fetching the content of each URL, and finally saving the extracted data to Google Drive. It's designed to efficiently scrape website content for analysis, archiving, or other data processing needs.

What it does

Triggers Manually: The workflow starts when manually executed.
Fetches Sitemap: It makes an HTTP request to a specified sitemap URL (e.g., https://n8n.io/sitemap.xml) to retrieve the sitemap XML.
Parses XML: The retrieved XML content is parsed to extract all the URLs listed within the sitemap.
Limits URLs (Optional): It includes an optional "Limit" node to restrict the number of URLs processed, useful for testing or partial scrapes. This is currently set to process only the first 5 items.
Loops Over URLs: It iterates through each extracted URL in batches (e.g., 5 at a time) to manage processing load.
Fetches Website Content: For each URL, it makes another HTTP request to fetch the actual HTML content of the webpage.
Extracts Text Content: A "Code" node is used to process the HTML content, extracting only the visible text from the webpage, removing HTML tags and other non-text elements.
Prepares Data for Google Drive: An "Edit Fields (Set)" node prepares the extracted text content by assigning it a filename (based on the original URL) and the content itself.
Uploads to Google Drive: The extracted text content for each page is then uploaded as a new file to a specified folder in Google Drive.
Waits (Optional): A "Wait" node introduces a delay (e.g., 1 second) between processing each batch of URLs, helping to prevent overwhelming the target website or API rate limits.

Prerequisites/Requirements

n8n Instance: A running instance of n8n.
Google Drive Account: A Google Drive account to store the extracted website content.
Google Drive Credentials: OAuth2 credentials configured in n8n for Google Drive.
Target Website Sitemap: The URL of the sitemap (e.g., sitemap.xml) of the website you wish to scrape.

Setup/Usage

Import the Workflow: Import the provided JSON into your n8n instance.
Configure Google Drive Credentials:
- Locate the "Google Drive" node.
- Click on "Credentials" and select or create new Google Drive OAuth2 credentials. Follow the n8n documentation for setting up Google Drive credentials if needed.
Specify Sitemap URL:
- Locate the first "HTTP Request" node (ID 19).
- Update the URL field with the sitemap URL of the website you want to scrape (e.g., https://example.com/sitemap.xml).
Configure Google Drive Folder:
- Locate the "Google Drive" node.
- Specify the Folder ID where you want the extracted content files to be saved. You can find the folder ID in the URL when browsing the folder in Google Drive.
Adjust Limit (Optional):
- If you want to process more or fewer URLs, adjust the Limit value in the "Limit" node (ID 1237).
Adjust Batch Size and Wait Time (Optional):
- Modify the Batch Size in the "Loop Over Items (Split in Batches)" node (ID 39) to control how many URLs are processed concurrently.
- Adjust the Time to Wait in the "Wait" node (ID 514) to control the delay between batches, which can help avoid rate limiting from the target website.
Activate and Execute:
- Activate the workflow.
- Click "Execute Workflow" on the "Manual Trigger" node to run the workflow.

The workflow will then proceed to fetch sitemap URLs, scrape content, and upload it to your designated Google Drive folder.

Related Templates

Automate Dutch Public Procurement Data Collection with TenderNed

TenderNed Public Procurement What This Workflow Does This workflow automates the collection of public procurement data from TenderNed (the official Dutch tender platform). It: Fetches the latest tender publications from the TenderNed API Retrieves detailed information in both XML and JSON formats for each tender Parses and extracts key information like organization names, titles, descriptions, and reference numbers Filters results based on your custom criteria Stores the data in a database for easy querying and analysis Setup Instructions This template comes with sticky notes providing step-by-step instructions in Dutch and various query options you can customize. Prerequisites TenderNed API Access - Register at TenderNed for API credentials Configuration Steps Set up TenderNed credentials: Add HTTP Basic Auth credentials with your TenderNed API username and password Apply these credentials to the three HTTP Request nodes: "Tenderned Publicaties" "Haal XML Details" "Haal JSON Details" Customize filters: Modify the "Filter op ..." node to match your specific requirements Examples: specific organizations, contract values, regions, etc. How It Works Step 1: Trigger The workflow can be triggered either manually for testing or automatically on a daily schedule. Step 2: Fetch Publications Makes an API call to TenderNed to retrieve a list of recent publications (up to 100 per request). Step 3: Process & Split Extracts the tender array from the response and splits it into individual items for processing. Step 4: Fetch Details For each tender, the workflow makes two parallel API calls: XML endpoint - Retrieves the complete tender documentation in XML format JSON endpoint - Fetches metadata including reference numbers and keywords Step 5: Parse & Merge Parses the XML data and merges it with the JSON metadata and batch information into a single data structure. Step 6: Extract Fields Maps the raw API data to clean, structured fields including: Publication ID and date Organization name Tender title and description Reference numbers (kenmerk, TED number) Step 7: Filter Applies your custom filter criteria to focus on relevant tenders only. Step 8: Store Inserts the processed data into your database for storage and future analysis. Customization Tips Modify API Parameters In the "Tenderned Publicaties" node, you can adjust: offset: Starting position for pagination size: Number of results per request (max 100) Add query parameters for date ranges, status filters, etc. Add More Fields Extend the "Splits Alle Velden" node to extract additional fields from the XML/JSON data, such as: Contract value estimates Deadline dates CPV codes (procurement classification) Contact information Integrate Notifications Add a Slack, Email, or Discord node after the filter to get notified about new matching tenders. Incremental Updates Modify the workflow to only fetch new tenders by: Storing the last execution timestamp Adding date filters to the API query Only processing publications newer than the last run Troubleshooting No data returned? Verify your TenderNed API credentials are correct Check that you have setup youre filter proper Need help setting this up or interested in a complete tender analysis solution? Get in touch 🔗 LinkedIn – Wessel Bulte

By Wessel Bulte

247

AI multi-agent executive team for entrepreneurs with Gemini, Perplexity and WhatsApp

This workflow is an AI-powered multi-agent system built for startup founders and small business owners who want to automate decision-making, accountability, research, and communication, all through WhatsApp. The “virtual executive team,” is designed to help small teams to work smarter. This workflow sends you market analysis, market and sales tips, It can also monitor what your competitors are doing using perplexity (Research agent) and help you stay a head, or make better decisions. And when you feeling stuck with your start-up accountability director is creative enough to break the barrier 🎯 Core Features 🧑‍💼 1. President (Super Agent) Acts as the main controller that coordinates all sub-agents. Routes messages, assigns tasks, and ensures workflow synchronization between the AI Directors. 📊 2. Sales & Marketing Director Uses SerpAPI to search for market opportunities, leads, and trends. Suggests marketing campaigns, keywords, or outreach ideas. Can analyze current engagement metrics to adjust content strategy. 🕵️‍♀️ 3. Business Research Director Powered by Perplexity AI for competitive and market analysis. Monitors competitor moves, social media engagement, and product changes. Provides concise insights to help the founder adapt and stay ahead. ⏰ 4. Accountability Director Keeps the founder and executive team on track. Sends motivational nudges, task reminders, and progress reports. Promotes consistency and discipline — key traits for early-stage success. 🗓️ 5. Executive Secretary Handles scheduling, email drafting, and reminders. Connects with Google Calendar, Gmail, and Sheets through OAuth. Automates follow-ups, meeting summaries, and notifications directly via WhatsApp. 💬 WhatsApp as the Main Interface Interact naturally with your AI team through WhatsApp Business API. All responses, updates, and summaries are delivered to your chat. Ideal for founders who want to manage operations on the go. ⚙️ How It Works Trigger: The workflow starts from a WhatsApp Trigger node (via Meta Developer Account). Routing: The President agent analyzes the incoming message and determines which Director should handle it. Processing: Marketing or sales queries go to the Sales & Marketing Director. Research questions are handled by the Business Research Director. Accountability tasks are assigned to the Accountability Director. Scheduling or communication requests are managed by the Secretary. Collaboration: Each sub-agent returns results to the President, who summarizes and sends the reply back via WhatsApp. Memory: Context is maintained between sessions, ensuring personalized and coherent communication. 🧩 Integrations Required Gemini API – for general intelligence and task reasoning Supabase- for RAG and postgres persistent memory Perplexity API – for business and competitor analysis SerpAPI – for market research and opportunity scouting Google OAuth – to connect Sheets, Calendar, and Gmail WhatsApp Business API – for message triggers and responses 🚀 Benefits Acts like a team of tireless employees available 24/7. Saves time by automating research, reminders, and communication. Enhances accountability and strategy consistency for founders. Keeps operations centralized in a simple WhatsApp interface. 🧰 Setup Steps Create API credentials for: WhatsApp (via Meta Developer Account) Gemini, Perplexity, and SerpAPI Google OAuth (Sheets, Calendar, Gmail) Create a supabase account at supabase Add the credentials in the corresponding n8n nodes. Customize the system prompts for each Director based on your startup’s needs. Activate and start interacting with your virtual executive team on WhatsApp. Use Case You are a small organisation or start-up that can not afford hiring; marketing department, research department and secretar office, then this workflow is for you 💡 Need Customization? Want to tailor it for your startup or integrate with CRM tools like Notion or HubSpot? You can easily extend the workflow or contact the creator for personalized support. Consider adjusting the system prompt to suite your business

By Shadrack

331

🎓 How to transform unstructured email data into structured format with AI agent

This workflow automates the process of extracting structured, usable information from unstructured email messages across multiple platforms. It connects directly to Gmail, Outlook, and IMAP accounts, retrieves incoming emails, and sends their content to an AI-powered parsing agent built on OpenAI GPT models. The AI agent analyzes each email, identifies relevant details, and returns a clean JSON structure containing key fields: From – sender’s email address To – recipient’s email address Subject – email subject line Summary – short AI-generated summary of the email body The extracted information is then automatically inserted into an n8n Data Table, creating a structured database of email metadata and summaries ready for indexing, reporting, or integration with other tools. --- Key Benefits ✅ Full Automation: Eliminates manual reading and data entry from incoming emails. ✅ Multi-Source Integration: Handles data from different email providers seamlessly. ✅ AI-Driven Accuracy: Uses advanced language models to interpret complex or unformatted content. ✅ Structured Storage: Creates a standardized, query-ready dataset from previously unstructured text. ✅ Time Efficiency: Processes emails in real time, improving productivity and response speed. *✅ Scalability: Easily extendable to handle additional sources or extract more data fields. --- How it works This workflow automates the transformation of unstructured email data into a structured, queryable format. It operates through a series of connected steps: Email Triggering: The workflow is initiated by one of three different email triggers (Gmail, Microsoft Outlook, or a generic IMAP account), which constantly monitor for new incoming emails. AI-Powered Parsing & Structuring: When a new email is detected, its raw, unstructured content is passed to a central "Parsing Agent." This agent uses a specified OpenAI language model to intelligently analyze the email text. Data Extraction & Standardization: Following a predefined system prompt, the AI agent extracts key information from the email, such as the sender, recipient, subject, and a generated summary. It then forces the output into a strict JSON structure using a "Structured Output Parser" node, ensuring data consistency. Data Storage: Finally, the clean, structured data (the from, to, subject, and summarize fields) is inserted as a new row into a specified n8n Data Table, creating a searchable and reportable database of email information. --- Set up steps To implement this workflow, follow these configuration steps: Prepare the Data Table: Create a new Data Table within n8n. Define the columns with the following names and string type: From, To, Subject, and Summary. Configure Email Credentials: Set up the credential connections for the email services you wish to use (Gmail OAuth2, Microsoft Outlook OAuth2, and/or IMAP). Ensure the accounts have the necessary permissions to read emails. Configure AI Model Credentials: Set up the OpenAI API credential with a valid API key. The workflow is configured to use the model, but this can be changed in the respective nodes if needed. Connect the Nodes: The workflow canvas is already correctly wired. Visually confirm that the email triggers are connected to the "Parsing Agent," which is connected to the "Insert row" (Data Table) node. Also, ensure the "OpenAI Chat Model" and "Structured Output Parser" are connected to the "Parsing Agent" as its AI model and output parser, respectively. Activate the Workflow: Save the workflow and toggle the "Active" switch to ON. The triggers will begin polling for new emails according to their schedule (e.g., every minute), and the automation will start processing incoming messages. --- Need help customizing? Contact me for consulting and support or add me on Linkedin.

By Davide

1616