Back to Catalog

Web crawler: Convert websites to AI-ready markdown in Google Sheets

Daniel NkenchoDaniel Nkencho
1090 views
2/3/2026
Official Page

Transform any website into a structured knowledge repository with this intelligent crawler that extracts hyperlinks from the homepage, intelligently filters images and content pages, and aggregates full Markdown-formatted contentโ€”perfect for fueling AI agents or building comprehensive company dossiers without manual effort.

๐Ÿ“‹ What This Template Does

This advanced workflow acts as a lightweight web crawler: it scrapes the homepage to discover all internal links (mimicking a sitemap extraction), deduplicates and validates them, separates image assets from textual pages, then fetches and converts non-image page content to clean Markdown. Results are seamlessly appended to Google Sheets for easy analysis, export, or integration into vector databases.

  • Automatically discovers and processes subpage links from the homepage
  • Filters out duplicates and non-HTTP links for efficient crawling
  • Converts scraped content to Markdown for AI-ready formatting
  • Categorizes and stores images, links, and full content in a single sheet row per site

๐Ÿ”ง Prerequisites

  • Google account with Sheets access for data storage
  • n8n instance (cloud or self-hosted)
  • Basic understanding of URLs and web links

๐Ÿ”‘ Required Credentials

Google Sheets OAuth2 API Setup

  1. Go to console.cloud.google.com โ†’ APIs & Services โ†’ Credentials
  2. Click "Create Credentials" โ†’ Select "OAuth client ID" โ†’ Choose "Web application"
  3. Add authorized redirect URIs: https://your-n8n-instance.com/rest/oauth2-credential/callback (replace with your n8n URL)
  4. Download the client ID and secret, then add to n8n as "Google Sheets OAuth2 API" credential type
  5. During setup, grant access to Google Sheets scopes (e.g., spreadsheets) and test the connection by listing a sheet

โš™๏ธ Configuration Steps

  1. Import the workflow JSON into your n8n instance
  2. In the "Set Website" node, update the website_url value to your target site (e.g., https://example.com)
  3. Assign your Google Sheets credential to the three "Add ... to Sheet" nodes
  4. Update the documentId and sheetName in those nodes to your target spreadsheet ID and sheet name/ID
  5. Ensure your sheet has columns: "Website", "Links", "Scraped Content", "Images"
  6. Activate the workflow and trigger manually to test scraping

๐ŸŽฏ Use Cases

  • Knowledge base creation: Crawl a company's site to aggregate all content into Sheets, then export to Notion or a vector DB for internal wikis
  • AI agent training: Extract structured Markdown from industry sites to fine-tune LLMs on domain-specific data like legal docs or tech blogs
  • Competitor intelligence: Build dossiers by crawling rival websites, separating assets and text for SEO audits or market analysis
  • Content archiving: Preserve dynamic sites (e.g., news portals) as static knowledge dumps for compliance or historical research

โš ๏ธ Troubleshooting

  • No links extracted: Verify the homepage has <a> tags; test with a simple site like example.com and check HTTP response in executions
  • Sheet update fails: Confirm column names match exactly (case-sensitive) and credential has edit permissions; try a new blank sheet
  • Content truncated: Google Sheets limits cells to ~50k charsโ€”adjust the .slice(0, 50000) in "Add Scraped Content to Sheet" or split into multiple rows
  • Rate limiting errors: Add a "Wait" node after "Scrape Links" with 1-2s delay if the site blocks rapid requests

Web Crawler to AI-Ready Markdown in Google Sheets

This n8n workflow automates the process of crawling websites, extracting their content, converting it into AI-ready Markdown format, and then storing the results in a Google Sheet. It's designed to help you quickly gather and prepare web content for various AI applications like content generation, summarization, or knowledge base creation.

What it does

  1. Triggers Manually: The workflow is initiated manually.
  2. Reads URLs from Google Sheets: It fetches a list of URLs from a specified Google Sheet.
  3. Removes Duplicate URLs: Ensures that each URL is processed only once to avoid redundant work.
  4. Crawls Each URL: For each unique URL, it performs an HTTP request to fetch the web page content.
  5. Extracts Main Content: Uses an HTML node to extract the primary content from the fetched web page, ignoring boilerplate elements like headers, footers, and sidebars.
  6. Converts to Markdown: Transforms the extracted HTML content into a clean Markdown format.
  7. Prepares Data for Google Sheets: Organizes the crawled URL and its Markdown content into a structured format suitable for appending to a Google Sheet.
  8. Appends to Google Sheet: Adds the processed data (URL and Markdown content) as new rows to a designated Google Sheet.

Prerequisites/Requirements

  • n8n Instance: A running n8n instance (cloud or self-hosted).
  • Google Sheets Account: A Google account with access to Google Sheets.
  • Google Sheets Credential: An n8n credential configured for Google Sheets to allow the workflow to read from and write to your spreadsheets.

Setup/Usage

  1. Import the Workflow: Import the provided JSON into your n8n instance.
  2. Configure Google Sheets Credentials:
    • Locate the "Google Sheets" node.
    • Click on the "Credential" field and select an existing Google Sheets OAuth2 credential or create a new one. Ensure it has the necessary permissions to read and write to your Google Sheets.
  3. Specify Input Google Sheet:
    • In the "Google Sheets" node (the first one), configure it to read from your input Google Sheet. This sheet should contain a column with the URLs you want to crawl.
  4. Specify Output Google Sheet:
    • In the "Google Sheets" node (the last one), configure it to write to your output Google Sheet. This sheet will store the crawled URLs and their generated Markdown content.
  5. Adjust HTML Extraction (Optional):
    • The "HTML" node is configured to extract the main content. You might need to adjust the CSS selectors within this node based on the structure of the websites you intend to crawl for optimal content extraction.
  6. Execute the Workflow: Click "Execute Workflow" to run the process. The workflow will read URLs, crawl them, convert content to Markdown, and update your Google Sheet.

Related Templates

Two-way property repair management system with Google Sheets & Drive

This workflow automates the repair request process between tenants and building managers, keeping all updates organized in a single spreadsheet. It is composed of two coordinated workflows, as two separate triggers are required โ€” one for new repair submissions and another for repair updates. A Unique Unit ID that corresponds to individual units is attributed to each request, and timestamps are used to coordinate repair updates with specific requests. General use cases include: Property managers who manage multiple buildings or units. Building owners looking to centralize tenant repair communication. Automation builders who want to learn multi-trigger workflow design in n8n. --- โš™๏ธ How It Works Workflow 1 โ€“ New Repair Requests Behind the Scenes: A tenant fills out a Google Form (โ€œRepair Request Formโ€), which automatically adds a new row to a linked Google Sheet. Steps: Trigger: Google Sheets rowAdded โ€“ runs when a new form entry appears. Extract & Format: Collects all relevant form data (address, unit, urgency, contacts). Generate Unit ID: Creates a standardized identifier (e.g., BUILDING-UNIT) for tracking. Email Notification: Sends the building manager a formatted email summarizing the repair details and including a link to a Repair Update Form (which activates Workflow 2). --- Workflow 2 โ€“ Repair Updates Behind the Scenes:\ Triggered when the building manager submits a follow-up form (โ€œRepair Update Formโ€). Steps: Lookup by UUID: Uses the Unit ID from Workflow 1 to find the existing row in the Google Sheet. Conditional Logic: If photos are uploaded: Saves each image to a Google Drive folder, renames files consistently, and adds URLs to the sheet. If no photos: Skips the upload step and processes textual updates only. Merge & Update: Combines new data with existing repair info in the same spreadsheet row โ€” enabling a full repair history in one place. --- ๐Ÿงฉ Requirements Google Account (for Forms, Sheets, and Drive) Gmail/email node connected for sending notifications n8n credentials configured for Google API access --- โšก Setup Instructions (see more detail in workflow) Import both workflows into n8n, then copy one into a second workflow. Change manual trigger in workflow 2 to a n8n Form node. Connect Google credentials to all nodes. Update spreadsheet and folder IDs in the corresponding nodes. Customize email text, sender name, and form links for your organization. Test each workflow with a sample repair request and a repair update submission. --- ๐Ÿ› ๏ธ Customization Ideas Add Slack or Telegram notifications for urgent repairs. Auto-create folders per building or unit for photo uploads. Generate monthly repair summaries using Google Sheets triggers. Add an AI node to create summaries/extract relevant repair data from repair request that include long submissions.

Matt@VeraisonLabsBy Matt@VeraisonLabs
208

Send WooCommerce cross-sell offers to customers via WhatsApp using Rapiwa API

Who Is This For? This n8n workflow enables automated cross-selling by identifying each WooCommerce customer's most frequently purchased product, finding a related product to recommend, and sending a personalized WhatsApp message using the Rapiwa API. It also verifies whether the user's number is WhatsApp-enabled before sending, and logs both successful and unsuccessful attempts to Google Sheets for tracking. What This Workflow Does Retrieves all paying customers from your WooCommerce store Identifies each customer's most purchased product Finds the latest product in the same category as their most purchased item Cleans and verifies customer phone numbers for WhatsApp compatibility Sends personalized WhatsApp messages with product recommendations Logs all activities to Google Sheets for tracking and analysis Handles both verified and unverified numbers appropriately Key Features Customer Segmentation: Automatically identifies paying customers from your WooCommerce store Product Analysis: Determines each customer's most purchased product Smart Recommendations: Finds the latest products in the same category as customer favorites WhatsApp Integration: Uses Rapiwa API for message delivery Phone Number Validation: Verifies WhatsApp numbers before sending messages Dual Logging System: Tracks both successful and failed message attempts in Google Sheets Rate Limiting: Uses batching and wait nodes to prevent API overload Personalized Messaging: Includes customer name and product details in messages Requirements WooCommerce store with API access Rapiwa account with API access for WhatsApp verification and messaging Google account with Sheets access Customer phone numbers in WooCommerce (stored in billing.phone field) How to Use โ€” Step-by-Step Setup Credentials Setup WooCommerce API: Configure WooCommerce API credentials in n8n (e.g., "WooCommerce (get customer)" and "WooCommerce (get customer data)") Rapiwa Bearer Auth: Create an HTTP Bearer credential with your Rapiwa API token Google Sheets OAuth2: Set up OAuth2 credentials for Google Sheets access Configure Google Sheets Ensure your sheet has the required columns as specified in the Google Sheet Column Structure section Verify Code Nodes Code (get paying_customer): Filters customers to include only those who have made purchases Get most buy product id & Clear Number: Identifies the most purchased product and cleans phone numbers Configure HTTP Request Nodes Get customer data: Verify the WooCommerce API endpoint for retrieving customer orders Get specific product data: Verify the WooCommerce API endpoint for product details Get specific product recommend latest product: Verify the WooCommerce API endpoint for finding latest products by category Check valid WhatsApp number Using Rapiwa: Verify the Rapiwa endpoint for WhatsApp number validation Rapiwa Sender: Verify the Rapiwa endpoint for sending messages Google Sheet Required Columns Youโ€™ll need two Google Sheets (or two tabs in one spreadsheet): A Google Sheet formatted like this โžค sample The workflow uses a Google Sheet with the following columns to track coupon distribution: Both must have the following headers (match exactly): | name | number | email | address1 | price | suk | title | product link | validity | staus | | ---------- | ------------- | ----------------------------------------------- | ----------- | ----- | --- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------- | -------- | | Abdul Mannan | 8801322827799 | contact@spagreen.net | mirpur dohs | 850 | | Sharp Most Demanding Hoodie x Nike | https://yourshopdomain/p-img-nike | verified | sent | | Abdul Mannan | 8801322827799 | contact@spagreen.net | mirpur dohs | 850 | | Sharp Most Demanding Hoodie x Nike | https://yourshopdomain/p-img-nike | unverified | not sent | | Abdul Mannan | 8801322827799 | contact@spagreen.net | mirpur dohs | 850 | | Sharp Most Demanding Hoodie x Nike | https://yourshopdomain/p-img-nike | verified | sent | Important Notes Phone Number Format: The workflow cleans phone numbers by removing all non-digit characters. Ensure your WooCommerce phone numbers are in a compatible format. API Rate Limits: Rapiwa and WooCommerce APIs have rate limits. Adjust batch sizes and wait times accordingly. Data Privacy: Ensure compliance with data protection regulations when sending marketing messages. Error Handling: The workflow logs unverified numbers but doesn't have extensive error handling. Consider adding error notifications for failed API calls. Product Availability: The workflow recommends the latest product in a category, but doesn't check if it's in stock. Consider adding stock status verification. Testing: Always test with a small batch before running the workflow on your entire customer list. Useful Links Dashboard: https://app.rapiwa.com Official Website: https://rapiwa.com Documentation: https://docs.rapiwa.com Support & Help WhatsApp: Chat on WhatsApp Discord: SpaGreen Community Facebook Group: SpaGreen Support Website: https://spagreen.net Developer Portfolio: Codecanyon SpaGreen

RapiwaBy Rapiwa
183

Track SDK documentation drift with GitHub, Notion, Google Sheets, and Slack

๐Ÿ“Š Description Automatically track SDK releases from GitHub, compare documentation freshness in Notion, and send Slack alerts when docs lag behind. This workflow ensures documentation stays in sync with releases, improves visibility, and reduces version drift across teams. ๐Ÿš€๐Ÿ“š๐Ÿ’ฌ What This Template Does Step 1: Listens to GitHub repository events to detect new SDK releases. ๐Ÿงฉ Step 2: Fetches release metadata including version, tag, and publish date. ๐Ÿ“ฆ Step 3: Logs release data into Google Sheets for record-keeping and analysis. ๐Ÿ“Š Step 4: Retrieves FAQ or documentation data from Notion. ๐Ÿ“š Step 5: Merges GitHub and Notion data to calculate documentation drift. ๐Ÿ” Step 6: Flags SDKs whose documentation is over 30 days out of date. โš ๏ธ Step 7: Sends detailed Slack alerts to notify responsible teams. ๐Ÿ”” Key Benefits โœ… Keeps SDK documentation aligned with product releases โœ… Prevents outdated information from reaching users โœ… Provides centralized release tracking in Google Sheets โœ… Sends real-time Slack alerts for overdue updates โœ… Strengthens DevRel and developer experience operations Features GitHub release trigger for real-time monitoring Google Sheets logging for tracking and auditing Notion database integration for documentation comparison Automated drift calculation (days since last update) Slack notifications for overdue documentation Requirements GitHub OAuth2 credentials Notion API credentials Google Sheets OAuth2 credentials Slack Bot token with chat:write permissions Target Audience Developer Relations (DevRel) and SDK engineering teams Product documentation and technical writing teams Project managers tracking SDK and doc release parity Step-by-Step Setup Instructions Connect your GitHub account and select your SDK repository. Replace YOURGOOGLESHEETID and YOURSHEET_GID with your tracking spreadsheet. Add your Notion FAQ database ID. Configure your Slack channel ID for alerts. Run once manually to validate setup, then enable automation.

Rahul JoshiBy Rahul Joshi
31