From sitemap crawling to vector storage: Creating an efficient workflow for RAG
This template crawls a website from its sitemap, deduplicates URLs in Supabase, scrapes pages with Crawl4AI, cleans and validates the text, then stores content + metadata in a Supabase vector store using OpenAI embeddings. It’s a reliable, repeatable pipeline for building searchable knowledge bases, SEO research corpora, and RAG datasets. ⸻
Good to know
• Built-in de-duplication via a scrape_queue table (status: pending/completed/error).
• Resilient flow: waits, retries, and marks failed tasks.
• Costs depend on Crawl4AI usage and OpenAI embeddings.
• Replace any placeholders (API keys, tokens, URLs) before running.
• Respect website robots/ToS and applicable data laws when scraping.
How it works
1. Sitemap fetch & parse — Load sitemap.xml, extract all URLs.
2. De-dupe — Normalize URLs, check Supabase scrape_queue; insert only new ones.
3. Scrape — Send URLs to Crawl4AI; poll task status until completed.
4. Clean & score — Remove boilerplate/markup, detect content type, compute quality metrics, extract metadata (title, domain, language, length).
5. Chunk & embed — Split text, create OpenAI embeddings.
6. Store — Upsert into Supabase vector store (documents) with metadata; update job status.
Requirements
• Supabase (Postgres + Vector extension enabled)
• Crawl4AI API key (or header auth)
• OpenAI API key (for embeddings)
• n8n credentials set for HTTP, Postgres/Supabase
How to use
1. Configure credentials (Supabase/Postgres, Crawl4AI, OpenAI).
2. (Optional) Run the provided SQL to create scrape_queue and documents.
3. Set your sitemap URL in the HTTP Request node.
4. Execute the workflow (manual trigger) and monitor Supabase statuses.
5. Query your documents table or vector store from your app/RAG stack.
Potential Use Cases
This automation is ideal for:
- Market research teams collecting competitive data
- Content creators monitoring web trends
- SEO specialists tracking website content updates
- Analysts gathering structured data for insights
- Anyone needing reliable, structured web content for analysis
Need help customizing?
Contact me for consulting and support: LinkedIn
n8n Workflow: Sitemap Crawling to Vector Storage for RAG
This n8n workflow provides an efficient, automated pipeline for extracting content from a website's sitemap, processing it, and storing it in a Supabase vector database. This process is crucial for building a robust Retrieval-Augmented Generation (RAG) system, enabling AI models to retrieve relevant information from your website content.
What it does
This workflow automates the following steps:
- Manual Trigger: Initiates the workflow upon manual execution.
- HTTP Request (Get Sitemap): Fetches the XML content of a specified website sitemap.
- XML Parsing: Parses the retrieved sitemap XML to extract individual URLs.
- Edit Fields (Set URL): Transforms the extracted URL data into a usable format, setting the
urlfield for subsequent steps. - Loop Over Items (Split in Batches): Processes the URLs in batches to manage API limits and system resources efficiently.
- HTTP Request (Get Page Content): For each URL, it makes an HTTP request to fetch the actual web page content.
- Wait: Introduces a delay between page content requests to avoid overwhelming the target website or hitting rate limits.
- Default Data Loader: Loads the fetched HTML content into a document format suitable for text processing.
- Character Text Splitter: Splits the loaded document content into smaller, manageable chunks (e.g., paragraphs or sections) to optimize embedding and retrieval.
- Embeddings OpenAI: Generates vector embeddings for each text chunk using the OpenAI API.
- Supabase Vector Store: Stores the generated text chunks and their corresponding embeddings into a Supabase vector database.
- Postgres (Optional Logging): An optional node for logging or further processing of the data in a PostgreSQL database (currently not connected in the provided JSON).
- Supabase (Optional): An optional, unconnected Supabase node, potentially for other database operations.
- If (Conditional Logic): An unconnected conditional node, which could be used to implement branching logic based on data (e.g., checking for successful page fetches).
- Code (Custom Logic): An unconnected code node, allowing for custom JavaScript logic if needed.
- Sticky Note: Provides a note within the workflow for documentation or reminders.
- Split Out: An unconnected node, potentially for further data manipulation.
Prerequisites/Requirements
To use this workflow, you will need:
- n8n Instance: A running n8n instance.
- OpenAI API Key: For generating text embeddings. This should be configured as an n8n credential.
- Supabase Project: A Supabase project with a configured
vectorstable or similar, and appropriate API keys for connecting to the vector store. This should be configured as an n8n credential. - Website Sitemap URL: The URL of the sitemap you wish to crawl (e.g.,
https://example.com/sitemap.xml). - PostgreSQL Database (Optional): If you intend to use the Postgres node for logging or storage, you'll need a PostgreSQL database and its credentials.
Setup/Usage
- Import the workflow: Download the JSON provided and import it into your n8n instance.
- Configure Credentials:
- Set up your OpenAI API Key as an n8n credential.
- Set up your Supabase credentials (Project URL and Anon Key) as an n8n credential.
- If using the Postgres node, configure your PostgreSQL credentials.
- Configure Nodes:
- HTTP Request (Get Sitemap): Ensure the URL points to your target website's sitemap.
- HTTP Request (Get Page Content): This node dynamically uses the URLs extracted from the sitemap.
- Embeddings OpenAI: Select your configured OpenAI credential.
- Supabase Vector Store: Select your configured Supabase credential and specify the table name and content column where vectors should be stored.
- Adjust the
Waitnode's delay as needed to prevent rate limiting. - Review the
Character Text Splitterconfiguration to ensure chunking is suitable for your content.
- Activate the workflow: Once configured, activate the workflow.
- Execute the workflow: Click "Execute workflow" in the Manual Trigger node to start the process.
This workflow provides a powerful foundation for populating your RAG system with up-to-date website content, making your AI applications more informed and accurate.
Related Templates
Generate, encrypt, and send invoices with PDF Generator API & Google Suite
Why Creating and sending invoices manually is a major administrative bottleneck. It's not only slow but also prone to human error, such as creating duplicate invoice numbers or sending sensitive financial data in an unsecured format. This workflow solves these problems by creating a robust, end-to-end automation. It ensures every invoice has a unique ID, is professionally generated, is password-protected, and is delivered to your customer automatically. What This workflow provides a complete, secure solution for automated invoicing. It is designed to be triggered by a Webhook (e.g., from your e-commerce store, CRM, or billing platform) that provides customer and order details. The workflow then executes the following steps: Generate & Verify ID: It first generates a new invoice ID. It then performs a critical check by reading your master Google Sheet to ensure this ID is unique, preventing duplicate invoices. Generate PDF: Once the ID is verified, it passes the data to the PDF Generator API. This service dynamically populates your custom invoice template. (PDF Generator API makes it incredibly easy to build and manage your document templates via their web-based editor). Encrypt Document: For enhanced security, the workflow uses a PDF Generator API operation to encrypt the newly generated invoice with a password, protecting your client's sensitive data. Store & Deliver: Finally, it uploads the secure PDF to a specified Google Drive folder for your records and then automatically sends it to the customer as an attachment using Gmail. How Prerequisites: You will need active accounts for: PDF Generator API (for both generation and encryption) Google Suite (for Sheets, Drive, and Gmail) PDF Generator API Setup: Log in to your PDF Generator API account and use their template builder to create your invoice design. Note your Template ID, API Key, and API Secret. In the n8n PDFGeneratorAPI node (Generate a PDF document), create new credentials using your Key and Secret. In the node's parameters, select your Template ID from the list. Google Sheets Setup: Create a Google Sheet to act as your master list of invoices. In the Check If ID Already Exists node, authenticate your Google Sheets account. Set the Spreadsheet ID and Sheet Name. In the "Columns to Return" field, enter the name of the column where you store your invoice IDs. Security & Delivery Setup: Encrypt Node: In the Encrypt PDF document node, authenticate your PDF Generator API credentials (the same ones from Step 2). You can set a static password, or for better security, use an expression to set a dynamic password from the webhook data (e.g., the customer's postal code or order ID). Google Drive Node: Authenticate the Upload file node and specify the Drive and Folder ID where invoices should be stored. Gmail Node: Authenticate the Send a message + file node. Use an expression to map the customer's email from the trigger data into the "To" field. Test & Activate: The Webhook node has pinned test data. You can click "Test workflow" to run the entire process with this sample data. Once you confirm the file is generated, encrypted, and sent, connect your live app (e.g., Shopify, Stripe, etc.) to the production Webhook URL. Activate the workflow.
Task escalation system with Google Sheets, Gmail, Telegram & Jira automation
Description This workflow sends an instant email alert when a task in a Google Sheet is marked as Urgent, and then sends a Telegram reminder notification after 2 hours if the task still hasn’t been updated. Then a Jira ticket is created so the task enters in the formal workflow and another Telegram message is sent with the details of the issue created. It helps teams avoid missed deadlines and ensures urgent tasks get attention — without requiring anyone to refresh or monitor the sheet manually. Context In shared task lists, urgent items can be overlooked if team members aren't actively checking the spreadsheet. This workflow solves that by: Sending an email as soon as a task becomes Urgent Waiting 2 hours Checking if the task is still open Sending a Telegram reminder only if action has not been taken Creating a Jira issue Sending a Telegram message with the details of the issue created This prevents both silence and spam, creating a smart and reliable alert system. Target Users Project Managers using Google Sheets Team leads managing shared task boards Remote teams needing lightweight coordination Anyone who wants escalation notifications without complex systems Technical Requirements Google Sheets credential Gmail credential Telegram Bot + Chat ID Google Sheet with a column named Priority Jira credential Workflow Steps Trigger: Google Sheets Trigger (on update in the “Priority” column) IF Node – Checks if Priority = Urgent Send Email – Sends alert email with task name, owner, status, deadline Mark Notified = Yes in the sheet Wait 2 hours IF Status is still not resolved Send Telegram reminder create an Issue on Jira based on the information provided Send Telegram message with the details of the ticket Key Features Real-time alerts on critical tasks Simple logic (no code required) Custom email body with dynamic fields Works on any Google Sheet with a “Priority” column Telegram notification ensures the task doesn’t get forgotten Expected Output Personalized email alert when a task is marked as "Urgent" Email includes task info: title, owner, deadline, status, next step Telegram message after 2 hours if the task is still open Automatic creation of a Jira issue with the higgest priority Telegram message to notify about the new Jira ticket How it works Trigger: Watches for “Priority” updates 🔍 Check: If Priority = Urgent AND Notified is empty 📧 Email: Sends a personalized alert ✏️ Sheet Update: Marks the task as already notified ⏳ Wait: 2-hour delay 🤖 Check Again: If Status hasn’t changed → send Telegram reminder, create Jira ticket and send the details. Tutorial video: Watch the Youtube Tutorial video About me : I’m Yassin a Project & Product Manager Scaling tech products with data-driven project management. 📬 Feel free to connect with me on Linkedin
Workshop certificate generator with Google Drive, Gmail & QR verification
Automated Pre-Issued Workshop Certificate Generator Description: This workflow automates the entire pre-issuance process of workshop participation certificates. When an attendee submits a registration form via a webhook, the workflow validates the data, verifies the attendee’s email, generates a unique Certificate ID and QR code, creates a styled certificate image, stores it on Google Drive, emails the certificate to the attendee, logs all details in Google Sheets, and notifies organizers via Slack — all fully automated. This template is ideal for institutions, event teams, training organizations, hackathons, and workshops that want to automate certificate issuing and remove manual processing. --- Key Features: Webhook-based registration intake Required field + email validation using VerifiEmail API Auto-generated Certificate ID, QR code, and verification URL Dynamic HTML-to-Image certificate generation Automatic email delivery with certificate attachment (Gmail) Auto-upload certificate to Google Drive Real-time Slack notification for organizers Registration + certificate logging in Google Sheets Instant webhook response with certificate metadata --- How It Works (Short Summary): Webhook Trigger receives registration details. Validator checks for mandatory fields (name, email, event). Email verification ensures the email is deliverable. Certificate generation creates unique ID + QR + HTML. HTML-to-Image converts the certificate to PNG. Upload to Google Drive stores the certificate file. Email node sends the certificate to the attendee. Google Sheets logs the registration + certificate details. Slack message notifies organizers instantly. Webhook response returns success JSON. --- Use Cases: Workshops Webinars Training sessions Bootcamps Corporate events Hackathons Student registrations Event ticketing / entry pass systems --- Required Credentials: VerifiEmail API – email validation at verifi.email HTMLCSStoImage API – convert certificate HTML to PNG at htmlcsstoimg.com Gmail OAuth2 – send certificate emails Google Drive OAuth2 – store certificate files Google Sheets OAuth2 – logging Slack API – organizer notifications --- Setup Instructions: Import this template into your n8n instance. Open the Webhook node and copy the generated webhook URL. Use this URL in your registration form / frontend / Postman. Add all required credentials in the Credentials Manager. Customize certificate HTML (colors, branding, logos) if needed. Test with a sample POST request containing all required fields: name email event date time venue organization designation Enable the workflow. --- Input Format (POST Body Required): json { "name": "John Doe", "email": "john@example.com", "event": "AI Workshop 2025", "date": "25 Nov 2025", "time": "10:00 AM", "venue": "Auditorium Hall", "organization": "Tech University", "designation": "Student" } --- Output (Webhook Response): json { "success": true, "message": "Registration successful! Certificate sent to your email.", "certificateId": "CERT-12345-ABCD", "verifyUrl": "https://workshopverify.com/cert?id=CERT-12345-ABCD" } --- Why This Workflow is Useful: Eliminates manual certificate design & sending Ensures professional, consistent certificates Reduces event staff workload Guarantees accurate data logging Provides instant attendee confirmation Enhances event experience with automation ---