Multi-format document processing for RAG chatbot with Google Drive & Supabase

508 views

2/3/2026

PostgreSQL Backup AWS S3 Database Restore

This n8n workflow is the data ingestion pipeline for the "RAG System V2" chatbot. It automatically monitors a specific Google Drive folder for new files, processes them based on their type, and inserts their content into a Supabase vector database to make it searchable for the RAG agent.

Key Features & Workflow:

Google Drive Trigger: The workflow starts automatically when a new file is created in a designated folder (named "DOCUMENTS" in this template).

Smart File Handling: A Switch node routes the file based on its MIME type (e.g., PDF, Excel, Google Doc, Word Doc) for correct processing.

Multi-Format Extraction:

PDF: Text is extracted directly using the Extract PDF Text node.

Google Docs: Files are downloaded and converted to plain text (text/plain) and processed by the Extract from Text File node.

Excel: Data is extracted, aggregated, and concatenated into a single text block for embedding.

Word (.doc/.docx): Word files are automatically converted into Google Docs format using an HTTP Request. This newly created Google Doc will then trigger the entire workflow again, ensuring it's processed correctly.

Chunking & Metadata Enrichment: The extracted text is split into manageable chunks using the Recursive Character Text Splitter (set to 2000-character chunks). The Enhanced Default Data Loader then enriches these chunks with crucial metadata from the original file, such as file_name, creator, and created_at.

Vectorization & Storage: Finally, the workflow uses OpenAI Embeddings to create vector representations of the text chunks and inserts them into the Supabase Vector Store.

n8n Workflow: Multi-Format Document Processing for RAG Chatbot with Google Drive & Supabase

This n8n workflow automates the ingestion and processing of various document types from Google Drive into a Supabase Vector Store, making them ready for use in a Retrieval-Augmented Generation (RAG) chatbot. It handles different file formats, extracts their content, splits them into manageable chunks, generates embeddings using OpenAI, and stores these embeddings in Supabase for efficient retrieval.

What it does:

Monitors Google Drive: Triggers whenever a new file is added or an existing file is updated in a specified Google Drive folder.
Filters by File Type: Checks the file extension to determine if it's a supported document type (e.g., PDF, DOCX, TXT, CSV, JSON, HTML).
Downloads and Extracts Content:
- For supported file types, it downloads the file from Google Drive.
- It then extracts the text content from the binary file using a default data loader.
- If the file is a CSV, it aggregates the data into a single string.
Splits Text into Chunks: Uses a Recursive Character Text Splitter to break down the extracted document content into smaller, overlapping chunks suitable for embedding.
Generates Embeddings: Creates vector embeddings for each text chunk using OpenAI's embedding model.
Stores in Supabase: Upserts the text chunks along with their generated embeddings into a Supabase Vector Store, making them queryable for a RAG chatbot.

Prerequisites/Requirements:

n8n Instance: A running n8n instance.
Google Drive Account: With credentials configured in n8n to access the target folder.
OpenAI API Key: For generating text embeddings.
Supabase Project:
- A Supabase project with a configured pg_vector extension.
- A table set up to store documents and their embeddings (e.g., documents table with content and embedding columns).
- Supabase API URL and Service Role Key configured as credentials in n8n.

Setup/Usage:

Import the Workflow: Import the provided JSON into your n8n instance.
Configure Google Drive Trigger (Node 531):
- Select your Google Drive credential.
- Specify the "Folder ID" you want to monitor for new or updated documents.
- Choose the "Operation" (e.g., "Watch for new or updated files").
Configure Supabase Vector Store (Node 1231):
- Select your Supabase credential.
- Enter the "Table Name" where documents and embeddings will be stored (e.g., documents).
- Ensure the "Content Column" (e.g., content) and "Embedding Column" (e.g., embedding) match your Supabase table schema.
Configure OpenAI Embeddings (Node 1141):
- Select your OpenAI credential.
- Choose the desired "Model" for embeddings (e.g., text-embedding-ada-002).
Activate the Workflow: Once all credentials and configurations are set, activate the workflow.

Now, any new or updated documents in your specified Google Drive folder will be automatically processed and added to your Supabase Vector Store, ready to power your RAG chatbot!

Related Templates

Auto-reply & create Linear tickets from Gmail with GPT-5, gotoHuman & human review

This workflow automatically classifies every new email from your linked mailbox, drafts a personalized reply, and creates Linear tickets for bugs or feature requests. It uses a human-in-the-loop with gotoHuman and continuously improves itself by learning from approved examples. How it works The workflow triggers on every new email from your linked mailbox. Self-learning Email Classifier: an AI model categorizes the email into defined categories (e.g., Bug Report, Feature Request, Sales Opportunity, etc.). It fetches previously approved classification examples from gotoHuman to refine decisions. Self-learning Email Writer: the AI drafts a reply to the email. It learns over time by using previously approved replies from gotoHuman, with per-classification context to tailor tone and style (e.g., different style for sales vs. bug reports). Human Review in gotoHuman: review the classification and the drafted reply. Drafts can be edited or retried. Approved values are used to train the self-learning agents. Send approved Reply: the approved response is sent as a reply to the email thread. Create ticket: if the classification is Bug or Feature Request, a ticket is created by another AI agent in Linear. Human Review in gotoHuman: How to set up Most importantly, install the gotoHuman node before importing this template! (Just add the node to a blank canvas before importing) Set up credentials for gotoHuman, OpenAI, your email provider (e.g. Gmail), and Linear. In gotoHuman, select and create the pre-built review template "Support email agent" or import the ID: 6fzuCJlFYJtlu9mGYcVT. Select this template in the gotoHuman node. In the "gotoHuman: Fetch approved examples" http nodes you need to add your formId. It is the ID of the review template that you just created/imported in gotoHuman. Requirements gotoHuman (human supervision, memory for self-learning) OpenAI (classification, drafting) Gmail or your preferred email provider (for email trigger+replies) Linear (ticketing) How to customize Expand or refine the categories used by the classifier. Update the prompt to reflect your own taxonomy. Filter fetched training data from gotoHuman by reviewer so the writer adapts to their personalized tone and preferences. Add more context to the AI email writer (calendar events, FAQs, product docs) to improve reply quality.

By gotoHuman

353

Ai website scraper & company intelligence

AI Website Scraper & Company Intelligence Description This workflow automates the process of transforming any website URL into a structured, intelligent company profile. It's triggered by a form, allowing a user to submit a website and choose between a "basic" or "deep" scrape. The workflow extracts key information (mission, services, contacts, SEO keywords), stores it in a structured Supabase database, and archives a full JSON backup to Google Drive. It also features a secondary AI agent that automatically finds and saves competitors for each company, building a rich, interconnected database of company intelligence. --- Quick Implementation Steps Import the Workflow: Import the provided JSON file into your n8n instance. Install Custom Community Node: You must install the community node from: https://www.npmjs.com/package/n8n-nodes-crawl-and-scrape FIRECRAWL N8N Documentation https://docs.firecrawl.dev/developer-guides/workflow-automation/n8n Install Additional Nodes: n8n-nodes-crawl-and-scrape and n8n-nodes-mcp fire crawl mcp . Set up Credentials: Create credentials in n8n for FIRE CRAWL API,Supabase, Mistral AI, and Google Drive. Configure API Key (CRITICAL): Open the Web Search tool node. Go to Parameters → Headers and replace the hardcoded Tavily AI API key with your own. Configure Supabase Nodes: Assign your Supabase credential to all Supabase nodes. Ensure table names (e.g., companies, competitors) match your schema. Configure Google Drive Nodes: Assign your Google Drive credential to the Google Drive2 and save to Google Drive1 nodes. Select the correct Folder ID. Activate Workflow: Turn on the workflow and open the Webhook URL in the “On form submission” node to access the form. --- What It Does Form Trigger Captures user input: “Website URL” and “Scraping Type” (basic or deep). Scraping Router A Switch node routes the flow: Deep Scraping → AI-based MCP Firecrawler agent. Basic Scraping → Crawlee node. Deep Scraping (Firecrawl AI Agent) Uses Firecrawl and Tavily Web Search. Extracts a detailed JSON profile: mission, services, contacts, SEO keywords, etc. Basic Scraping (Crawlee) Uses Crawl and Scrape node to collect raw text. A Mistral-based AI extractor structures the data into JSON. Data Storage Stores structured data in Supabase tables (companies, company_basicprofiles). Archives a full JSON backup to Google Drive. Automated Competitor Analysis Runs after a deep scrape. Uses Tavily web search to find competitors (e.g., from Crunchbase). Saves competitor data to Supabase, linked by company_id. --- Who's It For Sales & Marketing Teams: Enrich leads with deep company info. Market Researchers: Build structured, searchable company databases. B2B Data Providers: Automate company intelligence collection. Developers: Use as a base for RAG or enrichment pipelines. --- Requirements n8n instance (self-hosted or cloud) Supabase Account: With tables like companies, competitors, social_links, etc. Mistral AI API Key Google Drive Credentials Tavily AI API Key (Optional) Custom Nodes: n8n-nodes-crawl-and-scrape --- How It Works Flow Summary Form Trigger: Captures “Website URL” and “Scraping Type”. Switch Node: deep → MCP Firecrawler (AI Agent). basic → Crawl and Scrape node. Scraping & Extraction: Deep path: Firecrawler → JSON structure. Basic path: Crawlee → Mistral extractor → JSON. Storage: Save JSON to Supabase. Archive in Google Drive. Competitor Analysis (Deep Only): Finds competitors via Tavily. Saves to Supabase competitors table. End: Finishes with a No Operation node. --- How To Set Up Import workflow JSON. Install community nodes (especially n8n-nodes-crawl-and-scrape from npm). Configure credentials (Supabase, Mistral AI, Google Drive). Add your Tavily API key. Connect Supabase and Drive nodes properly. Fix disconnected “basic” path if needed. Activate workflow. Test via the webhook form URL. --- How To Customize Change LLMs: Swap Mistral for OpenAI or Claude. Edit Scraper Prompts: Modify system prompts in AI agent nodes. Change Extraction Schema: Update JSON Schema in extractor nodes. Fix Relational Tables: Add Items node before Supabase inserts for arrays (social links, keywords). Enhance Automation: Add email/slack notifications, or replace form trigger with a Google Sheets trigger. --- Add-ons Automated Trigger: Run on new sheet rows. Notifications: Email or Slack alerts after completion. RAG Integration: Use the Supabase database as a chatbot knowledge source. --- Use Case Examples Sales Lead Enrichment: Instantly get company + competitor data from a URL. Market Research: Collect and compare companies in a niche. B2B Database Creation: Build a proprietary company dataset. --- WORKFLOW IMAGE --- Troubleshooting Guide | Issue | Possible Cause | Solution | |-------|----------------|-----------| | Form Trigger 404 | Workflow not active | Activate the workflow | | Web Search Tool fails | Missing Tavily API key | Replace the placeholder key | | FIRECRAWLER / find competitor fails | Missing MCP node | Install n8n-nodes-mcp | | Basic scrape does nothing | Switch node path disconnected | Reconnect “basic” output | | Supabase node error | Wrong table/column names | Match schema exactly | --- Need Help or More Workflows? Want to customize this workflow for your business or integrate it with your existing tools? Our team at Digital Biz Tech can tailor it precisely to your use case from automation logic to AI-powered enhancements. Contact: shilpa.raju@digitalbiz.tech For more such offerings, visit us: https://www.digitalbiz.tech ---

By DIGITAL BIZ TECH

923

Automate Reddit brand monitoring & responses with GPT-4o-mini, Sheets & Slack

How it Works This workflow automates intelligent Reddit marketing by monitoring brand mentions, analyzing sentiment with AI, and engaging authentically with communities. Every 24 hours, the system searches Reddit for posts containing your configured brand keywords across all subreddits, finding up to 50 of the newest mentions to analyze. Each discovered post is sent to OpenAI's GPT-4o-mini model for comprehensive analysis. The AI evaluates sentiment (positive/neutral/negative), assigns an engagement score (0-100), determines relevance to your brand, and generates contextual, helpful responses that add genuine value to the conversation. It also classifies the response type (educational/supportive/promotional) and provides reasoning for whether engagement is appropriate. The workflow intelligently filters posts using a multi-criteria system: only posts that are relevant to your brand, score above 60 in engagement quality, and warrant a response type other than "pass" proceed to engagement. This prevents spam and ensures every interaction is meaningful. Selected posts are processed one at a time through a loop to respect Reddit's rate limits. For each worthy post, the AI-generated comment is posted, and complete interaction data is logged to Google Sheets including timestamp, post details, sentiment, engagement scores, and success status. This creates a permanent audit trail and analytics database. At the end of each run, the workflow aggregates all data into a comprehensive daily summary report with total posts analyzed, comments posted, engagement rate, sentiment breakdown, and the top 5 engagement opportunities ranked by score. This report is automatically sent to Slack with formatted metrics, giving your team instant visibility into your Reddit marketing performance. --- Who is this for? Brand managers and marketing teams needing automated social listening and engagement on Reddit Community managers responsible for authentic brand presence across multiple subreddits Startup founders and growth marketers who want to scale Reddit marketing without hiring a team PR and reputation teams monitoring brand sentiment and responding to discussions in real-time Product marketers seeking organic engagement opportunities in product-related communities Any business that wants to build authentic Reddit presence while avoiding spammy marketing tactics --- Setup Steps Setup time: Approx. 30-40 minutes (credential configuration, keyword setup, Google Sheets creation, Slack integration) Requirements: Reddit account with OAuth2 application credentials (create at reddit.com/prefs/apps) OpenAI API key with GPT-4o-mini access Google account with a new Google Sheet for tracking interactions Slack workspace with posting permissions to a marketing/monitoring channel Brand keywords and subreddit strategy prepared Create Reddit OAuth Application: Visit reddit.com/prefs/apps, create a "script" type app, and obtain your client ID and secret Configure Reddit Credentials in n8n: Add Reddit OAuth2 credentials with your app credentials and authorize access Set up OpenAI API: Obtain API key from platform.openai.com and configure in n8n OpenAI credentials Create Google Sheet: Set up a new sheet with columns: timestamp, postId, postTitle, subreddit, postUrl, sentiment, engagementScore, responseType, commentPosted, reasoning Configure these nodes: Brand Keywords Config: Edit the JavaScript code to include your brand name, product names, and relevant industry keywords Search Brand Mentions: Adjust the limit (default 50) and sort preference based on your needs AI Post Analysis: Customize the prompt to match your brand voice and engagement guidelines Filter Engagement-Worthy: Adjust the engagementScore threshold (default 60) based on your quality standards Loop Through Posts: Configure max iterations and batch size for rate limit compliance Log to Google Sheets: Replace YOURSHEETID with your actual Google Sheets document ID Send Slack Report: Replace YOURCHANNELID with your Slack channel ID Test the workflow: Run manually first to verify all connections work and adjust AI prompts Activate for daily runs: Once tested, activate the Schedule Trigger to run automatically every 24 hours --- Node Descriptions (10 words each) Daily Marketing Check - Schedule trigger runs workflow every 24 hours automatically daily Brand Keywords Config - JavaScript code node defining brand keywords to monitor Reddit Search Brand Mentions - Reddit node searches all subreddits for brand keyword mentions AI Post Analysis - OpenAI analyzes sentiment, relevance, generates contextual helpful comment responses Filter Engagement-Worthy - Conditional node filters only high-quality relevant posts worth engaging Loop Through Posts - Split in batches processes each post individually respecting limits Post Helpful Comment - Reddit node posts AI-generated comment to worthy Reddit discussions Log to Google Sheets - Appends all interaction data to spreadsheet for permanent tracking Generate Daily Summary - JavaScript aggregates metrics, sentiment breakdown, generates comprehensive daily report Send Slack Report - Posts formatted daily summary with metrics to team Slack channel

By Daniel Shashko

679