Automate document ingestion & RAG system with Google Drive, Sheets & OpenAI
1. Overview
The IngestionDocs workflow is a fully automated document ingestion and knowledge management system built with n8n. Its purpose is to continuously ingest organizational documents from Google Drive, transform them into vector embeddings using OpenAI, store them in Pinecone, and make them searchable and retrievable through an AI-powered Q&A interface.
This ensures that employees always have access to the most up-to-date knowledge base without requiring manual intervention.
2. Key Objectives
- Automated Ingestion → Seamlessly process new and updated documents from Google Drive.\
- Change Detection → Track and differentiate between new, updated, and previously processed documents.\
- Knowledge Base Construction → Convert documents into embeddings for semantic search.\
- AI-Powered Assistance → Provide an intelligent Q&A system for employees to query manuals.\
- Scalable & Maintainable → Modular design using n8n, LangChain, and Pinecone.
3. Workflow Breakdown
A. Document Monitoring and Retrieval
- The workflow begins with two Google Drive triggers:
- File Created Trigger → Fires when a new document is uploaded.\
- File Updated Trigger → Fires when an existing document is modified.\
- A search operation lists the files in the designated Google Drive folder.\
- Non-downloadable items (e.g., subfolders) are filtered out.\
- For valid files:
- The file is downloaded.\
- A SHA256 hash is generated to uniquely identify the file's content.
B. Record Management (Google Sheets Integration)
To keep track of ingestion states, the workflow uses a Google Sheets--based Record Manager:\
- Each file entry contains:\
- Id (Google Drive file ID)\
- Name (file name)\
- hashId (SHA256 checksum)\
- The workflow compares the current file's hash with the stored one:\
- New Document → File not found in records → Inserted into the Record Manager.\
- Already Processed → File exists and hash matches → Skipped.\
- Updated Document → File exists but hash differs → Record is updated.
This guarantees that only new or modified content is processed, avoiding duplication.
C. Document Processing and Vectorization
Once a document is marked as new or updated:\
- Default Data Loader extracts its content (binary files supported).\
- Pages are split into individual chunks.\
- Metadata such as file ID and name are attached.\
- Recursive Character Text Splitter divides the content into manageable segments with overlap.\
- OpenAI Embeddings (
text-embedding-3-large) transform each text chunk into a semantic vector.\ - Pinecone Vector Store stores these vectors in the configured index:\
- For new documents, embeddings are inserted into a namespace based on the file name.\
- For updated documents, the namespace is cleared first, then re-ingested with fresh embeddings.
This process builds a scalable and queryable knowledge base.
D. Knowledge Base Q&A Interface
The workflow also provides an interactive form-based user interface:\
- Form Trigger → Collects employee questions.\
- LangChain AI Agent:\
- Receives the question.\
- Retrieves relevant context from Pinecone using vector similarity search.\
- Processes the response using OpenAI Chat Model (
gpt-4.1-mini).\ - Answer Formatting:\
- Responses are returned in HTML format for readability.\
- A custom CSS theme ensures a modern, user-friendly design.\
- Answers may include references to page numbers when available.
This creates a self-service knowledge base assistant that employees can query in natural language.
4. Technologies Used
- n8n → Orchestration of the entire workflow.\
- Google Drive API → File monitoring, listing, and downloading.\
- Google Sheets API → Record manager for tracking file states.\
- OpenAI API:
- text-embedding-3-large for semantic vector creation.\
- gpt-4.1-mini for conversational Q&A.\
- Pinecone → Vector database for embedding storage and retrieval.\
- LangChain → Document loaders, text splitters, vector store connectors, and agent logic.\
- Crypto (SHA256) → File hash generation for change detection.\
- Form Trigger + Form Node → Employee-facing Q&A submission and answer display.\
- Custom CSS → Provides a modern, responsive, styled UI for the knowledge base.
5. End-to-End Data Flow
- Employee uploads or updates a document → Google Drive detects the change.\
- Workflow downloads and hashes the file → Ensures uniqueness and detects modifications.\
- Record Manager (Google Sheets) → Decides whether to skip, insert, or update the record.\
- Document Processing → Splitting + Embedding + Storing into Pinecone.\
- Knowledge Base Updated → The latest version of documents is indexed.\
- Employee asks a question via the web form.\
- AI Agent retrieves embeddings from Pinecone + uses GPT-4.1-mini → Generates a contextual answer.\
- Answer displayed in styled HTML → Delivered back to the employee through the form interface.
6. Benefits
- Always Up-to-Date → Automatically syncs documents when uploaded or changed.\
- No Duplicates → Smart hashing ensures only relevant updates are reprocessed.\
- Searchable Knowledge Base → Employees can query documents semantically, not just by keywords.\
- Enhanced Productivity → Answers are immediate, reducing time spent browsing manuals.\
- Scalable → New documents and users can be added without workflow redesign.
✅ In summary, IngestionDocs is a robust AI-driven document ingestion and retrieval system that integrates Google Drive, Google Sheets, OpenAI, and Pinecone within n8n. It continuously builds and maintains a knowledge base of manuals while offering employees an intelligent, user-friendly Q&A assistant for fast and accurate knowledge retrieval.
Automate Document Ingestion & RAG System with Google Drive, Sheets, and OpenAI
This n8n workflow automates the process of ingesting documents from Google Drive, processing them for a Retrieval-Augmented Generation (RAG) system using OpenAI embeddings and Pinecone, and managing the ingestion status in Google Sheets. It also provides a form interface to trigger the ingestion manually and query the RAG system.
What it does
This workflow streamlines the creation and querying of a RAG system by:
- Triggering Ingestion:
- Automatically monitors a specified Google Drive folder for new or updated files.
- Allows manual triggering of document ingestion via an n8n form, where users can specify a Google Drive folder ID.
- Document Processing:
- Downloads new or updated documents from Google Drive.
- Loads document content using a Default Data Loader.
- Splits the document content into smaller, manageable chunks using a Recursive Character Text Splitter.
- Generates vector embeddings for each text chunk using OpenAI Embeddings.
- Stores these embeddings in a Pinecone Vector Store for efficient retrieval.
- Status Tracking:
- Records the status of each document ingestion (e.g., "Ingested") in a Google Sheet, including the document ID, name, and the folder it was processed from.
- Updates the Google Sheet with the ingestion status, preventing reprocessing of already ingested documents.
- RAG System Querying:
- Provides a separate n8n form to submit queries to the RAG system.
- Utilizes an AI Agent (LangChain) with an OpenAI Chat Model and the Pinecone Vector Store to answer questions based on the ingested documents.
Prerequisites/Requirements
To use this workflow, you will need:
- n8n Instance: A running n8n instance.
- Google Drive Account: With access to the folders containing the documents to be ingested.
- Google Sheets Account: To track document ingestion status.
- OpenAI API Key: For generating document embeddings and powering the AI chat model.
- Pinecone Account: A Pinecone API key and environment for storing and retrieving vector embeddings.
- Credentials: Configured n8n credentials for:
- Google Drive (OAuth2)
- Google Sheets (OAuth2)
- OpenAI (API Key)
- Pinecone (API Key and Environment)
Setup/Usage
- Import the Workflow:
- Download the provided JSON file.
- In your n8n instance, go to "Workflows" and click "New".
- Click the "Import from JSON" button and paste the workflow JSON or upload the file.
- Configure Credentials:
- Locate the Google Drive, Google Sheets, OpenAI Embeddings, OpenAI Chat Model, and Pinecone Vector Store nodes.
- Click on each node and configure your respective credentials. If you don't have them set up, n8n will guide you through creating new ones.
- Configure Google Drive Trigger:
- In the "Google Drive Trigger" node, specify the Google Drive folder ID you want to monitor for new documents.
- Configure Google Sheets:
- In the "Google Sheets" nodes, specify the spreadsheet name and sheet name where you want to track the ingestion status. Ensure the sheet has columns for at least "Document ID", "Document Name", "Folder ID", and "Status".
- Configure AI Agent and Vector Store:
- In the "AI Agent" node, ensure the "OpenAI Chat Model" and "Pinecone Vector Store" are correctly linked.
- In the "Pinecone Vector Store" node, specify your Pinecone index name.
- Activate the Workflow:
- Once all credentials and configurations are set, activate the workflow by toggling the "Active" switch in the top right corner of the n8n editor.
The workflow will now automatically ingest documents from your specified Google Drive folder and allow you to query your RAG system via the n8n form.
Related Templates
Automated YouTube video uploads with 12h interval scheduling in JST
This workflow automates a batch upload of multiple videos to YouTube, spacing each upload 12 hours apart in Japan Standard Time (UTC+9) and automatically adding them to a playlist. ⚙️ Workflow Logic Manual Trigger — Starts the workflow manually. List Video Files — Uses a shell command to find all .mp4 files under the specified directory (/opt/downloads/单词卡/A1-A2). Sort and Generate Items — Sorts videos by day number (dayXX) extracted from filenames and assigns a sequential order value. Calculate Publish Schedule (+12h Interval) — Computes the next rounded JST hour plus a configurable buffer (default 30 min). Staggers each video’s scheduled time by order × 12 hours. Converts JST back to UTC for YouTube’s publishAt field. Split in Batches (1 per video) — Iterates over each video item. Read Video File — Loads the corresponding video from disk. Upload to YouTube (Scheduled) — Uploads the video privately with the computed publishAtUtc. Add to Playlist — Adds the newly uploaded video to the target playlist. 🕒 Highlights Timezone-safe: Pure UTC ↔ JST conversion avoids double-offset errors. Sequential scheduling: Ensures each upload is 12 hours apart to prevent clustering. Customizable: Change SPANHOURS, BUFFERMIN, or directory paths easily. Retry-ready: Each upload and playlist step has retry logic to handle transient errors. 💡 Typical Use Cases Multi-part educational video series (e.g., A1–A2 English learning). Regular content release cadence without manual scheduling. Automated YouTube publishing pipelines for pre-produced content. --- Author: Zane Category: Automation / YouTube / Scheduler Timezone: JST (UTC+09:00)
Monitor bank transactions with multi-channel alerts for accounting teams
Enhance financial oversight with this automated n8n workflow. Triggered every 5 minutes, it fetches real-time bank transactions via an API, enriches and transforms the data, and applies smart logic to detect critical, high, and medium priority alerts based on error conditions, amounts, or risk scores. It sends multi-channel notifications via email and Slack, logs all data to Google Sheets, and generates summary statistics for comprehensive tracking. 💰🚨 Key Features Real-time monitoring every 5 minutes for instant alerts. Smart prioritization (Critical, High, Medium) based on risk and errors. Multi-channel notifications via email and Slack. Detailed logging and summary reports in Google Sheets. How It Works Schedule Trigger: Runs every 5 minutes. Fetch Transactions: HTTP request retrieves real-time transaction data. API Error?: If condition for error logic is met, sends error alert. Enrich & Transform Data: Advanced risk calculation enhances data. Critical Alert?: If condition (50% or risk > 8) is met, raises alert. High Priority?: If condition (5% or risk > 7) is met, raises alert. Medium Priority?: If condition is met, raises alert. Log Priority to Sheet: Google Sheets appends critical, high, or medium priority data. Send Critical Email: HTML email to execute sheets append. Send High Priority Email: Email to finance team. Send High Priority Slack: Slack notification to finance team. Send Medium Priority Email: Email to finance team. Merge All Alerts: Combines all alerts for comprehensive tracking. Generate Summary Stats: Code block for analytics. Log Summary to Sheet: Summary statistics storage. Setup Instructions Import the workflow into n8n and configure the bank API credentials in "Fetch Transactions." Set up Google Sheets OAuth2 and replace the sheet ID for logging nodes. Configure Gmail API Key and Slack Bot Token for alerts. Test the workflow with sample transaction data exceeding risk or amount thresholds. Adjust priority conditions (e.g., 50%, 5%, risk > 8) based on your risk policy. Prerequisites Bank API access with real-time transaction data (e.g., https://api.bank.com) Google Sheets OAuth2 credentials Gmail API Key for email alerts Slack Bot Token (with chat:write permissions) Structured transaction data format Google Sheet Structure: Create a sheet with columns: Transaction ID Amount Date Risk Score Priority (Critical/High/Medium) Alert Sent Summary Stats Updated At Modification Options Adjust the "Schedule Trigger" interval (e.g., every 10 minutes). Modify "Critical Alert?" and "High Priority?" conditions for custom thresholds. Customize email and Slack templates with branded messaging. Integrate with fraud detection tools for enhanced risk analysis. Enhance "Generate Summary Stats" with additional metrics (e.g., average risk). Discover more workflows – Get in touch with us
Detect holiday conflicts & suggest meeting reschedules with Google Calendar and Slack
Who’s it for Remote and distributed teams that schedule across time zones and want to avoid meetings landing on public holidays—PMs, CS/AM teams, and ops leads who own cross-regional calendars. What it does / How it works The workflow checks next week’s Google Calendar events, compares event dates against public holidays for selected country codes, and produces a single Slack digest with any conflicts plus suggested alternative dates. Core steps: Workflow Configuration (Set) → Fetch Public Holidays (via a public holiday API such as Calendarific/Nager.Date) → Get Next Week Calendar Events (Google Calendar) → Detect Holiday Conflicts (compare dates) → Generate Reschedule Suggestions (find nearest business day that isn’t a holiday/weekend) → Format Slack Digest → Post Slack Digest. How to set up Open Workflow Configuration (Set) and edit: countryCodes, calendarId, slackChannel, nextWeekStart, nextWeekEnd. Connect your own Google Calendar and Slack credentials in n8n (no hardcoded keys). (Optional) Adjust the Trigger to run daily or only on Mondays. Requirements n8n (Cloud or self-hosted) Google Calendar read access to the target calendar Slack app with permission to post to the chosen channel A public-holiday API (no secrets needed for Nager.Date; Calendarific requires an API key) How to customize the workflow Time window: Change nextWeekStart/End to scan a different period. Holiday sources: Add or swap APIs; merge multiple regions. Suggestion logic: Tweak the look-ahead window or rules (e.g., skip Fridays). Output: Post per-calendar messages, DM owners, or create tentative reschedule events automatically.