Parse and extract invoice data with Nanonets OCR and export to Excel

999 views

2/3/2026

Google Sheets Data Logging GET Request HTTP Request API

This workflow contains community nodes that are only compatible with the self-hosted version of n8n.

Description

This workflow automates document processing and structured table extraction using the Nanonets API. You can submit a PDF file via an n8n form trigger or webhook—the workflow then forwards the document to Nanonets, waits for asynchronous parsing to finish, retrieves the results (including header fields and line items/tables), and returns the output as an Excel file. Ideal for automating invoice, receipt, or order data extraction with downstream business use.

How It Works

A document is uploaded (via n8n form or webhook).
The PDF is sent to the Nanonets Workflow API for parsing.
The workflow waits until processing is complete.
Parsed results are fetched. Both top-level fields and any table rows/line items are extracted and restructured.
Data is exported to Excel format and delivered to the requester.

Setup Steps

Nanonets Account: Register for a Nanonets account and set up a workflow for your specific document type (e.g., invoice, receipt).
Credentials in n8n: Add HTTP Basic Auth credentials in n8n for the Nanonets API (never store credentials directly in node parameters).
Webhook/Form Configuration:
- Option 1: Configure and enable the included n8n Form Trigger node for document uploads.
- Option 2: Use the included Webhook node to accept external POSTs with a PDF file.
Adjust Workflow:
- Update any HTTP nodes to use your credential profile.
- Insert your Nanonets Workflow ID in all relevant nodes.
Test the Workflow: Enable the workflow and try with a sample document.

Features

Accepts documents via n8n Form Trigger or direct webhook POST.
Securely sends files to Nanonets for document parsing (credentials stored in n8n credentials manager).
Automatically waits for async processing, checking Nanonets until results are ready.
Extracts both header data and all table/line items into a tabular format.
Exports results as an Excel file download.
Modular nodes allow easy customization or extension.

Prerequisites

Nanonets account with workflow configured for your document type.
n8n instance with HTTP Request, Webhook/Form, Code, and Excel/Spreadsheet nodes enabled.
Valid HTTP Basic Auth credentials saved in n8n for API access.

Example Use Cases

| Scenario | Benefit | |-----------------------|--------------------------------------------------| | Invoice Processing | Automated extraction of line items and totals | | Receipt Digitization | Parse amounts and charges for expense reports | | Purchase Orders | Convert scanned POs into structured Excel sheets |

Notes

You must set up credentials in the n8n credentials manager—do not store API keys directly in nodes.
All configuration and endpoints are clearly explained with inline sticky notes in the workflow editor.
Easily adaptable for other document types or similar APIs—just modify endpoints and result mapping.

n8n Workflow: Parse and Extract Invoice Data with Nanonets OCR and Export to Excel

This n8n workflow demonstrates how to process invoice data using a simulated OCR service (Nanonets is implied by the directory name, but the JSON uses generic HTTP requests) and then export the extracted information into an Excel file. It provides a basic framework for automating the extraction of structured data from documents.

What it does

This workflow performs the following steps:

Receives a Webhook Trigger: The workflow starts when it receives an incoming HTTP request via a webhook. This webhook is expected to contain the data needed to initiate the OCR process.
Simulates OCR Request: An "HTTP Request" node is used to simulate sending data to an OCR service (e.g., Nanonets API). This node would typically send the invoice document (or a reference to it) and receive the parsed data in return.
Waits for Processing: A "Wait" node introduces a delay, simulating the time it takes for the OCR service to process the document and return results. This is useful for asynchronous OCR APIs.
Processes OCR Response (Code): A "Code" node is used to parse and transform the data received from the simulated OCR service. This is where you would extract specific fields like invoice number, date, line items, total amount, etc., from the OCR output.
Converts to Excel File: The processed data from the "Code" node is then converted into an Excel (.xlsx) file using the "Convert to File" node.
Responds to Webhook: Finally, the workflow sends a response back to the initial webhook caller, potentially including a link to the generated Excel file or a confirmation message.

Prerequisites/Requirements

n8n Instance: A running instance of n8n.
Webhook Integration: An external system or application that can send HTTP requests to the n8n webhook.
OCR Service (Conceptual): While the workflow uses a generic "HTTP Request" node, in a real-world scenario, you would need an account and API key for an OCR service like Nanonets, Google Document AI, Amazon Textract, etc. The HTTP Request node would be configured to interact with that specific service.
Basic JavaScript Knowledge: For customizing the data extraction logic within the "Code" node.

Setup/Usage

Import the Workflow:
- Copy the provided JSON code.
- In your n8n instance, click on "Workflows" in the left sidebar.
- Click "New" and then "Import from JSON".
- Paste the JSON code and click "Import".
Configure the Webhook:
- Click on the "Webhook" node.
- Note down the "Webhook URL" displayed in the node settings. This is the URL you will send your trigger requests to.
- You can set the "HTTP Method" and "Response Mode" as needed for your integration.
Configure the HTTP Request Node:
- Click on the "HTTP Request" node.
- Method: Set this to the appropriate HTTP method for your OCR API (e.g., POST).
- URL: Replace the placeholder with the actual API endpoint of your OCR service (e.g., Nanonets API endpoint).
- Headers/Body: Configure the headers (e.g., Authorization with your API key) and the request body to send the invoice document or its URL to the OCR service.
Customize the Code Node:
- Click on the "Code" node.
- Modify the JavaScript code to parse the specific JSON structure returned by your chosen OCR service.
- Extract the relevant invoice fields and structure them into an array of objects, which will be used to generate the Excel file.
Activate the Workflow:
- Once configured, click the "Activate" toggle in the top right corner of the n8n editor to enable the workflow.
Trigger the Workflow:
- Send an HTTP request to the Webhook URL you obtained in step 2. The body of this request should contain any initial data required by your workflow (e.g., a document URL or base64 encoded file).
- The workflow will then execute, process the invoice, and generate an Excel file. The "Respond to Webhook" node will send a response back to your calling system.

Related Templates

Automate RSS to social media pipeline with AI, Airtable & GetLate for multiple platforms

Overview Automates your complete social media content pipeline: sources articles from Wallabag RSS, generates platform-specific posts with AI, creates contextual images, and publishes via GetLate API. Built with 63 nodes across two workflows to handle LinkedIn, Instagram, and Bluesky—with easy expansion to more platforms. Ideal for: Content marketers, solo creators, agencies, and community managers maintaining a consistent multi-platform presence with minimal manual effort. How It Works Two-Workflow Architecture: Content Aggregation Workflow Monitors Wallabag RSS feeds for tagged articles (to-share-linkedin, to-share-instagram, etc.) Extracts and converts content from HTML to Markdown Stores structured data in Airtable with platform assignment AI Generation & Publishing Workflow Scheduled trigger queries Airtable for unpublished content Routes to platform-specific sub-workflows (LinkedIn, Instagram, Bluesky) LLM generates optimized post text and image prompts based on custom brand parameters Optionally generates AI images and hosts them on Imgbb CDN Publishes via GetLate API (immediate or draft mode) Updates Airtable with publication status and metadata Key Features: Tag-based content routing using Wallabag's native system Swappable AI providers (Groq, OpenAI, Anthropic) Platform-specific optimization (tone, length, hashtags, CTAs) Modular design—duplicate sub-workflows to add new platforms in \~30 minutes Centralized Airtable tracking with 17 data points per post Set Up Steps Setup time: \~45-60 minutes for initial configuration Create accounts and get API keys (\~15 min) Wallabag (with RSS feeds enabled) GetLate (social media publishing) Airtable (create base with provided schema—see sticky notes) LLM provider (Groq, OpenAI, or Anthropic) Image service (Hugging Face, Fal.ai, or Stability AI) Imgbb (image hosting) Configure n8n credentials (\~10 min) Add all API keys in n8n's credential manager Detailed credential setup instructions in workflow sticky notes Set up Airtable database (\~10 min) Create "RSS Feed - Content Store" base Add 19 required fields (schema provided in workflow sticky notes) Get Airtable base ID and API key Customize brand prompts (\~15 min) Edit "Set Custom SMCG Prompt" node for each platform Define brand voice, tone, goals, audience, and image preferences Platform-specific examples provided in sticky notes Configure platform settings (\~10 min) Set GetLate account IDs for each platform Enable/disable image generation per platform Choose immediate publish vs. draft mode Adjust schedule trigger frequency Test and deploy Tag test articles in Wallabag Monitor the first few executions in draft mode Activate workflows when satisfied with the output Important: This is a proof-of-concept template. Test thoroughly with draft mode before production use. Detailed setup instructions, troubleshooting tips, and customization guidance are in the workflow's sticky notes. Technical Details 63 nodes: 9 Airtable operations, 8 HTTP requests, 7 code nodes, 3 LangChain LLM chains, 3 RSS triggers, 3 GetLate publishers Supports: Multiple LLM providers, multiple image generation services, unlimited platforms via modular architecture Tracking: 17 metadata fields per post, including publish status, applied parameters, character counts, hashtags, image URLs Prerequisites n8n instance (self-hosted or cloud) Accounts: Wallabag, GetLate, Airtable, LLM provider, image generation service, Imgbb Basic understanding of n8n workflows and credential configuration Time to customize prompts for your brand voice Detailed documentation, Airtable schema, prompt examples, and troubleshooting guides are in the workflow's sticky notes. Category Tags social-media-automation, ai-content-generation, rss-to-social, multi-platform-posting, getlate-api, airtable-database, langchain, workflow-automation, content-marketing

By Mikal Hayden-Gates

188

Ai website scraper & company intelligence

AI Website Scraper & Company Intelligence Description This workflow automates the process of transforming any website URL into a structured, intelligent company profile. It's triggered by a form, allowing a user to submit a website and choose between a "basic" or "deep" scrape. The workflow extracts key information (mission, services, contacts, SEO keywords), stores it in a structured Supabase database, and archives a full JSON backup to Google Drive. It also features a secondary AI agent that automatically finds and saves competitors for each company, building a rich, interconnected database of company intelligence. --- Quick Implementation Steps Import the Workflow: Import the provided JSON file into your n8n instance. Install Custom Community Node: You must install the community node from: https://www.npmjs.com/package/n8n-nodes-crawl-and-scrape FIRECRAWL N8N Documentation https://docs.firecrawl.dev/developer-guides/workflow-automation/n8n Install Additional Nodes: n8n-nodes-crawl-and-scrape and n8n-nodes-mcp fire crawl mcp . Set up Credentials: Create credentials in n8n for FIRE CRAWL API,Supabase, Mistral AI, and Google Drive. Configure API Key (CRITICAL): Open the Web Search tool node. Go to Parameters → Headers and replace the hardcoded Tavily AI API key with your own. Configure Supabase Nodes: Assign your Supabase credential to all Supabase nodes. Ensure table names (e.g., companies, competitors) match your schema. Configure Google Drive Nodes: Assign your Google Drive credential to the Google Drive2 and save to Google Drive1 nodes. Select the correct Folder ID. Activate Workflow: Turn on the workflow and open the Webhook URL in the “On form submission” node to access the form. --- What It Does Form Trigger Captures user input: “Website URL” and “Scraping Type” (basic or deep). Scraping Router A Switch node routes the flow: Deep Scraping → AI-based MCP Firecrawler agent. Basic Scraping → Crawlee node. Deep Scraping (Firecrawl AI Agent) Uses Firecrawl and Tavily Web Search. Extracts a detailed JSON profile: mission, services, contacts, SEO keywords, etc. Basic Scraping (Crawlee) Uses Crawl and Scrape node to collect raw text. A Mistral-based AI extractor structures the data into JSON. Data Storage Stores structured data in Supabase tables (companies, company_basicprofiles). Archives a full JSON backup to Google Drive. Automated Competitor Analysis Runs after a deep scrape. Uses Tavily web search to find competitors (e.g., from Crunchbase). Saves competitor data to Supabase, linked by company_id. --- Who's It For Sales & Marketing Teams: Enrich leads with deep company info. Market Researchers: Build structured, searchable company databases. B2B Data Providers: Automate company intelligence collection. Developers: Use as a base for RAG or enrichment pipelines. --- Requirements n8n instance (self-hosted or cloud) Supabase Account: With tables like companies, competitors, social_links, etc. Mistral AI API Key Google Drive Credentials Tavily AI API Key (Optional) Custom Nodes: n8n-nodes-crawl-and-scrape --- How It Works Flow Summary Form Trigger: Captures “Website URL” and “Scraping Type”. Switch Node: deep → MCP Firecrawler (AI Agent). basic → Crawl and Scrape node. Scraping & Extraction: Deep path: Firecrawler → JSON structure. Basic path: Crawlee → Mistral extractor → JSON. Storage: Save JSON to Supabase. Archive in Google Drive. Competitor Analysis (Deep Only): Finds competitors via Tavily. Saves to Supabase competitors table. End: Finishes with a No Operation node. --- How To Set Up Import workflow JSON. Install community nodes (especially n8n-nodes-crawl-and-scrape from npm). Configure credentials (Supabase, Mistral AI, Google Drive). Add your Tavily API key. Connect Supabase and Drive nodes properly. Fix disconnected “basic” path if needed. Activate workflow. Test via the webhook form URL. --- How To Customize Change LLMs: Swap Mistral for OpenAI or Claude. Edit Scraper Prompts: Modify system prompts in AI agent nodes. Change Extraction Schema: Update JSON Schema in extractor nodes. Fix Relational Tables: Add Items node before Supabase inserts for arrays (social links, keywords). Enhance Automation: Add email/slack notifications, or replace form trigger with a Google Sheets trigger. --- Add-ons Automated Trigger: Run on new sheet rows. Notifications: Email or Slack alerts after completion. RAG Integration: Use the Supabase database as a chatbot knowledge source. --- Use Case Examples Sales Lead Enrichment: Instantly get company + competitor data from a URL. Market Research: Collect and compare companies in a niche. B2B Database Creation: Build a proprietary company dataset. --- WORKFLOW IMAGE --- Troubleshooting Guide | Issue | Possible Cause | Solution | |-------|----------------|-----------| | Form Trigger 404 | Workflow not active | Activate the workflow | | Web Search Tool fails | Missing Tavily API key | Replace the placeholder key | | FIRECRAWLER / find competitor fails | Missing MCP node | Install n8n-nodes-mcp | | Basic scrape does nothing | Switch node path disconnected | Reconnect “basic” output | | Supabase node error | Wrong table/column names | Match schema exactly | --- Need Help or More Workflows? Want to customize this workflow for your business or integrate it with your existing tools? Our team at Digital Biz Tech can tailor it precisely to your use case from automation logic to AI-powered enhancements. Contact: shilpa.raju@digitalbiz.tech For more such offerings, visit us: https://www.digitalbiz.tech ---

By DIGITAL BIZ TECH

923

Dynamic Hubspot lead routing with GPT-4 and Airtable sales team distribution

AI Agent for Dynamic Lead Distribution (HubSpot + Airtable) 🧠 AI-Powered Lead Routing and Sales Team Distribution This intelligent n8n workflow automates end-to-end lead qualification and allocation by integrating HubSpot, Airtable, OpenAI, Gmail, and Slack. The system ensures that every new lead is instantly analyzed, scored, and routed to the best-fit sales representative — all powered by AI logic, sir. --- 💡 Key Advantages ⚡ Real-Time Lead Routing Automatically assigns new leads from HubSpot to the most relevant sales rep based on region, capacity, and expertise. 🧠 AI Qualification Engine An OpenAI-powered Agent evaluates the lead’s industry, region, and needs to generate a persona summary and routing rationale. 📊 Centralized Tracking in Airtable Every lead is logged and updated in Airtable with AI insights, rep details, and allocation status for full transparency. 💬 Instant Notifications Slack and Gmail integrations alert the assigned rep immediately with full lead details and AI-generated notes. 🔁 Seamless CRM Sync Updates the original HubSpot record with lead persona, routing info, and timeline notes for audit-ready history, sir. --- ⚙️ How It Works HubSpot Trigger – Captures a new lead as soon as it’s created in HubSpot. Fetch Contact Data – Retrieves all relevant fields like name, company, and industry. Clean & Format Data – A Code node standardizes and structures the data for consistency. Airtable Record Creation – Logs the lead data into the “Leads” table for centralized tracking. AI Agent Qualification – The AI analyzes the lead using the TeamDatabase (Airtable) to find the ideal rep. Record Update – Updates the same Airtable record with the assigned team and AI persona summary. Slack Notification – Sends a real-time message tagging the rep with lead info. Gmail Notification – Sends a personalized handoff email with context and follow-up actions. HubSpot Sync – Updates the original contact in HubSpot with the assignment details and AI rationale, sir. --- 🛠️ Setup Steps Trigger Node: HubSpot → Detect new leads. HubSpot Node: Retrieve complete lead details. Code Node: Clean and normalize data. Airtable Node: Log lead info in the “Leads” table. AI Agent Node: Process lead and match with sales team. Slack Node: Notify the designated representative. Gmail Node: Email the rep with details. HubSpot Node: Update CRM with AI summary and allocation status, sir. --- 🔐 Credentials Required HubSpot OAuth2 API – To fetch and update leads. Airtable Personal Access Token – To store and update lead data. OpenAI API – To power the AI qualification and matching logic. Slack OAuth2 – For sending team notifications. Gmail OAuth2 – For automatic email alerts to assigned reps, sir. --- 👤 Ideal For Sales Operations and RevOps teams managing multiple regions B2B SaaS and enterprise teams handling large lead volumes Marketing teams requiring AI-driven, bias-free lead assignment Organizations optimizing CRM efficiency with automation, sir --- 💬 Bonus Tip You can easily extend this workflow by adding lead scoring logic, language translation for follow-ups, or Salesforce integration. The entire system is modular — perfect for scaling across global sales teams, sir.

By MANISH KUMAR

113