Extract and structure Thai documents to Google Sheets using Typhoon OCR and Llama 3.1
⚠️ Note: This template requires a community node and works only on self-hosted n8n installations. It uses the Typhoon OCR Python package and custom command execution. Make sure to install required dependencies locally. --- Who is this for? This template is for developers, operations teams, and automation builders in Thailand (or any Thai-speaking environment) who regularly process PDFs or scanned documents in Thai and want to extract structured text into a Google Sheet. It is ideal for: Local government document processing Thai-language enterprise paperwork AI automation pipelines requiring Thai OCR --- What problem does this solve? Typhoon OCR is one of the most accurate OCR tools for Thai text. However, integrating it into an end-to-end workflow usually requires manual scripting and data wrangling. This template solves that by: Running Typhoon OCR on PDF files Using AI to extract structured data fields Automatically storing results in Google Sheets --- What this workflow does Trigger: Run manually or from any automation source Read Files: Load local PDF files from a doc/ folder Execute Command: Run Typhoon OCR on each file using a Python command LLM Extraction: Send the OCR markdown to an AI model (e.g., GPT-4 or OpenRouter) to extract fields Code Node: Parse the LLM output as JSON Google Sheets: Append structured data into a spreadsheet --- Setup Install Requirements Python 3.10+ typhoon-ocr: pip install typhoon-ocr Install Poppler and add to system PATH (needed for pdftoppm, pdfinfo) Create folders Create a folder called doc in the same directory where n8n runs (or mount it via Docker) Google Sheet Create a Google Sheet with the following column headers: | book\id | date | subject | detail | signed\by | signed\by2 | contact | download\url | | -------- | ---- | ------- | ------ | ---------- | ----------- | ------- | ------------- | You can use this example Google Sheet as a reference. API Key Export your TYPHOONOCRAPIKEY and OPENAIAPI_KEY in your environment (or set inside the command string in Execute Command node). --- How to customize this workflow Replace the LLM provider in the Basic LLM Chain node (currently supports OpenRouter) Change output fields to match your data structure (adjust the prompt and Google Sheet headers) Add trigger nodes (e.g., Dropbox Upload, Webhook) to automate input --- About Typhoon OCR Typhoon is a multilingual LLM and toolkit optimized for Thai NLP. It includes typhoon-ocr, a Python OCR library designed for Thai-centric documents. It is open-source, highly accurate, and works well in automation pipelines. Perfect for government paperwork, PDF reports, and multilingual documents in Southeast Asia. ---
Index legal documents for hybrid search with Qdrant, OpenAI & BM25
Index Legal Dataset to Qdrant for Hybrid Retrieval *This pipeline is the first part of "Hybrid Search with Qdrant & n8n, Legal AI". The second part, "Hybrid Search with Qdrant & n8n, Legal AI: Retrieval", covers retrieval and simple evaluation.* Overview This pipeline transforms a Q&A legal corpus from Hugging Face (isaacus) into vector representations and indexes them to Qdrant, providing the foundation for running Hybrid Search, combining: Dense vectors (embeddings) for semantic similarity search; Sparse vectors for keyword-based exact search. After running this pipeline, you will have a Qdrant collection with your legal dataset ready for hybrid retrieval on BM25 and dense embeddings: either mxbai-embed-large-v1 or text-embedding-3-small. Options for Embedding Inference This pipeline equips you with two approaches for generating dense vectors: Using Qdrant Cloud Inference, conversion to vectors handled directly in Qdrant; Using external provider, e.g. OpenAI for generating embeddings. Prerequisites A cluster on Qdrant Cloud Paid cluster in the US region if you want to use Qdrant Cloud Inference Free Tier Cluster if using an external provider (here OpenAI) Qdrant Cluster credentials: You'll be guided on how to obtain both the URL and API_KEY from the Qdrant Cloud UI when setting up your cluster; An OpenAI API key (if you’re not using Qdrant’s Cloud Inference); P.S. To ask retrieval in Qdrant-related questions, join the Qdrant Discord. Star Qdrant n8n community node repo <3