Back to Catalog

Index legal documents for hybrid search with Qdrant, OpenAI & BM25

Jenny Jenny
2085 views
2/3/2026
Official Page

Index Legal Dataset to Qdrant for Hybrid Retrieval

This pipeline is the first part of "Hybrid Search with Qdrant & n8n, Legal AI".
The second part, "Hybrid Search with Qdrant & n8n, Legal AI: Retrieval", covers retrieval and simple evaluation.

Overview

This pipeline transforms a Q&A legal corpus from Hugging Face (isaacus) into vector representations and indexes them to Qdrant, providing the foundation for running Hybrid Search, combining:

After running this pipeline, you will have a Qdrant collection with your legal dataset ready for hybrid retrieval on BM25 and dense embeddings: either mxbai-embed-large-v1 or text-embedding-3-small.

Options for Embedding Inference

This pipeline equips you with two approaches for generating dense vectors:

  1. Using Qdrant Cloud Inference, conversion to vectors handled directly in Qdrant;
  2. Using external provider, e.g. OpenAI for generating embeddings.

Prerequisites

  • A cluster on Qdrant Cloud
    • Paid cluster in the US region if you want to use Qdrant Cloud Inference
    • Free Tier Cluster if using an external provider (here OpenAI)
  • Qdrant Cluster credentials:
    • You'll be guided on how to obtain both the URL and API_KEY from the Qdrant Cloud UI when setting up your cluster;
  • An OpenAI API key (if you’re not using Qdrant’s Cloud Inference);

P.S.

n8n Workflow: Index Legal Documents for Hybrid Search with Qdrant and OpenAI

This n8n workflow demonstrates how to process legal documents, generate embeddings using OpenAI, and index them into Qdrant for hybrid search capabilities. It's designed to prepare a dataset of legal documents for efficient semantic and keyword-based retrieval.

What it does

This workflow automates the following steps:

  1. Manual Trigger: Starts the workflow manually, allowing you to initiate the indexing process on demand.
  2. Edit Fields (Set): This node is likely used to define or modify input data, such as the source of legal documents or parameters for processing. (The provided JSON doesn't specify the exact fields, but this is its typical use).
  3. Loop Over Items (Split in Batches): Iterates over a collection of legal documents or data items, processing them in batches to manage API limits or resource usage.
  4. HTTP Request (OpenAI Embeddings): For each document (or batch of documents), it sends a request to the OpenAI API to generate vector embeddings. These embeddings capture the semantic meaning of the document content.
  5. HTTP Request (Qdrant Upsert): After obtaining the embeddings, the workflow sends another HTTP request to the Qdrant vector database to upsert (insert or update) the document. This includes the original document content (payload) and its generated vector embedding.
  6. Merge: Combines the output from the batch processing, likely aggregating results or ensuring all items have been processed before proceeding.
  7. If: This node introduces conditional logic. It likely checks the success of the Qdrant upsert operation or other conditions to determine the next steps.
  8. Limit: This node is present but not connected in the provided JSON, suggesting it might be a placeholder or an unused component. If connected, it would restrict the number of items passing through.
  9. Split Out: This node is also present but not connected, indicating it might be an unused component. If connected, it would typically split an array of items into individual items.
  10. Aggregate: This node is present but not connected. If connected, it would typically combine items into a single collection or perform aggregation operations.
  11. Sticky Note: Provides a place to add comments or documentation directly within the workflow canvas.

Prerequisites/Requirements

To use this workflow, you will need:

  • OpenAI API Key: For generating document embeddings.
  • Qdrant Instance: A running Qdrant vector database instance where the documents will be indexed. This could be self-hosted or a cloud service.
  • n8n Instance: A running n8n instance to host and execute the workflow.

Setup/Usage

  1. Import the Workflow: Download the JSON provided and import it into your n8n instance.
  2. Configure Credentials:
    • OpenAI: Configure an HTTP Request node (likely node ID 19, if it's the OpenAI one) with your OpenAI API key. This will typically involve setting an Authorization header with Bearer YOUR_OPENAI_API_KEY.
    • Qdrant: Configure the HTTP Request node responsible for Qdrant upsert (likely node ID 19 as well, or another HTTP Request node if there are two distinct ones) with the URL of your Qdrant instance and any necessary API keys or authentication headers.
  3. Define Input Data: Modify the "Edit Fields (Set)" node (node ID 38) to provide the legal documents you wish to index. This could be an array of objects, where each object represents a document with its content and any metadata.
  4. Activate and Execute:
    • Ensure the workflow is active.
    • Click "Execute Workflow" on the "Manual Trigger" node (node ID 838) to start the indexing process.

This workflow provides a robust foundation for building a hybrid search solution for legal documents, leveraging the power of large language models for semantic understanding and vector databases for efficient retrieval.

Related Templates

Automate task deadline reminders with Google Sheets and Gmail (Today/3-Day/7-Day)

Task Deadline Reminder Workflow (Today / 3-Day / 7-Day) Task deadline management manually is inefficient and leads to missed deadlines—especially when teams rely on spreadsheets and individual reminders. This workflow automates the entire follow-up process by reading a centralized task sheet in Google Sheets every morning, checking the deadline for each task, and sending automatic email notifications to the responsible person based on urgency. Tasks due today, within three days, or within one week are identified and routed to customized Gmail notifications, ensuring that every team member is aware of upcoming deadlines without manual checking. Who’s it for This workflow is ideal for teams and organizations that manage multiple tasks across departments and need a reliable way to stay on top of deadlines. It is especially useful for: Project managers coordinating many deadlines Back-office teams monitoring routine operational tasks Organizations with distributed members Anyone who relies on spreadsheets but needs automated follow-up By integrating Google Sheets, n8n, and Gmail, you gain a proactive notification system that keeps everyone aligned and reduces the risk of forgotten tasks. How it works Daily trigger The workflow runs every morning at 9:00 using a Schedule Trigger. Load task list from Google Sheets The workflow retrieves all rows from the designated spreadsheet, including task name, deadline, responsible person, and email address. Process tasks individually A loop node evaluates each task one by one. Evaluate deadline conditions Due today: Deadline matches today’s date Due within 3 days: Deadline falls between today and three days ahead Due within 7 days: Deadline falls between today and one week ahead Send notifications Depending on urgency: “本日が締め切りです” for tasks due today “タスク期限が三日前となりました” for tasks due within 3 days “タスクの期限が一週間以内です” for tasks due within 7 days Each email is automatically sent to the responsible person based on the “メールアドレス” field in the sheet. Complete processing The loop continues until all task rows have been checked. How to set up Import the workflow into your n8n instance Authenticate Google Sheets and select the task spreadsheet Authenticate Gmail as the sender account Confirm required columns: タスク, 期限, 担当, メールアドレス Adjust time, message text, or conditions based on your internal rules Requirements Active n8n instance Google Sheets access with permission to read the task list Gmail OAuth connection for email sending Spreadsheet with at least: task name, deadline, responsible person, email address How to customize You can expand and refine this workflow to match your company’s processes: Add Slack, Chatwork, or LINE notifications Add overdue task detection Add task priority sorting (High / Medium / Low) Log notifications back into the spreadsheet Send daily summary reports to managers This workflow provides a flexible foundation for building a complete automated task governance system.

Yuki HirotaBy Yuki Hirota
194

Automate job matching with Gemini AI, Decodo scraping & resume analysis to Telegram

AI Job Matcher with Decodo, Gemini AI & Resume Analysis Sign up for Decodo — get better pricing here Who’s it for This workflow is built for job seekers, recruiters, founders, automation builders, and data engineers who want to automate job discovery and intelligently match job listings against resumes using AI. It’s ideal for anyone building job boards, candidate matching systems, hiring pipelines, or personal job alert automations using n8n. What this workflow does This workflow automatically scrapes job listings from SimplyHired using Decodo residential proxies, extracts structured job data with a Gemini AI agent, downloads resumes from Google Drive, extracts and summarizes resume content, and surfaces the most relevant job opportunities. The workflow stores structured results in a database and sends real-time notifications via Telegram, creating a scalable and low-maintenance AI-powered job matching pipeline. How it works A schedule trigger starts the workflow automatically Decodo fetches job search result pages from SimplyHired Job card HTML is extracted from the page A Gemini AI agent converts raw HTML into structured job data Resume PDFs are downloaded from Google Drive Resume text is extracted from PDF files A Gemini AI agent summarizes key resume highlights Job and resume data are stored in a database Matching job alerts are sent via Telegram How to set up Add your Decodo API credentials Add your Google Gemini API key Connect Google Drive for resume access Configure your Telegram bot Set up your database (Google Sheets by default) Update the job search URL with your keywords and location Requirements Self-hosted n8n instance Decodo account (community node) Google Gemini API access Google Drive access Telegram Bot token Google Sheets or another database > Note: This template uses a community node (Decodo) and is intended for self-hosted n8n only. How to customize the workflow Replace SimplyHired with another job board or aggregator Add job–resume matching or scoring logic Extend the resume summary with custom fields Swap Google Sheets for PostgreSQL, Supabase, or Airtable Route notifications to Slack, Email, or Webhooks Add pagination or multi-resume processing

Rully SaputraBy Rully Saputra
65

Daily Magento 2 customer sync to Google Contacts & Sheets without duplicates

Automatically sync newly registered Magento 2 customers to Google Contacts and Google Sheets every 24 hours — with full duplication control and seamless automation. This workflow is a plug-and-play customer contact automation system designed for Magento 2 store owners, marketers, and CRM teams. It fetches customer records registered within the last 24 hours (from 00:00:00 to 23:59:59), checks against an existing Google Sheet to avoid reprocessing, and syncs only the new ones into Google Contacts. This ensures your contact list is always fresh and up to date — without clutter or duplicates. ✅ What This Workflow Does: Automates Customer Syncing Every day, it fetches newly registered Magento 2 customers via API based on the exact date range (midnight to midnight). Deduplicates Using Google Sheets A master Google Sheet tracks already-synced emails. Before adding a customer, the workflow checks this list and skips if already present. Creates Google Contacts Automatically For each unique customer, it creates a new contact in your Google Contacts, saving fields like first name, last name, and email. Logs New Entries to Google Sheets In Google Sheets, it even records magento 2 customer group, createdat, websiteid & store_id After syncing, it adds each new email to the tracking sheet, building a cumulative record of synced contacts. Fully Scheduled & Automated Can be scheduled with the Cron node to run daily (e.g., 12:05 AM) with no manual intervention required. 🔧 Modules Used: HTTP Request (Magento 2 API) Date & Time (for filtering registrations) Google Sheets (for reading/writing synced emails) Google Contacts (for contact creation) Set, IF, and Merge nodes (for control logic) Cron (for scheduling the automation) 💼 Use Cases: Keep your email marketing tools synced with Magento 2 customer data. Build a CRM-friendly contact base in Google Contacts without duplicates. Share customer data with sales or support teams through synced Google Sheets. Reduce manual work and human error in data transfer processes. 🔒 Credentials Required Magento 2 Bearer Auth: Set up as a credential in n8n using your Magento 2 API access token. Google API 📂 Category E-commerce → Magento 2 (Adobe Commerce) 💬 Need Help? 💡 Having trouble setting it up or want to customize this workflow further? Feel free to reach out — I’m happy to help with setup, customization, or Magento 2 API integration issues. Contact: Author 👤 Author Kanaka Kishore Kandregula Certified Magento 2 Developer https://gravatar.com/kmyprojects https://www.linkedin.com/in/kanakakishore

Kanaka Kishore KandregulaBy Kanaka Kishore Kandregula
163