Evaluate tool usage accuracy in multi-agent AI workflows using evaluation nodes
Who's it for
This workflow is ideal for AI developers running multi-agent systems in n8n who need to quantitatively evaluate tool usage behavior. If you're building autonomous agents and want to verify their decisions against ground-truth expectations, this workflow gives you plug-and-play observability.
What it does
This template uses n8n's built-in Evaluation Trigger and Evaluation nodes to assess whether an AI agent correctly used all the expected tools. It supports:
-
Dataset-driven testing of agent behavior
-
Logging actual tools to compare them with the expected tools
-
Assigning performance metrics (tool_called = true/false)
-
Persisting output back to Google Sheets for further debugging
The workflow can be triggered by either the chat input or the dataset row evaluation. It routes through a multi-tool agent node powered by the best LLMs. The agent has access to tools such as web search, calculator, vector search, and summarizer tools. The workflow then aims to validate tool use decisions by extracting the intermediate steps from the agent (i.e., action + observation) and comparing the tools that were called with the expected tools. If the tools that were called during the workflow execution match, then it's a pass; otherwise, it's documented as a fail. The evaluation nodes take care of that process.
How to set it up
-
Connect your Google Sheets OAuth2 credential. Replace the document with your own test dataset.
-
Set your desired models and configure the different agent tools, such as the summarizer and vector store. The default vector store used is Qdrant, so the user must create this vector store with a few samples of queries + web search results.
-
Run from either the chat trigger or the evaluation trigger to test.
Requirements
-
Google Sheets OAuth2 credential
-
OpenRouter / OpenAI credentials for AI agents and embeddings
-
Firecrawl and Qdrant credentials for web + vector search
How to customize
-
Edit the Search Agent system message to define tool selection behavior
-
Add more metric columns in the Evaluation node for complex scoring
-
Add new tool nodes and link them to the agent block
-
Swap in your own summarizer
Evaluate Tool Usage Accuracy in Multi-Agent AI Workflows using Evaluation Nodes
This n8n workflow demonstrates how to evaluate the accuracy of tool usage within multi-agent AI workflows. It leverages n8n's AI agent capabilities, various tools, and the new Evaluation nodes to systematically test and measure the performance of AI agents in specific scenarios.
What it does
This workflow is designed to set up a framework for evaluating AI agent performance, specifically focusing on how accurately and effectively an AI agent utilizes its available tools.
- Triggers an Evaluation Run: The workflow starts with an "Evaluation Trigger" node, indicating that it's designed to be run as part of an evaluation process, likely fetching a dataset row for each test case.
- Prepares Input Data: An "Edit Fields (Set)" node is used to define or transform the input data for the AI agent, ensuring it's in the correct format for the agent to process.
- Configures AI Agent: An "AI Agent" node is set up with:
- Embeddings: Uses "Embeddings OpenAI" to convert text into numerical vectors, enabling the agent to understand semantic similarity.
- Vector Store: Integrates with "Qdrant Vector Store" to store and retrieve vector embeddings, providing a knowledge base for the agent.
- Language Model: Utilizes an "OpenRouter Chat Model" as the core language model for the agent's reasoning and response generation.
- Tools: Equips the agent with two specific tools:
- Calculator: Allows the AI agent to perform mathematical calculations.
- Call n8n Workflow Tool: Enables the AI agent to trigger and execute other n8n workflows, extending its capabilities.
- Evaluates Agent Performance: The "Evaluation" node at the end is crucial. It collects the output from the AI agent and compares it against expected results (which would typically be defined in the dataset row fetched by the trigger). This node is used to calculate metrics and assess the accuracy of the AI agent's tool usage in solving the given task.
- No Operation: A "No Operation, do nothing" node is present, which often serves as a placeholder or a way to end a branch of a workflow without performing any specific action.
Prerequisites/Requirements
To use this workflow, you will need:
- n8n Instance: A running n8n instance (self-hosted or cloud).
- OpenAI API Key: For the "Embeddings OpenAI" node.
- OpenRouter API Key: For the "OpenRouter Chat Model" node.
- Qdrant Instance: Access to a Qdrant vector database instance.
- Configured n8n Workflows: If the "Call n8n Workflow Tool" is intended to call specific workflows, those workflows must exist and be configured.
Setup/Usage
- Import the Workflow: Download the provided JSON and import it into your n8n instance.
- Configure Credentials:
- Set up your OpenAI API Key credential for the "Embeddings OpenAI" node.
- Set up your OpenRouter API Key credential for the "OpenRouter Chat Model" node.
- Configure your Qdrant connection details for the "Qdrant Vector Store" node.
- Customize Input Data: Adjust the "Edit Fields (Set)" node to provide the specific input data you want to test your AI agent with. This data should ideally come from the "Evaluation Trigger" node, representing a single test case from your evaluation dataset.
- Define Evaluation Criteria: Configure the "Evaluation" node to define how you want to measure the accuracy of the AI agent's tool usage. This typically involves comparing the agent's output (e.g., the tool it chose, the parameters it used, or the final answer) against a predefined "ground truth" or expected value.
- Run the Evaluation: Execute the workflow. If integrated with an evaluation framework, this workflow would be triggered repeatedly for each row in your evaluation dataset.
- Analyze Results: Review the output of the "Evaluation" node to understand the performance metrics and identify areas where the AI agent's tool usage can be improved.
Related Templates
Two-way property repair management system with Google Sheets & Drive
This workflow automates the repair request process between tenants and building managers, keeping all updates organized in a single spreadsheet. It is composed of two coordinated workflows, as two separate triggers are required — one for new repair submissions and another for repair updates. A Unique Unit ID that corresponds to individual units is attributed to each request, and timestamps are used to coordinate repair updates with specific requests. General use cases include: Property managers who manage multiple buildings or units. Building owners looking to centralize tenant repair communication. Automation builders who want to learn multi-trigger workflow design in n8n. --- ⚙️ How It Works Workflow 1 – New Repair Requests Behind the Scenes: A tenant fills out a Google Form (“Repair Request Form”), which automatically adds a new row to a linked Google Sheet. Steps: Trigger: Google Sheets rowAdded – runs when a new form entry appears. Extract & Format: Collects all relevant form data (address, unit, urgency, contacts). Generate Unit ID: Creates a standardized identifier (e.g., BUILDING-UNIT) for tracking. Email Notification: Sends the building manager a formatted email summarizing the repair details and including a link to a Repair Update Form (which activates Workflow 2). --- Workflow 2 – Repair Updates Behind the Scenes:\ Triggered when the building manager submits a follow-up form (“Repair Update Form”). Steps: Lookup by UUID: Uses the Unit ID from Workflow 1 to find the existing row in the Google Sheet. Conditional Logic: If photos are uploaded: Saves each image to a Google Drive folder, renames files consistently, and adds URLs to the sheet. If no photos: Skips the upload step and processes textual updates only. Merge & Update: Combines new data with existing repair info in the same spreadsheet row — enabling a full repair history in one place. --- 🧩 Requirements Google Account (for Forms, Sheets, and Drive) Gmail/email node connected for sending notifications n8n credentials configured for Google API access --- ⚡ Setup Instructions (see more detail in workflow) Import both workflows into n8n, then copy one into a second workflow. Change manual trigger in workflow 2 to a n8n Form node. Connect Google credentials to all nodes. Update spreadsheet and folder IDs in the corresponding nodes. Customize email text, sender name, and form links for your organization. Test each workflow with a sample repair request and a repair update submission. --- 🛠️ Customization Ideas Add Slack or Telegram notifications for urgent repairs. Auto-create folders per building or unit for photo uploads. Generate monthly repair summaries using Google Sheets triggers. Add an AI node to create summaries/extract relevant repair data from repair request that include long submissions.
Automate Dutch Public Procurement Data Collection with TenderNed
TenderNed Public Procurement What This Workflow Does This workflow automates the collection of public procurement data from TenderNed (the official Dutch tender platform). It: Fetches the latest tender publications from the TenderNed API Retrieves detailed information in both XML and JSON formats for each tender Parses and extracts key information like organization names, titles, descriptions, and reference numbers Filters results based on your custom criteria Stores the data in a database for easy querying and analysis Setup Instructions This template comes with sticky notes providing step-by-step instructions in Dutch and various query options you can customize. Prerequisites TenderNed API Access - Register at TenderNed for API credentials Configuration Steps Set up TenderNed credentials: Add HTTP Basic Auth credentials with your TenderNed API username and password Apply these credentials to the three HTTP Request nodes: "Tenderned Publicaties" "Haal XML Details" "Haal JSON Details" Customize filters: Modify the "Filter op ..." node to match your specific requirements Examples: specific organizations, contract values, regions, etc. How It Works Step 1: Trigger The workflow can be triggered either manually for testing or automatically on a daily schedule. Step 2: Fetch Publications Makes an API call to TenderNed to retrieve a list of recent publications (up to 100 per request). Step 3: Process & Split Extracts the tender array from the response and splits it into individual items for processing. Step 4: Fetch Details For each tender, the workflow makes two parallel API calls: XML endpoint - Retrieves the complete tender documentation in XML format JSON endpoint - Fetches metadata including reference numbers and keywords Step 5: Parse & Merge Parses the XML data and merges it with the JSON metadata and batch information into a single data structure. Step 6: Extract Fields Maps the raw API data to clean, structured fields including: Publication ID and date Organization name Tender title and description Reference numbers (kenmerk, TED number) Step 7: Filter Applies your custom filter criteria to focus on relevant tenders only. Step 8: Store Inserts the processed data into your database for storage and future analysis. Customization Tips Modify API Parameters In the "Tenderned Publicaties" node, you can adjust: offset: Starting position for pagination size: Number of results per request (max 100) Add query parameters for date ranges, status filters, etc. Add More Fields Extend the "Splits Alle Velden" node to extract additional fields from the XML/JSON data, such as: Contract value estimates Deadline dates CPV codes (procurement classification) Contact information Integrate Notifications Add a Slack, Email, or Discord node after the filter to get notified about new matching tenders. Incremental Updates Modify the workflow to only fetch new tenders by: Storing the last execution timestamp Adding date filters to the API query Only processing publications newer than the last run Troubleshooting No data returned? Verify your TenderNed API credentials are correct Check that you have setup youre filter proper Need help setting this up or interested in a complete tender analysis solution? Get in touch 🔗 LinkedIn – Wessel Bulte
Automate invoice processing with OCR, GPT-4 & Salesforce opportunity creation
PDF Invoice Extractor (AI) End-to-end pipeline: Watch Drive ➜ Download PDF ➜ OCR text ➜ AI normalize to JSON ➜ Upsert Buyer (Account) ➜ Create Opportunity ➜ Map Products ➜ Create OLI via Composite API ➜ Archive to OneDrive. --- Node by node (what it does & key setup) 1) Google Drive Trigger Purpose: Fire when a new file appears in a specific Google Drive folder. Key settings: Event: fileCreated Folder ID: google drive folder id Polling: everyMinute Creds: googleDriveOAuth2Api Output: Metadata { id, name, ... } for the new file. --- 2) Download File From Google Purpose: Get the file binary for processing and archiving. Key settings: Operation: download File ID: ={{ $json.id }} Creds: googleDriveOAuth2Api Output: Binary (default key: data) and original metadata. --- 3) Extract from File Purpose: Extract text from PDF (OCR as needed) for AI parsing. Key settings: Operation: pdf OCR: enable for scanned PDFs (in options) Output: JSON with OCR text at {{ $json.text }}. --- 4) Message a model (AI JSON Extractor) Purpose: Convert OCR text into strict normalized JSON array (invoice schema). Key settings: Node: @n8n/n8n-nodes-langchain.openAi Model: gpt-4.1 (or gpt-4.1-mini) Message role: system (the strict prompt; references {{ $json.text }}) jsonOutput: true Creds: openAiApi Output (per item): $.message.content → the parsed JSON (ensure it’s an array). --- 5) Create or update an account (Salesforce) Purpose: Upsert Buyer as Account using an external ID. Key settings: Resource: account Operation: upsert External Id Field: taxid_c External Id Value: ={{ $json.message.content.buyer.tax_id }} Name: ={{ $json.message.content.buyer.name }} Creds: salesforceOAuth2Api Output: Account record (captures Id) for downstream Opportunity. --- 6) Create an opportunity (Salesforce) Purpose: Create Opportunity linked to the Buyer (Account). Key settings: Resource: opportunity Name: ={{ $('Message a model').item.json.message.content.invoice.code }} Close Date: ={{ $('Message a model').item.json.message.content.invoice.issue_date }} Stage: Closed Won Amount: ={{ $('Message a model').item.json.message.content.summary.grand_total }} AccountId: ={{ $json.id }} (from Upsert Account output) Creds: salesforceOAuth2Api Output: Opportunity Id for OLI creation. --- 7) Build SOQL (Code / JS) Purpose: Collect unique product codes from AI JSON and build a SOQL query for PricebookEntry by Pricebook2Id. Key settings: pricebook2Id (hardcoded in script): e.g., 01sxxxxxxxxxxxxxxx Source lines: $('Message a model').first().json.message.content.products Output: { soql, codes } --- 8) Query PricebookEntries (Salesforce) Purpose: Fetch PricebookEntry.Id for each Product2.ProductCode. Key settings: Resource: search Query: ={{ $json.soql }} Creds: salesforceOAuth2Api Output: Items with Id, Product2.ProductCode (used for mapping). --- 9) Code in JavaScript (Build OLI payloads) Purpose: Join lines with PBE results and Opportunity Id ➜ build OpportunityLineItem payloads. Inputs: OpportunityId: ={{ $('Create an opportunity').first().json.id }} Lines: ={{ $('Message a model').first().json.message.content.products }} PBE rows: from previous node items Output: { body: { allOrNone:false, records:[{ OpportunityLineItem... }] } } Notes: Converts discount_total ➜ per-unit if needed (currently commented for standard pricing). Throws on missing PBE mapping or empty lines. --- 10) Create Opportunity Line Items (HTTP Request) Purpose: Bulk create OLIs via Salesforce Composite API. Key settings: Method: POST URL: https://<your-instance>.my.salesforce.com/services/data/v65.0/composite/sobjects Auth: salesforceOAuth2Api (predefined credential) Body (JSON): ={{ $json.body }} Output: Composite API results (per-record statuses). --- 11) Update File to One Drive Purpose: Archive the original PDF in OneDrive. Key settings: Operation: upload File Name: ={{ $json.name }} Parent Folder ID: onedrive folder id Binary Data: true (from the Download node) Creds: microsoftOneDriveOAuth2Api Output: Uploaded file metadata. --- Data flow (wiring) Google Drive Trigger → Download File From Google Download File From Google → Extract from File → Update File to One Drive Extract from File → Message a model Message a model → Create or update an account Create or update an account → Create an opportunity Create an opportunity → Build SOQL Build SOQL → Query PricebookEntries Query PricebookEntries → Code in JavaScript Code in JavaScript → Create Opportunity Line Items --- Quick setup checklist 🔐 Credentials: Connect Google Drive, OneDrive, Salesforce, OpenAI. 📂 IDs: Drive Folder ID (watch) OneDrive Parent Folder ID (archive) Salesforce Pricebook2Id (in the JS SOQL builder) 🧠 AI Prompt: Use the strict system prompt; jsonOutput = true. 🧾 Field mappings: Buyer tax id/name → Account upsert fields Invoice code/date/amount → Opportunity fields Product name must equal your Product2.ProductCode in SF. ✅ Test: Drop a sample PDF → verify: AI returns array JSON only Account/Opportunity created OLI records created PDF archived to OneDrive --- Notes & best practices If PDFs are scans, enable OCR in Extract from File. If AI returns non-JSON, keep “Return only a JSON array” as the last line of the prompt and keep jsonOutput enabled. Consider adding validation on parsing.warnings to gate Salesforce writes. For discounts/taxes in OLI: Standard OLI fields don’t support per-line discount amounts directly; model them in UnitPrice or custom fields. Replace the Composite API URL with your org’s domain or use the Salesforce node’s Bulk Upsert for simplicity.