Compare GPT-4, Claude & Gemini Responses with Contextual AI's LMUnit Evaluation
PROBLEM
Evaluating and comparing responses from multiple LLMs (OpenAI, Claude, Gemini) can be challenging when done manually.
- Each model produces outputs that differ in clarity, tone, and reasoning structure.
- Traditional evaluation metrics like ROUGE or BLEU fail to capture nuanced quality differences.
- Human evaluations are inconsistent, slow, and difficult to scale.
This workflow automates LLM response quality evaluation using Contextual AI’s LMUnit, a natural language unit testing framework that provides systematic, fine-grained feedback on response clarity and conciseness.
> Note: LMUnit offers natural language-based evaluation with a 1–5 scoring scale, enabling consistent and interpretable results across different model outputs.
How it works
- A chat trigger node collects responses from multiple LLMs such as **OpenAI GPT-4.1, Claude 4.5 Sonnet, and Gemini 2.5 Flash.
- Each model receives the same input prompt to ensure fair comparison, which is then aggregated and associated with each test cases
- We use Contextual AI's LMUnit node to evaluate each response using predefined quality criteria:
- “Is the response clear and easy to understand?” - Clarity
- “Is the response concise and free from redundancy?” - Conciseness
- LMUnit then produces evaluation scores (1–5) for each test
- Results are aggregated and formatted into a structured summary showing model-wise performance and overall averages.
How to set up
- Create a free Contextual AI account and obtain your
CONTEXTUALAI_API_KEY. - In your n8n instance, add this key as a credential under “Contextual AI.”
- Obtain and add credentials for each model provider you wish to test:
- OpenAI API Key: platform.openai.com/account/api-keys
- Anthropic API Key: console.anthropic.com/settings/keys
- Gemini API Key: ai.google.dev/gemini-api/docs/api-key
- Start sending prompts using chat interface to automatically generate model outputs and evaluations.
How to customize the workflow
- Add more evaluation criteria (e.g., factual accuracy, tone, completeness) in the LMUnit test configuration.
- Include additional LLM providers by duplicating the response generation nodes.
- Adjust thresholds and aggregation logic to suit your evaluation goals.
- Enhance the final summary formatting for dashboards, tables, or JSON exports.
- For detailed API parameters, refer to the LMUnit API reference.
- If you have feedback or need support, please email feedback@contextual.ai.
Compare GPT-4, Claude & Gemini Responses with Contextual AI's LMUnit Evaluation
This n8n workflow provides a robust framework for comparing the responses of different Large Language Models (LLMs) like GPT-4, Claude, and Gemini. It leverages n8n's Langchain nodes to interact with these models and a custom Code node to evaluate their responses contextually using LMUnit. This allows for a systematic and automated way to assess the quality and relevance of AI-generated content based on a given prompt and context.
What it does
This workflow automates the following steps:
- Receives Chat Message: Initiates the workflow upon receiving a chat message, which serves as the user's prompt.
- Prepares Context (Edit Fields): Sets up the initial context and prompt for the LLMs. This can include system instructions or specific data points.
- Loops Over LLM Models: Iterates through a predefined list of LLMs (OpenAI, Google Gemini, Anthropic) to get responses from each.
- Generates OpenAI (GPT-4) Response: Sends the prompt and context to OpenAI's GPT-4 model and captures its response.
- Generates Google Gemini Response: Sends the prompt and context to Google Gemini and captures its response.
- Generates Anthropic (Claude) Response: Sends the prompt and context to Anthropic's Claude model and captures its response.
- Merges Responses: Combines the responses from all LLMs into a single data structure for unified processing.
- Evaluates Responses with LMUnit (Code): Executes custom JavaScript code to evaluate each LLM's response against the original prompt and context using LMUnit. This step assigns a score or rating based on predefined criteria.
- Waits (Optional Delay): Introduces a configurable delay, potentially to manage API rate limits or allow for external processing.
- Responds to Chat: Sends the evaluation results and the LLM responses back to the user via the chat interface.
Prerequisites/Requirements
To use this workflow, you will need:
- n8n Instance: A running instance of n8n.
- OpenAI API Key: For interacting with OpenAI's GPT-4 model.
- Google Gemini API Key: For interacting with Google Gemini.
- Anthropic API Key: For interacting with Anthropic's Claude model.
- LMUnit Integration: The Code node uses LMUnit for evaluation. Ensure LMUnit is correctly set up or that the custom code is adapted to your preferred evaluation method.
Setup/Usage
- Import the workflow: Download the JSON provided and import it into your n8n instance.
- Configure Credentials:
- For the "OpenAI" node, configure your OpenAI API credentials.
- For the "Google Gemini" node, configure your Google Gemini API credentials.
- For the "Anthropic" node, configure your Anthropic API credentials.
- Customize the "Edit Fields" node: Adjust the initial prompt and context as needed for your specific evaluation scenarios.
- Review the "Code" node: The LMUnit evaluation logic is within this node. You may need to modify the JavaScript code to define your specific evaluation criteria, metrics, and how LMUnit is called.
- Activate the workflow: Once configured, activate the workflow. It will now listen for incoming chat messages.
- Send a chat message: Interact with the "Chat Trigger" by sending a message to initiate the comparison and evaluation process. The results will be returned via the "Chat" response node.
Related Templates
AI-powered code review with linting, red-marked corrections in Google Sheets & Slack
Advanced Code Review Automation (AI + Lint + Slack) Who’s it for For software engineers, QA teams, and tech leads who want to automate intelligent code reviews with both AI-driven suggestions and rule-based linting — all managed in Google Sheets with instant Slack summaries. How it works This workflow performs a two-layer review system: Lint Check: Runs a lightweight static analysis to find common issues (e.g., use of var, console.log, unbalanced braces). AI Review: Sends valid code to Gemini AI, which provides human-like review feedback with severity classification (Critical, Major, Minor) and visual highlights (red/orange tags). Formatter: Combines lint and AI results, calculating an overall score (0–10). Aggregator: Summarizes results for quick comparison. Google Sheets Writer: Appends results to your review log. Slack Notification: Posts a concise summary (e.g., number of issues and average score) to your team’s channel. How to set up Connect Google Sheets and Slack credentials in n8n. Replace placeholders (<YOURSPREADSHEETID>, <YOURSHEETGIDORNAME>, <YOURSLACKCHANNEL_ID>). Adjust the AI review prompt or lint rules as needed. Activate the workflow — reviews will start automatically whenever new code is added to the sheet. Requirements Google Sheets and Slack integrations enabled A configured AI node (Gemini, OpenAI, or compatible) Proper permissions to write to your target Google Sheet How to customize Add more linting rules (naming conventions, spacing, forbidden APIs) Extend the AI prompt for project-specific guidelines Customize the Slack message formatting Export analytics to a dashboard (e.g., Notion or Data Studio) Why it’s valuable This workflow brings realistic, team-oriented AI-assisted code review to n8n — combining the speed of automated linting with the nuance of human-style feedback. It saves time, improves code quality, and keeps your team’s review history transparent and centralized.
Daily cash flow reports with Google Sheets, Slack & Email for finance teams
Simplify financial oversight with this automated n8n workflow. Triggered daily, it fetches cash flow and expense data from a Google Sheet, analyzes inflows and outflows, validates records, and generates a comprehensive daily report. The workflow sends multi-channel notifications via email and Slack, ensuring finance professionals stay updated with real-time financial insights. 💸📧 Key Features Daily automation keeps cash flow tracking current. Analyzes inflows and outflows for actionable insights. Multi-channel alerts enhance team visibility. Logs maintain a detailed record in Google Sheets. Workflow Process The Every Day node triggers a daily check at a set time. Get Cash Flow Data retrieves financial data from a Google Sheet. Analyze Inflows & Outflows processes the data to identify trends and totals. Validate Records ensures all entries are complete and accurate. If records are valid, it branches to: Sends Email Daily Report to finance team members. Send Slack Alert to notify the team instantly. Logs to Sheet appends the summary data to a Google Sheet for tracking. Setup Instructions Import the workflow into n8n and configure Google Sheets OAuth2 for data access. Set the daily trigger time (e.g., 9:00 AM IST) in the "Every Day" node. Test the workflow by adding sample cash flow data and verifying reports. Adjust analysis parameters as needed for specific financial metrics. Prerequisites Google Sheets OAuth2 credentials Gmail API Key for email reports Slack Bot Token (with chat:write permissions) Structured financial data in a Google Sheet Google Sheet Structure: Create a sheet with columns: Date Cash Inflow Cash Outflow Category Notes Updated At Modification Options Customize the "Analyze Inflows & Outflows" node to include custom financial ratios. Adjust the "Validate Records" filter to flag anomalies or missing data. Modify email and Slack templates with branded formatting. Integrate with accounting tools (e.g., Xero) for live data feeds. Set different trigger times to align with your financial review schedule. Discover more workflows – Get in touch with us
Generate Weather-Based Date Itineraries with Google Places, OpenRouter AI, and Slack
🧩 What this template does This workflow builds a 120-minute local date course around your starting point by querying Google Places for nearby spots, selecting the top candidates, fetching real-time weather data, letting an AI generate a matching emoji, and drafting a friendly itinerary summary with an LLM in both English and Japanese. It then posts the full bilingual plan with a walking route link and weather emoji to Slack. 👥 Who it’s for Makers and teams who want a plug-and-play bilingual local itinerary generator with weather awareness — no custom code required. ⚙️ How it works Trigger – Manual (or schedule/webhook). Discovery – Google Places nearby search within a configurable radius. Selection – Rank by rating and pick the top 3. Weather – Fetch current weather (via OpenWeatherMap). Emoji – Use an AI model to match the weather with an emoji 🌤️. Planning – An LLM writes the itinerary in Markdown (JP + EN). Route – Compose a Google Maps walking route URL. Share – Post the bilingual itinerary, route link, and weather emoji to Slack. 🧰 Requirements n8n (Cloud or self-hosted) Google Maps Platform (Places API) OpenWeatherMap API key Slack Bot (chat:write) LLM provider (e.g., OpenRouter or DeepL for translation) 🚀 Setup (quick) Open Set → Fields: Config and fill in coords/radius/time limit. Connect Credentials for Google, OpenWeatherMap, Slack, and your LLM. Test the workflow and confirm the bilingual plan + weather emoji appear in Slack. 🛠 Customize Adjust ranking filters (type, min rating). Modify translation settings (target language or tone). Change output layout (side-by-side vs separated). Tune emoji logic or travel mode. Add error handling, retries, or logging for production use.