Back to Catalog

Templates by Jinash Rouniyar

Compare GPT-4, Claude & Gemini Responses with Contextual AI's LMUnit Evaluation

PROBLEM Evaluating and comparing responses from multiple LLMs (OpenAI, Claude, Gemini) can be challenging when done manually. Each model produces outputs that differ in clarity, tone, and reasoning structure. Traditional evaluation metrics like ROUGE or BLEU fail to capture nuanced quality differences. Human evaluations are inconsistent, slow, and difficult to scale. This workflow automates LLM response quality evaluation using Contextual AI’s LMUnit, a natural language unit testing framework that provides systematic, fine-grained feedback on response clarity and conciseness. > Note: LMUnit offers natural language-based evaluation with a 1–5 scoring scale, enabling consistent and interpretable results across different model outputs. How it works A chat trigger node collects responses from multiple LLMs such as OpenAI GPT-4.1, Claude 4.5 Sonnet, and Gemini 2.5 Flash. Each model receives the same input prompt to ensure fair comparison, which is then aggregated and associated with each test cases We use Contextual AI's LMUnit node to evaluate each response using predefined quality criteria: “Is the response clear and easy to understand?” - Clarity “Is the response concise and free from redundancy?” - Conciseness LMUnit then produces evaluation scores (1–5) for each test Results are aggregated and formatted into a structured summary showing model-wise performance and overall averages. How to set up Create a free Contextual AI account and obtain your CONTEXTUALAIAPIKEY. In your n8n instance, add this key as a credential under “Contextual AI.” Obtain and add credentials for each model provider you wish to test: OpenAI API Key: platform.openai.com/account/api-keys Anthropic API Key: console.anthropic.com/settings/keys Gemini API Key: ai.google.dev/gemini-api/docs/api-key Start sending prompts using chat interface to automatically generate model outputs and evaluations. How to customize the workflow Add more evaluation criteria (e.g., factual accuracy, tone, completeness) in the LMUnit test configuration. Include additional LLM providers by duplicating the response generation nodes. Adjust thresholds and aggregation logic to suit your evaluation goals. Enhance the final summary formatting for dashboards, tables, or JSON exports. For detailed API parameters, refer to the LMUnit API reference. If you have feedback or need support, please email feedback@contextual.ai.

Jinash RouniyarBy Jinash Rouniyar
1110

Dynamic MCP server selection with OpenAI GPT-4.1 and contextual AI reranker

PROBLEM Thousands of MCP Servers exist and many are updated daily, making server selection difficult for LLMs. Current approaches require manually downloading and configuring servers, limiting flexibility. When multiple servers are pre-configured, LLMs get overwhelmed and confused about which server to use for specific tasks. This template enables dynamic server selection from a live PulseMCP directory of 5000+ servers. How it works A user query goes to an LLM that decides whether to use MCP servers to fulfill a given query and provides reasoning for its decision. Next, we fetch MCP Servers from Pulse MCP API and format them as documents for reranking Now, we use Contextual AI's Reranker to score and rank all MCP Servers based on our query and instructions How to set up Sign up for a free trial of Contextual AI here to find CONTEXTUALAIAPIKEY. Click on variables option in left panel and add a new environment variable CONTEXTUALAIAPIKEY. For the baseline model, we have used GPT 4.1 mini, you can find your OpenAI API key here How to customize the workflow We use chat trigger to initate the workflow. Feel free to replace it with a webhook or other trigger as required. We use OpenAI's GPT 4.1 mini as the baseline model and reranker prompt generator. You can swap out this section to use the LLM of your choice. We fetch 5000 MCP Servers from the PulseMCP directory as a baseline number, feel free to adjust this parameter as required. We are using Contextual AI's ctxl-rerank-v2-instruct-multilingual reranker model, which can be swapped with any one of the following rerankers: 1) ctxl-rerank-v2-instruct-multilingual 2) ctxl-rerank-v2-instruct-multilingual-mini 3) ctxl-rerank-v1-instruct You can checkout this blog for more information about rerankers to learn more about them. Good to know: Contextual AI Reranker (with full MCP docs): ~$0.035/query Includes 0.035 for reranking + ~$0.0001 for OpenAI instruction generation. OpenAI Baseline: ~$0.017/query

Jinash RouniyarBy Jinash Rouniyar
289
All templates loaded