Compare GPT-4, Claude & Gemini Responses with Contextual AI's LMUnit Evaluation

Name: Compare GPT-4, Claude & Gemini Responses with Contextual AI's LMUnit Evaluation
Availability: InStock
Rating: 4.5 (86 reviews)
Author: Jinash Rouniyar

$20/month : Unlimited workflows

2500 executions/month

Try free

THE #1 IN WEB SCRAPING

Scrape any website without limits

Try free

HOSTINGER 🎉 Early Black Friday Deal
DISCOUNT 20%

Self-hosted n8n

Unlimited workflows - from $4.99/mo

Try free

#1 hub for scraping, AI & automation

6000+ actors - $5 credits/mo

Try free

PROBLEM

Evaluating and comparing responses from multiple LLMs (OpenAI, Claude, Gemini) can be challenging when done manually.

Each model produces outputs that differ in clarity, tone, and reasoning structure.
Traditional evaluation metrics like ROUGE or BLEU fail to capture nuanced quality differences.
Human evaluations are inconsistent, slow, and difficult to scale.

This workflow automates LLM response quality evaluation using Contextual AI’s LMUnit, a natural language unit testing framework that provides systematic, fine-grained feedback on response clarity and conciseness.

> Note: LMUnit offers natural language-based evaluation with a 1–5 scoring scale, enabling consistent and interpretable results across different model outputs.

How it works

A chat trigger node collects responses from multiple LLMs such as **OpenAI GPT-4.1, Claude 4.5 Sonnet, and Gemini 2.5 Flash.
Each model receives the same input prompt to ensure fair comparison, which is then aggregated and associated with each test cases
We use Contextual AI's LMUnit node to evaluate each response using predefined quality criteria:
- “Is the response clear and easy to understand?” - Clarity
- “Is the response concise and free from redundancy?” - Conciseness
LMUnit then produces evaluation scores (1–5) for each test
Results are aggregated and formatted into a structured summary showing model-wise performance and overall averages.

How to set up

Create a free Contextual AI account and obtain your CONTEXTUALAI_API_KEY.
In your n8n instance, add this key as a credential under “Contextual AI.”
Obtain and add credentials for each model provider you wish to test:
- OpenAI API Key: platform.openai.com/account/api-keys
- Anthropic API Key: console.anthropic.com/settings/keys
- Gemini API Key: ai.google.dev/gemini-api/docs/api-key
Start sending prompts using chat interface to automatically generate model outputs and evaluations.

How to customize the workflow

Add more evaluation criteria (e.g., factual accuracy, tone, completeness) in the LMUnit test configuration.
Include additional LLM providers by duplicating the response generation nodes.
Adjust thresholds and aggregation logic to suit your evaluation goals.
Enhance the final summary formatting for dashboards, tables, or JSON exports.
For detailed API parameters, refer to the LMUnit API reference.
If you have feedback or need support, please email [email protected].

Jinash Rouniyar

0 workflows

Nodes

set gmail telegram agent google-gemini

Complexity

advanced

Published 09 Dec 2025

Likes 0

View on n8n.io Download Workflow

✨

Share Your Workflow

Have a great workflow to share? Join the n8n Creator Hub and help the community!

Submit Your Template How to Submit

Related Workflows

Coordinate patient care and alerts with EHR/FHIR, GPT-4, Twilio, Gmail and Slack

## How It Works This workflow automates end-to-end patient care coordination by monitoring appointment schedules, clinical events, and care milestones while orchestrating personalized communications across multiple channels. Designed for healthcare operations teams, care coordinators, and patient engagement specialists, it solves the challenge of manual patient follow-up, missed appointments, and fragmented communication across care teams. The system triggers on scheduled intervals and real-time clinical events, ingesting data from EHR systems, appointment schedulers, and lab result feeds. Patient records flow through validation and risk stratification layers using AI models that identify high-risk patients, predict no-show probability, and recommend intervention timing. The workflow applies clinical protocols for appointment reminders, medication adherence checks, and post-discharge follow-ups. Critical cases automatically route to care coordinators via Slack alerts, while routine communications deploy via SMS, email, and patient portal notifications. All interactions log to secure databases for compliance documentation. This eliminates manual outreach coordination, reduces no-shows by 40%, and ensures HIPAA-compliant patient engagement at scale. ## Setup Steps 1. Configure EHR/FHIR API credentialsfor patient data access 2. Set up webhook endpoints for real-time clinical event notifications 3. Add OpenAI API key for patient risk stratification and communication personalization 4. Configure Twilio credentials for SMS and voice call delivery 5. Set Gmail OAuth or SMTP credentials for email appointment reminders 6. Connect Slack workspace and define care coordination alert channels ## Prerequisites Active EHR system with FHIR API access or HL7 integration capability. ## Use Cases Automated appointment reminder campaigns reducing no-shows. ## Customization Modify risk scoring models for specialty-specific patient populations. ## Benefits Reduces patient no-show rates by 40% through timely, personalized reminders.

View

Evaluate AI workflows using Google Sheets, Gemini, Claude, GPT, and Perplexity

This template and YouTube video goes over 5 different implementations of evaluations within n8n. - Categorization - Correctness - Tools used - String similarity - Helpfulness You’ll learn when to use each type, how to set up test datasets in Google Sheets or data tables, and how to track your results over time. I also explain best practices like only changing one variable at a time, documenting your prompts and model settings, and building proper training datasets with enough examples to confidently validate your workflow. YouTube Video: https://www.youtube.com/watch?v=-4LXYOhQ-Z0 Thank you for downloading our free n8n Evaluations template. If you enjoyed the template + tutorial please subscribe to the YouTube channel. We are uploading weekly content on AI/n8n Connect With Us Check out the links down below. If you need help with this template, want 1:1 coaching, or have a n8n project you want to build, reach out at [email protected] Free Skool AI/n8n Group: https://www.skool.com/data-and-ai LinkedIn: https://www.linkedin.com/in/ryan-p-nolan/ Twitter/X:https://x.com/RyanMattDS Website: https://ryanandmattdatascience.com/

View

Extract meeting details with GPT-4.1-mini and evaluate accuracy in Google Sheets

## Who's it for Developers building AI-powered workflows who want to ensure their agents work reliably. If you need to validate AI outputs, test agent behavior systematically, or build maintainable automation, this template shows you how. ## What it does This subworkflow extracts structured meeting details (title, date, time, location, links, attendees) from natural language messages using an AI agent. It demonstrates production-ready patterns: - **Structured output validation**: JSON schema enforcement prevents malformed responses - **Error handling**: Graceful failures with full execution traceability - **Automated evaluation**: Test agent accuracy against expected outputs using Google Sheets - **Dual execution modes**: Normal extraction + evaluation/testing mode The AI resolves relative time ("tomorrow", "next Friday") using timezone context and handles incomplete data gracefully. ## How to set it up 1. Connect OpenAI API credential to the AI agent node 2. Copy the test data sheet: https://docs.google.com/spreadsheets/d/1U89nPsasM2WNv1D7gEYINhDwylyxYw7BOd_i8ipFC0M/edit?usp=sharing 3. Update Google Sheet IDs in `load_eval_data` and `record_eval_output` nodes 4. Test normal mode: Execute workflow "from trigger" 5. Test evaluation mode: Execute workflow "from load_eval_data" ## Requirements - OpenAI API key - Google Sheets OAuth credential ## Why subworkflow architecture? **Reusability**: Wrap AI agents in subworkflows to call them from multiple parent workflows. Extract meetings from Slack, email, or webhooks—same agent, consistent results. **Testability**: This pattern enables isolated testing for each AI component. Set up evaluation datasets, run automated tests, and validate accuracy before deploying to production. You can't do this easily with inline agents. **Maintainability**: Update the agent logic once, and all parent workflows benefit. Error handling and validation are built-in, so failures are traceable with execution IDs. **This framework includes**: - Dual-trigger pattern (normal + evaluation modes) - Output validation that catches silent AI failures - Error bubbling with execution metadata for debugging - Evaluation framework with semantic/exact matching - Proper routing that returns output to parent workflows ## Following this pattern for other agents To adapt this for any AI task (contact extraction, invoice processing, sentiment analysis, etc.): 1. Replace `extract_meeting_details` with your AI agent (add tools, memory, etc. as needed) 2. Update `Structured Output Parser` schema to match your data structure 3. Modify `evaluate_match` prompt for your validation criteria 4. Create test cases in Google Sheets with your inputs/expected outputs 5. Adjust `normalize_eval_data` timezone/reference time if needed The validation, error handling, and evaluation infrastructure stays the same regardless of what your agent does.

View

👨‍💻

Need Custom Automation?

N8N Automation Expert

Specialized in N8N automation, I design custom workflows that connect your tools and automate your processes.