Stop Guessing. Start Testing Your AI.
Merit is the first testing framework built specifically for LLMs and AI agents. Go beyond string matching with semantic assertions that understand what your AI actually says.The Problem with Testing AI
Traditional testing tools weren’t built for AI systems:- ❌ String matching fails - “Paris is France’s capital” ≠ “The capital of France is Paris”
- ❌ Can’t detect hallucinations - Your LLM adds made-up facts, tests still pass
- ❌ Manual verification - Checking 100s of outputs by hand isn’t scalable
- ❌ No root cause analysis - 50 test failures, but what’s actually broken?
Merit Solves This
Semantic Assertions
LLM-as-a-Judge checks facts, not strings. Detects hallucinations, missing info, and contradictions automatically.
Familiar Syntax
Pytest-like interface. If you know pytest, you know Merit. Resources, parametrization, async support.
Intelligent Analysis
Auto-cluster failures. Find root causes fast. 50 failures → 3 actual issues to fix.
Production-Ready
Built for scale. Concurrent testing, tracing, CI/CD integration. Test 1000s of cases in minutes.
See It In Action
Test your chatbot with semantic assertions:What You Can Test
Merit’s LLM-as-a-Judge assertions understand meaning, not just text:Catch Every Failure Mode
✓ Hallucinations
✓ Hallucinations
has_unsupported_facts - Detects when your LLM invents information not in the source.Example: Source says “2 million residents” but LLM outputs “50 million” ❌✓ Missing Information
✓ Missing Information
has_facts - Catches incomplete responses that skip critical details.Example: Asked for capital, population, and language. Only mentions capital. ❌✓ Wrong Information
✓ Wrong Information
has_conflicting_facts - Finds contradictions with your source material.Example: Source says “Paris” but LLM says “Berlin is the capital of France” ❌✓ Missing Topics
✓ Missing Topics
has_topics - Ensures all required subjects are covered.Example: Travel guide must cover hotels, transport, and attractions. ✓✓ Policy Violations
✓ Policy Violations
follows_policy - Validates compliance with brand guidelines and requirements.Example: Must include disclaimer, use professional tone, avoid promises. ✓✓ Wrong Style
✓ Wrong Style
matches_writing_style - Checks tone, voice, and writing patterns match your brand.Example: Casual brand voice vs formal corporate-speak. ✓Scale from 1 to 1,000,000 Tests
Real-world AI testing means lots of test cases. Merit makes it manageable:- Groups similar errors automatically
- Identifies problematic code
- Suggests fixes
- Generates interactive HTML reports
Who Uses Merit?
Chatbot DevelopersTest response accuracy, detect hallucinations, ensure brand voice consistency
Document AIVerify summaries, check extractions, validate transformations at scale
AI Agent TeamsTest complex workflows, validate tool usage, ensure reliable behavior
Content GenerationCheck facts, verify style, ensure quality across 1000s of outputs
RAG SystemsValidate groundedness, catch hallucinations, test retrieval quality
AI ResearchersEvaluate models, compare outputs, reproduce results reliably
Ready to Start Testing?
Quick Start Guide
Install Merit and write your first test in 2 minutes
Example Tests
See real-world testing patterns and best practices
AI Predicates
Learn about all 8 LLM-as-a-Judge assertions
Configuration
Set up API keys and environment variables
Need help getting started? Join our community or contact us for support.