Skip to main content

Stop Guessing. Start Testing Your AI.

Merit is the first testing framework built specifically for LLMs and AI agents. Go beyond string matching with semantic assertions that understand what your AI actually says.

The Problem with Testing AI

Traditional testing tools weren’t built for AI systems:
  • String matching fails - “Paris is France’s capital” ≠ “The capital of France is Paris”
  • Can’t detect hallucinations - Your LLM adds made-up facts, tests still pass
  • Manual verification - Checking 100s of outputs by hand isn’t scalable
  • No root cause analysis - 50 test failures, but what’s actually broken?

Merit Solves This

See It In Action

Test your chatbot with semantic assertions:
import merit
from merit.predicates import has_facts, has_unsupported_facts

def chatbot(prompt: str) -> str:
    """Your AI system"""
    return "Paris is the capital of France and home to the Eiffel Tower."

async def merit_chatbot_accuracy():
    response = chatbot("What is the capital of France?")
    
    # ✓ Semantic fact checking (not string matching)
    assert await has_facts(response, "Paris is the capital of France")
    
    # ✓ Hallucination detection
    assert not await has_unsupported_facts(
        response, 
        "Paris is the capital of France. Famous landmarks include the Eiffel Tower."
    )
Run it:
merit test
That’s it. Merit handles the complexity of semantic evaluation for you.

What You Can Test

Merit’s LLM-as-a-Judge assertions understand meaning, not just text:

Catch Every Failure Mode

has_unsupported_facts - Detects when your LLM invents information not in the source.Example: Source says “2 million residents” but LLM outputs “50 million” ❌
has_facts - Catches incomplete responses that skip critical details.Example: Asked for capital, population, and language. Only mentions capital. ❌
has_conflicting_facts - Finds contradictions with your source material.Example: Source says “Paris” but LLM says “Berlin is the capital of France” ❌
has_topics - Ensures all required subjects are covered.Example: Travel guide must cover hotels, transport, and attractions. ✓
follows_policy - Validates compliance with brand guidelines and requirements.Example: Must include disclaimer, use professional tone, avoid promises. ✓
matches_writing_style - Checks tone, voice, and writing patterns match your brand.Example: Casual brand voice vs formal corporate-speak. ✓
See all 8 predicates →

Scale from 1 to 1,000,000 Tests

Real-world AI testing means lots of test cases. Merit makes it manageable:
# Test 1000 cases concurrently
merit test --concurrency 10

# Auto-analyze failures to find root causes  
merit-analyzer analyze failures.csv
Merit Analyzer turns 100 failures into actionable insights:
  • Groups similar errors automatically
  • Identifies problematic code
  • Suggests fixes
  • Generates interactive HTML reports
Learn about error analysis →

Who Uses Merit?

Chatbot DevelopersTest response accuracy, detect hallucinations, ensure brand voice consistency

Document AIVerify summaries, check extractions, validate transformations at scale

AI Agent TeamsTest complex workflows, validate tool usage, ensure reliable behavior

Content GenerationCheck facts, verify style, ensure quality across 1000s of outputs

RAG SystemsValidate groundedness, catch hallucinations, test retrieval quality

AI ResearchersEvaluate models, compare outputs, reproduce results reliably

Ready to Start Testing?


Need help getting started? Join our community or contact us for support.