Merit vs Evals vs Tests

Merit helps developers test AI projects for AI-specific bugs like hallucinations, missing context, and incorrect decisions. But shouldn’t evals handle that? Or pytest? In this section, we explain the differences and help you pick the best approach for your use case.

What’s wrong about evals?

Evals come from data science. The workflow is:

Define a metric (accuracy, F1, BLEU, etc.)
Run your model on a benchmark dataset
Tweak parameters until the metric improves
Repeat

This works when you’re training a model. But when you’re building a product, you have more control over your system. By incrementally improving the code and system design, you can move the product quality to be enough for production. A 95% accuracy score tells you nothing about what code contributed to which failures, and which failures should be prioritized.

What’s wrong about pytest?

Automated tests come from software engineering. The workflow is:

Write an assertion: assert result == expected
Run the test
If it passes once, ship it

This works when code is deterministic. Call add(2, 2) and you always get 4. AI systems are stochastic. The same input can produce different outputs:

# This might pass 9 out of 10 times
def test_greeting():
    response = chatbot("Hello!")
    assert "hello" in response.lower()  # Sometimes it says "Hi there!" instead

Even for something that looks deterministic (like arithmetic), LLM-based systems can behave like every phrasing is a new case:

def test_llm_calculator(llm_as_calculator):
    #Both assertions might pass or fail regarldess of each other
    assert llm_as_calculator("3 + 2") == "5"
    assert llm_as_calculator("2 + 3") == "5"

This is why a single passing test is weak evidence for AI behavior: you need repeated runs and broader case coverage.

Comparison

Capability	Evals	Tests	Merit
Native Python syntax	No	Yes	Yes
Explicit test logic	No	Yes	Yes
Cases in datasets	Yes	No	Yes
Metrics aggregations	Yes	No	Yes
Determinism checks	Partial	Yes	Yes
LLM-as-a-Judge	Partial	No	Yes
CI/CD integration	Partial	Yes	Yes
Historical data	Partial	No	Yes

AI Predicates: Assert meaning, not strings

Merit’s AI predicates let you assert on complex properties:

from merit.predicates import has_unsupported_facts, follows_policy

async def merit_customer_support(support_bot):
    knowledge_base = "Returns accepted within 30 days with receipt."
    policy = "Always offer to help with other questions"
    response = support_bot.answer("What's the return policy?")

    # Assert the response doesn't hallucinate facts
    assert not await has_unsupported_facts(response, knowledge_base)

    # Assert the response follows company guidelines
    assert await follows_policy(response, policy)

Repeat: Measure consistency

Merit’s @merit.repeat runs the same merit multiple times:

@merit.repeat(10, min_passes=8)  # 8 out of 10 must pass
async def merit_consistent_greeting(chatbot):
    response = chatbot("Hello!")
    assert await follows_policy(response, "Greeting is friendly and professional")

Metrics: Aggregate statistics with quality gates

Merit’s metrics give you statistical power with explicit thresholds:

@merit.metric
def hallucination_rate():
    metric = Metric()
    yield metric
    assert metric.distribution[False] >= 0.95  # 95% must pass

@merit.iter_cases(*qa_dataset)
async def merit_rag_accuracy(case: Case, rag_system, hallucination_rate: Metric):
    response = rag_system.query(**case.sut_input_values)

    with metrics(hallucination_rate):
        assert not await has_unsupported_facts(response, case.references["context"])

You get both individual pass/fail results and aggregate statistics.

Cases: Turn datasets into type-safe explicit code

Merit’s Case abstraction lets you load test data from external sources:

from pydantic import BaseModel


class TranslationReference(BaseModel):
    expected_translation: str


cases: list[Case[TranslationReference]] = [
    Case(**item) for item in json.load(open("test_cases.json"))
]

@merit.iter_cases(*cases)
async def merit_translation(case: Case[TranslationReference], translator):
    result = translator.translate(**case.sut_input_values)
    assert case.references.expected_translation in result.lower()

Example: Testing a Customer Support Bot

Let’s see how the same system would be tested with each approach.

The System

A customer support chatbot that answers questions using a knowledge base:

class SupportBot:
    def __init__(self, knowledge_base: str):
        self.knowledge_base = knowledge_base

    def answer(self, question: str) -> str:
        # LLM-powered response using knowledge_base as context
        return llm_call(question, context=self.knowledge_base)

With pytest

# test_support_bot.py
import pytest

def test_returns_policy():
    bot = SupportBot("Returns accepted within 30 days.")
    response = bot.answer("What's your return policy?")

    # Brittle: fails if wording changes
    assert "30 days" in response

def test_shipping_info():
    bot = SupportBot("Free shipping on orders over $50.")
    response = bot.answer("Do you offer free shipping?")

    # Brittle: what if it says "$50" vs "fifty dollars"?
    assert "50" in response

def test_no_competitor_mentions():
    bot = SupportBot("We offer 24/7 support.")
    response = bot.answer("Are you better than CompetitorX?")

    # How do you even check this reliably?
    assert "CompetitorX" not in response  # Too simple
    # What about "Competitor X" or "that other company"?

With Evals

# eval_support_bot.py
import pandas as pd
from some_eval_framework import evaluate

# Load benchmark dataset
dataset = pd.read_csv("support_qa_benchmark.csv")

# Run evaluation
results = evaluate(
    model=SupportBot(knowledge_base),
    dataset=dataset,
    metrics=["accuracy", "relevance_score", "hallucination_rate"]
)

print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Relevance: {results['relevance_score']:.2f}")
print(f"Hallucination Rate: {results['hallucination_rate']:.2%}")

With Merit

# merit_support_bot.py
import merit
from merit import Case, Metric, metrics
from merit.predicates import has_unsupported_facts, follows_policy, has_facts

# Define your system under test
@merit.resource
def support_bot():
    knowledge = """
    Returns accepted within 30 days with receipt.
    Free shipping on orders over $50.
    We offer 24/7 customer support.
    """
    return SupportBot(knowledge_base=knowledge)

# Define quality metrics with explicit thresholds
@merit.metric
def accuracy():
    metric = Metric()
    yield metric
    assert metric.mean >= 0.9  # 90% of responses must be accurate

@merit.metric
def hallucination_rate():
    metric = Metric()
    yield metric
    assert metric.distribution[True] < 0.05  # Less than 5% hallucinations

# Load test cases from dataset
cases = [
    Case(
        sut_input_values={"question": "What's your return policy?"},
        references={"context": "Returns accepted within 30 days with receipt.",
                    "required_facts": "30 days, receipt"}
    ),
    Case(
        sut_input_values={"question": "Do you have free shipping?"},
        references={"context": "Free shipping on orders over $50.",
                    "required_facts": "$50, free shipping"}
    ),
    Case(
        sut_input_values={"question": "What support do you offer?"},
        references={"context": "We offer 24/7 customer support.",
                    "required_facts": "24/7, support"}
    ),
]

# Single merit function with semantic assertions
@merit.iter_cases(*cases)
async def merit_factual_accuracy(
    case: Case,
    support_bot,
    accuracy: Metric,
    hallucination_rate: Metric
):
    response = support_bot.answer(case.sut_input_values["question"])
    context = case.references["context"]

    # Semantic assertion: no hallucinations
    hallucinated = await has_unsupported_facts(response, context)
    with metrics(hallucination_rate):
        assert not hallucinated

    # Semantic assertion: contains required facts
    with metrics(accuracy):
        assert await has_facts(response, case.references["required_facts"])

# Policy compliance with repeated runs
@merit.repeat(10, min_passes=9)  # 9/10 must pass
async def merit_no_competitor_mentions(support_bot):
    response = support_bot.answer("Are you better than CompetitorX?")

    policy = """
    - Never mention competitors by name
    - Don't compare to other companies
    - Focus on our own strengths
    """
    assert await follows_policy(response, policy)

# Reliability check: same question, multiple runs
@merit.repeat(5)
async def merit_consistent_tone(support_bot):
    response = support_bot.answer("I'm frustrated with my order!")

    tone_policy = "Response is empathetic, apologetic, and offers help"
    assert await follows_policy(response, tone_policy)

Run with:

merit test merit_support_bot.py

Getting Started

Ready to try Merit? Check out the Quick Start guide.

Get Started

Usage

Concepts

API Reference

Examples

What’s wrong about evals?

What’s wrong about pytest?

Comparison

AI Predicates: Assert meaning, not strings

Repeat: Measure consistency

Metrics: Aggregate statistics with quality gates

Cases: Turn datasets into type-safe explicit code

Example: Testing a Customer Support Bot

The System

With pytest

With Evals

With Merit

Getting Started

Get Started

Usage

Concepts

API Reference

Examples

​What’s wrong about evals?

​What’s wrong about pytest?

​Comparison

​AI Predicates: Assert meaning, not strings

​Repeat: Measure consistency

​Metrics: Aggregate statistics with quality gates

​Cases: Turn datasets into type-safe explicit code

​Example: Testing a Customer Support Bot

​The System

​With pytest

​With Evals

​With Merit

​Getting Started

What’s wrong about evals?

What’s wrong about pytest?

Comparison

AI Predicates: Assert meaning, not strings

Repeat: Measure consistency

Metrics: Aggregate statistics with quality gates

Cases: Turn datasets into type-safe explicit code

Example: Testing a Customer Support Bot

The System

With pytest

With Evals

With Merit

Getting Started