Skip to main content
Merit is intentionally “pytest-shaped”: you write plain Python, Merit discovers merit_* cases, injects dependencies by parameter name (like pytest fixtures), runs them, and reports results. This page focuses on how to write merits and (importantly) where the behavior lives in the codebase, so you can trust what’s happening.

TLDR - If you know pytest, you already know 80%

pytestmerit
def test_*(): ...def merit_*(): ...
@pytest.fixture@merit.resource
@pytest.mark.parametrize(...)@merit.parametrize(...)
@pytest.mark.skip(...)@merit.tag.skip(...)
@pytest.mark.xfail(...)@merit.tag.xfail(...)
@pytest.mark.repeat(3)@merit.repeat(3)

Merits

Merit follows pytest-style discovery patterns to find merit functions in your codebase: Files: Merit discovers Python files starting with merit_:
  • merit_chatbot.py
  • merit_agent.py
  • helpers.py
Functions: Inside discovered files, Merit collects functions starting with merit_:
def merit_weather_agent():  # ✓ Discovered
    pass

def helper_function():      # ✗ Not discovered
    pass
Classes: Classes starting with Merit are discovered, and their merit_* methods become merit cases:
class MeritCustomerSupport:     # ✓ Discovered
    def merit_greeting(self):    # ✓ Collected
        pass

    def helper(self):            # ✗ Not collected
        pass

Modifiers

Modifiers are decorators that change how a merit is collected and/or executed. Some modifiers expand a single merit into many cases (like parametrization or dataset iteration), while others change execution semantics (like repeating a case, or marking it as skipped/xfail). Apply them to merit_* functions or Merit* classes. @merit.parametrize(names, values) - Run the same merit with different inputs
@merit.parametrize("model,temp", [("gpt-4", 0.7), ("claude-3", 0.5)])
def merit_model_response(model: str, temp: float, chatbot):
    response = chatbot.generate(model=model, temperature=temp)
    assert response
@merit.iter_cases(*cases, min_passes=len(cases)) - Iterate over Case objects from external sources, optionally allowing pass thresholds
import json
from merit import Case

cases = [Case(**item) for item in json.load(open("cases.json"))]

@merit.iter_cases(*cases)
def merit_from_dataset(case: Case, classifier):
    result = classifier(**case.sut_input_values)
    if case.references:
        assert result == case.references["some_ref_value"]

@merit.iter_cases(*cases, min_passes=8)  # Requires at least 8 case passes
def merit_from_dataset_threshold(case: Case, classifier):
    result = classifier(**case.sut_input_values)
    assert result == case.references["some_ref_value"]
@merit.iter_case_groups(*groups) - Iterate over CaseGroup objects with per-group thresholds and group-level references
from merit import Case, CaseGroup

geography = CaseGroup(
    name="geography",
    cases=[
        Case(sut_input_values={"prompt": "Capital of France?"}, references={"expected": "Paris"}),
        Case(sut_input_values={"prompt": "Capital of Germany?"}, references={"expected": "Berlin"}),
    ],
    min_passes=2,
)

music = CaseGroup(
    name="music",
    cases=[Case(sut_input_values={"prompt": "Best rock band?"}, references={"expected": "Metallica"})],
    min_passes=1,
)

@merit.iter_case_groups(geography, music)
def merit_chatbot(group: CaseGroup, case: Case, chatbot):
    response = chatbot(**case.sut_input_values)
    assert case.references["expected"] in response
@merit.tag(*tags) - Organize and filter merits by tags
@merit.tag("smoke", "fast")
def merit_health_check(api):
    assert api.health_check()

# Run: merit test --tag smoke
@merit.tag.skip(reason=...) - Skip merits unconditionally
@merit.tag.skip(reason="Feature not implemented")
def merit_upcoming():
    pass

@merit.tag.skip(reason="Requires API key")
def merit_external_api():
    pass
@merit.tag.xfail(reason=..., strict=False) - Mark merits expected to fail
@merit.tag.xfail(reason="Known bug #123")
def merit_known_issue():
    assert False  # Won't fail the suite

@merit.tag.xfail(reason="Should still fail", strict=True)
def merit_strict():
    pass  # If this passes, suite FAILS
@merit.repeat(n, min_passes=n) - Run merits multiple times to see if AI behavior is consistent
@merit.repeat(10)  # All 10 must pass
def merit_consistent(llm):
    assert "hello" in llm.generate("Say hello")

@merit.repeat(10, min_passes=8)  # 8 out of 10
def merit_mostly_correct(llm):
    assert "hola" in llm.generate("Say hello in Spanish")
@merit.run_inline - Opt out of default threaded execution for sync merits By default, synchronous merits (def merit_*) run in a worker thread via asyncio.to_thread(...) so the event loop stays responsive. Use @merit.run_inline when a sync merit must run on the main event-loop thread (for example, thread-sensitive libraries).
import threading

def merit_default_threaded():
    # Runs in a worker thread by default.
    assert threading.current_thread() is not threading.main_thread()

@merit.run_inline
def merit_main_thread_only():
    # Runs inline on the event-loop thread.
    assert threading.current_thread() is threading.main_thread()

Resources

Resources are the Merit equivalent of pytest fixtures: named, injectable dependencies that Merit resolves by parameter name. @merit.resource(scope="case") - Define injectable dependencies with lifecycle management
@merit.resource
def database():
    conn = connect_db()
    yield conn  # Injected into merits
    conn.close()  # Automatic cleanup

@merit.resource(scope="session")
def ml_model():
    return load_model()  # Shared across entire run

def merit_query(database, ml_model):
    # Both injected automatically
    result = database.query("SELECT 1")
    prediction = ml_model.predict(result)
    assert prediction
Scopes: "case" (default), "suite", "session". @merit.metric(scope="session") - Define a metric as a scoped, injectable measurement object Metrics behave like resources, but they’re intended to accumulate measurements across many cases and then assert on aggregates at the end of their scope. The most common pattern is to inject a Metric into your merits and use the metrics(...) context manager to record assertion outcomes into that metric.
from merit import Metric, metrics

@merit.metric
def accuracy():
    metric = Metric()
    yield metric
    assert metric.mean > 0.8  # Check after all data collected

@merit.parametrize("input,expected", [("a", 1), ("b", 2)])
def merit_classifier(input, expected, classifier, accuracy: Metric):
    result = classifier(input)

    # Assertions inside metrics() are recorded as True/False
    with metrics(accuracy):
        assert result == expected
Scopes: "session" (default), "suite", "case". @merit.sut - Register a System Under Test (SUT) as an injectable callable A SUT is the thing you’re actually evaluating (an agent function, a pipeline, a classifier, a client wrapper, etc.). Declaring it with @merit.sut makes it injectable and traceable, so you can assert not only on the output, but also on how it behaved internally (for example: tool calls).
A SUT must be callable. Merit will inject it into your merit function and you will call it like a normal function (or callable object).
from demo_app import agent

@merit.sut
def weather_agent(prompt: str):
    return agent(prompt, tools=["get_weather"])

def merit_agent_uses_tools(weather_agent, trace_context):
    result = weather_agent("What's the weather?")

    # Access trace spans for assertions
    sut_spans = trace_context.get_sut_spans(name="weather_agent")
    assert sut_spans

    # If you want to assert on tool calls, query LLM spans explicitly.
    # Note: attribute keys come from OpenTelemetry LLM instrumentations.
    tool_names = [
        s.attributes.get("llm.request.functions.0.name")
        for s in trace_context.get_llm_calls()
        if s.attributes
    ]
    assert "get_weather" in tool_names

Custom Assert

Merit transforms Python’s assert keyword to provide richer testing capabilities for AI systems. When you run merit files through Merit’s runner, assertions behave differently than standard Python.

Continue on failure (default behavior)

By default, Merit continues running remaining assertions even after one fails. This is different from standard Python, where the first failed assertion stops execution immediately.
def merit_multiple_checks(classifier):
    result = classifier("test input")

    assert result.confidence > 0.8  # Fails
    assert result.label != ""        # Still runs
    assert result.valid              # Still runs
All three assertions will be evaluated and reported, even if the first one fails. This behavior lets you see all test failures in a single run rather than fixing them one at a time. To stop on the first failure, use the --fail-fast CLI flag:
merit test --fail-fast

Integration with metrics

When assertions are evaluated inside a metrics() context manager, Merit automatically records whether each assertion passed or failed to the specified metrics:
from merit import Metric, metrics

@merit.metric
def accuracy():
    metric = Metric()
    yield metric
    assert metric.mean > 0.8

@merit.parametrize("input,expected", [("a", 1), ("b", 2), ("c", 3)])
def merit_classifier(input, expected, classifier, accuracy: Metric):
    result = classifier(input)

    # Assertions inside metrics() are recorded as 1 (pass) or 0 (fail)
    with metrics(accuracy):
        assert result == expected

Only works through Merit’s runner

Important: Merit’s assertion transformation only applies when you run files through Merit’s test runner:
merit test merit_my_tests.py    # ✓ Transformed assertions
python merit_my_tests.py         # ✗ Standard Python behavior
uv run merit_my_tests.py         # ✗ Standard Python behavior

Assert messages

Assert messages work as expected and are captured in the AssertionResult:
def merit_validation(response):
    assert response.status == 200, f"Expected 200, got {response.status}"
    assert "error" not in response.body, "Response contains error"