Merit is intentionally “pytest-shaped”: you write plain Python, Merit discovers merit_* cases, injects dependencies by parameter name (like pytest fixtures), runs them, and reports results.
This page focuses on how to write merits and (importantly) where the behavior lives in the codebase, so you can trust what’s happening.
TLDR - If you know pytest, you already know 80%
| pytest | merit |
|---|
def test_*(): ... | def merit_*(): ... |
@pytest.fixture | @merit.resource |
@pytest.mark.parametrize(...) | @merit.parametrize(...) |
@pytest.mark.skip(...) | @merit.tag.skip(...) |
@pytest.mark.xfail(...) | @merit.tag.xfail(...) |
@pytest.mark.repeat(3) | @merit.repeat(3) |
Merits
Merit follows pytest-style discovery patterns to find merit functions in your codebase:
Files: Merit discovers Python files starting with merit_:
merit_chatbot.py ✓
merit_agent.py ✓
helpers.py ✗
Functions: Inside discovered files, Merit collects functions starting with merit_:
def merit_weather_agent(): # ✓ Discovered
pass
def helper_function(): # ✗ Not discovered
pass
Classes: Classes starting with Merit are discovered, and their merit_* methods become merit cases:
class MeritCustomerSupport: # ✓ Discovered
def merit_greeting(self): # ✓ Collected
pass
def helper(self): # ✗ Not collected
pass
Modifiers
Modifiers are decorators that change how a merit is collected and/or executed. Some modifiers expand a single merit into many cases (like parametrization or dataset iteration), while others change execution semantics (like repeating a case, or marking it as skipped/xfail). Apply them to merit_* functions or Merit* classes.
@merit.parametrize(names, values) - Run the same merit with different inputs
@merit.parametrize("model,temp", [("gpt-4", 0.7), ("claude-3", 0.5)])
def merit_model_response(model: str, temp: float, chatbot):
response = chatbot.generate(model=model, temperature=temp)
assert response
@merit.iter_cases(*cases, min_passes=len(cases)) - Iterate over Case objects from external sources, optionally allowing pass thresholds
import json
from merit import Case
cases = [Case(**item) for item in json.load(open("cases.json"))]
@merit.iter_cases(*cases)
def merit_from_dataset(case: Case, classifier):
result = classifier(**case.sut_input_values)
if case.references:
assert result == case.references["some_ref_value"]
@merit.iter_cases(*cases, min_passes=8) # Requires at least 8 case passes
def merit_from_dataset_threshold(case: Case, classifier):
result = classifier(**case.sut_input_values)
assert result == case.references["some_ref_value"]
@merit.iter_case_groups(*groups) - Iterate over CaseGroup objects with per-group thresholds and group-level references
from merit import Case, CaseGroup
geography = CaseGroup(
name="geography",
cases=[
Case(sut_input_values={"prompt": "Capital of France?"}, references={"expected": "Paris"}),
Case(sut_input_values={"prompt": "Capital of Germany?"}, references={"expected": "Berlin"}),
],
min_passes=2,
)
music = CaseGroup(
name="music",
cases=[Case(sut_input_values={"prompt": "Best rock band?"}, references={"expected": "Metallica"})],
min_passes=1,
)
@merit.iter_case_groups(geography, music)
def merit_chatbot(group: CaseGroup, case: Case, chatbot):
response = chatbot(**case.sut_input_values)
assert case.references["expected"] in response
@merit.tag(*tags) - Organize and filter merits by tags
@merit.tag("smoke", "fast")
def merit_health_check(api):
assert api.health_check()
# Run: merit test --tag smoke
@merit.tag.skip(reason=...) - Skip merits unconditionally
@merit.tag.skip(reason="Feature not implemented")
def merit_upcoming():
pass
@merit.tag.skip(reason="Requires API key")
def merit_external_api():
pass
@merit.tag.xfail(reason=..., strict=False) - Mark merits expected to fail
@merit.tag.xfail(reason="Known bug #123")
def merit_known_issue():
assert False # Won't fail the suite
@merit.tag.xfail(reason="Should still fail", strict=True)
def merit_strict():
pass # If this passes, suite FAILS
@merit.repeat(n, min_passes=n) - Run merits multiple times to see if AI behavior is consistent
@merit.repeat(10) # All 10 must pass
def merit_consistent(llm):
assert "hello" in llm.generate("Say hello")
@merit.repeat(10, min_passes=8) # 8 out of 10
def merit_mostly_correct(llm):
assert "hola" in llm.generate("Say hello in Spanish")
@merit.run_inline - Opt out of default threaded execution for sync merits
By default, synchronous merits (def merit_*) run in a worker thread via asyncio.to_thread(...) so the event loop stays responsive. Use @merit.run_inline when a sync merit must run on the main event-loop thread (for example, thread-sensitive libraries).
import threading
def merit_default_threaded():
# Runs in a worker thread by default.
assert threading.current_thread() is not threading.main_thread()
@merit.run_inline
def merit_main_thread_only():
# Runs inline on the event-loop thread.
assert threading.current_thread() is threading.main_thread()
Resources
Resources are the Merit equivalent of pytest fixtures: named, injectable dependencies that Merit resolves by parameter name.
@merit.resource(scope="case") - Define injectable dependencies with lifecycle management
@merit.resource
def database():
conn = connect_db()
yield conn # Injected into merits
conn.close() # Automatic cleanup
@merit.resource(scope="session")
def ml_model():
return load_model() # Shared across entire run
def merit_query(database, ml_model):
# Both injected automatically
result = database.query("SELECT 1")
prediction = ml_model.predict(result)
assert prediction
Scopes: "case" (default), "suite", "session".
@merit.metric(scope="session") - Define a metric as a scoped, injectable measurement object
Metrics behave like resources, but they’re intended to accumulate measurements across many cases and then assert on aggregates at the end of their scope.
The most common pattern is to inject a Metric into your merits and use the metrics(...) context manager to record assertion outcomes into that metric.
from merit import Metric, metrics
@merit.metric
def accuracy():
metric = Metric()
yield metric
assert metric.mean > 0.8 # Check after all data collected
@merit.parametrize("input,expected", [("a", 1), ("b", 2)])
def merit_classifier(input, expected, classifier, accuracy: Metric):
result = classifier(input)
# Assertions inside metrics() are recorded as True/False
with metrics(accuracy):
assert result == expected
Scopes: "session" (default), "suite", "case".
@merit.sut - Register a System Under Test (SUT) as an injectable callable
A SUT is the thing you’re actually evaluating (an agent function, a pipeline, a classifier, a client wrapper, etc.). Declaring it with @merit.sut makes it injectable and traceable, so you can assert not only on the output, but also on how it behaved internally (for example: tool calls).
A SUT must be callable. Merit will inject it into your merit function and you will call it like a normal function (or callable object).
from demo_app import agent
@merit.sut
def weather_agent(prompt: str):
return agent(prompt, tools=["get_weather"])
def merit_agent_uses_tools(weather_agent, trace_context):
result = weather_agent("What's the weather?")
# Access trace spans for assertions
sut_spans = trace_context.get_sut_spans(name="weather_agent")
assert sut_spans
# If you want to assert on tool calls, query LLM spans explicitly.
# Note: attribute keys come from OpenTelemetry LLM instrumentations.
tool_names = [
s.attributes.get("llm.request.functions.0.name")
for s in trace_context.get_llm_calls()
if s.attributes
]
assert "get_weather" in tool_names
Custom Assert
Merit transforms Python’s assert keyword to provide richer testing capabilities for AI systems. When you run merit files through Merit’s runner, assertions behave differently than standard Python.
Continue on failure (default behavior)
By default, Merit continues running remaining assertions even after one fails. This is different from standard Python, where the first failed assertion stops execution immediately.
def merit_multiple_checks(classifier):
result = classifier("test input")
assert result.confidence > 0.8 # Fails
assert result.label != "" # Still runs
assert result.valid # Still runs
All three assertions will be evaluated and reported, even if the first one fails. This behavior lets you see all test failures in a single run rather than fixing them one at a time.
To stop on the first failure, use the --fail-fast CLI flag:
Integration with metrics
When assertions are evaluated inside a metrics() context manager, Merit automatically records whether each assertion passed or failed to the specified metrics:
from merit import Metric, metrics
@merit.metric
def accuracy():
metric = Metric()
yield metric
assert metric.mean > 0.8
@merit.parametrize("input,expected", [("a", 1), ("b", 2), ("c", 3)])
def merit_classifier(input, expected, classifier, accuracy: Metric):
result = classifier(input)
# Assertions inside metrics() are recorded as 1 (pass) or 0 (fail)
with metrics(accuracy):
assert result == expected
Only works through Merit’s runner
Important: Merit’s assertion transformation only applies when you run files through Merit’s test runner:
merit test merit_my_tests.py # ✓ Transformed assertions
python merit_my_tests.py # ✗ Standard Python behavior
uv run merit_my_tests.py # ✗ Standard Python behavior
Assert messages
Assert messages work as expected and are captured in the AssertionResult:
def merit_validation(response):
assert response.status == 200, f"Expected 200, got {response.status}"
assert "error" not in response.body, "Response contains error"