Merit is intentionally “pytest-shaped”: you write plain Python, Merit discovers merit_* cases, injects dependencies by parameter name (like pytest fixtures), runs them, and reports results.This page focuses on how to write merits and (importantly) where the behavior lives in the codebase, so you can trust what’s happening.
Modifiers are decorators that change how a merit is collected and/or executed. Some modifiers expand a single merit into many cases (like parametrization or dataset iteration), while others change execution semantics (like repeating a case, or marking it as skipped/xfail). Apply them to merit_* functions or Merit* classes.
@merit.iter_cases(*cases, min_passes=len(cases)) - Iterate over Case objects from external sources, optionally allowing pass thresholds
import jsonfrom merit import Casecases = [Case(**item) for item in json.load(open("cases.json"))]@merit.iter_cases(*cases)def merit_from_dataset(case: Case, classifier): result = classifier(**case.sut_input_values) if case.references: assert result == case.references["some_ref_value"]@merit.iter_cases(*cases, min_passes=8) # Requires at least 8 case passesdef merit_from_dataset_threshold(case: Case, classifier): result = classifier(**case.sut_input_values) assert result == case.references["some_ref_value"]
@merit.iter_case_groups(*groups) - Iterate over CaseGroup objects with per-group thresholds and group-level references
from merit import Case, CaseGroupgeography = CaseGroup( name="geography", cases=[ Case(sut_input_values={"prompt": "Capital of France?"}, references={"expected": "Paris"}), Case(sut_input_values={"prompt": "Capital of Germany?"}, references={"expected": "Berlin"}), ], min_passes=2,)music = CaseGroup( name="music", cases=[Case(sut_input_values={"prompt": "Best rock band?"}, references={"expected": "Metallica"})], min_passes=1,)@merit.iter_case_groups(geography, music)def merit_chatbot(group: CaseGroup, case: Case, chatbot): response = chatbot(**case.sut_input_values) assert case.references["expected"] in response
@merit.tag(*tags) - Organize and filter merits by tags
@merit.tag("smoke", "fast")def merit_health_check(api): assert api.health_check()# Run: merit test --tag smoke
@merit.tag.skip(reason="Feature not implemented")def merit_upcoming(): pass@merit.tag.skip(reason="Requires API key")def merit_external_api(): pass
@merit.tag.xfail(reason=..., strict=False) - Mark merits expected to fail
@merit.tag.xfail(reason="Known bug #123")def merit_known_issue(): assert False # Won't fail the suite@merit.tag.xfail(reason="Should still fail", strict=True)def merit_strict(): pass # If this passes, suite FAILS
@merit.repeat(n, min_passes=n) - Run merits multiple times to see if AI behavior is consistent
@merit.repeat(10) # All 10 must passdef merit_consistent(llm): assert "hello" in llm.generate("Say hello")@merit.repeat(10, min_passes=8) # 8 out of 10def merit_mostly_correct(llm): assert "hola" in llm.generate("Say hello in Spanish")
@merit.run_inline - Opt out of default threaded execution for sync meritsBy default, synchronous merits (def merit_*) run in a worker thread via asyncio.to_thread(...) so the event loop stays responsive. Use @merit.run_inline when a sync merit must run on the main event-loop thread (for example, thread-sensitive libraries).
import threadingdef merit_default_threaded(): # Runs in a worker thread by default. assert threading.current_thread() is not threading.main_thread()@merit.run_inlinedef merit_main_thread_only(): # Runs inline on the event-loop thread. assert threading.current_thread() is threading.main_thread()
@merit.resource(scope="case") - Define injectable dependencies with lifecycle management
@merit.resourcedef database(): conn = connect_db() yield conn # Injected into merits conn.close() # Automatic cleanup@merit.resource(scope="session")def ml_model(): return load_model() # Shared across entire rundef merit_query(database, ml_model): # Both injected automatically result = database.query("SELECT 1") prediction = ml_model.predict(result) assert prediction
Scopes: "case" (default), "suite", "session".@merit.metric(scope="session") - Define a metric as a scoped, injectable measurement objectMetrics behave like resources, but they’re intended to accumulate measurements across many cases and then assert on aggregates at the end of their scope.The most common pattern is to inject a Metric into your merits and use the metrics(...) context manager to record assertion outcomes into that metric.
from merit import Metric, metrics@merit.metricdef accuracy(): metric = Metric() yield metric assert metric.mean > 0.8 # Check after all data collected@merit.parametrize("input,expected", [("a", 1), ("b", 2)])def merit_classifier(input, expected, classifier, accuracy: Metric): result = classifier(input) # Assertions inside metrics() are recorded as True/False with metrics(accuracy): assert result == expected
Scopes: "session" (default), "suite", "case".@merit.sut - Register a System Under Test (SUT) as an injectable callableA SUT is the thing you’re actually evaluating (an agent function, a pipeline, a classifier, a client wrapper, etc.). Declaring it with @merit.sut makes it injectable and traceable, so you can assert not only on the output, but also on how it behaved internally (for example: tool calls).
A SUT must be callable. Merit will inject it into your merit function and you will call it like a normal function (or callable object).
from demo_app import agent@merit.sutdef weather_agent(prompt: str): return agent(prompt, tools=["get_weather"])def merit_agent_uses_tools(weather_agent, trace_context): result = weather_agent("What's the weather?") # Access trace spans for assertions sut_spans = trace_context.get_sut_spans(name="weather_agent") assert sut_spans # If you want to assert on tool calls, query LLM spans explicitly. # Note: attribute keys come from OpenTelemetry LLM instrumentations. tool_names = [ s.attributes.get("llm.request.functions.0.name") for s in trace_context.get_llm_calls() if s.attributes ] assert "get_weather" in tool_names
Merit transforms Python’s assert keyword to provide richer testing capabilities for AI systems. When you run merit files through Merit’s runner, assertions behave differently than standard Python.
By default, Merit continues running remaining assertions even after one fails. This is different from standard Python, where the first failed assertion stops execution immediately.
def merit_multiple_checks(classifier): result = classifier("test input") assert result.confidence > 0.8 # Fails assert result.label != "" # Still runs assert result.valid # Still runs
All three assertions will be evaluated and reported, even if the first one fails. This behavior lets you see all test failures in a single run rather than fixing them one at a time.To stop on the first failure, use the --fail-fast CLI flag:
When assertions are evaluated inside a metrics() context manager, Merit automatically records whether each assertion passed or failed to the specified metrics:
from merit import Metric, metrics@merit.metricdef accuracy(): metric = Metric() yield metric assert metric.mean > 0.8@merit.parametrize("input,expected", [("a", 1), ("b", 2), ("c", 3)])def merit_classifier(input, expected, classifier, accuracy: Metric): result = classifier(input) # Assertions inside metrics() are recorded as 1 (pass) or 0 (fail) with metrics(accuracy): assert result == expected
Important: Merit’s assertion transformation only applies when you run files through Merit’s test runner:
merit test merit_my_tests.py # ✓ Transformed assertionspython merit_my_tests.py # ✗ Standard Python behavioruv run merit_my_tests.py # ✗ Standard Python behavior