Merits are the core building blocks of your AI system evaluations. Like pytest discovers test_* functions, Merit discovers and runs merit_* functions - each one checking how well your AI system performs using Merit’s APIs and components.
The simplest merit function is a function whose name starts with merit_. Merit automatically discovers and executes these functions.
Copy
import meritfrom merit.predicates import has_unsupported_facts# Define your AI system (or import from your codebase)def chatbot(prompt: str) -> str: return call_llm(prompt)# Merit function: discovered and run automaticallyasync def merit_chatbot_no_hallucinations(): context = "Our store hours are 9 AM to 6 PM Monday-Saturday." response = chatbot("When are you open?") # Use semantic predicates to check output quality assert not await has_unsupported_facts(response, context)
Run all merit functions in your project:
Copy
merit test
Merit discovers all merit_* functions, executes them, and generates a report - just like pytest but for AI system evaluation.
AI systems are inherently non-deterministic, making it essential to test them across multiple scenarios and runs. Merit provides three distinct approaches to iterate the same merit definition, each optimized for different use cases: quick parametrization for a few variations, structured cases for large datasets, and repeated execution to assess consistency.
When you need to run the same merit with a small set of different inputs, @merit.parametrize offers the most concise syntax. It’s ideal for testing a handful of variations without the overhead of defining structured case objects.
Copy
import merit@merit.parametrize("city,state", [ ("Boston", "Massachusetts"), ("Austin", "Texas"),])def merit_geography_bot(city: str, state: str, geography_bot): result = geography_bot.ask(f"What state is {city} in?") assert state in result
When evaluating against tens or hundreds of examples, or when you need consistent typing and structure, use @merit.iter_cases with Case objects. This approach provides type safety through Pydantic validation and enables loading test cases from external sources like JSON files or databases.
Copy
from merit import Caseimport json# Load merit cases from filewith open("merit_cases.json") as f: cases = [Case(**item) for item in json.load(f)]@merit.iter_cases(*cases)def merit_from_dataset(case: Case, classifier): result = classifier(**case.sut_input_values) expected = case.references["expected_label"] assert result == expected@merit.iter_cases(*cases, min_passes=8)def merit_from_dataset_pass_at_k(case: Case, classifier): result = classifier(**case.sut_input_values) expected = case.references["expected_label"] assert result == expected
min_passes on iter_cases works like repeat: by default all case executions must pass, but you can require a lower threshold when evaluating large or noisy datasets.
When your cases naturally fall into groups (e.g. topics, difficulty tiers, languages), use @merit.iter_case_groups with CaseGroup objects. Each group carries its own group-level references and a min_passes threshold, giving you hierarchical reporting (run → groups → cases) and per-group pass/fail semantics.
Copy
import meritfrom merit import Case, CaseGroupgeography = CaseGroup( name="geography", cases=[ Case(sut_input_values={"prompt": "Capital of France?"}, references={"expected": "Paris"}), Case(sut_input_values={"prompt": "Capital of Germany?"}, references={"expected": "Berlin"}), ], min_passes=2, # strict: both must pass)music = CaseGroup( name="music", cases=[ Case(sut_input_values={"prompt": "Best rock band?"}, references={"expected": "Metallica"}), Case(sut_input_values={"prompt": "Best pop artist?"}, references={"expected": "Lady Gaga"}), ], min_passes=1, # tolerant: at least one must pass)@merit.iter_case_groups(geography, music)def merit_chatbot(group: CaseGroup, case: Case, chatbot): response = chatbot(**case.sut_input_values) assert case.references["expected"] in response
The merit passes only if every group meets its own min_passes. Inside the merit function, group and case are injected automatically — use group.references for group-level data and case.references for case-level data.
Use CaseGroup when you need per-group thresholds or group-level metadata. If all cases are flat and share the same threshold, stick with @merit.iter_cases(*cases, min_passes=k).
AI systems can produce different outputs for identical inputs due to their non-deterministic nature. Use @merit.repeat to run the same merit multiple times with the same data, measuring consistency and reliability of your AI component.
Copy
import merit@merit.repeat(count=5)def merit_chatbot_consistent_greeting(chatbot): """Run 5 times - all must pass.""" response = chatbot.ask("Hello") assert "hi" in response.lower() or "hello" in response.lower()@merit.repeat(count=10, min_passes=8)def merit_sentiment_mostly_accurate(classifier): """Run 10 times - at least 8 must pass.""" result = classifier("This product is amazing!") assert result.sentiment == "positive"
The min_passes parameter is sometimes referred to as “pass@k” in the AI evaluation community. For example, @merit.repeat(count=10, min_passes=8) checks if your system achieves the desired behavior in at least 8 out of 10 attempts (pass@8/10).
def merit_conditional_skip(): if not os.getenv("API_KEY"): merit.skip("API_KEY not configured") # Test continues if condition not met assert True
You can also use merit.skip() inside resources to conditionally skip merits when dependencies aren’t available. This centralizes skip logic where the resource is defined rather than in every merit that uses it.
Mark merits expected to fail with @merit.tag.xfail:
Copy
@merit.tag.xfail(reason="Known bug #123")def merit_known_issue(): # This failure won't fail the merit suite assert False@merit.tag.xfail(reason="Model not accurate yet", strict=True)def merit_strict_xfail(): # If this passes, the merit suite FAILS (unexpected pass) pass
Use strict=True when the merit passing would be surprising and worth investigating.
Merit’s dependency injection system enables better resource management and merit isolation. Inject dependencies as parameters instead of importing globally.Don’t do this:
Copy
# merit_agent.pyfrom app import agent # Global importdef merit_weather_queries(): # Using global - can't control lifecycle or swap implementations response = agent("What's the weather?") assert response
Do this:
Copy
# merit_agent.pyimport meritfrom app import agent as production_agent@merit.resourcedef agent(): """Evaluation instance of agent with controlled lifecycle.""" instance = production_agent.create(env="test") yield instance instance.cleanup()def merit_weather_queries(agent): # Injected - Merit manages lifecycle and can track usage response = agent("What's the weather?") assert response