Skip to main content
Merits are the core building blocks of your AI system evaluations. Like pytest discovers test_* functions, Merit discovers and runs merit_* functions - each one checking how well your AI system performs using Merit’s APIs and components.

Basic Usage

The simplest merit function is a function whose name starts with merit_. Merit automatically discovers and executes these functions.
import merit
from merit.predicates import has_unsupported_facts

# Define your AI system (or import from your codebase)
def chatbot(prompt: str) -> str:
    return call_llm(prompt)

# Merit function: discovered and run automatically
async def merit_chatbot_no_hallucinations():
    context = "Our store hours are 9 AM to 6 PM Monday-Saturday."
    response = chatbot("When are you open?")

    # Use semantic predicates to check output quality
    assert not await has_unsupported_facts(response, context)
Run all merit functions in your project:
merit test
Merit discovers all merit_* functions, executes them, and generates a report - just like pytest but for AI system evaluation.

Merit Discovery

Merit follows pytest’s discovery patterns, finding merits in files, functions, and classes that follow naming conventions.

Files

Merit discovers Python files starting with merit_:
project
merit_chatbot.py ✓ Discovered
merit_agent.py ✓ Discovered
tests
merit_rag.py ✓ Discovered
helpers.py ✗ Not discovered
src
agent.py ✗ Not discovered

Functions

Within discovered files, Merit collects functions starting with merit_:
# merit_agents.py

def merit_weather_agent():     # ✓ Collected
    pass

def merit_calculator():        # ✓ Collected
    pass

def helper_function():         # ✗ Not collected (no merit_ prefix)
    pass

def test_something():          # ✗ Not collected (pytest convention)
    pass

Classes

Classes starting with Merit are discovered, and their merit_* methods become merit cases:
# merit_agents.py

class MeritCustomerSupport:    # ✓ Class discovered

    def merit_greeting(self):   # ✓ Method collected
        pass

    def merit_farewell(self):   # ✓ Method collected
        pass

    def helper(self):           # ✗ Not collected (no merit_ prefix)
        pass

class TestSomething:           # ✗ Not discovered (pytest convention)
    pass

Dependency Injection

Merit automatically injects dependencies by matching parameter names to registered resources, SUTs, and metrics.
import merit
from merit import Metric, metrics

# Define resources (dependencies)
@merit.resource
def chatbot():
    return ChatBot(model="gpt-4")

@merit.metric
def accuracy():
    metric = Metric()
    yield metric
    assert metric.mean > 0.8

# Merit function with injected dependencies
def merit_chatbot_accuracy(chatbot, accuracy: Metric):
    # chatbot and accuracy automatically injected by name

    test_cases = [
        ("What's 2+2?", "4"),
        ("Capital of France?", "Paris"),
    ]

    for question, expected in test_cases:
        answer = chatbot.ask(question)
        with metrics(accuracy):
            assert expected.lower() in answer.lower()

Async Support

Merit automatically detects and runs async functions:
from merit.predicates import follows_policy

# Sync merit function
def merit_sync_test(calculator):
    result = calculator.add(2, 2)
    assert result == 4

# Async merit function - automatically detected
async def merit_async_test(chatbot):
    response = await chatbot.ask_async("Hello")

    # Many predicates are async
    policy = "Agent is friendly and professional"
    assert await follows_policy(response, policy)
Resources can be async too:
@merit.resource
async def async_database():
    conn = await connect_async()
    yield conn
    await conn.close()

async def merit_query(async_database):
    result = await async_database.query("SELECT 1")
    assert result
Merit handles the async execution automatically - no asyncio.run() needed.

Iterate Merits

AI systems are inherently non-deterministic, making it essential to test them across multiple scenarios and runs. Merit provides three distinct approaches to iterate the same merit definition, each optimized for different use cases: quick parametrization for a few variations, structured cases for large datasets, and repeated execution to assess consistency.

Iterate with different parameters

When you need to run the same merit with a small set of different inputs, @merit.parametrize offers the most concise syntax. It’s ideal for testing a handful of variations without the overhead of defining structured case objects.
import merit

@merit.parametrize("city,state", [
    ("Boston", "Massachusetts"),
    ("Austin", "Texas"),
])
def merit_geography_bot(city: str, state: str, geography_bot):
    result = geography_bot.ask(f"What state is {city} in?")
    assert state in result
This creates 2 merit cases:
  • merit_geography_bot(city='Boston', state='Massachusetts')
  • merit_geography_bot(city='Austin', state='Texas')
Multiple parameters can be stacked:
@merit.parametrize("model", ["gpt-4", "claude-3"])
@merit.parametrize("temperature", [0.0, 0.7, 1.0])
def merit_model_combinations(model: str, temperature: float):
    # Runs 6 times: 2 models × 3 temperatures
    pass
Parametrization works best when you have a small number of input variations (typically fewer than 10) and don’t require strict type definitions.

Iterate with different cases

When evaluating against tens or hundreds of examples, or when you need consistent typing and structure, use @merit.iter_cases with Case objects. This approach provides type safety through Pydantic validation and enables loading test cases from external sources like JSON files or databases.
from merit import Case
import json

# Load merit cases from file
with open("merit_cases.json") as f:
    cases = [Case(**item) for item in json.load(f)]

@merit.iter_cases(*cases)
def merit_from_dataset(case: Case, classifier):
    result = classifier(**case.sut_input_values)

    expected = case.references["expected_label"]
    assert result == expected

@merit.iter_cases(*cases, min_passes=8)
def merit_from_dataset_pass_at_k(case: Case, classifier):
    result = classifier(**case.sut_input_values)

    expected = case.references["expected_label"]
    assert result == expected
min_passes on iter_cases works like repeat: by default all case executions must pass, but you can require a lower threshold when evaluating large or noisy datasets.

Iterate with grouped cases

When your cases naturally fall into groups (e.g. topics, difficulty tiers, languages), use @merit.iter_case_groups with CaseGroup objects. Each group carries its own group-level references and a min_passes threshold, giving you hierarchical reporting (run → groups → cases) and per-group pass/fail semantics.
import merit
from merit import Case, CaseGroup

geography = CaseGroup(
    name="geography",
    cases=[
        Case(sut_input_values={"prompt": "Capital of France?"}, references={"expected": "Paris"}),
        Case(sut_input_values={"prompt": "Capital of Germany?"}, references={"expected": "Berlin"}),
    ],
    min_passes=2,  # strict: both must pass
)

music = CaseGroup(
    name="music",
    cases=[
        Case(sut_input_values={"prompt": "Best rock band?"}, references={"expected": "Metallica"}),
        Case(sut_input_values={"prompt": "Best pop artist?"}, references={"expected": "Lady Gaga"}),
    ],
    min_passes=1,  # tolerant: at least one must pass
)


@merit.iter_case_groups(geography, music)
def merit_chatbot(group: CaseGroup, case: Case, chatbot):
    response = chatbot(**case.sut_input_values)
    assert case.references["expected"] in response
The merit passes only if every group meets its own min_passes. Inside the merit function, group and case are injected automatically — use group.references for group-level data and case.references for case-level data.
Use CaseGroup when you need per-group thresholds or group-level metadata. If all cases are flat and share the same threshold, stick with @merit.iter_cases(*cases, min_passes=k).

Repeat with same data

AI systems can produce different outputs for identical inputs due to their non-deterministic nature. Use @merit.repeat to run the same merit multiple times with the same data, measuring consistency and reliability of your AI component.
import merit

@merit.repeat(count=5)
def merit_chatbot_consistent_greeting(chatbot):
    """Run 5 times - all must pass."""
    response = chatbot.ask("Hello")
    assert "hi" in response.lower() or "hello" in response.lower()

@merit.repeat(count=10, min_passes=8)
def merit_sentiment_mostly_accurate(classifier):
    """Run 10 times - at least 8 must pass."""
    result = classifier("This product is amazing!")
    assert result.sentiment == "positive"
The min_passes parameter is sometimes referred to as “pass@k” in the AI evaluation community. For example, @merit.repeat(count=10, min_passes=8) checks if your system achieves the desired behavior in at least 8 out of 10 attempts (pass@8/10).

Organizing Merits with Tags

Running only specific Merits

Use @merit.tag to organize and filter merits:
import merit

@merit.tag("smoke", "fast")
def merit_health_check(api_client):
    response = api_client.get("/health")
    assert response.status_code == 200

@merit.tag("integration", "slow")
def merit_end_to_end_workflow(system):
    # Long-running integration merit
    pass

# Tag entire classes
@merit.tag("customer-support")
class MeritSupportBot:

    @merit.tag("greeting")
    def merit_hello(self, support_bot):
        pass

    @merit.tag("farewell")
    def merit_goodbye(self, support_bot):
        pass
Run specific tags from CLI:
merit test --tag smoke       # Only smoke merits
merit test --tag slow        # Only slow merits

Skipping Merits unconditionally

Skip merits with @merit.tag.skip:
import merit

@merit.tag.skip(reason="Feature not implemented yet")
def merit_upcoming_feature():
    pass

@merit.tag.skip(reason="Requires API key")
def merit_external_api():
    pass

Skipping Merits conditionally

def merit_conditional_skip():
    if not os.getenv("API_KEY"):
        merit.skip("API_KEY not configured")
    # Test continues if condition not met
    assert True
You can also use merit.skip() inside resources to conditionally skip merits when dependencies aren’t available. This centralizes skip logic where the resource is defined rather than in every merit that uses it.

Expected Failures

Mark merits expected to fail with @merit.tag.xfail:
@merit.tag.xfail(reason="Known bug #123")
def merit_known_issue():
    # This failure won't fail the merit suite
    assert False

@merit.tag.xfail(reason="Model not accurate yet", strict=True)
def merit_strict_xfail():
    # If this passes, the merit suite FAILS (unexpected pass)
    pass
Use strict=True when the merit passing would be surprising and worth investigating.

Recommendations

1. Name functions descriptively

Merit function names become merit case identifiers in reports. Use descriptive names that explain what’s being evaluated. Don’t do this:
def merit_test1():
    pass

def merit_test2():
    pass

def merit_chatbot():  # Too vague
    pass
Do this:
def merit_chatbot_responds_to_greetings():
    """Check that chatbot handles basic greetings appropriately."""
    pass

def merit_chatbot_no_hallucinations_in_faq():
    """Verify chatbot doesn't invent facts when answering FAQ questions."""
    pass

def merit_chatbot_follows_brand_voice():
    """Ensure chatbot responses match company's tone and style guidelines."""
    pass
Descriptive names make reports self-documenting and help team members understand merit failures.

2. Use dependency injection over global imports

Merit’s dependency injection system enables better resource management and merit isolation. Inject dependencies as parameters instead of importing globally. Don’t do this:
# merit_agent.py
from app import agent  # Global import

def merit_weather_queries():
    # Using global - can't control lifecycle or swap implementations
    response = agent("What's the weather?")
    assert response
Do this:
# merit_agent.py
import merit
from app import agent as production_agent

@merit.resource
def agent():
    """Evaluation instance of agent with controlled lifecycle."""
    instance = production_agent.create(env="test")
    yield instance
    instance.cleanup()

def merit_weather_queries(agent):
    # Injected - Merit manages lifecycle and can track usage
    response = agent("What's the weather?")
    assert response
This pattern enables:
  • Automatic setup and teardown
  • Resource scoping and reuse
  • Merit isolation
  • Better reporting and analytics
Group related merits in classes for better organization and shared tags/setup: Don’t do this:
# merit_support.py - flat functions with repeated tags

@merit.tag("customer-support", "greeting")
def merit_support_greeting_casual():
    pass

@merit.tag("customer-support", "greeting")
def merit_support_greeting_formal():
    pass

@merit.tag("customer-support", "farewell")
def merit_support_farewell_casual():
    pass

@merit.tag("customer-support", "farewell")
def merit_support_farewell_formal():
    pass
Do this:
# merit_support.py - organized in classes

@merit.tag("customer-support")
class MeritSupportGreetings:
    """Evaluate support bot greeting scenarios."""

    @merit.tag("casual")
    def merit_greeting_casual(self, support_bot):
        pass

    @merit.tag("formal")
    def merit_greeting_formal(self, support_bot):
        pass

@merit.tag("customer-support")
class MeritSupportFarewells:
    """Evaluate support bot farewell scenarios."""

    @merit.tag("casual")
    def merit_farewell_casual(self, support_bot):
        pass

    @merit.tag("formal")
    def merit_farewell_formal(self, support_bot):
        pass
Classes provide:
  • Logical grouping in reports
  • Shared tags that cascade to methods
  • Better code organization
  • Easier navigation in IDEs