Merit

Merits are the core building blocks of your AI system evaluations. Like pytest discovers test_* functions, Merit discovers and runs merit_* functions - each one checking how well your AI system performs using Merit’s APIs and components.

Basic Usage

The simplest merit function is a function whose name starts with merit_. Merit automatically discovers and executes these functions.

import merit
from merit.predicates import has_unsupported_facts

# Define your AI system (or import from your codebase)
def chatbot(prompt: str) -> str:
    return call_llm(prompt)

# Merit function: discovered and run automatically
async def merit_chatbot_no_hallucinations():
    context = "Our store hours are 9 AM to 6 PM Monday-Saturday."
    response = chatbot("When are you open?")

    # Use semantic predicates to check output quality
    assert not await has_unsupported_facts(response, context)

Run all merit functions in your project:

merit test

Merit discovers all merit_* functions, executes them, and generates a report - just like pytest but for AI system evaluation.

Merit Discovery

Merit follows pytest’s discovery patterns, finding merits in files, functions, and classes that follow naming conventions.

Files

Merit discovers Python files starting with merit_:

project

merit_chatbot.py ✓ Discovered

merit_agent.py ✓ Discovered

tests

merit_rag.py ✓ Discovered

helpers.py ✗ Not discovered

src

agent.py ✗ Not discovered

Functions

Within discovered files, Merit collects functions starting with merit_:

# merit_agents.py

def merit_weather_agent():     # ✓ Collected
    pass

def merit_calculator():        # ✓ Collected
    pass

def helper_function():         # ✗ Not collected (no merit_ prefix)
    pass

def test_something():          # ✗ Not collected (pytest convention)
    pass

Classes

Classes starting with Merit are discovered, and their merit_* methods become merit cases:

# merit_agents.py

class MeritCustomerSupport:    # ✓ Class discovered

    def merit_greeting(self):   # ✓ Method collected
        pass

    def merit_farewell(self):   # ✓ Method collected
        pass

    def helper(self):           # ✗ Not collected (no merit_ prefix)
        pass

class TestSomething:           # ✗ Not discovered (pytest convention)
    pass

Dependency Injection

Merit automatically injects dependencies by matching parameter names to registered resources, SUTs, and metrics.

import merit
from merit import Metric, metrics

# Define resources (dependencies)
@merit.resource
def chatbot():
    return ChatBot(model="gpt-4")

@merit.metric
def accuracy():
    metric = Metric()
    yield metric
    assert metric.mean > 0.8

# Merit function with injected dependencies
def merit_chatbot_accuracy(chatbot, accuracy: Metric):
    # chatbot and accuracy automatically injected by name

    test_cases = [
        ("What's 2+2?", "4"),
        ("Capital of France?", "Paris"),
    ]

    for question, expected in test_cases:
        answer = chatbot.ask(question)
        with metrics(accuracy):
            assert expected.lower() in answer.lower()

Async Support

Merit automatically detects and runs async functions:

from merit.predicates import follows_policy

# Sync merit function
def merit_sync_test(calculator):
    result = calculator.add(2, 2)
    assert result == 4

# Async merit function - automatically detected
async def merit_async_test(chatbot):
    response = await chatbot.ask_async("Hello")

    # Many predicates are async
    policy = "Agent is friendly and professional"
    assert await follows_policy(response, policy)

Resources can be async too:

@merit.resource
async def async_database():
    conn = await connect_async()
    yield conn
    await conn.close()

async def merit_query(async_database):
    result = await async_database.query("SELECT 1")
    assert result

Merit handles the async execution automatically - no asyncio.run() needed.

Iterate Merits

AI systems are inherently non-deterministic, making it essential to test them across multiple scenarios and runs. Merit provides three distinct approaches to iterate the same merit definition, each optimized for different use cases: quick parametrization for a few variations, structured cases for large datasets, and repeated execution to assess consistency.

Iterate with different parameters

When you need to run the same merit with a small set of different inputs, @merit.parametrize offers the most concise syntax. It’s ideal for testing a handful of variations without the overhead of defining structured case objects.

import merit

@merit.parametrize("city,state", [
    ("Boston", "Massachusetts"),
    ("Austin", "Texas"),
])
def merit_geography_bot(city: str, state: str, geography_bot):
    result = geography_bot.ask(f"What state is {city} in?")
    assert state in result

This creates 2 merit cases:

merit_geography_bot(city='Boston', state='Massachusetts')
merit_geography_bot(city='Austin', state='Texas')

Multiple parameters can be stacked:

@merit.parametrize("model", ["gpt-4", "claude-3"])
@merit.parametrize("temperature", [0.0, 0.7, 1.0])
def merit_model_combinations(model: str, temperature: float):
    # Runs 6 times: 2 models × 3 temperatures
    pass

Parametrization works best when you have a small number of input variations (typically fewer than 10) and don’t require strict type definitions.

Iterate with different cases

When evaluating against tens or hundreds of examples, or when you need consistent typing and structure, use @merit.iter_cases with Case objects. This approach provides type safety through Pydantic validation and enables loading test cases from external sources like JSON files or databases.

from merit import Case
import json

# Load merit cases from file
with open("merit_cases.json") as f:
    cases = [Case(**item) for item in json.load(f)]

@merit.iter_cases(*cases)
def merit_from_dataset(case: Case, classifier):
    result = classifier(**case.sut_input_values)

    expected = case.references["expected_label"]
    assert result == expected

@merit.iter_cases(*cases, min_passes=8)
def merit_from_dataset_pass_at_k(case: Case, classifier):
    result = classifier(**case.sut_input_values)

    expected = case.references["expected_label"]
    assert result == expected

min_passes on iter_cases works like repeat: by default all case executions must pass, but you can require a lower threshold when evaluating large or noisy datasets.

Iterate with grouped cases

When your cases naturally fall into groups (e.g. topics, difficulty tiers, languages), use @merit.iter_case_groups with CaseGroup objects. Each group carries its own group-level references and a min_passes threshold, giving you hierarchical reporting (run → groups → cases) and per-group pass/fail semantics.

import merit
from merit import Case, CaseGroup

geography = CaseGroup(
    name="geography",
    cases=[
        Case(sut_input_values={"prompt": "Capital of France?"}, references={"expected": "Paris"}),
        Case(sut_input_values={"prompt": "Capital of Germany?"}, references={"expected": "Berlin"}),
    ],
    min_passes=2,  # strict: both must pass
)

music = CaseGroup(
    name="music",
    cases=[
        Case(sut_input_values={"prompt": "Best rock band?"}, references={"expected": "Metallica"}),
        Case(sut_input_values={"prompt": "Best pop artist?"}, references={"expected": "Lady Gaga"}),
    ],
    min_passes=1,  # tolerant: at least one must pass
)


@merit.iter_case_groups(geography, music)
def merit_chatbot(group: CaseGroup, case: Case, chatbot):
    response = chatbot(**case.sut_input_values)
    assert case.references["expected"] in response

The merit passes only if every group meets its own min_passes. Inside the merit function, group and case are injected automatically — use group.references for group-level data and case.references for case-level data.

Use CaseGroup when you need per-group thresholds or group-level metadata. If all cases are flat and share the same threshold, stick with @merit.iter_cases(*cases, min_passes=k).

Repeat with same data

AI systems can produce different outputs for identical inputs due to their non-deterministic nature. Use @merit.repeat to run the same merit multiple times with the same data, measuring consistency and reliability of your AI component.

import merit

@merit.repeat(count=5)
def merit_chatbot_consistent_greeting(chatbot):
    """Run 5 times - all must pass."""
    response = chatbot.ask("Hello")
    assert "hi" in response.lower() or "hello" in response.lower()

@merit.repeat(count=10, min_passes=8)
def merit_sentiment_mostly_accurate(classifier):
    """Run 10 times - at least 8 must pass."""
    result = classifier("This product is amazing!")
    assert result.sentiment == "positive"

The min_passes parameter is sometimes referred to as “pass@k” in the AI evaluation community. For example, @merit.repeat(count=10, min_passes=8) checks if your system achieves the desired behavior in at least 8 out of 10 attempts (pass@8/10).

Organizing Merits with Tags

Running only specific Merits

Use @merit.tag to organize and filter merits:

import merit

@merit.tag("smoke", "fast")
def merit_health_check(api_client):
    response = api_client.get("/health")
    assert response.status_code == 200

@merit.tag("integration", "slow")
def merit_end_to_end_workflow(system):
    # Long-running integration merit
    pass

# Tag entire classes
@merit.tag("customer-support")
class MeritSupportBot:

    @merit.tag("greeting")
    def merit_hello(self, support_bot):
        pass

    @merit.tag("farewell")
    def merit_goodbye(self, support_bot):
        pass

Run specific tags from CLI:

merit test --tag smoke       # Only smoke merits
merit test --tag slow        # Only slow merits

Skipping Merits unconditionally

Skip merits with @merit.tag.skip:

import merit

@merit.tag.skip(reason="Feature not implemented yet")
def merit_upcoming_feature():
    pass

@merit.tag.skip(reason="Requires API key")
def merit_external_api():
    pass

Skipping Merits conditionally

def merit_conditional_skip():
    if not os.getenv("API_KEY"):
        merit.skip("API_KEY not configured")
    # Test continues if condition not met
    assert True

You can also use merit.skip() inside resources to conditionally skip merits when dependencies aren’t available. This centralizes skip logic where the resource is defined rather than in every merit that uses it.

Expected Failures

Mark merits expected to fail with @merit.tag.xfail:

@merit.tag.xfail(reason="Known bug #123")
def merit_known_issue():
    # This failure won't fail the merit suite
    assert False

@merit.tag.xfail(reason="Model not accurate yet", strict=True)
def merit_strict_xfail():
    # If this passes, the merit suite FAILS (unexpected pass)
    pass

Use strict=True when the merit passing would be surprising and worth investigating.

Recommendations

1. Name functions descriptively

Merit function names become merit case identifiers in reports. Use descriptive names that explain what’s being evaluated. Don’t do this:

def merit_test1():
    pass

def merit_test2():
    pass

def merit_chatbot():  # Too vague
    pass

Do this:

def merit_chatbot_responds_to_greetings():
    """Check that chatbot handles basic greetings appropriately."""
    pass

def merit_chatbot_no_hallucinations_in_faq():
    """Verify chatbot doesn't invent facts when answering FAQ questions."""
    pass

def merit_chatbot_follows_brand_voice():
    """Ensure chatbot responses match company's tone and style guidelines."""
    pass

Descriptive names make reports self-documenting and help team members understand merit failures.

2. Use dependency injection over global imports

Merit’s dependency injection system enables better resource management and merit isolation. Inject dependencies as parameters instead of importing globally. Don’t do this:

# merit_agent.py
from app import agent  # Global import

def merit_weather_queries():
    # Using global - can't control lifecycle or swap implementations
    response = agent("What's the weather?")
    assert response

Do this:

# merit_agent.py
import merit
from app import agent as production_agent

@merit.resource
def agent():
    """Evaluation instance of agent with controlled lifecycle."""
    instance = production_agent.create(env="test")
    yield instance
    instance.cleanup()

def merit_weather_queries(agent):
    # Injected - Merit manages lifecycle and can track usage
    response = agent("What's the weather?")
    assert response

This pattern enables:

Automatic setup and teardown
Resource scoping and reuse
Merit isolation
Better reporting and analytics

Group related merits in classes for better organization and shared tags/setup: Don’t do this:

# merit_support.py - flat functions with repeated tags

@merit.tag("customer-support", "greeting")
def merit_support_greeting_casual():
    pass

@merit.tag("customer-support", "greeting")
def merit_support_greeting_formal():
    pass

@merit.tag("customer-support", "farewell")
def merit_support_farewell_casual():
    pass

@merit.tag("customer-support", "farewell")
def merit_support_farewell_formal():
    pass

Do this:

# merit_support.py - organized in classes

@merit.tag("customer-support")
class MeritSupportGreetings:
    """Evaluate support bot greeting scenarios."""

    @merit.tag("casual")
    def merit_greeting_casual(self, support_bot):
        pass

    @merit.tag("formal")
    def merit_greeting_formal(self, support_bot):
        pass

@merit.tag("customer-support")
class MeritSupportFarewells:
    """Evaluate support bot farewell scenarios."""

    @merit.tag("casual")
    def merit_farewell_casual(self, support_bot):
        pass

    @merit.tag("formal")
    def merit_farewell_formal(self, support_bot):
        pass

Classes provide:

Logical grouping in reports
Shared tags that cascade to methods
Better code organization
Easier navigation in IDEs

Get Started

Usage

Concepts

API Reference

Examples

Basic Usage

Merit Discovery

Files

Functions

Classes

Dependency Injection

Async Support

Iterate Merits

Iterate with different parameters

Iterate with different cases

Iterate with grouped cases

Repeat with same data

Organizing Merits with Tags

Running only specific Merits

Skipping Merits unconditionally

Skipping Merits conditionally

Expected Failures

Recommendations

1. Name functions descriptively

2. Use dependency injection over global imports

Get Started

Usage

Concepts

API Reference

Examples

​Basic Usage

​Merit Discovery

​Files

​Functions

​Classes

​Dependency Injection

​Async Support

​Iterate Merits

​Iterate with different parameters

​Iterate with different cases

​Iterate with grouped cases

​Repeat with same data

​Organizing Merits with Tags

​Running only specific Merits

​Skipping Merits unconditionally

​Skipping Merits conditionally

​Expected Failures

​Recommendations

​1. Name functions descriptively

​2. Use dependency injection over global imports

​3. Organize related merits in Merit classes

Basic Usage

Merit Discovery

Files

Functions

Classes

Dependency Injection

Async Support

Iterate Merits

Iterate with different parameters

Iterate with different cases

Iterate with grouped cases

Repeat with same data

Organizing Merits with Tags

Running only specific Merits

Skipping Merits unconditionally

Skipping Merits conditionally

Expected Failures

Recommendations

1. Name functions descriptively

2. Use dependency injection over global imports

3. Organize related merits in Merit classes