Skip to main content

Decorator

@metric

Register a metric resource that yields a Metric instance and optionally a final value. Signature:
@metric(
    fn: Callable | None = None,
    *,
    scope: Scope | str = Scope.SESSION,
)
Parameters:
NameTypeDefaultDescription
fnCallable | NoneNoneGenerator or async generator function
scopeScope | strScope.SESSIONLifecycle scope: "case", "suite", or "session"
Returns: Decorated function registered as a metric resource Generator Requirements:
  1. Must yield a Metric instance first (this gets injected)
  2. Optionally yield a final calculated value (becomes MetricResult.value)
  3. Can assert on metric properties after all data is collected
Example:
import merit
from merit import Metric, metrics

@merit.metric
def accuracy():
    """Track accuracy across all test cases."""
    metric = Metric()
    yield metric  # This gets injected into merits

    # After all merits run, compute final value
    final_accuracy = metric.mean
    yield final_accuracy  # Captured in MetricResult

    # Assert on aggregate quality
    assert metric.mean > 0.8, f"Accuracy too low: {metric.mean}"

@merit.metric(scope="suite")
def latency_ms():
    """Track latency within a suite."""
    metric = Metric()
    yield metric

    # Check 95th percentile
    assert metric.p95 < 500, f"P95 latency too high: {metric.p95}ms"

# Use in merit functions
def merit_test(accuracy: Metric, latency_ms: Metric):
    result = model.predict("input")

    with metrics(accuracy, latency_ms):
        assert result == "expected"

Classes

Metric

Thread-safe class for recording data and computing statistical properties. Attributes:
NameTypeDescription
namestr | NoneMetric name (auto-set by decorator)
metadataMetricMetadataCollection metadata (scope, contributors, timestamps)
Methods:
MethodParametersDescription
add_recordvalue: int | float | bool | list[int | float | bool] | tuple[int | float | bool, ...]Record one or more data points (numeric/bool only)
Statistical Properties: All properties are computed lazily and cached. Recording happens when you call add_record(...) (manually) or when you use with metrics(...) to record assertion pass/fail outcomes.
PropertyTypeDescription
raw_valueslist[int | float | bool]All recorded values
lenintNumber of values
sumfloatSum of all values
minfloatMinimum value
maxfloatMaximum value
meanfloatArithmetic mean
medianfloatMedian (50th percentile)
stdfloatSample standard deviation
variancefloatSample variance
pstdfloatPopulation standard deviation
pvariancefloatPopulation variance
p25float25th percentile
p50float50th percentile (median)
p75float75th percentile
p90float90th percentile
p95float95th percentile
p99float99th percentile
percentileslist[float]All 99 percentiles (p1 to p99)
ci_90tuple[float, float]90% confidence interval (lower, upper)
ci_95tuple[float, float]95% confidence interval
ci_99tuple[float, float]99% confidence interval
counterCounter[int | float | bool]Frequency count of each unique value
distributiondict[int | float | bool, float]Share of each unique value (0-1)
Example:
from merit import Metric, metric, metrics

@metric
def response_quality():
    metric = Metric()
    yield metric

    # Access statistical properties
    print(f"Mean quality: {metric.mean}")
    print(f"Std deviation: {metric.std}")
    print(f"95th percentile: {metric.p95}")
    print(f"95% CI: {metric.ci_95}")

    # Distribution analysis
    print(f"Pass rate: {metric.distribution.get(True, 0)}")
    print(f"Fail count: {metric.counter[False]}")

    # Quality gate
    assert metric.mean > 0.8

@merit.parametrize("input,expected", test_cases)
def merit_quality_check(input, expected, response_quality: Metric):
    result = model.predict(input)
    is_correct = result == expected

    # Record to metric
    with metrics(response_quality):
        assert is_correct

MetricMetadata

Metadata tracking metric lifecycle and contributors. Attributes:
NameTypeDescription
last_item_recorded_atdatetime | NoneTimestamp of most recent value
first_item_recorded_atdatetime | NoneTimestamp of first value
scopeScopeLifecycle scope (SESSION, SUITE, CASE)
collected_from_meritsset[str]Names of merit functions that contributed
collected_from_resourcesset[str]Names of resources that contributed
collected_from_casesset[str]Case IDs that contributed
Example:
@metric
def my_metric():
    metric = Metric()
    yield metric

    # Inspect metadata
    meta = metric.metadata
    print(f"Scope: {meta.scope}")
    print(f"Contributing merits: {meta.collected_from_merits}")
    print(f"Case count: {len(meta.collected_from_cases)}")
    print(f"First recorded: {meta.first_item_recorded_at}")
    print(f"Last recorded: {meta.last_item_recorded_at}")

MetricResult

Result captured when a metric resource completes. Attributes:
NameTypeDescription
namestrMetric name
metadataMetricMetadataSnapshot of metadata at completion
assertion_resultslist[AssertionResult]Assertions evaluated in metric teardown
valueCalculatedValueFinal yielded value (or NaN if none)
Note: MetricResult instances are automatically collected and included in merit run reports. Example:
@metric
def accuracy():
    metric = Metric()
    yield metric

    final_accuracy = metric.mean
    yield final_accuracy  # Becomes MetricResult.value

    assert metric.mean > 0.8  # Captured in assertion_results

# MetricResult is created automatically after metric completes
# and includes:
# - name: "accuracy"
# - value: final_accuracy (the second yielded value)
# - assertion_results: [assertion about metric.mean]
# - metadata: snapshot of metric collection info

Context Manager

metrics()

Attach metrics to assertions for automatic data collection. Signature:
@contextmanager
def metrics(*metrics: Metric) -> Iterator[None]
Parameters:
NameTypeDescription
metricsMetricMetrics to record assertion outcomes into
Returns: Context manager that captures assertion results Behavior:
  • When an assertion passes inside the context, records True to all metrics
  • When an assertion fails, records False to all metrics
  • Multiple assertions in one context each contribute a data point
  • Works with both standard assertions and predicate assertions
Example:
from merit import Metric, metric, metrics

@metric
def accuracy():
    metric = Metric()
    yield metric

    # After all tests: mean should be > 80%
    assert metric.mean > 0.8

@metric
def hallucination_rate():
    metric = Metric()
    yield metric

    # After all tests: less than 5% should be False
    false_rate = metric.counter[False] / metric.len if metric.len > 0 else 0
    assert false_rate < 0.05

@merit.parametrize("question,context", test_cases)
async def merit_bot_quality(question, context, bot, accuracy: Metric, hallucination_rate: Metric):
    answer = bot.ask(question, context=context)

    # Multiple metrics in one context
    with metrics(accuracy, hallucination_rate):
        # Each assertion records True/False to all metrics
        assert len(answer) > 0
        assert not await has_unsupported_facts(answer, context)

Usage Patterns

Basic Metric Collection

from merit import Metric, metric, metrics

@metric
def latency_ms():
    metric = Metric()
    yield metric

    # Quality gates
    assert metric.p95 < 500
    assert metric.mean < 200

def merit_performance(api_client, latency_ms: Metric):
    import time

    start = time.time()
    result = api_client.call()
    duration_ms = (time.time() - start) * 1000

    # Manual recording
    latency_ms.add_record(duration_ms)

    assert result.ok

Assertion-Based Collection

from merit import Metric, metric, metrics

@metric
def accuracy():
    metric = Metric()
    yield metric
    assert metric.mean > 0.8

@merit.parametrize("input,expected", test_cases)
def merit_classifier(input, expected, classifier, accuracy: Metric):
    result = classifier(input)

    # Automatic recording based on assertion outcome
    with metrics(accuracy):
        assert result == expected  # Records True/False

Multiple Metrics

from merit import Metric, metric, metrics

@metric
def precision():
    metric = Metric()
    yield metric
    assert metric.mean > 0.85

@metric
def recall():
    metric = Metric()
    yield metric
    assert metric.mean > 0.80

def merit_evaluation(model, precision: Metric, recall: Metric):
    predictions = model.predict(test_data)

    for pred, actual in zip(predictions, ground_truth):
        # Track precision
        if pred == "positive":
            with metrics(precision):
                assert pred == actual

        # Track recall
        if actual == "positive":
            with metrics(recall):
                assert pred == actual

Distribution Analysis

from merit import Metric, metric, metrics

SENTIMENT_TO_SCORE = {"negative": -1, "neutral": 0, "positive": 1}

@metric
def sentiment_score_distribution():
    metric = Metric()
    yield metric

    # Analyze distribution
    total = metric.len
    dist = metric.distribution

    print(f"Positive: {dist.get(1, 0) * 100:.1f}%")
    print(f"Neutral: {dist.get(0, 0) * 100:.1f}%")
    print(f"Negative: {dist.get(-1, 0) * 100:.1f}%")

    # Ensure balanced distribution
    for score in (-1, 0, 1):
        rate = dist.get(score, 0)
        assert 0.2 < rate < 0.5, f"score {score} rate out of range: {rate}"

@merit.parametrize("text", test_texts)
def merit_sentiment(text, classifier, sentiment_score_distribution: Metric):
    label = classifier.classify(text)  # e.g. "positive" | "neutral" | "negative"
    score = SENTIMENT_TO_SCORE[label]

    # Record numeric values
    sentiment_score_distribution.add_record(score)

Percentile Analysis

from merit import Metric, metric, metrics

@metric
def response_time():
    metric = Metric()
    yield metric

    # SLA checks
    assert metric.p50 < 100, f"Median too high: {metric.p50}ms"
    assert metric.p95 < 500, f"P95 too high: {metric.p95}ms"
    assert metric.p99 < 1000, f"P99 too high: {metric.p99}ms"

    # Final value for reporting
    yield metric.p95

@merit.repeat(100)
def merit_latency_test(api, response_time: Metric):
    import time

    start = time.time()
    api.call()
    duration_ms = (time.time() - start) * 1000

    response_time.add_record(duration_ms)

Confidence Intervals

from merit import Metric, metric, metrics

@metric
def success_rate():
    metric = Metric()
    yield metric

    # Check confidence interval
    lower, upper = metric.ci_95

    print(f"Success rate: {metric.mean:.2%}")
    print(f"95% CI: [{lower:.2%}, {upper:.2%}]")

    # Lower bound must exceed threshold
    assert lower > 0.75, f"95% CI lower bound too low: {lower}"

@merit.repeat(50, min_passes=40)
def merit_reliability(llm, success_rate: Metric):
    response = llm.generate("Hello")

    with metrics(success_rate):
        assert "hello" in response.lower()