Skip to main content
Metric is an object that records data points and calculates statistical values. Developers use metrics to track system behavior across multiple merit runs and make data-driven assertions about performance, accuracy, or other measurable properties. Using Metric enables:
  • Recording assertion results automatically as True/False values
  • Calculating statistics (mean, std, percentiles) on collected data
  • Tracking metrics at different scopes (session, suite, case)
  • Composing metrics via dependency injection for hierarchical analysis
  • Generating quality reports with measurable insights

Basic Usage

The most common pattern is to define a metric as a generator function, yield a Metric instance, then use the metrics() context manager to automatically track assertion results.
import merit
from merit import Metric, metrics

@merit.metric
def accuracy():
    metric = Metric()
    yield metric

    # After all tests run, assert on the calculated statistics
    assert metric.mean > 0.8  # 80% accuracy threshold

@merit.parametrize("input,expected", [("a", 1), ("b", 2)])
def merit_classifier_accuracy(my_classifier, input, expected, accuracy: Metric):
    result = my_classifier(input)

    # Assertions inside metrics() are recorded as True/False
    with metrics(accuracy):
        assert result == expected

Recording assertions with selected metrics

When you use with metrics(metric1, metric2):, any assertions inside that block are automatically recorded:
  • Passing assertions record True
  • Failing assertions record False
  • recorded bool values available inside metric1.raw_values and metric2.raw_values

Metric Properties

The Metric class provides statistical calculations on demand:
@merit.metric
def latency():
    metric = Metric()
    yield metric

    # Access calculated properties after data collection
    print(f"Mean: {metric.mean}")
    print(f"Median: {metric.median}")
    print(f"P95: {metric.p95}")
    print(f"Std Dev: {metric.std}")
    print(f"Min/Max: {metric.min}/{metric.max}")
    print(f"95% CI: {metric.ci_95}")
    print(f"Distribution: {metric.distribution}")
Available properties:
  • Basic stats: len, sum, min, max, mean, median
  • Variability: variance, std, pvariance, pstd
  • Percentiles: p25, p50, p75, p90, p95, p99, or percentiles for p1-p99
  • Confidence intervals: ci_90, ci_95, ci_99
  • Distributions: counter (frequency counts), distribution (proportions)
  • Raw data: raw_values (all recorded values)

Recording Data Manually

While metrics() automatically records assertion results, you can also manually record data:
@merit.metric
def response_time():
    metric = Metric()
    yield metric

def merit_latency_test(response_time: Metric):
    import time
    start = time.time()
    result = call_api()
    elapsed = time.time() - start

    # Manually record a value
    response_time.add_record(elapsed * 1000)  # milliseconds

    assert result

Scopes: Session, Suite, Case

Metrics can be scoped to different lifecycle levels. This enables tracking both local statistics (per merit case) and global statistics (across all merits).
# Session scope: collects data across the entire merit run
@merit.metric(scope="session")
def average_hallucinations_per_case():
    metric = Metric()
    yield metric

    # Assert on overall performance after all cases complete
    assert metric.mean < 2  # Average hallucinations should be less than 2

# Case scope: creates a new metric instance for each merit case
@merit.metric(scope="case")
def case_hallucinations_count(average_hallucinations_per_case: Metric):
    metric = Metric()
    yield metric

    # Write case-level data to session-level metric
    hallucinations_for_case = metric.counter[False]
    average_hallucinations_per_case.add_record(hallucinations_for_case)

@merit.parametrize("city,expected_state", [("Boston", "Massachusetts"), ("Miami", "Florida")])
def merit_geography_bot(
    city: str,
    expected_state: str,
    case_hallucinations_count: Metric
):
    result = geography_bot(city)

    # Each case tracks its own hallucinations
    with metrics(case_hallucinations_count):
        assert expected_state in result
Available scopes:
  • "session": One metric instance for the entire merit run (default)
  • "suite": One instance per merit file/module
  • "case": New instance for each parametrized merit case

Composite Metrics via Dependency Injection

Metrics can depend on other metrics, enabling hierarchical analysis. This pattern is useful for tracking components separately while calculating aggregate statistics.
import merit
from merit import Metric, metrics

@merit.metric
def accuracy():
    """Overall accuracy from both false positives and negatives"""
    metric = Metric()
    yield metric

    # After child metrics write their data, check overall accuracy
    assert metric.distribution[True] >= 0.8  # 80% correct

@merit.metric
def false_positives(accuracy: Metric):
    """Track false positives and contribute to overall accuracy"""
    metric = Metric()
    yield metric

    # Propagate values to parent metric
    accuracy.add_record(metric.raw_values)

    # Check this specific metric
    assert metric.counter[False] < 5  # Less than 5 false positives

@merit.metric
def false_negatives(accuracy: Metric):
    """Track false negatives and contribute to overall accuracy"""
    metric = Metric()
    yield metric

    # Propagate values to parent metric
    accuracy.add_record(metric.raw_values)

    # Check this specific metric
    assert metric.counter[False] < 3  # Less than 3 false negatives

@merit.parametrize("input", ["good1", "good2", "good3"])
def merit_positive_cases(input: str, false_negatives: Metric):
    result = classifier(input)
    with metrics(false_negatives):
        assert result == True

@merit.parametrize("input", ["bad1", "bad2", "bad3"])
def merit_negative_cases(input: str, false_positives: Metric):
    result = classifier(input)
    with metrics(false_positives):
        assert result == False
This creates a hierarchy:
accuracy (session-level)
├── false_positives (tracks FP, writes to accuracy)
└── false_negatives (tracks FN, writes to accuracy)

Recommendations

1. Use metrics() for automatic assertion tracking

The metrics() context manager automatically records assertion results, eliminating manual bookkeeping. Don’t do this:
@merit.metric
def accuracy():
    metric = Metric()
    yield metric

def merit_test(accuracy: Metric):
    result = classifier("input")
    is_correct = result == "expected"

    # Manual tracking is error-prone and verbose
    accuracy.add_record(True if is_correct else False)
    assert is_correct
Do this:
@merit.metric
def accuracy():
    metric = Metric()
    yield metric

def merit_test(accuracy: Metric):
    result = classifier("input")

    # Automatic tracking is cleaner and safer
    with metrics(accuracy):
        assert result == "expected"

2. Scope metrics appropriately for your analysis needs

Choose scope based on what you’re measuring. Use case-level metrics for per-merit statistics and session-level metrics for aggregate analysis.
# Case scope: Track each merit case individually
@merit.metric(scope="case")
def case_latency():
    metric = Metric()
    yield metric

    # Assert each case completes quickly
    assert metric.p95 < 1000  # Each case's p95 under 1 second

# Session scope: Track overall system performance
@merit.metric(scope="session")
def overall_latency():
    metric = Metric()
    yield metric

    # Assert on aggregate performance
    assert metric.mean < 500  # Average across all cases

3. Build composite metrics for hierarchical insights

Use dependency injection to create parent-child metric relationships. This enables drilling down from aggregate metrics to specific failure modes.
@merit.metric
def overall_quality():
    """Top-level quality metric"""
    metric = Metric()
    yield metric
    assert metric.mean > 0.9

@merit.metric
def accuracy(overall_quality: Metric):
    """Tracks correctness, contributes to quality"""
    metric = Metric()
    yield metric
    overall_quality.add_record(metric.mean)
    yield metric.mean  # Optionally yield final value for reports

@merit.metric
def fluency(overall_quality: Metric):
    """Tracks readability, contributes to quality"""
    metric = Metric()
    yield metric
    overall_quality.add_record(metric.mean)
    yield metric.mean

4. Yield final values for report generation

After the metric completes, you can yield a final calculated value that will be captured in MetricResult for reports.
@merit.metric
def error_rate():
    metric = Metric()
    yield metric

    # Calculate final error rate
    if metric.len > 0:
        final_error_rate = metric.counter[False] / metric.len
        assert final_error_rate < 0.1

        # Yield final value for reporting
        yield final_error_rate
    else:
        yield 0.0

5. Use raw_values for custom calculations

Access raw_values to perform calculations beyond the built-in statistical properties.
@merit.metric
def custom_analysis():
    metric = Metric()
    yield metric

    # Custom calculations on raw data
    values = metric.raw_values
    outliers = [v for v in values if abs(v - metric.mean) > 3 * metric.std]

    assert len(outliers) < len(values) * 0.01  # Less than 1% outliers