Documentation Index
Fetch the complete documentation index at: https://docs.appmerit.com/llms.txt
Use this file to discover all available pages before exploring further.
Metric is an object that records data points and calculates statistical values. Developers use metrics to track system behavior across multiple merit runs and make data-driven assertions about performance, accuracy, or other measurable properties.
Using Metric enables:
- Recording assertion results automatically as True/False values
- Calculating statistics (mean, std, percentiles) on collected data
- Tracking metrics at different scopes (session, suite, case)
- Composing metrics via dependency injection for hierarchical analysis
- Generating quality reports with measurable insights
Basic Usage
The most common pattern is to define a metric as a generator function, yield a Metric instance, then use the metrics() context manager to automatically track assertion results.
import merit
from merit import Metric, metrics
@merit.metric
def accuracy():
metric = Metric()
yield metric
# After all tests run, assert on the calculated statistics
assert metric.mean > 0.8 # 80% accuracy threshold
@merit.parametrize("input,expected", [("a", 1), ("b", 2)])
def merit_classifier_accuracy(my_classifier, input, expected, accuracy: Metric):
result = my_classifier(input)
# Assertions inside metrics() are recorded as True/False
with metrics(accuracy):
assert result == expected
Recording assertions with selected metrics
When you use with metrics(metric1, metric2):, any assertions inside that block are automatically recorded:
- Passing assertions record
True
- Failing assertions record
False
- recorded bool values available inside metric1.raw_values and metric2.raw_values
Metric Properties
The Metric class provides statistical calculations on demand:
@merit.metric
def latency():
metric = Metric()
yield metric
# Access calculated properties after data collection
print(f"Mean: {metric.mean}")
print(f"Median: {metric.median}")
print(f"P95: {metric.p95}")
print(f"Std Dev: {metric.std}")
print(f"Min/Max: {metric.min}/{metric.max}")
print(f"95% CI: {metric.ci_95}")
print(f"Distribution: {metric.distribution}")
Available properties:
- Basic stats:
len, sum, min, max, mean, median
- Variability:
variance, std, pvariance, pstd
- Percentiles:
p25, p50, p75, p90, p95, p99, or percentiles for p1-p99
- Confidence intervals:
ci_90, ci_95, ci_99
- Distributions:
counter (frequency counts), distribution (proportions)
- Raw data:
raw_values (all recorded values)
Recording Data Manually
While metrics() automatically records assertion results, you can also manually record data:
@merit.metric
def response_time():
metric = Metric()
yield metric
def merit_latency_test(response_time: Metric):
import time
start = time.time()
result = call_api()
elapsed = time.time() - start
# Manually record a value
response_time.add_record(elapsed * 1000) # milliseconds
assert result
Scopes: Session, Suite, Case
Metrics can be scoped to different lifecycle levels. This enables tracking both local statistics (per merit case) and global statistics (across all merits).
# Session scope: collects data across the entire merit run
@merit.metric(scope="session")
def average_hallucinations_per_case():
metric = Metric()
yield metric
# Assert on overall performance after all cases complete
assert metric.mean < 2 # Average hallucinations should be less than 2
# Case scope: creates a new metric instance for each merit case
@merit.metric(scope="case")
def case_hallucinations_count(average_hallucinations_per_case: Metric):
metric = Metric()
yield metric
# Write case-level data to session-level metric
hallucinations_for_case = metric.counter[False]
average_hallucinations_per_case.add_record(hallucinations_for_case)
@merit.parametrize("city,expected_state", [("Boston", "Massachusetts"), ("Miami", "Florida")])
def merit_geography_bot(
city: str,
expected_state: str,
case_hallucinations_count: Metric
):
result = geography_bot(city)
# Each case tracks its own hallucinations
with metrics(case_hallucinations_count):
assert expected_state in result
Available scopes:
"session": One metric instance for the entire merit run (default)
"suite": One instance per merit file/module
"case": New instance for each parametrized merit case
Composite Metrics via Dependency Injection
Metrics can depend on other metrics, enabling hierarchical analysis. This pattern is useful for tracking components separately while calculating aggregate statistics.
import merit
from merit import Metric, metrics
@merit.metric
def accuracy():
"""Overall accuracy from both false positives and negatives"""
metric = Metric()
yield metric
# After child metrics write their data, check overall accuracy
assert metric.distribution[True] >= 0.8 # 80% correct
@merit.metric
def false_positives(accuracy: Metric):
"""Track false positives and contribute to overall accuracy"""
metric = Metric()
yield metric
# Propagate values to parent metric
accuracy.add_record(metric.raw_values)
# Check this specific metric
assert metric.counter[False] < 5 # Less than 5 false positives
@merit.metric
def false_negatives(accuracy: Metric):
"""Track false negatives and contribute to overall accuracy"""
metric = Metric()
yield metric
# Propagate values to parent metric
accuracy.add_record(metric.raw_values)
# Check this specific metric
assert metric.counter[False] < 3 # Less than 3 false negatives
@merit.parametrize("input", ["good1", "good2", "good3"])
def merit_positive_cases(input: str, false_negatives: Metric):
result = classifier(input)
with metrics(false_negatives):
assert result == True
@merit.parametrize("input", ["bad1", "bad2", "bad3"])
def merit_negative_cases(input: str, false_positives: Metric):
result = classifier(input)
with metrics(false_positives):
assert result == False
This creates a hierarchy:
accuracy (session-level)
├── false_positives (tracks FP, writes to accuracy)
└── false_negatives (tracks FN, writes to accuracy)
Recommendations
1. Use metrics() for automatic assertion tracking
The metrics() context manager automatically records assertion results, eliminating manual bookkeeping.
Don’t do this:
@merit.metric
def accuracy():
metric = Metric()
yield metric
def merit_test(accuracy: Metric):
result = classifier("input")
is_correct = result == "expected"
# Manual tracking is error-prone and verbose
accuracy.add_record(True if is_correct else False)
assert is_correct
Do this:
@merit.metric
def accuracy():
metric = Metric()
yield metric
def merit_test(accuracy: Metric):
result = classifier("input")
# Automatic tracking is cleaner and safer
with metrics(accuracy):
assert result == "expected"
2. Scope metrics appropriately for your analysis needs
Choose scope based on what you’re measuring. Use case-level metrics for per-merit statistics and session-level metrics for aggregate analysis.
# Case scope: Track each merit case individually
@merit.metric(scope="case")
def case_latency():
metric = Metric()
yield metric
# Assert each case completes quickly
assert metric.p95 < 1000 # Each case's p95 under 1 second
# Session scope: Track overall system performance
@merit.metric(scope="session")
def overall_latency():
metric = Metric()
yield metric
# Assert on aggregate performance
assert metric.mean < 500 # Average across all cases
3. Build composite metrics for hierarchical insights
Use dependency injection to create parent-child metric relationships. This enables drilling down from aggregate metrics to specific failure modes.
@merit.metric
def overall_quality():
"""Top-level quality metric"""
metric = Metric()
yield metric
assert metric.mean > 0.9
@merit.metric
def accuracy(overall_quality: Metric):
"""Tracks correctness, contributes to quality"""
metric = Metric()
yield metric
overall_quality.add_record(metric.mean)
yield metric.mean # Optionally yield final value for reports
@merit.metric
def fluency(overall_quality: Metric):
"""Tracks readability, contributes to quality"""
metric = Metric()
yield metric
overall_quality.add_record(metric.mean)
yield metric.mean
4. Yield final values for report generation
After the metric completes, you can yield a final calculated value that will be captured in MetricResult for reports.
@merit.metric
def error_rate():
metric = Metric()
yield metric
# Calculate final error rate
if metric.len > 0:
final_error_rate = metric.counter[False] / metric.len
assert final_error_rate < 0.1
# Yield final value for reporting
yield final_error_rate
else:
yield 0.0
5. Use raw_values for custom calculations
Access raw_values to perform calculations beyond the built-in statistical properties.
@merit.metric
def custom_analysis():
metric = Metric()
yield metric
# Custom calculations on raw data
values = metric.raw_values
outliers = [v for v in values if abs(v - metric.mean) > 3 * metric.std]
assert len(outliers) < len(values) * 0.01 # Less than 1% outliers