Documentation Index
Fetch the complete documentation index at: https://docs.appmerit.com/llms.txt
Use this file to discover all available pages before exploring further.
Decorator
@metric
Register a metric resource that yields a Metric instance and optionally a final value.
Signature:
@metric(
fn: Callable | None = None,
*,
scope: Scope | str = Scope.SESSION,
)
Parameters:
| Name | Type | Default | Description |
|---|
fn | Callable | None | None | Generator or async generator function |
scope | Scope | str | Scope.SESSION | Lifecycle scope: "case", "suite", or "session" |
Returns: Decorated function registered as a metric resource
Generator Requirements:
- Must
yield a Metric instance first (this gets injected)
- Optionally
yield a final calculated value (becomes MetricResult.value)
- Can assert on metric properties after all data is collected
Example:
import merit
from merit import Metric, metrics
@merit.metric
def accuracy():
"""Track accuracy across all test cases."""
metric = Metric()
yield metric # This gets injected into merits
# After all merits run, compute final value
final_accuracy = metric.mean
yield final_accuracy # Captured in MetricResult
# Assert on aggregate quality
assert metric.mean > 0.8, f"Accuracy too low: {metric.mean}"
@merit.metric(scope="suite")
def latency_ms():
"""Track latency within a suite."""
metric = Metric()
yield metric
# Check 95th percentile
assert metric.p95 < 500, f"P95 latency too high: {metric.p95}ms"
# Use in merit functions
def merit_test(accuracy: Metric, latency_ms: Metric):
result = model.predict("input")
with metrics(accuracy, latency_ms):
assert result == "expected"
Classes
Metric
Thread-safe class for recording data and computing statistical properties.
Attributes:
| Name | Type | Description |
|---|
name | str | None | Metric name (auto-set by decorator) |
metadata | MetricMetadata | Collection metadata (scope, contributors, timestamps) |
Methods:
| Method | Parameters | Description |
|---|
add_record | value: int | float | bool | list[int | float | bool] | tuple[int | float | bool, ...] | Record one or more data points (numeric/bool only) |
Statistical Properties:
All properties are computed lazily and cached. Recording happens when you call add_record(...) (manually) or when you use with metrics(...) to record assertion pass/fail outcomes.
| Property | Type | Description |
|---|
raw_values | list[int | float | bool] | All recorded values |
len | int | Number of values |
sum | float | Sum of all values |
min | float | Minimum value |
max | float | Maximum value |
mean | float | Arithmetic mean |
median | float | Median (50th percentile) |
std | float | Sample standard deviation |
variance | float | Sample variance |
pstd | float | Population standard deviation |
pvariance | float | Population variance |
p25 | float | 25th percentile |
p50 | float | 50th percentile (median) |
p75 | float | 75th percentile |
p90 | float | 90th percentile |
p95 | float | 95th percentile |
p99 | float | 99th percentile |
percentiles | list[float] | All 99 percentiles (p1 to p99) |
ci_90 | tuple[float, float] | 90% confidence interval (lower, upper) |
ci_95 | tuple[float, float] | 95% confidence interval |
ci_99 | tuple[float, float] | 99% confidence interval |
counter | Counter[int | float | bool] | Frequency count of each unique value |
distribution | dict[int | float | bool, float] | Share of each unique value (0-1) |
Example:
from merit import Metric, metric, metrics
@metric
def response_quality():
metric = Metric()
yield metric
# Access statistical properties
print(f"Mean quality: {metric.mean}")
print(f"Std deviation: {metric.std}")
print(f"95th percentile: {metric.p95}")
print(f"95% CI: {metric.ci_95}")
# Distribution analysis
print(f"Pass rate: {metric.distribution.get(True, 0)}")
print(f"Fail count: {metric.counter[False]}")
# Quality gate
assert metric.mean > 0.8
@merit.parametrize("input,expected", test_cases)
def merit_quality_check(input, expected, response_quality: Metric):
result = model.predict(input)
is_correct = result == expected
# Record to metric
with metrics(response_quality):
assert is_correct
Metadata tracking metric lifecycle and contributors.
Attributes:
| Name | Type | Description |
|---|
last_item_recorded_at | datetime | None | Timestamp of most recent value |
first_item_recorded_at | datetime | None | Timestamp of first value |
scope | Scope | Lifecycle scope (SESSION, SUITE, CASE) |
collected_from_merits | set[str] | Names of merit functions that contributed |
collected_from_resources | set[str] | Names of resources that contributed |
collected_from_cases | set[str] | Case IDs that contributed |
Example:
@metric
def my_metric():
metric = Metric()
yield metric
# Inspect metadata
meta = metric.metadata
print(f"Scope: {meta.scope}")
print(f"Contributing merits: {meta.collected_from_merits}")
print(f"Case count: {len(meta.collected_from_cases)}")
print(f"First recorded: {meta.first_item_recorded_at}")
print(f"Last recorded: {meta.last_item_recorded_at}")
MetricResult
Result captured when a metric resource completes.
Attributes:
| Name | Type | Description |
|---|
name | str | Metric name |
metadata | MetricMetadata | Snapshot of metadata at completion |
assertion_results | list[AssertionResult] | Assertions evaluated in metric teardown |
value | CalculatedValue | Final yielded value (or NaN if none) |
Note: MetricResult instances are automatically collected and included in merit run reports.
Example:
@metric
def accuracy():
metric = Metric()
yield metric
final_accuracy = metric.mean
yield final_accuracy # Becomes MetricResult.value
assert metric.mean > 0.8 # Captured in assertion_results
# MetricResult is created automatically after metric completes
# and includes:
# - name: "accuracy"
# - value: final_accuracy (the second yielded value)
# - assertion_results: [assertion about metric.mean]
# - metadata: snapshot of metric collection info
Context Manager
metrics()
Attach metrics to assertions for automatic data collection.
Signature:
@contextmanager
def metrics(*metrics: Metric) -> Iterator[None]
Parameters:
| Name | Type | Description |
|---|
metrics | Metric | Metrics to record assertion outcomes into |
Returns: Context manager that captures assertion results
Behavior:
- When an assertion passes inside the context, records
True to all metrics
- When an assertion fails, records
False to all metrics
- Multiple assertions in one context each contribute a data point
- Works with both standard assertions and predicate assertions
Example:
from merit import Metric, metric, metrics
@metric
def accuracy():
metric = Metric()
yield metric
# After all tests: mean should be > 80%
assert metric.mean > 0.8
@metric
def hallucination_rate():
metric = Metric()
yield metric
# After all tests: less than 5% should be False
false_rate = metric.counter[False] / metric.len if metric.len > 0 else 0
assert false_rate < 0.05
@merit.parametrize("question,context", test_cases)
async def merit_bot_quality(question, context, bot, accuracy: Metric, hallucination_rate: Metric):
answer = bot.ask(question, context=context)
# Multiple metrics in one context
with metrics(accuracy, hallucination_rate):
# Each assertion records True/False to all metrics
assert len(answer) > 0
assert not await has_unsupported_facts(answer, context)
Usage Patterns
Basic Metric Collection
from merit import Metric, metric, metrics
@metric
def latency_ms():
metric = Metric()
yield metric
# Quality gates
assert metric.p95 < 500
assert metric.mean < 200
def merit_performance(api_client, latency_ms: Metric):
import time
start = time.time()
result = api_client.call()
duration_ms = (time.time() - start) * 1000
# Manual recording
latency_ms.add_record(duration_ms)
assert result.ok
Assertion-Based Collection
from merit import Metric, metric, metrics
@metric
def accuracy():
metric = Metric()
yield metric
assert metric.mean > 0.8
@merit.parametrize("input,expected", test_cases)
def merit_classifier(input, expected, classifier, accuracy: Metric):
result = classifier(input)
# Automatic recording based on assertion outcome
with metrics(accuracy):
assert result == expected # Records True/False
Multiple Metrics
from merit import Metric, metric, metrics
@metric
def precision():
metric = Metric()
yield metric
assert metric.mean > 0.85
@metric
def recall():
metric = Metric()
yield metric
assert metric.mean > 0.80
def merit_evaluation(model, precision: Metric, recall: Metric):
predictions = model.predict(test_data)
for pred, actual in zip(predictions, ground_truth):
# Track precision
if pred == "positive":
with metrics(precision):
assert pred == actual
# Track recall
if actual == "positive":
with metrics(recall):
assert pred == actual
Distribution Analysis
from merit import Metric, metric, metrics
SENTIMENT_TO_SCORE = {"negative": -1, "neutral": 0, "positive": 1}
@metric
def sentiment_score_distribution():
metric = Metric()
yield metric
# Analyze distribution
total = metric.len
dist = metric.distribution
print(f"Positive: {dist.get(1, 0) * 100:.1f}%")
print(f"Neutral: {dist.get(0, 0) * 100:.1f}%")
print(f"Negative: {dist.get(-1, 0) * 100:.1f}%")
# Ensure balanced distribution
for score in (-1, 0, 1):
rate = dist.get(score, 0)
assert 0.2 < rate < 0.5, f"score {score} rate out of range: {rate}"
@merit.parametrize("text", test_texts)
def merit_sentiment(text, classifier, sentiment_score_distribution: Metric):
label = classifier.classify(text) # e.g. "positive" | "neutral" | "negative"
score = SENTIMENT_TO_SCORE[label]
# Record numeric values
sentiment_score_distribution.add_record(score)
Percentile Analysis
from merit import Metric, metric, metrics
@metric
def response_time():
metric = Metric()
yield metric
# SLA checks
assert metric.p50 < 100, f"Median too high: {metric.p50}ms"
assert metric.p95 < 500, f"P95 too high: {metric.p95}ms"
assert metric.p99 < 1000, f"P99 too high: {metric.p99}ms"
# Final value for reporting
yield metric.p95
@merit.repeat(100)
def merit_latency_test(api, response_time: Metric):
import time
start = time.time()
api.call()
duration_ms = (time.time() - start) * 1000
response_time.add_record(duration_ms)
Confidence Intervals
from merit import Metric, metric, metrics
@metric
def success_rate():
metric = Metric()
yield metric
# Check confidence interval
lower, upper = metric.ci_95
print(f"Success rate: {metric.mean:.2%}")
print(f"95% CI: [{lower:.2%}, {upper:.2%}]")
# Lower bound must exceed threshold
assert lower > 0.75, f"95% CI lower bound too low: {lower}"
@merit.repeat(50, min_passes=40)
def merit_reliability(llm, success_rate: Metric):
response = llm.generate("Hello")
with metrics(success_rate):
assert "hello" in response.lower()