What’s wrong about evals?
Evals come from data science. The workflow is:- Define a metric (accuracy, F1, BLEU, etc.)
- Run your model on a benchmark dataset
- Tweak parameters until the metric improves
- Repeat
What’s wrong about pytest?
Automated tests come from software engineering. The workflow is:- Write an assertion:
assert result == expected - Run the test
- If it passes once, ship it
add(2, 2) and you always get 4.
AI systems are stochastic. The same input can produce different outputs:
Comparison
| Capability | Evals | Tests | Merit |
|---|---|---|---|
| Native Python syntax | No | Yes | Yes |
| Explicit test logic | No | Yes | Yes |
| Cases in datasets | Yes | No | Yes |
| Metrics aggregations | Yes | No | Yes |
| Determinism checks | Partial | Yes | Yes |
| LLM-as-a-Judge | Partial | No | Yes |
| CI/CD integration | Partial | Yes | Yes |
| Historical data | Partial | No | Yes |
AI Predicates: Assert meaning, not strings
Merit’s AI predicates let you assert on complex properties:Repeat: Measure consistency
Merit’s@merit.repeat runs the same merit multiple times: