Use this file to discover all available pages before exploring further.
Merit helps developers test AI projects for AI-specific bugs like hallucinations, missing context, and incorrect decisions. But shouldn’t evals handle that? Or pytest?In this section, we explain the differences and help you pick the best approach for your use case.
This works when you’re training a model. But when you’re building a product, you have more control over your system.By incrementally improving the code and system design, you can move the product quality to be enough for production.A 95% accuracy score tells you nothing about what code contributed to which failures, and which failures should be prioritized.
Automated tests come from software engineering. The workflow is:
Write an assertion: assert result == expected
Run the test
If it passes once, ship it
This works when code is deterministic. Call add(2, 2) and you always get 4.AI systems are stochastic. The same input can produce different outputs:
# This might pass 9 out of 10 timesdef test_greeting(): response = chatbot("Hello!") assert "hello" in response.lower() # Sometimes it says "Hi there!" instead
Even for something that looks deterministic (like arithmetic), LLM-based systems can behave like every phrasing is a new case:
def test_llm_calculator(llm_as_calculator): #Both assertions might pass or fail regarldess of each other assert llm_as_calculator("3 + 2") == "5" assert llm_as_calculator("2 + 3") == "5"
This is why a single passing test is weak evidence for AI behavior: you need repeated runs and broader case coverage.
Merit’s AI predicates let you assert on complex properties:
from merit.predicates import has_unsupported_facts, follows_policyasync def merit_customer_support(support_bot): knowledge_base = "Returns accepted within 30 days with receipt." policy = "Always offer to help with other questions" response = support_bot.answer("What's the return policy?") # Assert the response doesn't hallucinate facts assert not await has_unsupported_facts(response, knowledge_base) # Assert the response follows company guidelines assert await follows_policy(response, policy)
Merit’s @merit.repeat runs the same merit multiple times:
@merit.repeat(10, min_passes=8) # 8 out of 10 must passasync def merit_consistent_greeting(chatbot): response = chatbot("Hello!") assert await follows_policy(response, "Greeting is friendly and professional")
# test_support_bot.pyimport pytestdef test_returns_policy(): bot = SupportBot("Returns accepted within 30 days.") response = bot.answer("What's your return policy?") # Brittle: fails if wording changes assert "30 days" in responsedef test_shipping_info(): bot = SupportBot("Free shipping on orders over $50.") response = bot.answer("Do you offer free shipping?") # Brittle: what if it says "$50" vs "fifty dollars"? assert "50" in responsedef test_no_competitor_mentions(): bot = SupportBot("We offer 24/7 support.") response = bot.answer("Are you better than CompetitorX?") # How do you even check this reliably? assert "CompetitorX" not in response # Too simple # What about "Competitor X" or "that other company"?