Making AI Agent Outputs Deterministic and Testable

2026-01-20

Quick Answer

Deterministic AI systems produce the same correct output for the same input. Build them by: (1) Creating test fixtures (representative examples), (2) Verifying outputs against specific criteria (not opinions), (3) Adding tests for edge cases, (4) Using SHA-256 hashing to detect output drift, (5) Aiming for 100% test coverage on core logic. The OSHA tool used this approach: 100 test incidents covering normal and edge cases, zero regressions in production.

The problem with AI systems is they feel magical until they break. An AI model that works great on Monday might produce different outputs on Thursday without code changing. Or it works perfectly on your examples but fails on real user data.

This isn't magic. It's probabilistic systems meeting the real world. The solution isn't to accept it—it's to make AI outputs testable, verifiable, and (as much as possible) deterministic.

This guide covers patterns I used to build the OSHA tool, which produces consistent, testable outputs on thousands of edge cases. The patterns work for any AI-backed system.

The Determinism Problem

Deterministic system: Same input → Always same correct output Probabilistic system: Same input → Maybe same output, maybe different

AI models are inherently probabilistic. They're trained on patterns and often generate outputs with some randomness (temperature settings, sampling approaches).

But outputs don't have to be probabilistic from the user's perspective. You can build deterministic systems on top of probabilistic models.

How?

Constraint 1: Verification layers Every output is verified against specific criteria. If it fails, regenerate or reject it.

Constraint 2: Test coverage The system is tested on hundreds of examples. If it fails on any, it's not done.

Constraint 3: Idempotent operations The same input always produces the same output (or equivalent outputs that pass verification).

Constraint 4: Version control You track prompts, test cases, and expected outputs in git. You can reproduce any result.

How the OSHA Tool Achieved This

Test Corpus: 100 incident descriptions covering:

Normal recordable injuries (sprained ankle requiring physical therapy)
Normal non-recordable incidents (first aid only)
Edge cases (unclear details, conflicting information)
Ambiguous situations (was this work-related? unclear)

Verification Criteria: Each test incident was verified against:

Official OSHA guidance (osha.gov)
29 CFR 1904 recordability rules
Expert judgment from OSHA specialists

The verification created ground truth: For incident X, the correct classification is Y because of reason Z.

Automated Testing: For each test incident:

Pass through the AI classification system
Compare output against ground truth
If correct, pass. If wrong, log failure.
Iterate on the system (prompts, logic, validation) until all 100 tests pass.

Regression Testing: After every change, re-run all 100 tests. Zero regressions allowed. If a change breaks anything, revert immediately.

Output Hashing: Calculate SHA-256 hash of each output. Track it in version control. Any change to output produces different hash. This catches subtle changes (wrong field order, changed wording, etc.) automatically.

This approach achieved:

100% test coverage on real cases
Zero production regressions
Deterministic outputs (same input always produces same correct output)
Auditability (you can see why each decision was made)

Building a Test Corpus

The hardest part of this approach is creating good test cases. Here's how:

Step 1: Collect real examples Get examples from your problem domain. For OSHA, these were real incident reports from small businesses. Not invented examples—actual data.

Step 2: Establish ground truth For each example, determine the correct answer. How?

Research the rule (OSHA recordability)
Consult experts (OSHA specialists, compliance professionals)
Document the reasoning

This takes time. It's also the most valuable time you'll spend because it forces you to understand the problem deeply.

Step 3: Add edge cases For each ground truth answer, create variants that test edge cases:

Base case: Employee twisted ankle, got X-ray, required physical therapy → Recordable

Edge cases:

What if they don't get an X-ray? → Still recordable (medical treatment not limited to imaging)
What if they decline physical therapy? → Still recordable (offer of treatment counts)
What if it happened at lunch? → Depends (must be work-related, eating lunch might not be)

Step 4: Create ambiguous cases Real data is messy. Add cases where the answer is genuinely unclear:

"Employee hurt ankle but didn't mention it until days later. Unclear if work-related. Unclear if medical treatment was from the incident or pre-existing condition."

These ambiguous cases are where your system shows its real quality. Can it acknowledge uncertainty? Can it ask for clarification? Can it provide the most conservative (safest) classification?

Step 5: Version control it Save your test cases in version control. For each case:

Input (the incident description)
Expected output (classification)
Reasoning (why this is the correct classification)
Source (where did this come from? real data? official guidance?)

This creates a reference library. As your system evolves, you can re-test against it.

Verification Layers

Not all outputs are equal. Some need more verification than others.

Layer 1: Structural validation Does the output have the right format?

Required fields present?
Data types correct? (number instead of text)
Field lengths valid? (not exceeding limits)

This catches stupid mistakes (blank required fields, obvious format errors).

Layer 2: Logical validation Does the output make logical sense?

If injured on Tuesday, recovery date can't be Monday
If classified as recordable, must have a reason code
If requiring days away, those days must be > 0

This catches semantic errors (self-contradictory outputs).

Layer 3: Domain validation Does it follow domain rules?

Recordability classifications must comply with 29 CFR 1904
Forms must match OSHA template structure
Data must align with official records

This catches domain-specific errors (complying with rules).

Layer 4: User verification Human review of high-stakes decisions. (This is human-in-the-loop, not pure automation.)

Each layer is a gate. Output must pass all gates to be released.

Testing Strategy

Test-Driven Development (from the start):

Write test cases before building the system
Try to solve the test cases (they'll all fail initially)
Iterate until tests pass
Refactor for clarity
Add more edge cases, repeat

Regression Testing (every change): After any change (prompt update, logic change, new feature), re-run all tests. If anything breaks, investigate immediately.

Coverage Targets:

Happy path: 100% coverage
Edge cases: 100% coverage
Error cases: 100% coverage

This sounds like overkill. For production systems, it's the minimum. The OSHA tool had 100+ tests for a system that users depend on for legal compliance. Worth it.

Handling Probabilistic Nature

AI models sometimes produce different outputs for the same input (depending on sampling settings, model version, etc.). How do you reconcile this with "deterministic outputs"?

Approach 1: Remove randomness Set temperature to 0, use deterministic sampling (top-1 instead of top-k), lock model versions. Trade: Less creative outputs, more predictable.

Approach 2: Verify all variants Allow some randomness but verify that all possible outputs are correct. If the AI could output "recordable" or "non-recordable," both must be defensible for that incident.

Trade: More complex verification, but allows some flexibility.

Approach 3: One-way determinism The AI is allowed to vary slightly (different wording, different explanation) as long as the core decision (recordable: yes/no) is always the same.

Trade: Balances flexibility with consistency.

For OSHA, we used Approach 2: Temperature 0 (fully deterministic), locked model version, and verification that outputs never change.

For less critical systems, Approach 3 is often sufficient.

Output Hashing for Regression Detection

Simple technique that catches subtle changes:

``` Output: {"classification": "recordable", "reason": "Required medical treatment"} SHA-256: 7f3a9e8d2c1b5f4a...

Output (changed): {"classification": "recordable", "reason": "Involved medical treatment"} SHA-256: 2a8f1c3e5b9d4f6a... (completely different) ```

Track hashes in git:

Version X: Output hash = 7f3a9e8d2c1b5f4a
Version Y: Output hash = 7f3a9e8d2c1b5f4a (same, no regression)
Version Z: Output hash = 2a8f1c3e5b9d4f6a (different! investigate)

This catches changes you might miss in code review (subtle wording changes, field reordering, etc.).

Test Fixtures: Real Data, Not Invented Examples

The biggest mistake in building AI systems: Testing on invented examples that don't match real data.

Real data is messier:

Misspellings and jargon
Ambiguous phrasing
Missing context
Contradictory information

Invented examples are often:

Perfectly clear
Well-formatted
Unambiguous
Unrealistic

Test on real data. Not anonymized, if possible. Exactly as messy as users will input it.

The OSHA tool's 100 test incidents were actual incident reports from real small businesses. Not cleaned up, not perfectly formatted. This is why it works on real data.

When you're starting, collect 20-30 real examples minimum before you build. Build your system to handle real messiness.

Documentation and Auditability

Production systems need to be auditable. "Why did you classify this as recordable?"

Build explainability in from the start:

```json { "classification": "recordable", "explanation": "Required medical treatment beyond first aid (physical therapy)", "criteria_met": [ "Work-related: Yes (occurred during work shift)", "New case: Yes (first injury to this body part)", "Required medical treatment: Yes (prescribed physical therapy)" ], "regulation_citations": ["29 CFR 1904.7(b)(5)(i)"], "confidence": 0.98, "tested_against": "3,450 incident records, 100% accuracy on test set" } ```

This level of documentation:

Helps users trust the system
Helps compliance audits
Helps you debug when something goes wrong
Helps regulators understand your logic

It's not just good practice. For regulated systems, it's often required.

Building Habits

Making AI outputs deterministic and testable isn't one-time work. It's an ongoing practice:

Before every change:

What tests verify this change?
Do existing tests still pass?
Are there new edge cases?

During development:

Test constantly
Add tests for every bug you find
Aim for 100% coverage

After every change:

Regression test
Hash check for unexpected output changes
Deploy only if everything passes

These habits compound. A system with tests from day one is more reliable than a system where tests are added later.

The Cost-Benefit Trade-off

This approach takes longer than "ship fast and iterate." Here's the trade-off:

No testing approach:

Fast to initial version
Slow and painful in production (bugs, regressions, angry users)
Eventually costs more in support and fixes

Deterministic testing approach:

Slower initial development
Fast, predictable production
Fewer surprises, easier support
Costs less long-term

For the OSHA tool: 3 months of careful development with comprehensive testing. Then 12+ months of production use with zero critical bugs.

Different trade-off calculus, but worth it for systems people depend on.

Patterns to Adopt

Test-first: Write tests before code
100% coverage on critical paths: Non-negotiable for production
Real data, not invented: Use actual examples
Regression prevention: Re-test on all changes
Output hashing: Catch subtle changes
Explainability: Document why decisions were made
Verification layers: Structure validation, logic validation, domain validation
Version control: Track prompts, test cases, expected outputs

These patterns don't require a technical background. They require discipline, attention to detail, and refusing to ship uncertain systems.

That's exactly the methodology that builds production-quality AI systems.

Frequently Asked Questions

What's the difference between deterministic and probabilistic AI systems? Deterministic system: Same input always produces the same correct output. Probabilistic system: Same input might produce different outputs. AI models are inherently probabilistic, but outputs don't have to be probabilistic from the user's perspective. You can build deterministic systems on top of probabilistic models through verification layers, test coverage, idempotent operations, and version control.

How do you build a test corpus for AI systems? Step 1: Collect real examples from your problem domain - actual data, not invented examples. Step 2: Establish ground truth for each example by researching rules, consulting experts, documenting reasoning. Step 3: Add edge cases (variants that test boundaries). Step 4: Create ambiguous cases where the answer is genuinely unclear. Step 5: Version control everything - input, expected output, reasoning, and source.

What are the verification layers for AI outputs? Layer 1 - Structural validation: Are required fields present? Data types correct? Field lengths valid? Layer 2 - Logical validation: Does the output make logical sense (no self-contradictions)? Layer 3 - Domain validation: Does it follow domain rules (regulations, templates, official standards)? Layer 4 - User verification: Human review of high-stakes decisions. Output must pass all gates to be released.

How do you handle the probabilistic nature of AI models? Three approaches: (1) Remove randomness - set temperature to 0, use deterministic sampling, lock model versions. (2) Verify all variants - allow some randomness but verify all possible outputs are correct. (3) One-way determinism - allow slight variation (different wording) as long as the core decision is always the same. For compliance systems, approach 1 (fully deterministic) is recommended.

What is output hashing for regression detection? Calculate SHA-256 hash of each output and track it in version control. Any change to output produces a different hash, catching subtle changes automatically (wrong field order, changed wording). Compare hashes across versions: same hash = no regression, different hash = investigate the change. This catches modifications you might miss in code review.