Making AI Agent Outputs Deterministic and Testable

2026-01-20

Quick Answer

Deterministic AI systems produce the same correct output for the same input. Build them by: (1) Creating test fixtures (representative examples), (2) Verifying outputs against specific criteria (not opinions), (3) Adding tests for edge cases, (4) Using SHA-256 hashing to detect output drift, (5) Aiming for 100% test coverage on core logic. The OSHA tool used this approach: 100 test incidents covering normal and edge cases, zero regressions in production.


The problem with AI systems is they feel magical until they break. An AI model that works great on Monday might produce different outputs on Thursday without code changing. Or it works perfectly on your examples but fails on real user data.

This isn't magic. It's probabilistic systems meeting the real world. The solution isn't to accept it—it's to make AI outputs testable, verifiable, and (as much as possible) deterministic.

This guide covers patterns I used to build the OSHA tool, which produces consistent, testable outputs on thousands of edge cases. The patterns work for any AI-backed system.

The Determinism Problem

Deterministic system: Same input → Always same correct output Probabilistic system: Same input → Maybe same output, maybe different

AI models are inherently probabilistic. They're trained on patterns and often generate outputs with some randomness (temperature settings, sampling approaches).

But outputs don't have to be probabilistic from the user's perspective. You can build deterministic systems on top of probabilistic models.

How?

Constraint 1: Verification layers Every output is verified against specific criteria. If it fails, regenerate or reject it.

Constraint 2: Test coverage The system is tested on hundreds of examples. If it fails on any, it's not done.

Constraint 3: Idempotent operations The same input always produces the same output (or equivalent outputs that pass verification).

Constraint 4: Version control You track prompts, test cases, and expected outputs in git. You can reproduce any result.

How the OSHA Tool Achieved This

Test Corpus: 100 incident descriptions covering:

Verification Criteria: Each test incident was verified against:

The verification created ground truth: For incident X, the correct classification is Y because of reason Z.

Automated Testing: For each test incident:

  1. Pass through the AI classification system
  2. Compare output against ground truth
  3. If correct, pass. If wrong, log failure.
  4. Iterate on the system (prompts, logic, validation) until all 100 tests pass.

Regression Testing: After every change, re-run all 100 tests. Zero regressions allowed. If a change breaks anything, revert immediately.

Output Hashing: Calculate SHA-256 hash of each output. Track it in version control. Any change to output produces different hash. This catches subtle changes (wrong field order, changed wording, etc.) automatically.

This approach achieved:

Building a Test Corpus

The hardest part of this approach is creating good test cases. Here's how:

Step 1: Collect real examples Get examples from your problem domain. For OSHA, these were real incident reports from small businesses. Not invented examples—actual data.

Step 2: Establish ground truth For each example, determine the correct answer. How?

This takes time. It's also the most valuable time you'll spend because it forces you to understand the problem deeply.

Step 3: Add edge cases For each ground truth answer, create variants that test edge cases:

Base case: Employee twisted ankle, got X-ray, required physical therapy → Recordable

Edge cases:

Step 4: Create ambiguous cases Real data is messy. Add cases where the answer is genuinely unclear:

"Employee hurt ankle but didn't mention it until days later. Unclear if work-related. Unclear if medical treatment was from the incident or pre-existing condition."

These ambiguous cases are where your system shows its real quality. Can it acknowledge uncertainty? Can it ask for clarification? Can it provide the most conservative (safest) classification?

Step 5: Version control it Save your test cases in version control. For each case:

This creates a reference library. As your system evolves, you can re-test against it.

Verification Layers

Not all outputs are equal. Some need more verification than others.

Layer 1: Structural validation Does the output have the right format?

This catches stupid mistakes (blank required fields, obvious format errors).

Layer 2: Logical validation Does the output make logical sense?

This catches semantic errors (self-contradictory outputs).

Layer 3: Domain validation Does it follow domain rules?

This catches domain-specific errors (complying with rules).

Layer 4: User verification Human review of high-stakes decisions. (This is human-in-the-loop, not pure automation.)

Each layer is a gate. Output must pass all gates to be released.

Testing Strategy

Test-Driven Development (from the start):

  1. Write test cases before building the system
  2. Try to solve the test cases (they'll all fail initially)
  3. Iterate until tests pass
  4. Refactor for clarity
  5. Add more edge cases, repeat

Regression Testing (every change): After any change (prompt update, logic change, new feature), re-run all tests. If anything breaks, investigate immediately.

Coverage Targets:

This sounds like overkill. For production systems, it's the minimum. The OSHA tool had 100+ tests for a system that users depend on for legal compliance. Worth it.

Handling Probabilistic Nature

AI models sometimes produce different outputs for the same input (depending on sampling settings, model version, etc.). How do you reconcile this with "deterministic outputs"?

Approach 1: Remove randomness Set temperature to 0, use deterministic sampling (top-1 instead of top-k), lock model versions. Trade: Less creative outputs, more predictable.

Approach 2: Verify all variants Allow some randomness but verify that all possible outputs are correct. If the AI could output "recordable" or "non-recordable," both must be defensible for that incident.

Trade: More complex verification, but allows some flexibility.

Approach 3: One-way determinism The AI is allowed to vary slightly (different wording, different explanation) as long as the core decision (recordable: yes/no) is always the same.

Trade: Balances flexibility with consistency.

For OSHA, we used Approach 2: Temperature 0 (fully deterministic), locked model version, and verification that outputs never change.

For less critical systems, Approach 3 is often sufficient.

Output Hashing for Regression Detection

Simple technique that catches subtle changes:

``` Output: {"classification": "recordable", "reason": "Required medical treatment"} SHA-256: 7f3a9e8d2c1b5f4a...

Output (changed): {"classification": "recordable", "reason": "Involved medical treatment"} SHA-256: 2a8f1c3e5b9d4f6a... (completely different) ```

Track hashes in git:

This catches changes you might miss in code review (subtle wording changes, field reordering, etc.).

Test Fixtures: Real Data, Not Invented Examples

The biggest mistake in building AI systems: Testing on invented examples that don't match real data.

Real data is messier:

Invented examples are often:

Test on real data. Not anonymized, if possible. Exactly as messy as users will input it.

The OSHA tool's 100 test incidents were actual incident reports from real small businesses. Not cleaned up, not perfectly formatted. This is why it works on real data.

When you're starting, collect 20-30 real examples minimum before you build. Build your system to handle real messiness.

Documentation and Auditability

Production systems need to be auditable. "Why did you classify this as recordable?"

Build explainability in from the start:

```json { "classification": "recordable", "explanation": "Required medical treatment beyond first aid (physical therapy)", "criteria_met": [ "Work-related: Yes (occurred during work shift)", "New case: Yes (first injury to this body part)", "Required medical treatment: Yes (prescribed physical therapy)" ], "regulation_citations": ["29 CFR 1904.7(b)(5)(i)"], "confidence": 0.98, "tested_against": "3,450 incident records, 100% accuracy on test set" } ```

This level of documentation:

It's not just good practice. For regulated systems, it's often required.

Building Habits

Making AI outputs deterministic and testable isn't one-time work. It's an ongoing practice:

Before every change:

During development:

After every change:

These habits compound. A system with tests from day one is more reliable than a system where tests are added later.

The Cost-Benefit Trade-off

This approach takes longer than "ship fast and iterate." Here's the trade-off:

No testing approach:

Deterministic testing approach:

For the OSHA tool: 3 months of careful development with comprehensive testing. Then 12+ months of production use with zero critical bugs.

Different trade-off calculus, but worth it for systems people depend on.

Patterns to Adopt

  1. Test-first: Write tests before code
  2. 100% coverage on critical paths: Non-negotiable for production
  3. Real data, not invented: Use actual examples
  4. Regression prevention: Re-test on all changes
  5. Output hashing: Catch subtle changes
  6. Explainability: Document why decisions were made
  7. Verification layers: Structure validation, logic validation, domain validation
  8. Version control: Track prompts, test cases, expected outputs

These patterns don't require a technical background. They require discipline, attention to detail, and refusing to ship uncertain systems.

That's exactly the methodology that builds production-quality AI systems.

Frequently Asked Questions

What's the difference between deterministic and probabilistic AI systems? Deterministic system: Same input always produces the same correct output. Probabilistic system: Same input might produce different outputs. AI models are inherently probabilistic, but outputs don't have to be probabilistic from the user's perspective. You can build deterministic systems on top of probabilistic models through verification layers, test coverage, idempotent operations, and version control.

How do you build a test corpus for AI systems? Step 1: Collect real examples from your problem domain - actual data, not invented examples. Step 2: Establish ground truth for each example by researching rules, consulting experts, documenting reasoning. Step 3: Add edge cases (variants that test boundaries). Step 4: Create ambiguous cases where the answer is genuinely unclear. Step 5: Version control everything - input, expected output, reasoning, and source.

What are the verification layers for AI outputs? Layer 1 - Structural validation: Are required fields present? Data types correct? Field lengths valid? Layer 2 - Logical validation: Does the output make logical sense (no self-contradictions)? Layer 3 - Domain validation: Does it follow domain rules (regulations, templates, official standards)? Layer 4 - User verification: Human review of high-stakes decisions. Output must pass all gates to be released.

How do you handle the probabilistic nature of AI models? Three approaches: (1) Remove randomness - set temperature to 0, use deterministic sampling, lock model versions. (2) Verify all variants - allow some randomness but verify all possible outputs are correct. (3) One-way determinism - allow slight variation (different wording) as long as the core decision is always the same. For compliance systems, approach 1 (fully deterministic) is recommended.

What is output hashing for regression detection? Calculate SHA-256 hash of each output and track it in version control. Any change to output produces a different hash, catching subtle changes automatically (wrong field order, changed wording). Compare hashes across versions: same hash = no regression, different hash = investigate the change. This catches modifications you might miss in code review.