Building with LLMs: design the harness
An LLM is a probabilistic material, not a deterministic component. Building product with it means wrapping the model in a harness: the evals, guardrails, routing, and audit journals that hold it to a standard.
The model is a material, not a component
A normal software component is a contract. You give it inputs, it returns the same outputs, you wire it in and move on. An LLM does not work like that. Send it the same prompt twice and you can get two different answers, one of them confidently wrong. That is not a bug to patch out, it is the grain of the material. AI-native product design starts when you stop pretending the model is a component and start treating it as a probabilistic material you build structure around.
The name for that structure is a harness. Not the verb, the noun: the rig that holds the model, takes the load, and keeps it from drifting off course. On Articos, an AI user-research platform, the harness is the actual product. The model is the implementation detail inside it.
The eval harness gates the pipeline
The first piece is the one most teams skip. Before I wrote a prompt I wrote the rubric: six criteria a senior researcher would judge a report on, from evidence specificity to whether the model ever saw the hypothesis it was supposed to test. Then I made that rubric executable. It became an eval harness that every draft has to pass before it ships.
The numbers are not decoration, they are the gate. Forty-seven studies routed through the harness; 36 passed the citation-traceability check end to end; the other 11 failed on exactly the issues the rubric named, before any human read them. Theme recall across that set was 86%. The point of the eval harness is that the standard is external to the model. The model does not get to grade its own homework.
Writing the eval first felt slow. It cost me about two weeks of demo time while stakeholders wanted a working pipeline. But without an external target the model optimises for fluency, and fluency is the thing a senior reader dismisses in the first paragraph. The eval harness is how you find that out in a test instead of in front of a customer.
If the system cannot defend a report against its own rubric, the right behaviour is to refuse to ship it. The refusal is a feature, not a failure.
Guardrails and citation thresholds
Inside the harness the guardrails are specific and numeric, not vibes. A section has to hit a citation coverage threshold before it counts as done, so prose that drifts away from its sources gets caught and sent back rather than shipped. A synthesis stage caps confidence when it detects everyone agreeing, because unanimity is usually a sign the model collapsed nuance, not that the truth is settled.
These thresholds are where designing with LLMs gets concrete. You are not hoping the model behaves. You are setting the bar in numbers the system can check on every run, then deciding what happens when a draft falls short. Most of the engineering is in that second half: the recovery path, not the happy path.
Articos runs an 8-stage report pipeline, and each stage exists because of one specific failure the model produces alone. The blueprint stage injects a refuting-source pass so disconfirming evidence has to be looked for, not skipped. The repair stage is allowed to rewrite weak prose but is forbidden from dropping a source pin. Each rule is small. Stacked together they are what make the output something a researcher will put their name on.
Model routing: no single model is best at everything
The other thing you learn fast is that no one model is best at every job. Some are stronger at structured extraction, some at long-form synthesis, some at terse critique. So the harness routes across models. A given stage goes to whichever model is strongest for that specific task, and the routing is a design decision, not an accident of which API key was handy.
Cross-model routing also keeps the system honest. A critique written by a different model than the one that drafted the section is less likely to wave its own work through. Diversity of model becomes a cheap check against the failure mode of a model loving its own output.
Audit journals: the run has to be inspectable
The last piece of the harness is memory. Every run keeps an audit journal: which model ran each stage, which guardrail fired, which check failed and what the recovery did, which claim maps to which source paragraph. When a reviewer asks "how do you know?", there is a record to point at instead of a shrug.
That journal is what turns a probabilistic system into something a team can trust. You cannot make an LLM deterministic. You can make it accountable. Build with LLMs and the work is mostly this: the eval harness that sets the bar, the guardrails and thresholds that enforce it, the routing that picks the right tool, and the audit journal that proves what happened. The model is the easy part. The harness is the product.
For the design side of the same system, Designing AI Behavior covers the behaviour the harness holds in place: confidence caps, the critique-repair loop, and the AI-proposes-person-decides handoff. The full build is in the Articos case study.