The Harness Behind My Consulting Agent

For the past several months I’ve been doing a particular kind of freelance work: helping train frontier AI models on real consulting tasks. The arrangement is simple to describe and surprisingly deep in practice. I build realistic consulting engagements — a company, its data, its people, a genuine problem — and an AI agent attempts the work the way an analyst would. Then I evaluate how well it did, in detail, so the model can learn from the gap.

What I didn’t expect is how much of the job became building my own agent to help me do it. Designing a convincing engagement, producing the supporting data, and grading an output to a consistent standard is itself multi-step knowledge work — exactly the kind of thing an agent can carry if you give it the right scaffolding. Over time I’ve assembled a setup I keep reaching for. This post is a synthesized, blinded look at that setup and what it’s taught me. I’ve kept all client and platform specifics out; the interesting part is the architecture, not the accounts.

What a harness is — and why it matters

When people talk about AI getting better, they usually mean the model. But the model is only one part of what makes an agent useful. The other part — the part I spend most of my time on — is the harness.

A harness is the scaffolding around the model: the reasoning loop that turns a single clever response into sustained, reliable work. It simulates how a person actually moves through a problem. You reason about what’s being asked. You recognize which playbook applies and pull it up. You do the work with real tools. You step back and check it. You feed what you learned back in and go again. The model supplies raw intelligence; the harness supplies process.

This matters because of how errors compound. A model that’s 90% reliable on a single step is only about 60% reliable across five dependent steps — accuracy decays fast when work chains together. The fix isn’t a smarter model so much as a better-structured loop: offload execution to deterministic tools that don’t make probabilistic mistakes, codify the recurring procedures so they run the same way every time, and reserve the model’s judgment for the decisions that actually need it. In my experience the distance between a smart model and trustworthy output is almost entirely harness.

The loop, as a framework

Here is the whole thing in one picture, and the rest of this post is a walk through each move.

Four moves — reason, select a skill, act, review — grounded in three layers that persist across tasks: the skills it can run, the tools it can call, and the memory it carries forward. The moves are where the work happens; the layers are where my judgment is stored. What follows is what I’ve learned at each.

Reason & plan: how much to steer

The hardest dial on the whole setup is how much to reason a task out in advance versus how much freedom to leave the model. Early on I over-specified everything, narrating the exact path I wanted. What I’ve learned is that the model keeps surprising me — it routinely finds an approach I wouldn’t have scripted, and a tightly-specified plan often just caps it at my own imagination. So over time I’ve let go more and more: I give it the destination and the judgment calls, not the turn-by-turn route. A useful test is whether a capable agent could produce the deliverable from my instructions alone, without opening any of the source material — if so, I’ve over-specified and done the thinking it was supposed to do.

But letting go has a hard limit, and it’s worth being precise about why. You cannot treat the model like a junior analyst you trust to “use common sense and fill the gaps.” It is brilliant in places and then falls down in unexpected ones — often basic ones a person never would. The failures don’t land where a human’s would, which is exactly what makes them dangerous. So the freedom I give on the front end has to be paid for on the back end. The more latitude I leave the reasoning, the harder I lean on the review step, and the more often I stop trusting the transcript and open the actual artifact to look at it myself. Creativity in, rigor out — the two move together.

Skills: the playbooks I reach for

Skills are the standard operating procedures — written in plain language, each with triggers that tell the agent when it applies — so a recurring task runs the same considered way every time instead of being reinvented. The ones I reach for most:

Building slide decks — structure, layout discipline, and a consistent visual identity, so the output doesn’t need a manual cleanup pass.
Building spreadsheet models — a standard inputs / calc / summary / sensitivity structure I can audit and adjust later.
Online research — a defined source hierarchy and citation discipline, with a sub-agent spun up when the search needs real breadth.
Prompt refinement — turning a vague ask into a structured spec before any building starts, which is where a lot of wasted work gets avoided.
Quality review — both a mechanical pass and a deeper qualitative one (more on this below).
House-formatting and delivery — applying a visual identity to a file, and getting finished work out the door (email, calendar, cloud).

Each is a few hundred words of accumulated lessons. The value isn’t any single instruction; it’s that the hard-won detail is captured once and reused, instead of living in my head and degrading every time I re-explain it.

Tools: what does the actual work

Tools are the deterministic layer — the things the agent calls to execute rather than reason. The rule of thumb is simple: code should do what code is good at, and the model shouldn’t hand-calculate what a script can compute exactly. What I lean on:

Office automation — PowerPoint, Word, and Excel through their respective integrations, with scripting libraries as a fallback when I need finer control.
Web search and fetch — for grounding research in real, current sources.
Workspace clients — Sheets, Slides, and Drive, for anything that needs to live in the cloud rather than on my machine.
Render-to-image — converting a finished slide or sheet to a picture so it can actually be looked at, not just parsed as text. This one earns its keep in review.
Sub-agents — isolated workers for parallel research or independent checking, each with a clean context.
Small one-off scripts — for the transform that doesn’t justify a permanent tool but shouldn’t be done by hand.

Review & fix: where I spend the most time

This is the part that has grown the most, and where I now spend the majority of my time. It’s also where the junior-analyst comparison breaks down hardest, because the thing that makes review easy with a person — shared, unspoken standards — is exactly what the agent lacks.

The core realization is that the checks which are innate to me are not common sense to an agent. I’d automatically notice that a slide’s elements are misaligned or unevenly spaced, that a chart isn’t actually rendering, that the language in a report is pitched wrong for the client, that a number on a slide doesn’t reconcile with the model behind it. The agent won’t, unless I’ve said so. So I’ve worked to itemize and document those instincts — to turn “I’d know it when I see it” into an explicit checklist. Written down, my reflexes become something the agent can run on every deliverable, and something I can run against its work without missing the obvious.

A few things make this step actually work:

The checklist reads like a senior manager, not a rubric. It anchors on concrete, checkable things — a count that should land in a known range, a layout that has to clear a basic floor — and it accepts well-argued alternatives instead of demanding one blessed method. It maps cause and effect, too, so one upstream mistake that cascades into five downstream numbers gets noted once, not penalized five times.
I check every deliverable against it, no exceptions. Running the list is cheap and fast, so there’s no reason to skip it — and “cheap and thorough every time” beats “rigorous when I remember.”
Sometimes a second agent audits the QA. Because the check is so quick, I’ll occasionally have an independent agent review the review — isolated, and blind to what the “right” answer is, so it can’t simply flatter the work in front of it. Independence is what keeps a check honest.
I open things and look. Extracted text hides layout problems. From time to time I render the actual artifact and inspect it with my own eyes — some failures are only visible that way.

Where each instruction lives

A recurring question underneath all of this is where a given instruction should live, because there are three homes for it and they trade off against each other:

The always-on config (the agent’s main instruction file) — the small set of rules that apply to everything: how I want it to operate, my standing preferences, where things belong. It’s loaded every session, so it has to stay lean.
Memory — durable facts and lessons that aren’t universal but shouldn’t have to be re-learned: gotchas, preferences discovered through feedback, project context. Pulled in when relevant.
Skill-specific files — the bulk of the procedure: the detailed how-to for a particular task, loaded only when that task comes up.

The tension is performance versus latency versus cost. Everything in the always-on config is paid for on every single turn; everything tucked into a skill is free until the moment it’s needed — but only helps if the right skill actually triggers. Where I’ve landed for now is a rough split along those lines, but I’m treating it as a balance to monitor, not a settled answer. My suspicion is that more will migrate into the skill-specific layer over time, as models get better at pulling the right context in on their own and need less held permanently in front of them.

It’s coaching, not prompting

Step back and the whole exercise feels far more like coaching a junior analyst than like writing prompts. You’re developing a collaborator: explaining what good looks like, catching failures, codifying the lesson so the same mistake doesn’t recur. The difference is that you have to be much more articulate and specific than you would with a person, because the agent won’t quietly fill the gaps with assumed context the way a junior would — every standard you leave implicit is a standard it can miss.

The upside is real, though. This analyst isn’t moody. It doesn’t have a bad day, doesn’t need the feedback softened, and applies the same standard at 11pm as at 9am. Once a lesson is captured, it stays captured. On consistency and sheer throughput, it’s hard to beat.

The part I haven’t cracked: my own time

The thing I’m still adjusting to isn’t the agent — it’s how I schedule myself around it. The work now comes in ten- to fifteen-minute chunks: I set the agent off on something, then come back to review what it produced. That interval is awkward. It’s too short to fully context-switch into something demanding, and too long to just sit and watch the work happen. I haven’t yet figured out how to string those intervals together into a genuinely productive day — what to interleave, what to batch, how to stay in flow when my attention is being sliced into quarter-hours. It’s the open question of working this way, and for now I’m still experimenting with the answer.

The models will keep improving, and a lot of what I hand-build today will eventually be handled natively. I’m not worried about that. The harness is where my judgment lives — in knowing what good looks like, where the reasoning has to stay human, and how to structure a loop so the work comes out reliable. That’s the part that transfers, whatever the model underneath is doing.