~/rs

DRIFT: A Framework for Estimating LLM Projects Before You Commit

7 min read
AI/AgenticPlatform


title: "DRIFT: A Framework for Estimating LLM Projects Before You Commit" date: "2026-03-28" description: "Every AI project gets misestimated. DRIFT is a 7-dimension complexity framework that gives your team a defensible number before the build starts — not a guess dressed up as a story point." tags: ["AI/Agentic", "Platform"]

There is a specific conversation that happens in almost every AI project kickoff I've been part of: someone asks how long the LLM integration will take, an engineer gives a number based on vibes, a PM adds 20% for buffer, and the estimate goes into the plan. Three months later, everyone is surprised.

This isn't incompetence. It's a structural problem. LLM and agentic work has failure modes that don't exist in conventional software — and our estimation frameworks were built for conventional software.

DRIFT is my attempt to fix that.

Why Standard Estimation Breaks on LLM Work

Story points capture effort. Function points capture scope. Both assume that the primary variable is how much code you need to write, and that code complexity is the main source of uncertainty.

LLM projects have a different uncertainty profile:

  • Non-deterministic outputs require evaluation loops that conventional tasks don't
  • Context window constraints create emergent architecture decisions mid-build
  • Tool call chains multiply in complexity non-linearly — each additional tool in an agentic loop can double the number of failure paths
  • Retrieval quality determines whether RAG-based features work at all, and it's hard to know before you build
  • Agentic depth — whether the model is answering a question or executing a multi-step autonomous workflow — is the single biggest driver of engineering effort, and most teams don't explicitly estimate for it

Story points miss all of this. You end up with a 3-pointer that turns into three weeks of work because nobody scoped the evaluation harness, the embedding pipeline, or the fallback behavior for tool call failures.

The Seven Dimensions

DRIFT scores a project across seven dimensions. Each dimension captures a specific kind of complexity that drives LLM engineering effort.

Cx — Context complexity How much the model needs to reason over. A single-turn Q&A over a small corpus is low. A multi-turn agent that needs to maintain coherent state across a 50,000-token conversation window is high. This is the dimension most teams underestimate because it feels like a prompt engineering problem, not an architecture problem.

A — Agentic depth Is the model answering, or acting? A classification pipeline is depth 0. A single tool call is depth 1. A multi-step autonomous workflow with branching, retries, and human-in-the-loop checkpoints is depth 3+. Agentic depth has a multiplicative effect on every other dimension — higher A makes everything else harder.

C — Corpus size The volume of content the system needs to index, chunk, embed, and retrieve. Scales predictably until it doesn't: at certain thresholds you hit embedding pipeline architecture questions, storage decisions, and re-indexing strategies that are non-trivial to scope.

O — Orchestration How many models, agents, or sub-systems need to coordinate. Single-model pipelines are cheap. Multi-agent systems with shared state, parallel execution, and handoff protocols are expensive. Every new coordination point introduces failure modes that need to be designed and tested.

Rd — Retrieval depth For RAG-based features: how sophisticated does retrieval need to be? Simple cosine similarity over a flat vector store is low. Hybrid search, re-ranking, query expansion, multi-hop retrieval across heterogeneous sources is high. The difference in build time is measured in weeks.

V — Validation rigor How do you know the system is producing correct output? Subjective human evaluation is cheap and unreliable. Automated evaluation with ground truth datasets, LLM-as-judge pipelines, and regression suites is expensive and necessary for production. Most projects skip this and pay for it later.

Tc — Tool calls The count of distinct tools the agent can invoke. Each tool is a capability, a failure path, a security surface, and a testing requirement. Tool call count scales the agentic surface area of the system directly.

The Formula

T_total = 800·Cx·A + 0.75·C + O·(18 + 3·Rd + 4.8·V·A) + 150·Tc

This is not a magic formula. It's a structured way to make the complexity visible before you commit.

The coefficients come from calibrated estimates across real agentic projects — they represent relative weights, not absolute hours. The value is in the exercise: forcing every dimension to be scored explicitly, discussed, and agreed on before the build starts.

A project with Cx=2, A=2, C=10, O=3, Rd=2, V=2, Tc=5 looks like this:

T = 800·2·2 + 0.75·10 + 3·(18 + 3·2 + 4.8·2·2) + 150·5
  = 3200 + 7.5 + 3·(18 + 6 + 19.2) + 750
  = 3200 + 7.5 + 129.6 + 750
  = 4087

That number isn't hours in a vacuum — it's a complexity score that anchors the conversation. "We're treating this as a complexity-4000 project. Here's what that means for timeline, staffing, and risk." It forces the estimate to be a negotiation about scope, not a guess about speed.

How to Use It

Before scoping: Run DRIFT on the proposed system to produce a complexity score. If it's higher than your team expected, that's a conversation to have before anyone writes code.

During architecture: DRIFT dimensions map directly to architecture decisions. High Rd means you need to spec the retrieval pipeline early. High O means you need to design the inter-agent state protocol before building individual agents. Use the scores to prioritize what needs to be designed vs. what can be figured out later.

When re-scoping: When requirements change mid-project, rescore the affected dimensions. A feature addition that seems simple often changes the Tc count or the V requirements significantly. Running DRIFT on the delta gives you a defensible argument for why the timeline shifted.

With stakeholders: The most valuable use of DRIFT is translating technical complexity into a number that non-technical stakeholders can reason about. "This is a complexity-4000 project, our benchmark complexity-2000 project took 6 weeks, so we're looking at 10-12 weeks" is a better conversation than "it depends."

What DRIFT Doesn't Do

DRIFT doesn't replace judgment. It structures it.

It won't tell you whether a project is worth building. It won't catch requirements that haven't been written yet. It won't account for team-specific context — a team that has already built three RAG pipelines will move faster on Rd=3 than a team doing it for the first time.

What it does: it makes the complexity visible before you commit. It gives your team a shared vocabulary for the dimensions that drive LLM engineering effort. And it produces a number that can be reviewed, challenged, and revised — instead of a gut feeling that can't.

The Short Version

LLM and agentic projects fail at estimation because the complexity drivers are different from conventional software. Context size, agentic depth, retrieval sophistication, tool call count — these aren't captured by story points or function points. DRIFT is a framework that makes those dimensions explicit before the build starts.

The goal isn't a precise prediction. It's a structured conversation about complexity before you're three months into a build you didn't fully scope.


Where This Connects

DRIFT is the estimation layer inside the AI Workflow Sprint at Brain Breaking — every engagement starts with a DRIFT score so we have ground truth on complexity before we commit to scope. If you want a DRIFT score on your current AI project, get an estimate.

Also in this series: The Sprint Is the New Technical Debt · MCP Is Reshaping Microservices · Steering Files and the Prompt Quality Frontier