Why engineering teams suck at estimating delivery dates

Engineering teams are frequently asked to estimate “when will X be done?”

Unfortunately, most teams suck at estimating delivery dates. And that’s a problem. Businesses need good estimates to make informed decisions. Scenarios like this come up all the time:

There’s a huge conference on October 23. Most of our customers will be there. If we can announce X at the conference, we think it will generate $Ymm in revenue. We’d love to do this — can your team ship X by October 23rd?

A wrong answer here has material consequences for the business. Engineering teams can’t afford to suck at estimates. And yet, they do. Projects get delayed all the time.

Why? It’s because their estimation methods suck. This post covers why the two most popular estimation methods set teams up for failure and what you should do instead.

Popular method 1: gut check * padding

It’s ironic.

Software companies love to talk about being data-driven. But, the most common estimation technique used by engineering teams in 2025 is:

Pull a number out of an engineer’s ass (usually the most senior engineer)
Multiply that number by a “padding factor” (usually 1.5-3x).

That’s not a data-driven estimate. That’s a wild guess. And it’s based on whatever the pseudorandom number generator in “Chad’s gut” feels like spitting out today:

gut-check-comic

Unicorns and demons

Amazingly, some senior engineers are magical unicorns — somehow the pseudorandom number generator that lives in their “gut” consistently nails ship dates. The problem is that the vast majority of engineers aren’t magical unicorns. Some engineers are even the opposite of magical unicorns — they’re cursed.

For every blessed unicorn whose gut consistently nails ship dates, there’s another engineer who has been possessed by a demon of over-optimism. They’re convinced every project is trivial (likely because the demon incessantly whispers they can ‘crush it’ given a weekend and enough beer to reach Ballmer Peak).

Is your team full of magical unicorns? Unlikely. Are some people on your team permanently possessed[1] by over-optimism demons? Probably.

[1] It’s very hard to exorcise an over-optimism demon — bathing the possessed engineer in more experience rarely works.

Why this method sucks

This method doesn’t produce consistent, reliable estimates that scale across teams, much less organizations. It sets teams up for failure. In short, it sucks.

Should the business plan a major feature launch based on a number “Chad” conjured out of thin air and then multiplied by another number that he felt was ‘big enough’? Absolutely not.

And, what happens when this method is used on a “large” project?

Somebody creates “milestones”. Each milestone gets an “estimate”. The wild guesses proliferate. Then, some poor soul (usually an EM) inevitably opens a gateway to hell by creating a Gantt chart. That chart somehow winds up in an VP’s inbox (who most certainly is possessed by an over-optimism demon). And now the team is f***ed — executive expectations have been set and the planets need to align to ship this thing on time.

If you see somebody create a Gantt chart using this method…run.

Popular method 2: average velocity

The 2nd most popular estimation method teams use is based off of average velocity. It looks like this:

Assign some form of story points to each of a project’s tasks
Sum the story points
Divide by the team’s average velocity (calculated over some historical time window)

This method at least attempts to be data-driven. But it still sucks. Why?

Problem 1: you can’t sum story points

Story points look like cute little numerical values — e.g. 1,3,5,8. So, it’s tempting to sum them. The problem is you can’t.

Story points aren’t numerical, they’re categorical — they’re labels used to group tasks of similar complexity. That’s why teams that are wise to this use non-additive values like t-shirt sizes (e.g. small, medium, large and x-large).

What happens when you try to compute the total amount of work required by summing story points? You get a number that doesn’t accurately reflect the total amount of work required — you just added unnecessary error to your estimate.

Why?

Suppose you have three sets of tasks whose points totals all sum to the same value, e.g. 8x1-point, 4x2-point and 1x8-point.

Will each set of tasks take the same amount of time to deliver? No! But by summing story points we’re implicitly assuming they will. And, the more this assumption fails to hold, the more error we introduce.

Ok, fine — but how much error do we actually introduce?

Nobody knows! And that’s terrifying. Most teams don’t look at their historical cycle-time distributions by point value. And so, they have no hope of calibrating their point assignments in a way that makes eight 1-point tasks equal one 8-point task.

Problem 2: averages produce overly optimistic estimates

An average is a measure of central tendency.

What happens when you generate a delivery date estimate by dividing story points by average velocity? You get an overly optimistic estimate.

Why?

Imagine a curve that represents the true probability of shipping a project on or before a given date. The x-axis represents time and the y-axis represents the probability of shipping. Our chances of shipping improve as time elapses. So, our curve looks like this:

This curve is typically referred to as an s-curve. When we generate an estimate by dividing total story points by average velocity, where do we end up on our s-curve?

That depends on how your team’s historical velocities are distributed and how well past velocity predicts future velocity. But if they’re anything close to normally distributed, which they will be given enough data, we’ll be close-ish to y = 50%!

This is because many (roughly half, if normally distributed) of the observed velocities will be worse than our average — e.g. if your team completes an average of 20 points per sprint, they’ll complete fewer than 20 points per sprint ~50% of the time.

Do you want to provide the business with an estimated delivery date that you only have a ~50% chance of hitting? Hell no! That’s a coin flip’s chance of success. You want to give estimates that have high chances of success — e.g. 90% confidence.

Using a measure of central tendency to generate estimates sets teams up for failure.

Problem 3: story points and velocity aren’t the droids we’re looking for

What the business actually cares about estimating is how long it takes to complete work.

So why are we using story points and velocity? Story points were invented for capacity planning (i.e. how much work should we take on?). Velocity measures throughput (i.e. how much work passes through the system?)

Neither are direct measures of cycle time (how long it takes to complete work from start to finish). And so we end up with a roundabout calculation to estimate it by summing story points and dividing by average velocity — i.e. total work / throughput. This introduces two major problems:

It assumes the team works at a constant rate throughout the project (they likely won’t).
It outputs a single number that hides the uncertainty in the delivery date estimate because it ignores the underlying variance in task cycle times (e.g. not all 1-point tasks take the same amount of time to deliver — there’s a range of completion times).

As a result, the estimate ends up hiding risk.

This method sucks

This method unnecessarily inflates estimation error, produces overly optimistic estimates and hides risk, which sets teams up for failure. It’s significantly better than gut checks, but it still sucks.

I don’t want my team to suck at estimates, what should I do?

Let’s return to our project s-curve:

Good estimation is all about selecting points on this curve intentionally. And, clearly communicating to your stakeholders which point was selected.

Often, the business doesn’t care about predicting the exact date a project will ship. They care about predicting a date that we’re likely to ship on or before — e.g. if there’s a big customer conference on October 23rd, it doesn’t matter if we ship on October 1st or 21st. That’s good, because it gives us a wide margin for error. And, all estimates have error.

So, what we want to do is select points on our s-curve that correspond to a high level of confidence that we’ll be able to ship on or before a given date. And we want to clearly communicate that confidence level to our stakeholders.

How confident do we need to be?

That depends on the consequences for shipping late — maybe the business is fine with an 85% confidence level (y=85%), or maybe it requires 95% confidence (y=95%). The point is you need to be intentional about it.

Short term

If your team uses gut checks, stop — you have no idea what point on the s-curve estimates correspond to.

If your team uses average velocity, you’re picking points on the s-curve that are too optimistic that only have a moderate degree of confidence. You’re better off replacing average velocity with a percentile based velocity (e.g. 90th percentile). You’ll still have the other problems mentioned earlier in this post. But, this will at least shift your estimates to points on the s-curve that correspond to higher levels of confidence. And, you’ll be less likely to ship late as a result.

Longer term

You should develop the capability to directly construct s-curves for your projects.

Good estimates become a lot easier once you have this. You can show the curve to stakeholders and jointly select a point on the curve that corresponds with the organization’s desired level of confidence.

It gets even better when you’re able to automatically update your s-curves across multiple initiatives as soon as something changes. Shit happens all the time in software. And you want to know the impact right away.

Maybe a PM has a great idea in the shower that will change everything but it increases project scope by 25%. Or, maybe Chad (the tech-lead for a new feature) loses a week to a Zarkon-bug that explodes your Voltron modular-monolith because mounting technical-debt has rendered it completely incapable of handling its sh*t anymore.

What’s your probability of shipping given the changes — i.e. how has the s-curve changed? Are there levers you can pull to improve your chances — e.g. cut scope, reprioritize work, reassign tasks etc? If you do that, how does it impact other initiatives you’ve committed to in your quarterly roadmap?

It’s easy to find out. Make the changes and watch your s-curves update automatically.

Sounds nice, how do I do the long-term thing?

One way you can do this is by running a Monte Carlo simulation over your backlog using historical task cycle times (we’ll cover how to do this in a future post).

But, the good news is you don’t need to build this yourself. Over the course of 15 years working in software, I’ve had all the same frustrations with estimation.

So, I built Empirical to give teams a better way to answer “when will we be done?”.

Empirical automatically creates “probability of shipping” s-curves for all your epics using per-developer per-estimate historical data, that look like this:

AUTH-1

Authentication System Overhaul

On Track

100% chance to deliver by Jan 31

Empirical’s auto-generated forecasts are up to 90% more accurate than velocity-based estimates[1] . And, it automatically updates forecasts as soon as things change so you can roll with the punches and keep your stakeholders in the loop at all times.

[1] Based on historical backtesting using real teams’ data with CRPS as the scoring metric

Do you use Jira?

You can get a free trial of Empirical here on the Atlassian Marketplace.

Do you use Linear?

You can sign up for Empirical’s Linear waitlist here