The Canonical Rework Studies: NIST, IBM, Boehm, Capers Jones, DORA

Updated 17 April 2026

Every blog post on software rework cost cites IBM's 1-10-100 rule or Boehm's cost-of-change curve in isolation. Nobody has published all the major studies side by side with their specific figures, sample sizes, methodology notes, and caveats. This page does exactly that. It exists to be cited, linked to, and used as a reference -- by engineering leaders, consultants, and researchers who need a single source for the primary data.

Each study section follows the same structure: who, when, sample, methodology, headline figure, caveats, how it has been misquoted, and how to cite correctly.

NIST Planning Report 02-3 (2002)

The Economic Impacts of Inadequate Infrastructure for Software Testing

Who

Research Triangle Institute (RTI International), commissioned by NIST

When

2002

Sample

US software industry; mixed-method study combining survey data, industry interviews, and economic modeling across multiple sectors

Methodology

Estimated economic losses from software defects using two approaches: (1) survey of industry defect costs, (2) economic impact modeling of defect-related productivity losses. Unlike pure survey studies, the mixed-method approach partially corrects for self-reporting bias.

Headline Finding

$59.5 billion in annual economic losses attributable to inadequate software testing infrastructure in the US economy. 80% of development costs are spent on identifying and correcting defects. Prevention investment of $1 reduces failure costs by approximately $40 (the '1% finding').

Caveats

2002 data. Software costs and testing infrastructure have changed substantially. Cloud infrastructure and CI/CD pipelines have reduced some categories of defect cost. The $59.5B figure, inflation-adjusted, would be substantially higher today. The 80% defect cost figure is the most debated -- it includes all identification and correction activity, not just rework in the strict definition.

How It Is Often Misquoted

Often cited as if $59.5B is the cost of software bugs generally. The precise claim is the cost of inadequate testing infrastructure specifically -- the economic argument for investing in testing tooling and practice.

How to Cite

RTI International. NIST Planning Report 02-3: The Economic Impacts of Inadequate Infrastructure for Software Testing. National Institute of Standards and Technology, May 2002. Available: https://www.nist.gov/document/report02-3pdf

IBM Systems Sciences Institute (commonly cited as 'IBM SSI 1995')

Relative Costs of Fixing Defects by Phase

Who

IBM Systems Sciences Institute, internal research

When

Primary data circa 1981-1994; popularised in IBM SSI internal publications from the 1990s

Sample

IBM internal software projects. Sample size and precise methodology are not fully public. The study is often cited second-hand.

Methodology

Phase-by-phase defect cost analysis on IBM development projects. Measured the average cost to fix a defect depending on when it was discovered: design, coding, unit test, integration test, system test, field use.

Headline Finding

Relative defect fix cost by phase: Design ($1), Coding ($5), Testing ($10), Production ($100). The '1-10-100 rule' as commonly cited. Original IBM figures show a wider range (1-6.5x from design to integration, up to 100x for field defects).

Caveats

The specific multipliers vary in different citations. The most precise version of the data shows a range, not a single multiplier per phase. Modern codebases with strong CI/CD pipelines have smaller multipliers because deployment is faster and rollback is easier. The 100x figure for production defects is conservative for high-traffic consumer systems; catastrophic for safety-critical systems. Barry Boehm's 1981 data (Software Engineering Economics) is the primary academic source that IBM SSI built on.

How It Is Often Misquoted

Widely cited as '100x' with the implication that all production bugs cost 100x vs. design-phase fixes. The accurate claim is that the range is 1-100x depending on the phase and the type of defect. Many bugs are caught in production quickly and cheaply; the 100x figure applies to defects that cause significant customer impact.

How to Cite

IBM Systems Sciences Institute. Relative Costs of Fixing Defects. IBM, 1995. (Internal document; widely cited in secondary literature including Boehm & Basili, 'Software Defect Reduction Top 10 List', IEEE Computer, 2001.)

Boehm (1981) -- Software Engineering Economics

The Cost-of-Change Curve

Who

Barry Boehm, TRW Systems; later USC

When

1981 (Software Engineering Economics); updated in numerous subsequent papers

Sample

Data collected from TRW software projects in the 1970s. Medium-to-large embedded systems and commercial software.

Methodology

Phase-by-phase analysis of defect detection and correction costs. Boehm plotted the cost of changing a software requirement as a function of lifecycle phase, generating the exponential cost-of-change curve.

Headline Finding

The cost of changing a software requirement increases by roughly an order of magnitude per lifecycle phase: the earlier in development a requirement change is made, the cheaper. The curve shows roughly 1x (requirements phase) to 200x (post-delivery) for large systems.

Caveats

Boehm's data was collected from large, complex embedded systems in the 1970s-80s. The multipliers are larger for more complex systems and smaller for simpler web applications. The original 200x figure for post-delivery changes is specific to large-scale embedded systems with long deployment cycles; web applications with CI/CD and feature flags have fundamentally different cost profiles. Boehm himself updated the numbers in the 2000 'Spiral Model' work.

How It Is Often Misquoted

Often conflated with IBM SSI's figures and presented as if they are the same study. Boehm's curve is about the cost of changing requirements; IBM SSI is about the cost of fixing defects. They support the same directional conclusion (fix earlier = cheaper) but measure different things.

How to Cite

Boehm, B.W. Software Engineering Economics. Prentice-Hall, 1981. ISBN 0-13-822122-7.

Capers Jones (2008, updated)

Applied Software Measurement, 3rd edition

Who

Capers Jones, Software Productivity Research (SPR)

When

Data collected across 1970s-2000s; Applied Software Measurement first published 1991; 3rd edition 2008

Sample

Data from thousands of software projects across hundreds of client organisations, primarily North American. Strong representation of enterprise software, defence, and financial services. Less representation of web/SaaS startups.

Methodology

Function point based analysis. Jones and SPR tracked function points (a size metric), defect rates, defect removal efficiency (DRE), and productivity across project types, industries, and development methodologies. The Defect Removal Efficiency metric is Jones' primary contribution to rework measurement.

Headline Finding

Typical Defect Removal Efficiency (DRE) for average software organisations: 85% (15% of defects reach production). Best-in-class: 95%+. Requirements defects account for approximately 45% of all defects by cost but only ~15% by count -- requirements defects are the most expensive category. Rework consumes 20-40% of total development effort in typical organisations.

Caveats

Sample skews toward enterprise. Web/SaaS companies with agile CI/CD practices are underrepresented in earlier editions. Function point methodology is unfamiliar to most development teams today. Some of Jones' industry averages have not been updated to reflect the shift to cloud, microservices, and DevOps practices. The enterprise skew means the 85% DRE figure is probably optimistic for smaller organisations.

How It Is Often Misquoted

The '40% of effort is rework' figure is sometimes cited without the distinction between industries or team maturity levels. Jones' data shows a range of 15-60% depending on organisation type, with 20-40% as the typical enterprise range.

How to Cite

Jones, C. Applied Software Measurement: Global Analysis of Productivity and Quality. 3rd ed. McGraw-Hill, 2008. ISBN 978-0-07-150244-3.

DORA State of DevOps Report 2024

Accelerate State of DevOps Report, Google DORA Research Program

Who

DORA (DevOps Research and Assessment), acquired by Google. Research led by Dr. Nicole Forsgren (through 2019), now maintained by Google Cloud DORA team.

When

Annual; 2024 edition. The research program began in 2013.

Sample

39,000+ technology professionals globally. Self-selected survey. Strong representation from software-first companies; weaker from traditional enterprises. All industries and company sizes, but weighted toward US and European respondents.

Methodology

Structural equation modelling linking practices (technical, cultural, organizational) to outcomes (software delivery performance, organizational performance, well-being). The four key metrics are: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service.

Headline Finding

Change Failure Rate (CFR) by tier: Elite <5%, High 5-10%, Medium 10-15%, Low >30%. Elite teams deploy multiple times per day; low performers deploy monthly or less. Rework-relevant finding: elite teams have MTTR below 1 hour; low performers above 1 day. Test automation is one of the strongest predictors of elite CFR.

Caveats

Self-selected survey means response bias toward engaged, performance-tracking organisations. Teams with the worst metrics are least likely to complete surveys. The elite/high/medium/low classification has changed over the years -- 2024 numbers are not directly comparable to 2019 numbers without adjusting for the revised classification methodology. The DORA metrics measure delivery performance, not rework specifically -- CFR is the closest proxy.

How It Is Often Misquoted

Change failure rate is sometimes presented as equivalent to rework rate. They are correlated but not the same: a high CFR team has high production rework, but may have low pre-production rework (caught by CI/CD). A team can have low CFR but high sprint rework ratio from unclear requirements.

How to Cite

Google DORA. 2024 State of DevOps Report. Google LLC, 2024. Available: https://dora.dev/research/2024/dora-report/

Stripe Developer Coefficient (2018, updated 2020)

The Developer Coefficient: The Hidden Cost of Technical Debt

Who

Stripe, in partnership with Harris Poll

When

2018; methodology update 2020

Sample

500 CTOs and VPs of Engineering; 850 software developers across US, UK, France, Germany, Singapore, Japan. Enterprise and growth-stage companies.

Methodology

Structured survey measuring time allocation across work categories. Respondents reported how many hours per week they spent on 'bad code' (technical debt-related work, including rework), new features, maintenance, and meetings.

Headline Finding

17.3 hours per week lost per developer to bad code, including technical debt and rework. 88% of developers report their company's tech debt has reached an unacceptable level. $300 billion global annual cost estimate (extrapolated from the hourly figure and developer population).

Caveats

Self-reported time allocation is subject to recall bias. The 17.3 hours figure includes all 'bad code' work -- rework, tech debt navigation, inefficient processes -- not rework in the strict definition. The $300B extrapolation uses aggressive developer population estimates. The 2020 update showed similar figures, suggesting consistency, but both are surveys rather than observational data.

How It Is Often Misquoted

Often cited as if 17.3 hours per week is purely rework. The accurate claim is that 17.3 hours is all time spent on 'bad code' broadly -- a category that includes tech debt navigation, inefficient tooling, and undocumented systems as well as rework events.

How to Cite

Stripe. The Developer Coefficient. Stripe, Inc., 2018. Available via stripe.com/reports/developer-coefficient (may require archived access as of 2026).

McKinsey Developer Velocity Index 2023

Developer Velocity: How Software Excellence Fuels Business Performance

Who

McKinsey Digital

When

2023 (updated from 2020 original)

Sample

Hundreds of companies, primarily large enterprises ($500M+ revenue). Proprietary data from McKinsey client engagements plus survey data.

Methodology

Correlation analysis between developer velocity practices (tooling, culture, product management, security) and business outcomes (revenue growth, innovation rate, operating leverage). The Developer Velocity Index (DVI) is a composite of practice adoption scores.

Headline Finding

Top-quartile DVI companies report 4-5x more innovation (new products, services), 60% higher revenue growth, and significantly higher developer productivity than bottom-quartile peers. Rework reduction is identified as one of five key velocity drivers.

Caveats

McKinsey data is proprietary and methodology is not fully public. Sample skews heavily toward large enterprises. The 4-5x productivity figure is a composite of multiple practices, not attributable to rework reduction alone. Correlation vs. causation is a persistent limitation: high-performing companies may adopt good practices AND reduce rework as a consequence of being high-performing, rather than the practices causing the performance.

How It Is Often Misquoted

The 4-5x figure is sometimes cited as if it is attributable to rework reduction specifically. It is a composite of all five Developer Velocity dimensions, of which rework reduction is one component.

How to Cite

McKinsey & Company. Developer Velocity Index: How Software Excellence Fuels Business Performance. McKinsey Digital, 2023. Available: mckinsey.com/capabilities/mckinsey-digital.

GitHub Octoverse 2024

The state of open source and developer productivity

Who

GitHub (Microsoft subsidiary)

When

Annual; 2024 edition

Sample

GitHub platform data (100M+ developers, 420M+ repositories) plus developer survey. Survey sample: 11,000+ developers globally.

Methodology

Platform telemetry combined with developer survey. Measures activity patterns, tool adoption, AI coding tool usage, and self-reported productivity. Does not directly measure rework but tracks bug-fix commit rates, PR review patterns, and deployment frequency.

Headline Finding

AI coding tools (GitHub Copilot) show 55% faster task completion in controlled studies. PR review time is the largest single contributor to developer waiting time. Bug-fix commits account for a significant share of total commit volume (exact percentage varies by repository).

Caveats

Platform telemetry measures activity, not quality. Higher commit volume is not necessarily better. Bug-fix commit rate is a proxy for rework but undercounts rework done without a distinct commit (inline fixes during PR review). The AI productivity claims are from controlled lab studies, not production observational data.

How It Is Often Misquoted

The 55% faster task completion claim for AI tools is frequently taken out of context. It is from a controlled study where participants completed a specific coding task, not an observation of production velocity.

How to Cite

GitHub. Octoverse 2024: The state of open source. GitHub, Inc., 2024. Available: octoverse.github.com