The Canonical Rework Studies: NIST, IBM, Boehm, Capers Jones, DORA
Updated 17 April 2026
Every blog post on software rework cost cites IBM's 1-10-100 rule or Boehm's cost-of-change curve in isolation. Nobody has published all the major studies side by side with their specific figures, sample sizes, methodology notes, and caveats. This page does exactly that. It exists to be cited, linked to, and used as a reference -- by engineering leaders, consultants, and researchers who need a single source for the primary data.
Each study section follows the same structure: who, when, sample, methodology, headline figure, caveats, how it has been misquoted, and how to cite correctly.
NIST Planning Report 02-3 (2002)
The Economic Impacts of Inadequate Infrastructure for Software Testing
Who
Research Triangle Institute (RTI International), commissioned by NIST
When
2002
Sample
US software industry; mixed-method study combining survey data, industry interviews, and economic modeling across multiple sectors
Methodology
Estimated economic losses from software defects using two approaches: (1) survey of industry defect costs, (2) economic impact modeling of defect-related productivity losses. Unlike pure survey studies, the mixed-method approach partially corrects for self-reporting bias.
Headline Finding
$59.5 billion in annual economic losses attributable to inadequate software testing infrastructure in the US economy. 80% of development costs are spent on identifying and correcting defects. Prevention investment of $1 reduces failure costs by approximately $40 (the '1% finding').
Caveats
2002 data. Software costs and testing infrastructure have changed substantially. Cloud infrastructure and CI/CD pipelines have reduced some categories of defect cost. The $59.5B figure, inflation-adjusted, would be substantially higher today. The 80% defect cost figure is the most debated -- it includes all identification and correction activity, not just rework in the strict definition.
How It Is Often Misquoted
Often cited as if $59.5B is the cost of software bugs generally. The precise claim is the cost of inadequate testing infrastructure specifically -- the economic argument for investing in testing tooling and practice.
How to Cite
RTI International. NIST Planning Report 02-3: The Economic Impacts of Inadequate Infrastructure for Software Testing. National Institute of Standards and Technology, May 2002. Available: https://www.nist.gov/document/report02-3pdf
IBM Systems Sciences Institute (commonly cited as 'IBM SSI 1995')
Relative Costs of Fixing Defects by Phase
Who
IBM Systems Sciences Institute, internal research
When
Primary data circa 1981-1994; popularised in IBM SSI internal publications from the 1990s
Sample
IBM internal software projects. Sample size and precise methodology are not fully public. The study is often cited second-hand.
Methodology
Phase-by-phase defect cost analysis on IBM development projects. Measured the average cost to fix a defect depending on when it was discovered: design, coding, unit test, integration test, system test, field use.
Headline Finding
Relative defect fix cost by phase: Design ($1), Coding ($5), Testing ($10), Production ($100). The '1-10-100 rule' as commonly cited. Original IBM figures show a wider range (1-6.5x from design to integration, up to 100x for field defects).
Caveats
The specific multipliers vary in different citations. The most precise version of the data shows a range, not a single multiplier per phase. Modern codebases with strong CI/CD pipelines have smaller multipliers because deployment is faster and rollback is easier. The 100x figure for production defects is conservative for high-traffic consumer systems; catastrophic for safety-critical systems. Barry Boehm's 1981 data (Software Engineering Economics) is the primary academic source that IBM SSI built on.
How It Is Often Misquoted
Widely cited as '100x' with the implication that all production bugs cost 100x vs. design-phase fixes. The accurate claim is that the range is 1-100x depending on the phase and the type of defect. Many bugs are caught in production quickly and cheaply; the 100x figure applies to defects that cause significant customer impact.
How to Cite
IBM Systems Sciences Institute. Relative Costs of Fixing Defects. IBM, 1995. (Internal document; widely cited in secondary literature including Boehm & Basili, 'Software Defect Reduction Top 10 List', IEEE Computer, 2001.)
Boehm (1981) -- Software Engineering Economics
The Cost-of-Change Curve
Who
Barry Boehm, TRW Systems; later USC
When
1981 (Software Engineering Economics); updated in numerous subsequent papers
Sample
Data collected from TRW software projects in the 1970s. Medium-to-large embedded systems and commercial software.
Methodology
Phase-by-phase analysis of defect detection and correction costs. Boehm plotted the cost of changing a software requirement as a function of lifecycle phase, generating the exponential cost-of-change curve.
Headline Finding
The cost of changing a software requirement increases by roughly an order of magnitude per lifecycle phase: the earlier in development a requirement change is made, the cheaper. The curve shows roughly 1x (requirements phase) to 200x (post-delivery) for large systems.
Caveats
Boehm's data was collected from large, complex embedded systems in the 1970s-80s. The multipliers are larger for more complex systems and smaller for simpler web applications. The original 200x figure for post-delivery changes is specific to large-scale embedded systems with long deployment cycles; web applications with CI/CD and feature flags have fundamentally different cost profiles. Boehm himself updated the numbers in the 2000 'Spiral Model' work.
How It Is Often Misquoted
Often conflated with IBM SSI's figures and presented as if they are the same study. Boehm's curve is about the cost of changing requirements; IBM SSI is about the cost of fixing defects. They support the same directional conclusion (fix earlier = cheaper) but measure different things.
How to Cite
Boehm, B.W. Software Engineering Economics. Prentice-Hall, 1981. ISBN 0-13-822122-7.
Capers Jones (2008, updated)
Applied Software Measurement, 3rd edition
Who
Capers Jones, Software Productivity Research (SPR)
When
Data collected across 1970s-2000s; Applied Software Measurement first published 1991; 3rd edition 2008
Sample
Data from thousands of software projects across hundreds of client organisations, primarily North American. Strong representation of enterprise software, defence, and financial services. Less representation of web/SaaS startups.
Methodology
Function point based analysis. Jones and SPR tracked function points (a size metric), defect rates, defect removal efficiency (DRE), and productivity across project types, industries, and development methodologies. The Defect Removal Efficiency metric is Jones' primary contribution to rework measurement.
Headline Finding
Typical Defect Removal Efficiency (DRE) for average software organisations: 85% (15% of defects reach production). Best-in-class: 95%+. Requirements defects account for approximately 45% of all defects by cost but only ~15% by count -- requirements defects are the most expensive category. Rework consumes 20-40% of total development effort in typical organisations.
Caveats
Sample skews toward enterprise. Web/SaaS companies with agile CI/CD practices are underrepresented in earlier editions. Function point methodology is unfamiliar to most development teams today. Some of Jones' industry averages have not been updated to reflect the shift to cloud, microservices, and DevOps practices. The enterprise skew means the 85% DRE figure is probably optimistic for smaller organisations.
How It Is Often Misquoted
The '40% of effort is rework' figure is sometimes cited without the distinction between industries or team maturity levels. Jones' data shows a range of 15-60% depending on organisation type, with 20-40% as the typical enterprise range.
How to Cite
Jones, C. Applied Software Measurement: Global Analysis of Productivity and Quality. 3rd ed. McGraw-Hill, 2008. ISBN 978-0-07-150244-3.
DORA State of DevOps Report 2024
Accelerate State of DevOps Report, Google DORA Research Program
Who
DORA (DevOps Research and Assessment), acquired by Google. Research led by Dr. Nicole Forsgren (through 2019), now maintained by Google Cloud DORA team.
When
Annual; 2024 edition. The research program began in 2013.
Sample
39,000+ technology professionals globally. Self-selected survey. Strong representation from software-first companies; weaker from traditional enterprises. All industries and company sizes, but weighted toward US and European respondents.
Methodology
Structural equation modelling linking practices (technical, cultural, organizational) to outcomes (software delivery performance, organizational performance, well-being). The four key metrics are: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service.
Headline Finding
Change Failure Rate (CFR) by tier: Elite <5%, High 5-10%, Medium 10-15%, Low >30%. Elite teams deploy multiple times per day; low performers deploy monthly or less. Rework-relevant finding: elite teams have MTTR below 1 hour; low performers above 1 day. Test automation is one of the strongest predictors of elite CFR.
Caveats
Self-selected survey means response bias toward engaged, performance-tracking organisations. Teams with the worst metrics are least likely to complete surveys. The elite/high/medium/low classification has changed over the years -- 2024 numbers are not directly comparable to 2019 numbers without adjusting for the revised classification methodology. The DORA metrics measure delivery performance, not rework specifically -- CFR is the closest proxy.
How It Is Often Misquoted
Change failure rate is sometimes presented as equivalent to rework rate. They are correlated but not the same: a high CFR team has high production rework, but may have low pre-production rework (caught by CI/CD). A team can have low CFR but high sprint rework ratio from unclear requirements.
How to Cite
Google DORA. 2024 State of DevOps Report. Google LLC, 2024. Available: https://dora.dev/research/2024/dora-report/
Stripe Developer Coefficient (2018, updated 2020)
The Developer Coefficient: The Hidden Cost of Technical Debt
Who
Stripe, in partnership with Harris Poll
When
2018; methodology update 2020
Sample
500 CTOs and VPs of Engineering; 850 software developers across US, UK, France, Germany, Singapore, Japan. Enterprise and growth-stage companies.
Methodology
Structured survey measuring time allocation across work categories. Respondents reported how many hours per week they spent on 'bad code' (technical debt-related work, including rework), new features, maintenance, and meetings.
Headline Finding
17.3 hours per week lost per developer to bad code, including technical debt and rework. 88% of developers report their company's tech debt has reached an unacceptable level. $300 billion global annual cost estimate (extrapolated from the hourly figure and developer population).
Caveats
Self-reported time allocation is subject to recall bias. The 17.3 hours figure includes all 'bad code' work -- rework, tech debt navigation, inefficient processes -- not rework in the strict definition. The $300B extrapolation uses aggressive developer population estimates. The 2020 update showed similar figures, suggesting consistency, but both are surveys rather than observational data.
How It Is Often Misquoted
Often cited as if 17.3 hours per week is purely rework. The accurate claim is that 17.3 hours is all time spent on 'bad code' broadly -- a category that includes tech debt navigation, inefficient tooling, and undocumented systems as well as rework events.
How to Cite
Stripe. The Developer Coefficient. Stripe, Inc., 2018. Available via stripe.com/reports/developer-coefficient (may require archived access as of 2026).
McKinsey Developer Velocity Index 2023
Developer Velocity: How Software Excellence Fuels Business Performance
Who
McKinsey Digital
When
2023 (updated from 2020 original)
Sample
Hundreds of companies, primarily large enterprises ($500M+ revenue). Proprietary data from McKinsey client engagements plus survey data.
Methodology
Correlation analysis between developer velocity practices (tooling, culture, product management, security) and business outcomes (revenue growth, innovation rate, operating leverage). The Developer Velocity Index (DVI) is a composite of practice adoption scores.
Headline Finding
Top-quartile DVI companies report 4-5x more innovation (new products, services), 60% higher revenue growth, and significantly higher developer productivity than bottom-quartile peers. Rework reduction is identified as one of five key velocity drivers.
Caveats
McKinsey data is proprietary and methodology is not fully public. Sample skews heavily toward large enterprises. The 4-5x productivity figure is a composite of multiple practices, not attributable to rework reduction alone. Correlation vs. causation is a persistent limitation: high-performing companies may adopt good practices AND reduce rework as a consequence of being high-performing, rather than the practices causing the performance.
How It Is Often Misquoted
The 4-5x figure is sometimes cited as if it is attributable to rework reduction specifically. It is a composite of all five Developer Velocity dimensions, of which rework reduction is one component.
How to Cite
McKinsey & Company. Developer Velocity Index: How Software Excellence Fuels Business Performance. McKinsey Digital, 2023. Available: mckinsey.com/capabilities/mckinsey-digital.
GitHub Octoverse 2024
The state of open source and developer productivity
Who
GitHub (Microsoft subsidiary)
When
Annual; 2024 edition
Sample
GitHub platform data (100M+ developers, 420M+ repositories) plus developer survey. Survey sample: 11,000+ developers globally.
Methodology
Platform telemetry combined with developer survey. Measures activity patterns, tool adoption, AI coding tool usage, and self-reported productivity. Does not directly measure rework but tracks bug-fix commit rates, PR review patterns, and deployment frequency.
Headline Finding
AI coding tools (GitHub Copilot) show 55% faster task completion in controlled studies. PR review time is the largest single contributor to developer waiting time. Bug-fix commits account for a significant share of total commit volume (exact percentage varies by repository).
Caveats
Platform telemetry measures activity, not quality. Higher commit volume is not necessarily better. Bug-fix commit rate is a proxy for rework but undercounts rework done without a distinct commit (inline fixes during PR review). The AI productivity claims are from controlled lab studies, not production observational data.
How It Is Often Misquoted
The 55% faster task completion claim for AI tools is frequently taken out of context. It is from a controlled study where participants completed a specific coding task, not an observation of production velocity.
How to Cite
GitHub. Octoverse 2024: The state of open source. GitHub, Inc., 2024. Available: octoverse.github.com