Billion-Dollar Bugs: How Software Fails

Created by Shaunak Ghosh

Understand the technical mechanics behind famous software disasters—how numbers overflow, how race conditions emerge, and how deployments go wrong. By the end, you’ll be able to explain each incident as a chain of computable causes and name concrete engineering defenses that reduce catastrophic risk.

Requirements

Comfort with basic arithmetic and the idea of numeric ranges
Willingness to read simple pseudocode-like examples (no prior coding required)
Basic cause-and-effect reasoning about systems (inputs → internal state → outputs)

What you'll learn

Explain how fixed-size number representations lead to overflow and dangerous conversions
Trace a failure as a chain of program state changes and system assumptions
Differentiate concurrency from parallelism and identify timing-dependent race conditions

Learning path

9 modules • Each builds on the previous one

How computers store data and numbers

Build the core model: bits/bytes, binary, signed vs unsigned integers, floating-point approximation, and why every type has a limited range. This is the foundation for understanding overflows and dangerous type conversions in real incidents.

1 video5 min

Program flow: state, branches, loops

Learn how programs execute step-by-step: sequence, conditionals, loops, and how variables represent changing state. This lets you reason about “what happened first” in bugs, which matters for both races and deployments.

1 video5 min

Ariane 5: the overflow chain reaction

Analyze the Ariane 5 failure as a conversion-and-range problem: a 64-bit floating-point value was converted into a 16-bit signed integer that couldn’t represent it, triggering an exception that cascaded into loss of control. You’ll connect representation limits to real system behavior and “bad default” failure handling.

1 video5 min

Designing for safety plus testing basics

Turn lessons into engineering practice: hazards vs failures, fail-safe defaults, defense-in-depth, and how testing/QA supports safety (boundary tests for overflow, stress tests for concurrency, reviews, static analysis). This frames safety as a lifecycle, not a single technique.

1 video6 min

Concurrency: shared state and race conditions

Learn why “two things happening near the same time” breaks naive reasoning: interleavings, shared resources, atomicity, and nondeterminism. Then define race conditions precisely and see the standard mitigation tools (locks, message passing, and designing to avoid shared mutable state).

2 videos9 min

Therac-25: race conditions in medical devices

Study how a software race condition, combined with confusing UI feedback and missing hardware interlocks, led to lethal radiation overdoses. You’ll map the real-world behavior to shared-state timing bugs and see why safety-critical systems need layered protections beyond “the code seems to work.”

1 video5 min

The software deployment process, step-by-step

Learn how code moves from a laptop to production: build artifacts, configuration, environment parity, rollouts (blue/green, canary), and rollback. Focus on how mismatched versions and partial rollouts create failure modes even when the code is “correct.”

1 video6 min

Knight Capital: $440M deployment mistake

Reconstruct the Knight Capital incident as a change-control failure: new code reached some servers while one kept running old, incompatible code, enabling a dormant path that flooded markets with erroneous orders. You’ll connect deployment mechanics to runaway real-world effects and prevention techniques.

1 video5 min

Blackout of 2025: cloud fragility

Explore how modern internet failures cascade: shared dependencies, control-plane vs data-plane failures, DNS/CDN fragility, rate limits, and amplification effects. Use AWS and Cloudflare outage patterns to learn how “half the internet” can fail from surprisingly small triggers—and what resilience design looks like.

1 video4 min

Start Learning

Begin your learning journey

Modules9

Duration45 min

Science-backed learning

In-video quizzes and scaffolded content to maximize retention.

Key concepts

Bits, Bytes, And Why Number Formats Have LimitsRange Assumptions And Overflow Risk In Real SystemsBasic Program Flow For Tracing Cause And Effect

Loading course…

Learning path

9 modules • Each builds on the previous one

How computers store data and numbers

1 video5 min

Program flow: state, branches, loops

1 video5 min

Ariane 5: the overflow chain reaction

1 video5 min

Designing for safety plus testing basics

1 video6 min

Concurrency: shared state and race conditions

2 videos9 min

Therac-25: race conditions in medical devices

1 video5 min

The software deployment process, step-by-step

1 video6 min

Knight Capital: $440M deployment mistake

1 video5 min

Blackout of 2025: cloud fragility

1 video4 min