Loading course…
Loading course…
Created by Shaunak Ghosh
Understand the technical mechanics behind famous software disasters—how numbers overflow, how race conditions emerge, and how deployments go wrong. By the end, you’ll be able to explain each incident as a chain of computable causes and name concrete engineering defenses that reduce catastrophic risk.
9 modules • Each builds on the previous one
Build the core model: bits/bytes, binary, signed vs unsigned integers, floating-point approximation, and why every type has a limited range. This is the foundation for understanding overflows and dangerous type conversions in real incidents.
Learn how programs execute step-by-step: sequence, conditionals, loops, and how variables represent changing state. This lets you reason about “what happened first” in bugs, which matters for both races and deployments.
Analyze the Ariane 5 failure as a conversion-and-range problem: a 64-bit floating-point value was converted into a 16-bit signed integer that couldn’t represent it, triggering an exception that cascaded into loss of control. You’ll connect representation limits to real system behavior and “bad default” failure handling.
Turn lessons into engineering practice: hazards vs failures, fail-safe defaults, defense-in-depth, and how testing/QA supports safety (boundary tests for overflow, stress tests for concurrency, reviews, static analysis). This frames safety as a lifecycle, not a single technique.
Learn why “two things happening near the same time” breaks naive reasoning: interleavings, shared resources, atomicity, and nondeterminism. Then define race conditions precisely and see the standard mitigation tools (locks, message passing, and designing to avoid shared mutable state).
Study how a software race condition, combined with confusing UI feedback and missing hardware interlocks, led to lethal radiation overdoses. You’ll map the real-world behavior to shared-state timing bugs and see why safety-critical systems need layered protections beyond “the code seems to work.”
Learn how code moves from a laptop to production: build artifacts, configuration, environment parity, rollouts (blue/green, canary), and rollback. Focus on how mismatched versions and partial rollouts create failure modes even when the code is “correct.”
Reconstruct the Knight Capital incident as a change-control failure: new code reached some servers while one kept running old, incompatible code, enabling a dormant path that flooded markets with erroneous orders. You’ll connect deployment mechanics to runaway real-world effects and prevention techniques.
Explore how modern internet failures cascade: shared dependencies, control-plane vs data-plane failures, DNS/CDN fragility, rate limits, and amplification effects. Use AWS and Cloudflare outage patterns to learn how “half the internet” can fail from surprisingly small triggers—and what resilience design looks like.
Begin your learning journey
In-video quizzes and scaffolded content to maximize retention.