The Mars Pathfinder mission landed on Mars on July 4, 1997 and quickly became a public sensation, returning images and deploying the small Sojourner rover. A few days into surface operations the lander began experiencing total system resets, losing data each time and interrupting the flow of science. The behavior was intermittent and alarming: the spacecraft was healthy, yet it kept rebooting itself. The definitive firsthand account is “What really happened on Mars Rover Pathfinder,” written by Glenn E. Reeves, who led the software team for the Pathfinder spacecraft at the Jet Propulsion Laboratory.
As Reeves explains, the resets were a case of priority inversion. Pathfinder ran the VxWorks real-time operating system, and several tasks shared an information bus protected by a mutex semaphore. A high-priority bus-management task (bc_dist) needed that mutex. A low-priority meteorological data task (ASI/MET) sometimes held the same mutex. When the low-priority task held the lock and was then preempted by medium-priority tasks that ran for a long time, the high-priority task was left blocked, waiting on a lock that the low-priority task could not release because it never got to run. A watchdog timer noticed that the high-priority task had missed its deadline, concluded something was badly wrong, and reset the system.
What made the diagnosis remarkable was the discipline behind it. The same scenario had been seen, rarely, during pre-launch testing on the ground but had not been root-caused before flight. After the resets began on Mars, the team reproduced the failure on an exact replica of the spacecraft computer in the laboratory by running the system under representative loads with extensive tracing enabled. The trace data showed the priority inversion directly, removing any guesswork about what was happening.
Once the team understood the problem, the fix was small and precise. VxWorks mutexes could be created with a flag enabling priority inheritance, in which a low-priority task that holds a lock temporarily inherits the priority of the highest-priority task waiting on that lock, so it runs, finishes, and releases the lock promptly. The relevant mutexes had been created without that flag. The correction was to change the creation parameters to turn priority inheritance on. As Reeves notes, this was done by patching the running system from Earth, modifying the semaphore initialization in place so the inversion could no longer occur.
The episode became one of the most cited real-world examples of priority inversion (see priority-inversion), precisely because it was both consequential and cleanly resolved. It illustrated a recurring lesson about commercial off-the-shelf real-time software: a default configuration choice made for performance, leaving priority inheritance off, can have severe consequences in a concurrent system. Reeves distilled the takeaway bluntly, that when flying off-the-shelf components you must make sure you know how they actually work.
The Pathfinder story is also a model of root-cause analysis under pressure: a reproducible test bed, comprehensive tracing, and the refusal to ship a guess to a spacecraft millions of miles away. The bug had been latent and hard to catch because it depended on rare timing among independent tasks, a hallmark it shares with race conditions (see race-condition), yet once observed it pointed straight to a well-understood remedy from the real-time systems literature.