This is caused by a software system failure called priority inversion in the Pathfinder. Pathfinder is using a real time embedded operating system called VxWorks. This system is in charge of scheduling all of the tasks and management of all the system resource like the memory. Tasks are scheduled in such a way that one task is not executed continuously. Instead, the time to perform one task is divided into pieces and interleaved. Each task is assigned with a priority according to its importance. At time slot n, the system is performing task1, for time slot n+1, if there is a task2 with higher priority than task1, task2 will be performed. Otherwise task1 will be continued.
In the same system, a piece of shared resource, the “information bus” is used, much like a shared piece of memory, for tasks to read and write on. As explained before, the tasks are interleaved in time. Data stored by one task into the shared memory without any protection mechanism will definitely be overwritten or misread by tasks which are later scheduled to perform. The mechanism to ensure data consistency used in Pathfinder is simple, whenever a task sees the shared memory is used by another task, regardless priority, it will just wait for the other task to finish using it.
There are three tasks involved in the failure, the meteorological data gathering task with low priority, data transmission task with medium priority and information management task with high priority. Just before the failure occurred, the low priority meteorological data gathering task was running and holding the shared memory. Before it can release the shared memory, it is stopped because the high priority information management task needed to run and the next time slot is scheduled to it. However, The information management task needed access to shared memory which is held by the low priority meteorological data gathering task. So the information management task stops and wait for the shared memory to be released. The meteorological data gathering task got to run again. However, before it released the shared memory, the medium priority data transmission task is scheduled. This task takes extremely long time to finish. During this time, the high priority task cannot be run because it must wait for the shared memory to be released. The low priority task cannot release shared memory. Because of its low priority it will not get the chance to be scheduled. Both the high priority and the low priority tasks are locked. When a high priority task is waiting for too long, the system interpreted that there is something wrong and totally reset the system—much like when you see your mouse cursor does not respond to the move for long, you press the reset button on your desktop.
Posted by Jin Yunye, U037842W
Mike Jones, "What Really Happened on Mars Rover Pathfinder"http://www.cs.berkeley.edu/~brewer/cs262/PriorityInversion.htmlWikipedia, Priority Inversion http://en.wikipedia.org/wiki/Priority_inversion
Wikipedia, Mars Pathfinder http://en.wikipedia.org/wiki/Mars_Pathfinder