Tuesday, April 03, 2007

Space Exploration: Software Failure in the Mars Pathfinder

On December 4, 1996, NASA launched their space exploration robot, Mars Pathfinder, as a demonstration of a “faster, better and cheaper” spacecraft. It landed successfully on Mars on July 4, 1997 and started collecting sample soil, conducting chemical analysis and transmitting image data back to Earth. However, a few days after the landing, the robot started to experience frequent total system reset. This stopped the robot from normal operation and caused serious data loss.

This is caused by a software system failure called priority inversion in the Pathfinder. Pathfinder is using a real time embedded operating system called VxWorks. This system is in charge of scheduling all of the tasks and management of all the system resource like the memory. Tasks are scheduled in such a way that one task is not executed continuously. Instead, the time to perform one task is divided into pieces and interleaved. Each task is assigned with a priority according to its importance. At time slot n, the system is performing task1, for time slot n+1, if there is a task2 with higher priority than task1, task2 will be performed. Otherwise task1 will be continued.

In the same system, a piece of shared resource, the “information bus” is used, much like a shared piece of memory, for tasks to read and write on. As explained before, the tasks are interleaved in time. Data stored by one task into the shared memory without any protection mechanism will definitely be overwritten or misread by tasks which are later scheduled to perform. The mechanism to ensure data consistency used in Pathfinder is simple, whenever a task sees the shared memory is used by another task, regardless priority, it will just wait for the other task to finish using it.

There are three tasks involved in the failure, the meteorological data gathering task with low priority, data transmission task with medium priority and information management task with high priority. Just before the failure occurred, the low priority meteorological data gathering task was running and holding the shared memory. Before it can release the shared memory, it is stopped because the high priority information management task needed to run and the next time slot is scheduled to it. However, The information management task needed access to shared memory which is held by the low priority meteorological data gathering task. So the information management task stops and wait for the shared memory to be released. The meteorological data gathering task got to run again. However, before it released the shared memory, the medium priority data transmission task is scheduled. This task takes extremely long time to finish. During this time, the high priority task cannot be run because it must wait for the shared memory to be released. The low priority task cannot release shared memory. Because of its low priority it will not get the chance to be scheduled. Both the high priority and the low priority tasks are locked. When a high priority task is waiting for too long, the system interpreted that there is something wrong and totally reset the system—much like when you see your mouse cursor does not respond to the move for long, you press the reset button on your desktop.

Posted by Jin Yunye, U037842W

Mike Jones, "What Really Happened on Mars Rover Pathfinder"http://www.cs.berkeley.edu/~brewer/cs262/PriorityInversion.html

Wikipedia, Priority Inversion http://en.wikipedia.org/wiki/Priority_inversion
Wikipedia, Mars Pathfinder http://en.wikipedia.org/wiki/Mars_Pathfinder

4 comments:

Industry said...

I am wondering if robot can land on the sun in the future.

Xue Chao
U037176U

Anonymous said...

so far there is no material which could stay in solid state on sun, but it's a good sign for human to reach other planet. maybe in the future, we could stay on mays or other planet.

Yuzhenyu U037786A

Anonymous said...

yeah, that is what they called deadlock. Obviously the scheduling algorithm they are using did not handle this properly, which cause all the task being blocked. This is a common issue facing in real time scheduling when there are limited resources and they are shared by different task. I guess most of the robot system will use real time schedulging to handle the real life situation. So serious consideration should be taken into deadlock management and resource allocation.

Li Chao U037130N

Anonymous said...

yes i also think that deadlock management is a serious issues in real time systems processing sensor fusion, like in the case of autonomous robots. In this specific case, the robot was quite high profile with lot of money invested in it and i guess it is a case of very careless planning.
-Nitin Batra
U048708Y