Famous number computing errors

One of the most expensive Computer Software Errors ever!
One of the most expensive Computer Software Errors ever!

 

Everything a computer does happens with numbers. A character in the alphabet is represented by a number in the computer memory, colors are numbers and even any decimal number has its specific representation in memory. Those numbers which are used by a computer are binary numbers [See: Why do computers use binary numbers?], and for every type of data they represent there is a definition of how the binary number is set up. An integer is very common to handle whole numbers for example, and a definition would be that the integer has 32 bits and that it can only represent numbers from 0 to its maximum. A definition for a floating point number could be that it has 64 bits, where 1 bit is the sign, 11 bits are the exponent and 52 bits are the mantissa [See Binary numbers ].

These definitions restrict the use and the possible values of a number, and this is where a programmer has to pay very close attention in certain situations. Being aware of the boundaries of a number is very important as it can cause complications when dealing with different number types in the same program (number conversion). Furthermore, there are more complex number types like floating point numbers which have a more complicated format and which have to deal with other restrictions (floating point errors, for example rounding and cancellation).

Number computing errors can be very hard to test because often a software might run without any problems for years. Then suddenly the software fails because a certain calculation causes a number to be out of range or to overflow, or the software fails because a rounding error propagated over time, or because two number types suddenly do not match up any more in a calculation and parts of a number are cut off. Such software failures have caused many disasters all around the world, from millions of dollars lost to the death of people.

 

Number computing disasters around the world

Number computing errors have caused exploding rockets, lost money, collapsed building structures and probably many more small and big malfunctions. There are also many known software bugs which are caused by floating point errors.

 

Patriot Missile Failed to Intercept


The radar of a Patriot missile system is designed in a way that it has to detect an incoming missile twice in order to avoid false alarms. Once an incoming missile is detected, the system calculates where the incoming missile is expected to be after a certain time. If the incoming missile is detected at that position after the time expired it is confirmed that the target is actually a missile. The Patriot missile is only launched to intercept after the incoming missile is detected a second time. This is a mechanism to avoid false alarms and to avoid shooting down other flying targets (e.g. airplanes).

The problem of the Patriot system was that a 24 bit number was used to measure time, and it was incremented by 1/10. When converting 1/10 to binary, it results in 00111101110011001100110011001101… with an infinite number of bits. When cut off after 24 bits (because the number in memory was 24 bits), the number is 001111011100110011001100, resulting in an error of the remaining bits 000000000000000000000000110011001… which is about 0.000000095 in decimal. This means that every second, the time was off by 10*0.000000095, which results in about a third of a second after 100 hours system operation time. Since the speed of a Scud missile is over 1500 m/s, it can travel over half a mile within a third of a second. This error in the time calculation caused the Patriot system to expect an incoming missile at a wrong location for the second detection, causing it to consider the first detection as false alarm. The incoming Scud missile was not intercepted and it hit some barracks, killing 28 soldiers.

 

Pentium Floating Point Division Unit Error

A erroneous Intel Pentium processor chip caused miscalculations for certain floating point arithmetic and had to be replaced, causing monetary damage.

Floating point calculation requires many steps to be performed in a processor, which takes time and slows down processor speed. With the Pentium processor, Intel introduced a floating point unit (FPU) which makes the processor much faster for floating point calculations. The problem of the algorithm was that it needed a table (division table) with values which were used to perform the calculation steps. This table was incorrectly entered and shipped out with millions of chips in 1994. However, this error only affected the results of certain calculations within a specific range and the error might not even show up for many users. At first, Intel played down the possible negative outcomes of erroneous calculations and only offered to replace processors for users which were able to prove that they need the high accuracy which caused problems. Eventually Intel decided to replace the faulty processors for anyone.

 

Ariane 5 Rocket Explosin

The $500 million Ariane 5 rocket exploded seconds after its launch in 1996. A small software failure had a big impact when it caused the altitude and guidance information to be lost.

Shortly after the launch of the rocket, the inertial guidance system produced a number which was interpreted by the rockets on-board computer as a course change. The on-board computer then reacted correctly to get back on the right course based on that number. However, even though the number from the guidance system looked like a course change, it was not. The guidance system had actually shut down because of a number conversion error which was not handled correctly. The shut down was caused when the software attempted to convert a 64 bit velocity number into a 16 bit number, causing a number overflow error (because 64 bits is 48 bits too many for a 16 bit number) which was not handled safely.

The rocket system has been tested many times and it has even previously been used successfully in the Ariane 4 rocket. However, the programmer decided that the velocity value would never reach a level which could cause problems and therefore did not implement code to recover from such an error. This assumption had fatal consequences which could have been prevented by a few lines of code.

 

Sources: