Rocket Science & Model Risk Management

On April 25, 2023, the Hakuto-R Mission 1 lander from the Japanese company ispace, inc. crashed onto the moon. During a press conference held on May 26, ispace’s CTO, Ryo Ujiie, offered an explanation about the contributing factors to this catastrophic outcome. Mr Ujiie pointed to a software glitch causing the crash. 

In the current post, I would like to explain why this is actually a model risk incident. 

The Incident: What Happened?

The lunar lander, equipped with sensors such as accelerometers, gyroscopes, and radar, uses the sensor readouts to infer its position. Due to the noisy nature of sensor data, engineers use noise filters as well as algorithms that disable sensory input should a malfunction be detected.

Unfortunately, during its descent, the lunar lander passed over a cliff, triggering a sudden large change in the radar altimeter’s readout. The data control algorithm misinterpreted this as a malfunction, resulting in the deactivation of the altimeter’s signal. Without this crucial input, the lander miscalculated its position, eventually running out of fuel and crashing.

The Cause: Why Did It Happen?

Of course, the engineers at ispace had extensively tested the landing algorithm under a variety of conditions. However, after this testing had been concluded, the decision was made to change the location of the landing site. As a result, the software had not been tested as exhaustively with data simulating the new flight path (that involved flying over a cliff during approach), leading to a critical oversight.

The Solution: Model Risk Management

Model Risk Management (MRM), with its blend of qualitative and quantitative aspects, can significantly mitigate such risks. 

A cornerstone of MRM is to ensure that the model lifecycle process is well-designed and appropriately implemented. A typical lifecycle process is indicated in Fig. 1 showing various steps such as the ideation/prototyping phase, an independent validation phase after development, deployment, production and retirement. Each of these phases consist in itself of various subprocesses. 

No alt text provided for this image
Fig.1: A typical model lifecycle process

One aspect of the lifecycle process is to identify clear conditions for when a new independent review of the algorithm is necessary. In the case of the lunar lander, a significant change in the flight path during approach qualifies as a material change warranting a new independent review. 

Determining whether a change of the data is material, is a question that is frequently encountered in modelling, and depending on the field it is called a data representativity, a data drift, or a covariance shift issue.

To illustrate this in the case of the Hakuto-R Mission, the task at hand is to decide whether the time series generated by the sensors during approach of the new landing site would significantly deviate from time series corresponding to the previous landing site. One way to analyse this involves computing the statistical fingerprint of the time series of simulated approach data and conducting a clustering exercise to identify which types of paths have been tested. If the new landing site produces time series significantly different from the earlier ones, the clustering algorithm will assign the new data to a novel cluster. 

This is illustrated in Fig. 2 where we have built a toy model to test with two types of trajectories (landing on a flat surface (Cluster 0) and landing over a region with a shallow well (Cluster 1)). The new trajectory (shown as Cluster 2) represents a cliff and is clearly different from the others.

No alt text provided for this image
Fig.2: Studying various landing trajectories

The Role of Technology

MRM software solutions are instrumental in supporting these risk management efforts. A comprehensive MRM solution encompasses a model inventory, a workflow engine, and compute infrastructure.

The workflow engine enables users to configure and enforce model lifecycle processes, while the current state of the model is maintained in the model inventory. This status can be externally queried to implement quality gates (e.g., rejecting a new landing site if the model hasn’t been re-validated).

The computation infrastructure, finally, facilitates the automation of various MRM analyses. In our example, it would have been used to ascertain whether a material change has occurred in the model’s input data.

Conclusion

Model risk incidents can have a massive financial and reputational impact on an organisation. Model risk management is an efficient insurance against such incidents. As we continue to deploy complex algorithms in high-risk settings, sound model risk management will help us to do so in a trustworthy manner.