How to Monitor The Quality of Your Data?

Having a large amount of data available is good. However, having unreliable data is worse than not having data at all. In fact, making decisions based on incorrect data is a risk not worth taking.

Over the last years, data has taken over oil as the most valuable resource in the world. When properly connected and refined, data can be leveraged by companies to gain valuable insights and improve decision-making capabilities.

Examples of data-driven industries are banking and insurance, where data is being consumed, among other things, for fraud prevention and detection, to acquire a better understanding of customer insight and engagement, and for enhanced risk management.

Other industries that typically rely on data on a daily basis are media and entertainment, healthcare, education, manufacturing, retail and wholesale trade, transportation, and energy and utilities.

For those industries that greatly benefit from data science and analytics, a primary use case of data consumption is the display of data to discover obvious patterns. The goal is to gain insight into the data and facilitate data-driven decision making. For example, in the context of trading, it may be important to recognize regime changes in the volatility of an exchange rate, or seasonalities in the price of a commodity. 

To uncover these patterns you should use Business Intelligence (BI) tools. Graphs, for instance, using python packages such as Matplotlib and Seaborn, are useful tools to visualize patterns.

Another use case is to infer hidden patterns from data by using different types of algorithms. DeepMind’s Alpha Zero, which is a machine learning company, built a few algorithms to play the world championship of Go – an abstract strategy board game for two players in which the aim is to acquire more territory than the opponent. The algorithms DeepMind’s Alpha Zero developed for Go were retrained to be applicable to chess, leading to the discovery of new patterns in chess. Even though this game is more than 500 years old, the algorithms were still capable of surprising the grandmasters of chess.

The connection between machine learning algorithms and data

To learn from chess you should play a high number of different games to be able to translate the games into meaningful data. The algorithms DeepMind’s Alpha Zero used learned from the data to discover hidden patterns.

This is, however, an idealized type of application because the data you are generating is, to some extent, perfect. When encoding chess games, all the virtually generated games used to build up your dataset will be correct. This is because the data that was created contains no mistakes.

Many other algorithms are applied on data that is incorrect, or that is partly missing. In this case, interesting results can be obtained. For example, the algorithm can be used to adjust or complete the data itself. 

Examples of incomplete data

A simple and straightforward example of incomplete data was found in a tweet:

“I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors.

It learned to drive backwards because there are no bumpers on the back.” By @Smingleigh.

This tweet shows that the Roomba learned to navigate without bumping into things. Though, not through using senses, which can be considered as the conventional way. Instead, the Roomba decided to be creative and go backwards. An unusual pattern that was achieved by using missing data or data not considered by the programmer. In other words, incomplete data.

This will be an issue in case you are using data to learn specific patterns. You just wouldn’t have seen enough evidence to make any inference.

The Credit Crisis in 2008 is another good example of incomplete data. At that time, the majority of the financial systems were failing and the impact on the world economy was severe. One of the underlying reasons for this crisis was the use of overly simplified models to estimate risk in complex mortgages derivatives.

Since financial systems were using very simple models that did not take into account enough data, e.g. joint defaults, they were unable to see the actual risks they were taking. 

Data anomalies

Data anomalies are inconsistencies in the data stored in a database as a result of an operation such as update, insertion, and/or deletion.

The image below shows an example of an anomaly issue. This anomaly issue was found in the data while it was already being applied in the model. The magnificent building in the image is also known as the Melbourne Monolith.

The monolith is not a real building, but a building that was rendered in Windows Flight Simulator in one of its most recent versions. This platform recently started using machine learning to convert 3D maps from today’s 2D plans. However, sometimes these 3D maps are not the best representation of the actual environment. The height of this particular building in Melbourne was incorrectly inputted. Consequently, when flying over Melbourne through the simulator, you could see this massive building.

Moreover, a very recent example showed that when measuring oxygen levels of people with different skin colours, the concentration in blood shows different results. The reason behind this is the fact that the sensors generate more noise on the data, which has an actual impact on the interpolation of the data that computes the oxygen content.

Many more examples of data inconsistencies

In this extensive list, you can find many more examples of issues with data. This is a spreadsheet that is being compiled by a researcher from DeepMind and includes multiple examples from non-mathematical journal papers.

An example of data inconsistency is the research “Data order patterns”. This research concluded that assumptions cited in the papers were wrong as these were not using the machine learning algorithms correctly.

In a specific case from this file, a research where some researchers were training a neural network is mentioned. The goal of the project was to classify mushrooms into edible and poisonous mushrooms. To feed the neural network with data, researchers fed it both poisonous and edible mushrooms.

The surprise arose when the algorithm was reaching maximum accuracy. Instead of distinguishing the mushrooms, the neural network discovered that they were always alternating (i.e. poisonous, edible, poisonous, edible,…) and never learned any relevant features regarding the mushrooms in question. Overall, the actual issue was not in the data but in the data usage instead.

Monitoring data quality

There is a close link between data sets and data quality when building models to create something meaningful with the data. A prerequisite for using big data and for guaranteeing the value of data is to ensure and sustain its quality. At all times we should be cautious and take into consideration the potential risks involved.

In order to efficiently manage the model risk of ML/AI, it is recommended that you modify your existing processes and move towards technology-assisted validation. Because of AI, the datasets will become much larger. This requires powerful tools to assess, for instance, data quality and data stability. 

In addition, machine learning models offer specific challenges such as bias, explainability issues or adversarial attacks – which can only be addressed by using other algorithmic techniques, and the use of scalable computation infrastructure. 

To tackle these problems, Yields.io has created a data-centric model risk management platform called Chiron, which allows you to:

  • keep track of the linkage between data, analytics and reports, leading to maximal reproducibility.
  • leverage the modern big data technology stack to scale the computations to arbitrary large datasets.
  • use scenario generation and stress testing, allowing you to evaluate models on realistic data.
  • state of the art data quality analytics to detect and explain outliers. 

Conclusion

In this blog post, we have mentioned why data usage is important for many businesses across many different industries in informing decision making. 

Moreover, we have shown some examples of what can go wrong when data is misused or relied upon when incomplete. This highlighted the importance of monitoring the quality of your data for its more appropriate applications. 

We have finally presented some possible solutions on how we, at Yields.io, monitor the quality of your data in Chiron together with its practical benefits.

Scale your model risk management

Learn how technology can transform how you and your team manage model risk.

model validation v2