The word "model" is an oft-used term in data science. We build, train, fit, tune, benchmark, cross-validate, test, score, validate, bootstrap, calibrate, retrain, recalibrate, and monitor them (among other things). But what exactly is a model? It's a computer program, a strings of ones and zeros, a machine's approximation to a human's decision-making process, yes. More abstractly, it is a tool which solves a problem in a certain way and thus consists of two high-level parts: What problem are you trying to solve and how are you trying to solve it? Here at Yields.io, we call the "what" the interface, the "how" the algorithm.
• Interface: What problem are you trying to solve? Classification? Regression? Dimensionality Reduction?
• Algorithm: How are you trying to solve it? Neural Network? Random Forest? Gradient Boosting?
By answering these two questions, we have a kind of vocabulary for talking about what models are, how they work, how they are expected to behave, and so on. On a more technical note, we also modularize the problem, an important step in the development process. Here is a small sampling of the stock models included in the "marketplace" section of the Yields.io platform, Chiron. (Users also have the ability to create their own custom models using this paradigm.)
• Binary Classification by Neural Network
• Multi-Classification by XGBoost
• Regression by Random Forest
• Clustering by K-Means
• Clustering by DBSCAN
• Outlier Detection by DBSCAN
• Outlier Detection by Autoencoder
• Dimensionality Reduction by Autoencoder
• Data Cleaning by Autoencoder
Some algorithms provide the "how" to perform more than one "what", highlighting the significance of modularization. Autoencoders, for example, are useful creatures for all sorts of tasks, such as Outlier Detection, Dimensionality Reduction, and Data Cleaning. The underlying algorithm is the same, but it implements a different interface depending on its purpose. Also, one autoencoder fine-tuned and trained to perform task A might not be perfectly suited to perform task B, even if the underlying algorithm is the same. Therefore it is natural to create separate models with separate choices for hyperparameters to tackle two different tasks.
• Model #1 = Outlier Detection by Autoencoder
• Model #2 = Dimensionality Reduction by Autoencoder
• Model #3 = Data Cleaning by Autoencoder
Technical readers will recognize that what we are suggesting (and how we power our analytics under the hood) is defining a model as the implementation of an abstract interface with an algorithm property, allowing the user to interact with the model via its interface methods, which internally dispatch tasks to the algorithm to do the heavy lifting. Beyond our technical success using this paradigm, we believe this vocabulary introduces a powerful new way for thinking about what a model is from a design perspective. We hope that by proving its intuitive design in our platform, we can encourage the broader use of these terms and concepts.
• Model = Interface by Algorithm
Victor Davis, February 2020