In Aporia, a model is any system that can make predictions and can be improved through the use of data.
We use this broad definition in order to support a large number of use cases:
- A simple Pytorch model is a valid Aporia model.
- An ensemble of 15 XGBoost models, 37 LightGBM models and a few determinstic algorithms is also a valid Aporia model.
- An evolutionary algorithm can also be a valid Aporia model - though we focused less on this kind of use cases.
Aporia models usually serve specific business use cases: Fraud Detection, Credit Risk, Patient Diagnosis, Churn Prediction, LTV, etc.
Model Version / Schema
Each model has a schema - a definition of the model type, the various features it uses, and the outputs(predictions) it produces.
The model schema is versions, and is often referred to in Aporia as a model version or model version schema.
Models can solve different kinds of problems:
- Binary classification models predict a binary outcome (one of two possible classes)
- "Is this email spam or not spam?"
- Multiclass classification models generate predictions for one of more than two classes
- "Is this product a book, movie, or clothing?"
- Regression models predict a numeric value
- "What will the temperature be in NYC tomorrow?"
We call these model types. We currently support 3 model types:
Training and Test Sets
Before training a model, the available data will often be split to several distinct data sets:
- A Training set which will be used to train/fit the model
- A Test set which will be used to provide an unbiased evaluation of a final model fit on the training set
The direct inputs to models are called features. For example, a patient diagnosis model might have features such as age, gender, one hot encoding of state, etc.
Features are usually numbers, because that is what most ML libraries know how to work with.
The outputs of your model, and any other values you may produce from them, are referred to as predictions made by the model.
In general, if there is a value generated directly by your model (e.g. by calling
model.predict), or by some transformation performed later, whose behavior you wish to monitor, that value is a prediction.
Features don't appear out of thin air - they are usually constructed using various transformations on raw data, that was either part of a data-set or received at runtime from a user.
We refer to that raw data as the raw_inputs of a model.
For example, your dataset might contain a column with state names - you will then have some preprocessing code convert that raw_input to a feature using one-hot encoding.
In many systems, a model will try to predict an outcode which can later be verified - for example, whether or not a user will buy an insurance policy.
In such cases, we refer to the real-world outcome as the actual value of the prediction.
A data segment is a subgroup of your data, according to a filter on one (or more) features.
For example, let's consider a group of people, in which 20% enjoy eating chips. If we look at a sub-group (data-segmet) of the entire group, to whom the filter "people who enjoy ice cream" applies, we might discover that 60% of the people in that sub-group enjoy chips.
This example shows us that examining your data using various data segments might change your perspective, and lead to new insights.
Models are deployed to an environment, where they are used to make predictions.
The most commonly used environments are staging and production, but users can specify any environment they desire.
Each field (feature, prediction value, raw input or actual value) has a field type, that must be explicitly defined during model version creation.
The available field types are:
- categorical - the categories must be numbers
- string - a categorical field with string values
- datetime - this can contain either python datetime objects, or an ISO-8601 timestamp string
Please note that tensors/arrays are not currently supported.
In Aporia, a field refers to any single feature, predictions, raw input, or actual
A Data Segment Group is a group of different data segments that are defined using different filters on the same field.
For example, for an
age field we might define a data segment group that contains the following data segments:
- age < 10
- 10<= age < 50
- age > 50