It is crucial that we focus on the data-driven nature of this approach. A machine learning model depends on the data that it has been trained on. To be able to predict labels on unseen data it is necessary that our model has high out-of-sample accuracy.
We require large datasets to be able to build accurate model that can perform well. What if we don't take large datasets into account? Well, for instance - We gather a dataset and we divide it into training and testing part. We overlook the fact that occasionally all files having .bat extension are malware and not benign (which is not necessarily true for real world files).
While training, the model will exploit this property of the dataset, and will learn that any file having extension as .bat is malware. It will use this property for detection. When this model is tested on real world data, it will produce many false positives. To prevent this outcome, we needed to add benign files with .bat extension to the training set. Then, the model will not rely on an erroneous data set property.
Hence, we must have large datasets that actually maps with real world data so that our model can generate accurate results.
Malware Distribution in The Dataset
We are given a set of objects.
Each object is represented with feature set X.
Each object is mapped to the right answer or labeled as Y.
This training information is utilized during the training phase, when we search for the best model that will produce the correct label Y for previously unseen objects given the feature set X.
In our case, X can be any feature of the file or the behavior of the file content for instance, how it interacts with system at low level or any specific list of functions. Y can be an anomaly or clean file, or it can be as precise as a malware definition, trojan, virus or adware.
In the training phase, we need to select a family of models, for example, neural networks or decision trees. Usually, each model in a family is determined by its parameters. Training means that we search for the model from the selected family with a particular set of parameters that gives the most accurate answers for the trained model over the set of reference objects according to a particular metric. In other words, we ’learn’ the optimal parameters that define valid mapping from X to Y.
After we have trained a model and verified its quality, we are ready for the next phase – applying the model to new objects. In this phase, the type of the model and its parameters do not change. The model only produces predictions.