After you fit a machine-learning model next step is to assess the accuracy of that model. Before you can put a model to use, you need to know how well it’s expected to predict on new data. If you determine that the predictive performance is quite good, you can be comfortable in deploying that model in production to analyze new data. Likewise, if you assess that the predictive performance isn’t good enough for the task at hand, you can revisit your data and model to try to improve and optimize its accuracy. Properly assessing the predictive performance of an ML model is a nontrivial task. From there, we dive into assessment of ML classification models, focusing on the typical evaluation metrics and graphical tools used by machine-learning practitioners. Then we introduce analogous evaluation tools for regression models. Finally, we describe a simple way to optimize the predictive performance of a model through parameter tuning.
Model generalization: assessing predictive accuracy for new data The primary goal of supervised machine learning is accurate prediction. You want your ML model to be as accurate as possible when predicting on new data (for which the target variable is unknown). Said differently, you want your model, which has been built from training data, to generalize well to new data. That way, when you deploy the model in production, you can be assured that the predictions generated are of high quality. Therefore, when you evaluate the performance of a model, you want to determine how well that model will perform on new data. This seemingly simple task is wrought with complications and pitfalls that can befuddle even the most experienced ML users.
The problem: overfitting and model optimism To describe the challenges associated with estimating the predictive accuracy of a model, it’s easiest to start with an example. Imagine that you want to predict the production of bushels of corn per acre on a farm as a function of the proportion of that farm’s planting area that was treated with a new pesticide. You have training data for 100 farms for this regression problems. As you plot the target (bushels of corn per acre) versus the feature (percent of the farm treated), it’s clear that an increasing, nonlinear relationship exists, and that the data also has random fluctuations. Now, suppose you want to use a simple nonparametric ML regression modeling technique to build a predictive model for corn production as a function of proportion of land treated. One of the simplest ML regression models is kernel smoothing. Kernel smoothing operates by taking local averages: for each new data point, the value of the target variable is modeled as the average of the target variable for only the training data whose feature value is close to the feature value of the new data point. A single parameter, called the bandwidth parameter, controls the size of the window for the local averaging. For large values of the bandwidth, almost all of the training data is averaged together to predict the target, at each value of the input parameter. This causes the model to be flat and to underfit the obvious trend in the training data. Likewise, for small values of the bandwidth, only one or two training instances are used to determine the model output at each feature value. Therefore, the model effectively traces every bump and wiggle in the data. This susceptibility to model the intrinsic noise in the data instead of the true signal is called overfitting.
- Machine Learning