The next step in the machine-learning workflow is to use that data to begin exploring and uncovering the relationships that exist between the input features and the target. In machine learning, this process is done by building statistical models based on the data. In contrast to most machine-learning textbooks, we spend little time discussing the various approaches to ML modeling, instead focusing attention on the big-picture concepts. This will help you gain a broad understanding of machine-learning model building and quickly get up to speed on building your own models to solve real-world problems. For those seeking more information about specific ML modeling techniques, please see the appendix.
Basic machine-learning modeling The objective of machine learning is to discover patterns and relationships in data and to put those discoveries to use. This process of discovery is achieved through the use of modeling techniques that have been developed over the past 30 years in statistics, computer science, and applied mathematics. These various approaches can range from simple to tremendously complex, but all share a common goal: to estimate the functional relationship between the input features and the target variable.
Classification: predicting into buckets In machine learning, classification describes the prediction of new data into buckets (classes) by using a classifier built by the machine-learning algorithm. Spam detectors put email into Spam and No Spam buckets, and handwritten digit recognizers put images into buckets from 0 through 9, for example. Typically the best way to start an ML project is to get a feel for the data by visualizing it.
Regression: predicting numerical values Not every machine-learning problem is about putting records into classes. Sometimes the target variable takes on numerical values – for example, when predicting dollar values in a financial model. We call the act of predicting numerical values regression, and the model itself a regressor.
Performing regression on complex, nonlinear data In some datasets, the relationship between features can’t be fitted by a linear model, and algorithms such as linear regression may not be appropriate if accurate predictions are required. Other properties, such as scalability, may take lower accuracy a necessary trade-off. Also, there’s no guarantee that a nonlinear algorithm will be more accurate, as you risk overfitting to the data. As an example of a nonlinear regression model, we introduce the random forest algorithm. Random forest is a popular method for highly nonlinear problems for which accuracy is important.
- Machine Learning