Real-world data

In supervised machine learning, you use data to teach automated systems how to make accurate decisions.  ML algorithms are designed to discover patterns and associations in historical training data; they learn from that data and encode that learning into a model to accurately predict a data attribute of importance for new data. Training data, therefore, is fundamental in the pursuit of machine learning. With high-quality data, subtle nuances and correlations can be accurately captured and high-fidelity predictive systems can be built. But if training data is of poor quality, the efforts of even the best ML algorithms may be rendered useless. To get started with machine-learning, the first step is to ask a question that’s suited for an ML approach. Although ML has many flavors, most real-world problems in machine learning deal with predicting a target variable (or variables) of interest. Questions that are well suited for a supervised ML approach include the following:

  • Which of my customers will churn this month?
  • Will this user click my advertisement?
  • Is this user account fraudulent?
  • Is the sentiment of this tweet negative, positive, or neutral?
  • What will demand for my product be next month?

You will notice a few commonolities in these questions. First, they all require making assessments on one or several instances of interest.  These instances, can be people (such as in the churn question), events (such as the tweet sentiment question), or even periods of time (such as in the product demand question). Second, each of these problems has a well-defined target of interest, which in some cases is binary (churn versus not churn, fraud versus not fraud), in some cases takes on multiple classes (negative versus positive versus neutral), or even hundreds or thousands of classes (picking a song out of a large library) and in others takes on numerical values (product demand). Note that in statistics and computer science, the target is also commonly referred to as the response or dependent variable. These terms may be used interchangeably. Third, each of these problems can have sets of historical data in which the target is known.  For instance, over weeks or months of data collection, you can determine which of your subscribers churned and which people clicked your ads.  With some manual effort, you can assess the sentiment of different tweets. In addition to known target values, your historical data files will contain information about each instance that’s knowable at the time of prediction. These are input features (also commonly referred to as the explanatory or independent variables). For example, the product usage history of each customer, along with the customer’s demographic’s and account information, would be appropriate input features for churn prediction. The input features, together with the known values of the target variable, compose the training set. Finally, each of these questions comes with an implied action if the target were knowable.  For example, if you knew that a user would click your ad, you would bid on that user and serve the user an ad. Likewise, if you knew precisely your product demand for the upcoming month, you would position your supply chain to match that demand. The role of the ML algorithm is to use the training set to determine how the set of input features can most accurately predict the target variable. The result of this “learning ” is encoded in a machine-learning model.  When new instances (with an unknown target) are observed, their features are fed into the ML model, which generates predictions on those instances. Ultimately, those predictions enable the end user to taker smarter (and faster) actions.

  • Topics:
  • Data Science
  • Machine Learning

Top Stories

High Five! You just read 2 awesome articles, in row. You may want to subscribe to our blog newsletter for new blog posts.