Two Types of Tasks in Machine Learning
Karen Tao, Researcher
May 19, 2021
I introduced accuracy measures in my last few posts about classification algorithms. Classification algorithms predict a label or category. For example, whether a gift is liked or whether a graduate participates in the workforce. What if our task is to predict a continuous quantity? Those are regression problems. It is important to distinguish the two types of tasks to select the proper algorithm to implement and the accuracy measures to evaluate the model.
Classification models predict labels or categories
We learned in a previous post that machine learning is about developing a model using historical data to predict new data. This process is sometimes called function approximation, where the model learns a mapping function from inputs to outputs.
In other words, this is a mathematical problem of approximating a mapping function f from input variables X to output variables y so that y = f(X). Our goal is to approximate the mapping function f as accurately as possible so that when the model is given a new data point X, the output predicts y is as close to the ground truth as possible.
We predict labels or categories for the classification task. At the UDRC, we have predicted workforce retention of Utah’s postsecondary graduates. In this example, the output variable y is a binary yes/no to represent whether a graduate participates in the workforce. These are two discrete classes or labels. Common algorithms for classification tasks include logistic regression, decision tree, support vector machine, and Naïve Bayes.
Our last few posts describe the accuracy measures for classification tasks. We can evaluate the true-positive and false-positive rates using the confusion matrix and visualize the model performance with the AUC-ROC curve.
Regression models predict a continuous quantity
A regression model predicts a continuous quantity output. One classic example predicts the value for selling your house; the output variable is a real numeric value. At the UDRC, we have used regression techniques to model wage with experience and education as input. As an example, DoorDash may be using a regression model to predict the estimated time of your yummy delivery. Linear regression is the most basic algorithm for regression tasks, with ridge and lasso variations being quite popular.
For regression models, a confusion matrix would not be a good way to evaluate the model. We have no way of counting which predictions are a true-positive or a false-negative. Common evaluation metrics for regression models include the root mean squared error (RMSE) and the mean absolute error (MAE). We will review these two metrics in a future blog post.
The Data Scientist is Key
Finally, the data scientist is key in selecting the algorithms and evaluation metrics and plays the most critical role in building a model. For example, predicting the sales price of a house is a classic introductory machine learning homework assignment.
However, the same house can sell for a much higher amount in our economy today than it did three years ago. If we neglect to update our model and continue to use the conventional variables such as square footage, year built, zip code, and the number of bedrooms, the model would be missing the target output to a great extent.
Ultimately, the data scientist selects which variables to include when building a model, and the potential impacts when we miss crucial variables or introduce collinearity in our model.