Here, it is assumed that that one has made the decision to scale out the ML workflow and chosen a big-data processing system to use. Next, it is presented how some popular linear and nonlinear ML algorithms scale in the face of big data.

**Scaling learning algorithms** During the learning phase, the fundamental scalability challenge is dealing with the size, in memory, of very large training sets. To circumvent that problem, one option is to look for implementations of ML algorithms that either (a) use a smaller memory footprint than competing implementations of the same algorithm, or (b) can train over distributed systems in which each node requires only a subset of the entire dataset. Out in the wild, countless implementations of the most common ML learning algorithms exist. From scikit-learn to mlpack, these implementations are continually stretching the frontiers of memory efficiency (and thus increasing the dataset size that can be trained on a single computer with a fixed amount of RAM). Yet, data volumes are still outpacing the gains in ML software and computer hardware. For some training sets, the only option is horizontally scalable machine learning. The most commonly used distributed learning algorithm is linear (and logistic) regression. The *Vowpal* *Wabbit* (VW) library popularized this approach, and has been a mainstay for scalable linear learning across multiple machines. The basic way that distributed linear regression works is to first send subsets of the training data (subset by dataset rows) to the various machines in the cluster. Then, in an iterative manner, each machine performs an optimization problem on the subset of data on hand, sending back the result of the optimization to the central node. There, that information is combined to come up with the best overall solution. After a small number of iterations of this procedure, the final model is guaranteed to be close to the overall optimal model (if a single model were fit to all the data at once). Hence, linear models can be fit in a distributed way to terabytes or more of data! Linear algorithms aren’t necessarily adequate for modeling the nuances of data for accuracy predictions. In these cases, it can be helpful to turn to nonlinear models. Nonlinear models usually require more computational resources, and horizontal scalability isn’t always possible with nonlinear models. This can be understood loosely by thinking of nonlinear models as also considering complex interactions between features, thus requiring a larger portion of the dataset at any given node. In many cases, it’s more feasible to upgrade the hardware or find more efficient algorithms or more-efficient implementations of the algorithms one has chosen. But in other situations, scaling a nonlinear model is needed.

**Polynomial features** One of the most widely used tricks to model nonlinear feature interactions is to create new features that are combinations of the existing features and then train a linear model including the nonlinear features. A common way to combine features is to multiply features in various combinations, such as *feature* *1* *times* *feature* *2*, *feature* *2* *squared*, or *feature* *1* *times* *feature* *2* *times* *feature* *5*. Say a dataset consists of two features, f1 = 4 and f2 = 15. In addition to using f1 and f2 in the model, one can generate new features f1 × f2 = 60, f1 ^ 2 = 16 and f2 ^ 2 = 225. Datasets usually contain a lot more than two features, so this technique can generate a huge number of new features. These features are nonlinear combinations of existing features. We call them *polynomial* *features*. The following listing shows how this can be achieved with the scikit-learn Python library. The results of running the code in this listing show the accuracy gained when adding polynomial features to a standard Iris flower classification model. An example of another machine-learning toolkit that has polynomial feature extraction integrated is the Vowpal Wabbit library. VW can be used to build models on large datasets on single machines because all computation is done iteratively and *out* *of* *core*, meaning that only the data used in the particular iteration needs to be kept in memory. VW uses stochastic gradient descent and feature hashing to deal with unstructured and sparse data in a scalable fashion. VW can generate nonlinear models by supplying the –q and –cubic flags to generate quadratic or cubic features, corresponding to polynomial features where all pairs or all triplets of features have been multiplied together.

**Data and algorithm approximations** The polynomial feature approach has the ability to increase the accuracy of the model significantly, but also increases the number of features polynomially. That might not be feasible for a large number of input features, so here are a few nonlinear algorithms that have well-known approximations useful for scalable implementations. Other algorithms may have their own approximations for scalability. A widely used nonlinear learning algorithm is random forest. The random forest model consists of numerous decision trees, and on first sight it may look trivial to scale random forest to many machines by building only a subset of the trees on each node. If the data subsamples available at each node aren’t sufficiently similar, the accuracy of the model can suffer. But building more trees or splitting the data more intelligently could mitigate the loss in accuracy. Another approximation that can be used to scale random forests and other algorithms is a *histogram* *approximation*: each column in the dataset is replaced with the histogram of that column, which usually decreases the number of values in the column significantly. If the number of bins in the histogram is too small, a lot of nuance may be lost and model performance suffers. Another algorithm that has natural approximations is k-nearest neighbours; special approximate tree structures can be used to increase the scalability of the model. Support vector machines have seen multiple approximation methods to make the nonlinear versions more scalable, including Budgeted Stochastic Gradient Descent (BSGD) and Adaptive Multi-hyperplane Machines (AMM).

**Deep neural nets** *D**ee**p* *l**e**a**r**n**i**n**g* refers to a family of algorithms that extends the traditional neural network. Commonly, these models include many hidden layers in the neural network or many single-layer networks combined. One of the disadvantages of deep neural nets are that even on GPU hardware, the computational resources needed to build and optimize models can take a long time. In practice, one may be able to get just as good performance with other algorithms, such as random forests, using far less time and resources. This depends on the dataset and problem at hand. Another disadvantage is that it can be difficult to understand what’s going on under the hood of these neural net models. Some refer to them as *black-box* *models*, because one has to trust the results of the statistical analysis of the models without doing any introspection of their internals. This again depends on the use case. In case of working with images, for example, the neurons can take on intuitive representations of various visual patterns that lead to specific predictions. Many deep-learning methods have shown to scale to large datasets, sometimes by using modern graphic cards (GPUs) for performing certain computations.

- Topics:
- Artificial Intelligence
- Machine Learning