Machine Learning Reference
I often need to look up random bits of ML-related information. This post is a currently work-in-progress attempt to collect common machine learning terms and formulas into one central place. I plan on updating this post as I come across further useful pieces of information as needed. This reference is not intended to be exhaustive—in fact the opposite—it is intended only to be a concise, opinionated collection of the most relevant bits of ML knowledge for quick lookup.
- Google’s ML Crash Course.
- Neural Networks and Deep Learning by Michael Nielsen.
- Deep Learning by Goodfellow et al.
Machine learning problems come in 3 primary categories, depending on how well the labels (ground truth values) are known:
|Supervised||The labels are always known.||Classification, Regression|
|Semi‑supervised||Some labels are known, some are not.||Speech analysis, Protein sequencing|
|Unsupervised||There are no labels.||Clustering problems.|
Below are the broad types of problems you can tackle with ML.
|Classification||Sort inputs into a discrete set of output classes.||Spam detection|
|Regression||Given an input, output a prediction in a continuous space.||Housing price estimation|
|Clustering||Group inputs into a set of discrete groups.||Fraud detection|
A.k.a kNN, is a clustering algorithm for supervised learning.
Is an unsupervised clustering algorithm. You pick the number of clusters you want, , and the algorithm groups all your data points into the clusters all with elements the smallest distance from each other.
An algorithm to generalize a linear regression model to multiple binary classifiers without having to re-train separate models for each class. Suitable for multinomial regression.
The softmax function is also called the normalized exponential. It is used to highlight the largest values and suppress any values that are significantly smaller than the largest.
The cost function, also known as a loss or objective function, determines, given a prediction from a model, how close to perfect that prediction is, where 0 is a perfect prediction.
|Mean Squared Error||
|Mean Absolute Error||
Commonly used to speed up the learning rate.
Introduced by Claude Shannon in the 1950s, information theory is the study of bits of entropy. The amount of “information” inherent in something is a function of how many yes/no questions you have to answer in order to specify it. For instance, whether it’s night or day can be answered in one yes/no question, meaning its information can be represented in 1 bit.
Regularization is a term added to the cost function in order to prevent overfitting. The parameter controls the strength of regularization, with a higher value meaning more.
Also called Lasso Regression. It adds sum of the absolute value of all biases to the cost function. Lasso works better when there are a large number of features.
L2 Regularization for regression is also called Ridge Regression. It adds the sum of the squared magnitude of all biases to the loss function.
It might be worth using both L1 and L2 regularization in the same model. “This gives you both the nuance of L2 and the sparsity encouraged by L1,” as explained here.
Where is the vector of parameters to try at step , is the gradient at step , and is the learning rate.
|Stochastic Gradient Descent (SGD)
Sutskever et al., 2013
Hinton et al., 2012
Where is the discounting factor.
Below is the backpropagation algorithm including the four fundamental equations of backpropagation. These are from Michael Nielsen’s book Neural Networks and Deep Learning.
|Given a neural network with layers|
|Set the activations of the input layer
to the inputs.
|For all layers after the input layer||
|Compute the pre-activation value||
|Apply the non-linearity||
|Compute the error and backpropagate|
|Compute the error for the pass||
|Backpropagate the error|
|For all layers going backwards starting
from the penultimate layer
|Compute the gradient for each layer||
|Output the gradient|
|For all weights||
Recurrent Neural Networks
Or RNNs, are networks that contain a cycle. In practice, this cycle only recurs a finite number of times, otherwise the network would never finish execution.
One big downside to RNNs is that they cannot be parallelized in training. RNNs also require a huge amount of memory to train because they must keep the state of all variables times in memory in backpropagation.
Long short-term memory or LSTM models are a class of RNN models that reduce the vanishing gradient problem and the exploding gradient problem. They do this by introducing “forget gates” into the RNN, which control how much information is passed back through the recurrence.
Vanishing gradient problem
Early layers in a neural network learn an order of magnitude slower than later layers. This is especially a problem in RNNs because of their cyclical depth.
|Actual = True||Actual = False|
|Prediction = True||True Positive||False Negative|
|Prediction = False||False Positive||True Negative|
False positives are also called Type I errors. The false positive rate, or Type I error rate, a.k.a. sensitivity or , is denoted by the number of false positives over all predictions of negative values: .
False negatives are also called Type II errors. The false negative rate, or Type II error rate, a.k.a. specificity or , is denoted by the number of false negatives over all predictions of positive values: .
Precision and Recall
Precision and Recall are a tradeoff between having more false positives or more false negatives.
In the context of classification, precision refers to
the number of correct guesses over the number of total guesses made defined
. A classifier that returned
True for every input would have 100% precision (no false negatives) but
low recall. Recall is the fraction
of correct guesses over the total number of correct possible guesses. A
classifier that returned
False for every input would have 100% recall (no
false positives) but low precision.
A ROC Curve is a graph of the false positive rate (x-axis) and the true positive rate. The area under the ROC curve (ROC-AUC) is a useful way of comparing the relative performance of different models.
A PR Curve is a graph of the precision (y-axis) and the recall (x-axis) of the model. The PR curve is useful when the categories are imbalanced, for instance for a spam filter where most of the examples are not spam (a.k.a. ham) and relatively few examples are spam. A ROC curve will not accurately capture the performance of this model because changes to the number of false positives, while relatively significant, will not change the false positive rate rate by much due to the overwhelming amount of true negatives.
Moments measure the shape of a function. The moment of a variable is its expected value at a certain power.
|First raw moment, a.k.a. the mean||
|Second central moment, a.k.a. the variance||
|The -th moment about||
Norms measure the size of a vector.
Defined as the maximum of the absolute values of its components.
|sigmoid function, or standard deviation|