Machine Learning Reference
I often need to look up random bits of ML-related information. This post is a currently work-in-progress attempt to collect common machine learning terms and formulas into one central place. I plan on updating this post as I come across further useful pieces of information as needed.
A type of ML model that attempts to sort an input into a discrete set of output classes. Classification is also known as logistic regression.
Binary classification is a classification task where the model has 2 choices. The model generally outputs a value between 0.0 and 1.0, and you have to define a threshold yourself - above which you consider the prediction in group 1 and below which group 2.
As a concrete example, in spam classification, if you define your threshold as 0.95, if the model outputs a confidence anything above 0.95, you classify the input as spam, otherwise it’s ham.
A.k.a kNN, is a clustering algorithm for supervised learning.
Is an unsupervised clustering algorithm. You pick the number of clusters you want, , and the algorithm groups all your data points into the clusters all with elements the smallest distance from each other.
An algorithm to generalize a linear regression model to multiple binary classifiers without having to re-train separate models for each class. Suitable for multinomial regression.
The softmax function is also called the normalized exponential. It is used to highlight the largest values and suppress any values that are significantly smaller than the largest.
The cost function, also known as a loss or objective function, determines, given a prediction from a model, how close to perfect that prediction is, where 0 is a perfect prediction.
Log loss is a common cost function that speeds up the learning rate.
|Mean Squared Error|
|Mean Absolute Error|
Introduced by Claude Shannon in the 1950s, information theory is the study of bits of entropy. The amount of “information” inherent in something is a function of how many yes/no questions you have to answer in order to specify it. For instance, whether it’s night or day can be answered in one yes/no question, meaning its information can be represented in 1 bit.
L1 and L2 Regularization
Regularization is a term added to the cost function.
L1 Regularization for regression is also called Lasso Regression. It adds sum of the absolute value of all biases to the cost function. Lasso works better when there are a large number of features.
L2 Regularization for regression is also called Ridge Regression. It adds the sum of the squared magnitude of all biases to the loss function.
It might be worth using both L1 and L2 regularization in the same model, as described here. “This gives you both the nuance of L2 and the sparsity encouraged by L1.”
Below is the backpropagation algorithm including the four fundamental equations of backpropagation. These are from Michael Nielsen’s book Neural Networks and Deep Learning.
- = number of layers
1. Input: set the activations of the input layer to the inputs.
3. Compute the error:
4. Backpropagate the error:
5. Output the gradient for all weights and biases:
Recurrent Neural Networks
Or RNNs, are networks that contain a cycle. In practice, this cycle only recurs a finite number of times, otherwise the network would never finish execution.
One big downside to RNNs is that they cannot be parallelized in training. RNNs also require a huge amount of memory to train because they must keep the state of all variables times in memory in backpropagation.
Long short-term memory or LSTM models are a class of RNN models that reduce the vanishing gradient problem and the exploding gradient problem. They do this by introducing “forget gates” into the RNN, which control how much information is passed back through the recurrence.
Vanishing gradient problem
Early layers in a neural network learn an order of magnitude slower than later layers. This is especially a problem in RNNs because of their cyclical depth.
|Actual = True||Actual = False|
|Prediction = True||True Positive||False Negative|
|Prediction = False||False Positive||True Negative|
False positives are also called Type I errors. The false positive rate, or Type I error rate, a.k.a. sensitivity or , is denoted by the number of false positives over all predictions of negative values: .
False negatives are also called Type II errors. The false negative rate, or Type II error rate, a.k.a. specificity or , is denoted by the number of false negatives over all predictions of positive values: .
Precision and Recall
Precision and Recall are a tradeoff between having more false positives or more false negatives.
In the context of classification, precision refers to
the number of correct guesses over the number of total guesses made defined
. A classifier that returned
True for every input would have 100% precision (no false negatives) but
low recall. Recall is the fraction
of correct guesses over the total number of correct possible guesses. A
classifier that returned
False for every input would have 100% recall (no
false positives) but low precision.
A PR Curve is a graph of the precision (y-axis) and the recall (x-axis) of the model. The PR curve is useful when the categories are imbalanced, for instance for a spam filter where most of the examples are not spam (a.k.a. ham) and relatively few examples are spam. A ROC curve will not accurately capture the performance of this model because changes to the number of false positives, while relatively significant, will not change the false positive rate rate by much due to the overwhelming amount of true negatives.
Moments measure the shape of a function.
- A moment around 0 is a raw moment.
- The first raw moment is the mean.
- A central moment is a moment around the mean.
- The second central moment is the variance.
The -th moment about is defined as:
Norms measure the size of a vector.
|Infinity Norm||Defined as the maximum of the absolute values of its components.|
|sigmoid function, or standard deviation|