Machine Learning Reference

I often need to look up random bits of ML-related information. This post is a currently work-in-progress attempt to collect common machine learning terms and formulas into one central place. I plan on updating this post as I come across further useful pieces of information as needed. This reference is not intended to be exhaustive—in fact the opposite—it is intended only to be a concise, opinionated collection of the most relevant bits of ML knowledge for quick lookup.

Learning Resources

Problem Framing

Machine learning problems come in 3 primary categories, depending on how well the labels (ground truth values) are known:

Type Description Examples
Supervised The labels are always known. Classification, Regression
Semi‑supervised Some labels are known, some are not. Speech analysis, Protein sequencing
Unsupervised There are no labels. Clustering problems.

Below are the broad types of problems you can tackle with ML.

Problem Description Examples
Classification Sort inputs into a discrete set of output classes. Spam detection
Regression Given an input, output a prediction in a continuous space. Housing price estimation
Clustering Group inputs into a set of discrete groups. Fraud detection



A.k.a kNN, is a clustering algorithm for supervised learning.

k-means Clustering

Is an unsupervised clustering algorithm. You pick the number of clusters you want, , and the algorithm groups all your data points into the clusters all with elements the smallest distance from each other.


An algorithm to generalize a linear regression model to multiple binary classifiers without having to re-train separate models for each class. Suitable for multinomial regression.

The softmax function is also called the normalized exponential. It is used to highlight the largest values and suppress any values that are significantly smaller than the largest.

Cost function

The cost function, also known as a loss or objective function, determines, given a prediction from a model, how close to perfect that prediction is, where 0 is a perfect prediction.

Loss function Formula
Mean Squared Error

Mean Absolute Error

Log loss

Commonly used to speed up the learning rate.

Activation functions

Activation function Formula

Information theory

Introduced by Claude Shannon in the 1950s, information theory is the study of bits of entropy. The amount of “information” inherent in something is a function of how many yes/no questions you have to answer in order to specify it. For instance, whether it’s night or day can be answered in one yes/no question, meaning its information can be represented in 1 bit.


Regularization is a term added to the cost function in order to prevent overfitting. The parameter controls the strength of regularization, with a higher value meaning more.

Regularization Type Formula
L1 Regularization

Also called Lasso Regression. It adds sum of the absolute value of all biases to the cost function. Lasso works better when there are a large number of features.
L2 Regularization

L2 Regularization for regression is also called Ridge Regression. It adds the sum of the squared magnitude of all biases to the loss function.

It might be worth using both L1 and L2 regularization in the same model. “This gives you both the nuance of L2 and the sparsity encouraged by L1,” as explained here.


Optimization Algorithms

Where is the vector of parameters to try at step , is the gradient at step , and is the learning rate.

Learning algorithm Description
Stochastic Gradient Descent (SGD)
Sutskever et al., 2013


Hinton et al., 2012

Where is the discounting factor.

Backpropagation steps

Below is the backpropagation algorithm including the four fundamental equations of backpropagation. These are from Michael Nielsen’s book Neural Networks and Deep Learning.

Step Formula
Given a neural network with layers
Set the activations of the input layer
to the inputs.

For all layers after the input layer

Compute the pre-activation value

Apply the non-linearity

Compute the error and backpropagate
Compute the error for the pass

Backpropagate the error
For all layers going backwards starting
from the penultimate layer

Compute the gradient for each layer

Output the gradient
For all weights

And biases

Recurrent Neural Networks

Or RNNs, are networks that contain a cycle. In practice, this cycle only recurs a finite number of times, otherwise the network would never finish execution.

One big downside to RNNs is that they cannot be parallelized in training. RNNs also require a huge amount of memory to train because they must keep the state of all variables times in memory in backpropagation.


Long short-term memory or LSTM models are a class of RNN models that reduce the vanishing gradient problem and the exploding gradient problem. They do this by introducing “forget gates” into the RNN, which control how much information is passed back through the recurrence.

Vanishing gradient problem

Early layers in a neural network learn an order of magnitude slower than later layers. This is especially a problem in RNNs because of their cyclical depth.

(See also)

Model Evaluation

True/False Positives/Negatives

  Actual = True Actual = False
Prediction = True True Positive False Negative
Prediction = False False Positive True Negative

False positives are also called Type I errors. The false positive rate, or Type I error rate, a.k.a. sensitivity or , is denoted by the number of false positives over all predictions of negative values: .

False negatives are also called Type II errors. The false negative rate, or Type II error rate, a.k.a. specificity or , is denoted by the number of false negatives over all predictions of positive values: .

Precision and Recall

Precision and Recall are a tradeoff between having more false positives or more false negatives.

In the context of classification, precision refers to the number of correct guesses over the number of total guesses made defined as . A classifier that returned True for every input would have 100% precision (no false negatives) but low recall. Recall is the fraction of correct guesses over the total number of correct possible guesses. A classifier that returned False for every input would have 100% recall (no false positives) but low precision.

The F1 score is the harmonic mean of the precision and the recall, defined as:

A ROC Curve is a graph of the false positive rate (x-axis) and the true positive rate. The area under the ROC curve (ROC-AUC) is a useful way of comparing the relative performance of different models.

A PR Curve is a graph of the precision (y-axis) and the recall (x-axis) of the model. The PR curve is useful when the categories are imbalanced, for instance for a spam filter where most of the examples are not spam (a.k.a. ham) and relatively few examples are spam. A ROC curve will not accurately capture the performance of this model because changes to the number of false positives, while relatively significant, will not change the false positive rate rate by much due to the overwhelming amount of true negatives.



Moments measure the shape of a function. The moment of a variable is its expected value at a certain power.

Moment Formula
First raw moment, a.k.a. the mean

Second central moment, a.k.a. the variance

The -th moment about

Vector Norms

Norms measure the size of a vector.

Norm Formula
L1 Norm

L2 Norm

Infinity Norm

Defined as the maximum of the absolute values of its components.

(See also)


Symbol Description
cost function
learning rate
regularization parameter
sigmoid function, or standard deviation
activation function
outputs (true)
outputs (predicted)


Contents (top)