Machine Learning Reference

I often need to look up random bits of ML-related information. This post is a currently work-in-progress attempt to collect common machine learning terms and formulas into one central place. I plan on updating this post as I come across further useful pieces of information as needed.

Learning Resources

Problem Framing


A type of ML model that attempts to sort an input into a discrete set of output classes. Classification is also known as logistic regression.

Binary classification is a classification task where the model has 2 choices. The model generally outputs a value between 0.0 and 1.0, and you have to define a threshold yourself - above which you consider the prediction in group 1 and below which group 2.

As a concrete example, in spam classification, if you define your threshold as 0.95, if the model outputs a confidence anything above 0.95, you classify the input as spam, otherwise it’s ham.



A.k.a kNN, is a clustering algorithm for supervised learning.

k-means Clustering

Is an unsupervised clustering algorithm. You pick the number of clusters you want, , and the algorithm groups all your data points into the clusters all with elements the smallest distance from each other.


An algorithm to generalize a linear regression model to multiple binary classifiers without having to re-train separate models for each class. Suitable for multinomial regression.

The softmax function is also called the normalized exponential. It is used to highlight the largest values and suppress any values that are significantly smaller than the largest.

Cost function

The cost function, also known as a loss or objective function, determines, given a prediction from a model, how close to perfect that prediction is, where 0 is a perfect prediction.

Log loss is a common cost function that speeds up the learning rate.

Loss function Formula
Mean Squared Error
Mean Absolute Error
Log loss

Activation functions

Activation function Formula

Information theory

Introduced by Claude Shannon in the 1950s, information theory is the study of bits of entropy. The amount of “information” inherent in something is a function of how many yes/no questions you have to answer in order to specify it. For instance, whether it’s night or day can be answered in one yes/no question, meaning its information can be represented in 1 bit.

L1 and L2 Regularization

Regularization is a term added to the cost function.

L1 Regularization for regression is also called Lasso Regression. It adds sum of the absolute value of all biases to the cost function. Lasso works better when there are a large number of features.

L2 Regularization for regression is also called Ridge Regression. It adds the sum of the squared magnitude of all biases to the loss function.

It might be worth using both L1 and L2 regularization in the same model, as described here. “This gives you both the nuance of L2 and the sparsity encouraged by L1.”



Below is the backpropagation algorithm including the four fundamental equations of backpropagation. These are from Michael Nielsen’s book Neural Networks and Deep Learning.

1. Input: set the activations of the input layer to the inputs.

2. Feedforward:

3. Compute the error:

4. Backpropagate the error:

5. Output the gradient for all weights and biases:

Recurrent Neural Networks

Or RNNs, are networks that contain a cycle. In practice, this cycle only recurs a finite number of times, otherwise the network would never finish execution.

One big downside to RNNs is that they cannot be parallelized in training. RNNs also require a huge amount of memory to train because they must keep the state of all variables times in memory in backpropagation.


Long short-term memory or LSTM models are a class of RNN models that reduce the vanishing gradient problem and the exploding gradient problem. They do this by introducing “forget gates” into the RNN, which control how much information is passed back through the recurrence.

Vanishing gradient problem

Early layers in a neural network learn an order of magnitude slower than later layers. This is especially a problem in RNNs because of their cyclical depth.

(See also)

Model Evaluation

True/False Positives/Negatives

  Actual = True Actual = False
Prediction = True True Positive False Negative
Prediction = False False Positive True Negative

False positives are also called Type I errors. The false positive rate, or Type I error rate, a.k.a. sensitivity or , is denoted by the number of false positives over all predictions of negative values: .

False negatives are also called Type II errors. The false negative rate, or Type II error rate, a.k.a. specificity or , is denoted by the number of false negatives over all predictions of positive values: .

Precision and Recall

Precision and Recall are a tradeoff between having more false positives or more false negatives.

In the context of classification, precision refers to the number of correct guesses over the number of total guesses made defined as . A classifier that returned True for every input would have 100% precision (no false negatives) but low recall. Recall is the fraction of correct guesses over the total number of correct possible guesses. A classifier that returned False for every input would have 100% recall (no false positives) but low precision.

The F1 score is the harmonic mean of the precision and the recall, defined as:

A ROC Curve ia graph of the false positive rate (x-axis) and the true positive rate. The area under the ROC curve (ROC-AUC) is a useful way of comparing the relative performance of different models.

A PR Curve is a graph of the precision (y-axis) and the recall (x-axis) of the model. The PR curve is useful when the categories are imbalanced, for instance for a spam filter where most of the examples are not spam (a.k.a. ham) and relatively few examples are spam. A ROC curve will not accurately capture the performance of this model because changes to the number of false positives, while relatively significant, will not change the false positive rate rate by much due to the overwhelming amount of true negatives.



Moments measure the shape of a function.

The -th moment about is defined as:

Vector Norms

Norms measure the size of a vector.

Norm Formula
L1 Norm
L2 Norm
Infinity Norm Defined as the maximum of the absolute values of its components.

(See also)


Symbol Description
cost function
learning rate
sigmoid function, or standard deviation
activation function
outputs (true)
outputs (predicted)


Contents (top)