Machine Learning Reference
I often need to look up random bits of MLrelated information. This post is a currently workinprogress attempt to collect common machine learning terms and formulas into one central place. I plan on updating this post as I come across further useful pieces of information as needed.
Learning Resources
 Google’s ML Crash Course.
 Neural Networks and Deep Learning by Michael Nielsen.
Problem Framing
Classification
A type of ML model that attempts to sort an input into a discrete set of output classes. Classification is also known as logistic regression.
Binary classification is a classification task where the model has 2 choices. The model generally outputs a value between 0.0 and 1.0, and you have to define a threshold yourself  above which you consider the prediction in group 1 and below which group 2.
As a concrete example, in spam classification, if you define your threshold as 0.95, if the model outputs a confidence anything above 0.95, you classify the input as spam, otherwise it’s ham.
Modeling
kNearestNeighbors
A.k.a kNN, is a clustering algorithm for supervised learning.
kmeans Clustering
Is an unsupervised clustering algorithm. You pick the number of clusters you want, $k$ , and the algorithm groups all your data points into the $k$ clusters all with elements the smallest distance from each other.
Softmax
An algorithm to generalize a linear regression model to multiple binary classifiers without having to retrain separate models for each class. Suitable for multinomial regression.
The softmax function is also called the normalized exponential. It is used to highlight the largest values and suppress any values that are significantly smaller than the largest.
Cost function
The cost function, also known as a loss or objective function, determines, given a prediction from a model, how close to perfect that prediction is, where 0 is a perfect prediction.
Log loss is a common cost function that speeds up the learning rate.
Loss function  Formula 

Mean Squared Error  $n∑_{n}(y_{i}−y^ _{i})_{2} $ 
Mean Absolute Error  $n∑_{n}∣y_{i}−y^ _{i}∣_{2} $ 
Log loss  $−(y⋅log(y^ )+(1−y)⋅log(1−y^ ))$ 
Activation functions
Activation function  Formula 

ReLU  $max(0,x)$ 
logistic  $1+e_{−x}1 $ 
tanh  $1+e_{−2x}2 −1$ 
Information theory
Introduced by Claude Shannon in the 1950s, information theory is the study of bits of entropy. The amount of “information” inherent in something is a function of how many yes/no questions you have to answer in order to specify it. For instance, whether it’s night or day can be answered in one yes/no question, meaning its information can be represented in 1 bit.
L1 and L2 Regularization
Regularization is a term added to the cost function.
L1 Regularization for regression is also called Lasso Regression. It adds sum of the absolute value of all biases to the cost function. Lasso works better when there are a large number of features.
$(loss function)+λj=1∑p ∣β_{j}∣$
L2 Regularization for regression is also called Ridge Regression. It adds the sum of the squared magnitude of all biases to the loss function.
$(loss function)+λj=1∑p β_{j}$
It might be worth using both L1 and L2 regularization in the same model, as described here. “This gives you both the nuance of L2 and the sparsity encouraged by L1.”
Learning
Backpropagation
Below is the backpropagation algorithm including the four fundamental equations of backpropagation. These are from Michael Nielsen’s book Neural Networks and Deep Learning.
 $L$ = number of layers
1. Input: set the activations of the input layer to the inputs.
$a_{1}=x$
2. Feedforward:
$∀l_{2,3,...,L},$
$z_{l}=w_{l}a_{l−1}+b_{l},$
$a_{l}=σ(z_{l})$
3. Compute the error:
$∂_{L}=δ_{a}C⊙σ_{′}(z_{l})$
4. Backpropagate the error:
$∀l_{L−1,L−2,...,2},$
$∂_{l}=((w_{l+1})_{T}∂_{l+1})⊙σ_{′}(z_{l})$
5. Output the gradient for all weights and biases:
$∂w_{jk}∂C =a_{k}∂_{j}$
$∂b_{j}∂C =∂_{j}$
Recurrent Neural Networks
Or RNNs, are networks that contain a cycle. In practice, this cycle only recurs a finite number of times, otherwise the network would never finish execution.
One big downside to RNNs is that they cannot be parallelized in training. RNNs also require a huge amount of memory to train because they must keep the state of all variables $k$ times in memory in backpropagation.
LSTM
Long shortterm memory or LSTM models are a class of RNN models that reduce the vanishing gradient problem and the exploding gradient problem. They do this by introducing “forget gates” into the RNN, which control how much information is passed back through the recurrence.
Vanishing gradient problem
Early layers in a neural network learn an order of magnitude slower than later layers. This is especially a problem in RNNs because of their cyclical depth.
(See also)
Model Evaluation
True/False Positives/Negatives
Actual = True  Actual = False  

Prediction = True  True Positive  False Negative 
Prediction = False  False Positive  True Negative 
False positives are also called Type I errors. The false positive rate, or Type I error rate, a.k.a. sensitivity or $α$ , is denoted by the number of false positives over all predictions of negative values: $α=FP+TNFP $ .
False negatives are also called Type II errors. The false negative rate, or Type II error rate, a.k.a. specificity or $β$ , is denoted by the number of false negatives over all predictions of positive values: $β=TP+FNFN $ .
Precision and Recall
Precision and Recall are a tradeoff between having more false positives or more false negatives.
$precision=TP+FPTP $
$recall=TP+FNTP $
In the context of classification, precision refers to
the number of correct guesses over the number of total guesses made defined
as $TP+FPTP $
. A classifier that returned
True
for every input would have 100% precision (no false negatives) but
low recall. Recall is the fraction
of correct guesses over the total number of correct possible guesses. A
classifier that returned False
for every input would have 100% recall (no
false positives) but low precision.
The F1 score is the harmonic mean of the precision and the recall, defined as:
$F_{1}=2precision+recallprecision⋅recall $
A ROC Curve ia graph of the false positive rate (xaxis) and the true positive rate. The area under the ROC curve (ROCAUC) is a useful way of comparing the relative performance of different models.
A PR Curve is a graph of the precision (yaxis) and the recall (xaxis) of the model. The PR curve is useful when the categories are imbalanced, for instance for a spam filter where most of the examples are not spam (a.k.a. ham) and relatively few examples are spam. A ROC curve will not accurately capture the performance of this model because changes to the number of false positives, while relatively significant, will not change the false positive rate rate by much due to the overwhelming amount of true negatives.
Math
Moments
Moments measure the shape of a function.
 A moment around 0 is a raw moment.
 The first raw moment is the mean.
 A central moment is a moment around the mean.
 The second central moment is the variance.
The $n$ th moment about $c$ is defined as:
$∫_{−∞}(x−c)_{n}f(x)dx$
Vector Norms
Norms measure the size of a vector.
Norm  Formula 

L1 Norm  $∑_{n}∣v_{i}∣$

L2 Norm  $∑_{n}∣v_{i}∣_{2} $ 
Infinity Norm  $max(∣v_{i}∣:i=1,2,...,n)$ Defined as the maximum of the absolute values of its components. 
(See also)
Symbols
Symbol  Description 

$α$  weight 
$β$  bias 
$C$  cost function 
$η$  learning rate 
$μ$  mean 
$σ$  sigmoid function, or standard deviation 
$ϕ$  activation function 
$x$  inputs 
$y$  outputs (true) 
$y^ $  outputs (predicted) 
History
 Updated 5/25/2020 to add Backpropagation and Symbols sections.