Brief: Machine Learning

Briefs are where I try condensing my notes on a subject into a concise list of explanations (Feyman Technique-style), each fewer than 250 words.

🚨 This brief is a work in progress. 🚨

Overview

Learning Resources


Binary classification

A classification task where the model has 2 choices. The model generally outputs a value between 0.0 and 1.0, and you have to define a threshold yourself - above which you consider the prediction in group 1 and below which group 2.

As a concrete example, in spam classification, if you define your threshold as 0.95, if the model outputs a confidence anything above 0.95, you classify the input as spam, otherwise it’s ham.

BERT

Is a method to preprocess text for NLP tasks using transformers.

Classification

A type of ML model that attempts to sort an input into a discrete set of output classes. When the number of classes is 2, it’s binary classification.

Classification is also known as logistic regression.

Cost function

The cost function determines, given a prediction from a model, how close to perfect that prediction is, where 0 is a perfect prediction.

F1 score

Is the harmonic mean of the precision and the recall of a binary classifier.

F1=2precisionrecallprecision+recallF_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

False negative

In binary classification, when your model outputs false but the answer is actually true.

y^=0,y=1\hat{y} = 0, y = 1

False positive

In binary classification, when your model outputs true but the answer is actually false.

y^=1,y=0\hat{y} = 1, y = 0

Information theory

Introduced by Claude Shannon in the 1950s, information theory is the study of bits of entropy. The amount of “information” inherent in something is a function of how many yes/no questions you have to answer in order to specify it. For instance, whether it’s night or day can be answered in one yes/no question, meaning its information can be represented in 1 bit.

k-means Clustering

Is an unsupervised clustering algorithm. You pick the number of clusters you want, kk , and the algorithm groups all your data points into the kk clusters all with elements the smallest distance from each other.

kNN

kNN or k-Nearest-Neighbors is a clustering algorithm for supervised learning.

L1 and L2 Regularization

Both add to the cost function

L1 Regularization for regression is also called Lasso Regression. It adds sum of the absolute value of all biases to the cost function. Lasso works better when there are a large number of features.

(loss function)+λj=1pβj\text{(loss function)} + \lambda \sum\limits_{j=1}^{p} |\beta_j|

L2 Regularization for regression is also called Ridge Regression. It adds the sum of the squared magnitude of all biases to the loss function.

(loss function)+λj=1pβj2\text{(loss function)} + \lambda \sum\limits_{j=1}^{p} \beta^2_j

It might be worth using both L1 and L2 regularization in the same model, as described here. “This gives you both the nuance of L2 and the sparsity encouraged by L1.”

LSTM

Long short-term memory or LSTM models are a class of RNN models that reduce the vanishing gradient problem and the exploding gradient problem. They do this by introducing “forget gates” into the RNN, which control how much information is passed back through the recurrence.

PR Curve

For a binary classifier, the PR curve is a graph of the precision (y-axis) and the recall (x-axis) of the model. The PR curve is useful when the categories are imbalanced, for instance for a spam filter where most of the examples are ham, and there are relatively few examples of spam.

Precision

In the context of classification, precision refers to the number of correct guesses over the number of total guesses made. Compare to recall. A classifier that returned True for every input would have 100% precision (no false negatives) but low recall.

TPTP+FP\frac{TP}{TP + FP}

Recall

In the context of classification, recall is the fraction of correct guesses over the total number of correct possible guesses. A classifier that returned False for every input would have 100% recall (no false positives) but low precision.

TPTP+FN\frac{TP}{TP + FN}

Recurrent Neural Network

Or RNNs, are DNNs that contain a cycle. In practice, this cycle only “recurs” kk times, otherw

One big downside to RNNs is that they cannot be parallelized in training. RNNs also require a huge amount of memory to train because they must keep the state of all variables kk times in memory in backpropogation.

ROC Curve

For a binary classifier, a graph of the false positive rate (x-axis) and the true positive rate. The area under the ROC curve (ROC-AUC) is a useful way of comparing the relative performance of different models.

Sensitivity

Is also known as the True Positive Rate. It’s the number of true positives over the number of true positives plus false negatives.

TPTP+FN\frac{TP}{TP + FN}

Softmax

An algorithm to generalize a linear regression model to multiple binary classifiers without having to re-train separate models for each class. Suitable for multinomial regression.

The softmax function is also called the normalized exponential. It is used to highlight the largest values and suppress any values that are significantly smaller than the largest.

TensorFlow

A general platform for distributed graphical computation in disguise as a machine learning library.

TFX

A set of technologies for setting up training and deployment pipelines for ML models.

Type I and Type II Errors

Type I errors are false positives.
Type II errors are false negatives.

The type I error rate, or false positive rate, is denoted by the letter α\alpha .

α=FPFP+TN\alpha = \frac{FP}{FP + TN}

The type II error rate is denoted by the letter β\beta .

β=FNTP+FN\beta = \frac{FN}{TP + FN}

Vanishing gradient problem

Early layers in a neural network learn an order of magnitude slower than later layers.