How are Words Represented in Machine Learning? Part 1

Machine learning on human languages is a super exciting space right now. Applications are exploding—just think of how many natural language ML models it takes to run a smart assistant, from transforming spoken audio to text, to finding the exact part of a web page that answers your question, to choosing the correct words with the correct grammar to reply to you.

At work I recently had the opportunity to build an NLP system. Without much previous NLP experience, I had no idea where to begin. The initial roadblock I ran into was: okay, I have a mountain of text… but how do I do machine learning on it???

Machine learning algorithms deal with numbers, not language. The fundamental question I want to focus on in this series is how, exactly, do you turn this:

“the quick brown fox jumps over the lazy dog”\text{``the quick brown fox jumps over the lazy dog''}

into this:

[0.4,0.1,0.0,0.1,0.3][0.4, 0.1, 0.0, 0.1, 0.3]

Part 1 covers the basics: Bag of Words, the Hashing Trick, and TF-IDF.

Part 2 will cover more advanced approaches such as neural word embeddings, like Word2Vec and GloVe. Part 3 will cover the state of the art of human language modeling in 2019, like transformer models and BERT. All example code is in Python 3 and available on Colab.

Bag of Words

What’s the simplest way you could turn a string of text into a list of numbers?

You could sum up all occurences of each word in a document. Here’s a simple implementation. First, build a vocabulary of all words you’ll need:

def build_vocabulary(corpus):
  vocabulary = set()
  for document in corpus:
    for word in document.split(' '):
      vocabulary.add(word)
  return list(vocabulary)

vocabulary = build_vocabulary([
  'the quick brown fox jumps over the lazy dog',
])

print(vocabulary)

Then you can transform a piece of text using the vocabulary:

def bow_transform(vocabulary, document):
  # Initialize a list of zeros.
  output = [0] * len(vocabulary)

  for word in document.split(' '):
    try:
      index = vocabulary.index(word)
      output[index] += 1
    except ValueError:
      continue

  # Normalize output vector so all values sum to 1.
  output = [num / sum(output) for num in output]
  return output

transformed_vector = bow_transform(
    vocabulary, 'the quick brown fox jumps over the lazy dog')

for i in range(len(vocabulary)):
  print(vocabulary[i], f'{transformed_vector[i]:.2f}')

Running bow_transform for the example sentence outputs:

lazy 0.11
jumps 0.11
quick 0.11
brown 0.11
dog 0.11
the 0.22
over 0.11
fox 0.11

Notice how the word “the” becomes 0.22 since it’s repeated twice in the input.

You could stop here. This is a perfectly valid approach and is used by many production ML systems. However, it has 3 big downsides:

  1. Each piece of text will be a different length, so your ML algorithm must choose a fixed dimensionality, then pad shorter documents and cut off longer documents.
  2. The model can’t handle words that it hasn’t seen before in training—it must throw them out.
  3. When you handle any non-trivial amount of text, your vocabulary becomes huge, which causes your input to become really sparse. And sparse input is hard to learn. This is called The Curse of Dimensionality.

The Hashing Trick

To address all three problems above, you can use the hashing trick.

Pick a number kk , say 5. That number becomes the fixed dimensionality of your input vectors. To transform a piece of text into a numerical vector, you apply a kernel function to each word to map it into the fixed vector. For instance, you could hash the word to a number between 0 and kk , and add 1 to each resulting index in the output vector—i.e. output[hash(word) % k] += 1.

This approach has the benefit of not requiring you to build a vocabulary and keep it around when you’re doing both training and inference, so only the transform function is needed. Here’s a simple example implementation in Python:

def kernel_trick_transform(k, string):
  output = [0] * k
  for word in string.split(' '):
    output[hash(word) % k] += 1

  # Normalize output vector so all values sum to 1.
  output = [num / sum(output) for num in output]
  return output

print(kernel_trick_transform(5, 'the quick brown fox jumps over the lazy dog'))

The original string, transformed, becomes:

[0.4444444444444444, 0.1111111111111111, 0.0, 0.1111111111111111, 0.3333333333333333]

One thing to watch out for is picking a kk that’s big enough such that hash collisions don’t happen too often. In practice, I’ve found that kk between 500 to 1000 works fine, but your use case and data may vary.

The hashing trick is great. However, it still has one big downside: it treats all words equally. Common words like “the” have the same weight as more relevant words. There is, fortunately, a way to weight rarer words higher.

TF-IDF

To weight uncommon words with more importance, you can use TF-IDF, which stands for Text Frequency * Inverse Document Frequency.

Using TF-IDF, the output value of each word ii in document jj becomes:

wi,j=times word i appears in document jtimes word i appears in all documentsw_{i,j} = \frac{\text{times word \textit{i} appears in document \textit{j}}}{\text{times word \textit{i} appears in all documents}}

To implement TF-IDF, first build a vocabulary dictionary that maps words to number of appearances in all documents. I’m adding more examples to the corpus below to illustrate the effects of TF-IDF:

def build_tfidf_vocabulary(corpus):
  vocabulary = dict()
  for document in corpus:
    for word in document.split(' '):
      if word not in vocabulary:
        vocabulary[word] = 0
      vocabulary[word] += 1
  return vocabulary

vocabulary = build_tfidf_vocabulary([
  'the quick brown fox jumps over the lazy dog',
  'waxy and quivering jocks fumble the pizza',
  'few quips galvanized the mock jury box',
  'the quick onyx goblin jumps over the lazy dwarf',
])

print(vocabulary)
{'the': 6, 'quick': 2, 'brown': 1, 'fox': 1, 'jumps': 2, 'over': 2, 'lazy': 2, 'dog': 1, 'waxy': 1, 'and': 1, 'quivering': 1, 'jocks': 1, 'fumble': 1, 'pizza': 1, 'few': 1, 'quips': 1, 'galvanized': 1, 'mock': 1, 'jury': 1, 'box': 1, 'onyx': 1, 'goblin': 1, 'dwarf': 1}

Then to transform input documents, we can re-use the vocabulary builder to get the counts just for the document in question:

def tfidf_transform(vocabulary, document):
  # Initialize a vector of zeros.
  output = [0] * len(vocabulary)

  document_vocab = build_tfidf_vocabulary([document])

  for i, word in enumerate(vocabulary.keys()):
    if word in document_vocab:
      output[i] = document_vocab[word] / vocabulary[word]

  # Normalize output vector so all values sum to 1.
  output = [num / sum(output) for num in output]
  return output

transformed_vector = tfidf_transform(
    vocabulary, 'the quick brown fox jumps over the lazy dog')

for i in range(len(vocabulary)):
  print(list(vocabulary.keys())[i], f'{transformed_vector[i]:.2f}')

Outputs:

the 0.06
quick 0.09
brown 0.19
fox 0.19
jumps 0.09
over 0.09
lazy 0.09
dog 0.19
waxy 0.00
and 0.00
quivering 0.00
jocks 0.00
fumble 0.00
pizza 0.00
few 0.00
quips 0.00
galvanized 0.00
mock 0.00
jury 0.00
box 0.00
onyx 0.00
goblin 0.00
dwarf 0.00

Notice how brown, fox, and dog are rewarded for being unique only to the input document, and jumps and lazy are lower, since they appear in more documents.

A Final Note on Tokenization

One part that’s been oversimplified in this post is text tokenization, or breaking words apart. Using input.split(' ') in this post is an oversimplification. In practice, you might use something like nltk.

There’s also token normalization, where you stem or lemmatize words into their roots—e.g. “running” becomes “run”. You can read more about that here.


That’s it for Part 1. If you’d like to know when future parts come out, subscribe to my newsletter! And again, all code in this post is available on Colab.