Build a Markov Chain Sentence Generator in 20 lines of Python

A bot who can write a long letter with ease, cannot write ill.

—Jane Austen, Pride and Prejudice

This post walks you through how to write a Markov Chain from scratch with Python in order to generate completely new sentences that resemble English.

The text we’ll be using to build the Markov Chain is Pride and Prejudice by Jane Austen. You can follow along here or grab a runnable notebook version of this post on Colab.

Setup

First download the full text of Pride and Prejudice.

# Download Pride and Prejudice and cut off header.
!curl https://www.gutenberg.org/files/1342/1342-0.txt | tail -n+32 > /content/pride-and-prejudice.txt
  
# Preview file.
!head -n 10 /content/pride-and-prejudice.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  707k  100  707k    0     0  1132k      0 --:--:-- --:--:-- --:--:-- 1130k
PRIDE AND PREJUDICE

By Jane Austen



Chapter 1


It is a truth universally acknowledged, that a single man in possession

Add some necessary imports.

import numpy as np
import random
import re
from collections import defaultdict

Building the Markov Chain

Read the file into a string and split the words into a list. Then, we can use Python’s handy defaultdict to create the Markov Chain.

To build the chain, take every word in the text and insert it into the dictionary where the key is the previous word, incrementing the counter for that word in the inner dictionary each time. This generates a dictionary where each key points to all the words that have ever followed it, along with the number of instances.

# Read text from file and tokenize.
path = '/content/pride-and-prejudice.txt'
with open(path) as f:
  text = f.read()
tokenized_text = [
    word
    for word in re.split('\W+', text)
    if word != ''
]
 
# Create graph.
markov_graph = defaultdict(lambda: defaultdict(int))

last_word = tokenized_text[0].lower()
for word in tokenized_text[1:]:
  word = word.lower()
  markov_graph[last_word][word] += 1
  last_word = word

# Preview graph.
limit = 3
for first_word in ('the', 'by', 'who'):
  next_words = list(markov_graph[first_word].keys())[:limit]
  for next_word in next_words:
    print(first_word, next_word)
the feelings
the minds
the surrounding
by jane
by a
by the
who has
who waited
who came

Generating Sentences

Now for the fun part. Define a function to help us walk the chain. It starts at a random word and of the possible choices for the next word, it takes a weighted random choice using np.random.choice.

def walk_graph(graph, distance=5, start_node=None):
  """Returns a list of words from a randomly weighted walk."""
  if distance <= 0:
    return []
  
  # If not given, pick a start node at random.
  if not start_node:
    start_node = random.choice(list(graph.keys()))
  
  
  weights = np.array(
      list(markov_graph[start_node].values()),
      dtype=np.float64)
  # Normalize word counts to sum to 1.
  weights /= weights.sum()

  # Pick a destination using weighted distribution.
  choices = list(markov_graph[start_node].keys())
  chosen_word = np.random.choice(choices, None, p=weights)
  
  return [chosen_word] + walk_graph(
      graph, distance=distance-1,
      start_node=chosen_word)
  
for i in range(10):
  print(' '.join(walk_graph(
      markov_graph, distance=12)), '\n')
was with each other of communication it kitty and such a doubt 

when the country ensued made for she cried miss elizabeth that had 

it would have taken a valuable neighbour lady s steady friendship replied 

on these recollections that he considered as well is but her companions 

and laugh that i only headstrong and what lydia s mr darcy 

till supper his it a part us yesterday se nnight elizabeth had 

on that he that whatever she thus addressed them that he might 

countenance of both joy jane when it which mr darcy was suddenly 

woods to me you know him at five years longer be adapted 

unless charlotte s letter though she did before they must give her 

And that’s it for a basic Markov Chain! There are a lot of enhancements that could be made from here, but hopefully this demonstrates that you can implement a Markov Chain text generator with just a couple dozen lines of Python.

Contents (top)

Comments