Build a Markov Chain Sentence Generator in 20 lines of Python
A bot who can write a long letter with ease, cannot write ill.
—Jane Austen, Pride and Prejudice
This post walks you through how to write a Markov Chain from scratch with Python in order to generate completely new sentences that resemble English.
First download the full text of Pride and Prejudice.
# Download Pride and Prejudice and cut off header. !curl https://www.gutenberg.org/files/1342/1342-0.txt | tail -n+32 > /content/pride-and-prejudice.txt # Preview file. !head -n 10 /content/pride-and-prejudice.txt
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 707k 100 707k 0 0 1132k 0 --:--:-- --:--:-- --:--:-- 1130k PRIDE AND PREJUDICE By Jane Austen Chapter 1 It is a truth universally acknowledged, that a single man in possession
Add some necessary imports.
import numpy as np import random import re from collections import defaultdict
Building the Markov Chain
Read the file into a string and split the words into a list. Then, we can use Python’s handy defaultdict to create the Markov Chain.
To build the chain, take every word in the text and insert it into the dictionary where the key is the previous word, incrementing the counter for that word in the inner dictionary each time. This generates a dictionary where each key points to all the words that have ever followed it, along with the number of instances.
# Read text from file and tokenize. path = '/content/pride-and-prejudice.txt' with open(path) as f: text = f.read() tokenized_text = [ word for word in re.split('\W+', text) if word != '' ] # Create graph. markov_graph = defaultdict(lambda: defaultdict(int)) last_word = tokenized_text.lower() for word in tokenized_text[1:]: word = word.lower() markov_graph[last_word][word] += 1 last_word = word # Preview graph. limit = 3 for first_word in ('the', 'by', 'who'): next_words = list(markov_graph[first_word].keys())[:limit] for next_word in next_words: print(first_word, next_word)
the feelings the minds the surrounding by jane by a by the who has who waited who came
Now for the fun part. Define a function to help us walk the chain. It starts at a random word and of the possible choices for the next word, it takes a weighted random choice using np.random.choice.
def walk_graph(graph, distance=5, start_node=None): """Returns a list of words from a randomly weighted walk.""" if distance <= 0: return  # If not given, pick a start node at random. if not start_node: start_node = random.choice(list(graph.keys())) weights = np.array( list(markov_graph[start_node].values()), dtype=np.float64) # Normalize word counts to sum to 1. weights /= weights.sum() # Pick a destination using weighted distribution. choices = list(markov_graph[start_node].keys()) chosen_word = np.random.choice(choices, None, p=weights) return [chosen_word] + walk_graph( graph, distance=distance-1, start_node=chosen_word) for i in range(10): print(' '.join(walk_graph( markov_graph, distance=12)), '\n')
was with each other of communication it kitty and such a doubt when the country ensued made for she cried miss elizabeth that had it would have taken a valuable neighbour lady s steady friendship replied on these recollections that he considered as well is but her companions and laugh that i only headstrong and what lydia s mr darcy till supper his it a part us yesterday se nnight elizabeth had on that he that whatever she thus addressed them that he might countenance of both joy jane when it which mr darcy was suddenly woods to me you know him at five years longer be adapted unless charlotte s letter though she did before they must give her
And that’s it for a basic Markov Chain! There are a lot of enhancements that could be made from here, but hopefully this demonstrates that you can implement a Markov Chain text generator with just a couple dozen lines of Python.