用20行Python构建Markov Chain语句生成器

A bot who can write a long letter with ease, cannot write ill.

—Jane Austen, Pride and Prejudice

这篇文章将引导您逐步学习如何使用Python从头开始编写马尔可夫链(Markov Chain),以生成好像一个真实的人写的英语的全新句子。 简·奥斯丁的《傲慢与偏见》(Pride and Prejudice by Jane Austen) 是我们用来构建马尔可夫链的文字。 Colab 上有一篇可运行的笔记本版本。

Read the English version of this post here.

Setup

首先下载“傲慢与偏见”的全文。

# 下载Pride and Prejudice和并切断头.
!curl https://www.gutenberg.org/files/1342/1342-0.txt | tail -n+32 > /content/pride-and-prejudice.txt

# 预览文件.
!head -n 10 /content/pride-and-prejudice.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  707k  100  707k    0     0  1132k      0 --:--:-- --:--:-- --:--:-- 1130k
PRIDE AND PREJUDICE

By Jane Austen



Chapter 1


It is a truth universally acknowledged, that a single man in possession

添加一些必要的导入。

import collections
import random
import re

import numpy as np

建立马尔可夫链

将文件读取为字符串,然后将单词拆分为列表。 然后,我们可以使用Python方便的defaultdict来创建马尔可夫链。 要构建链,请获取文本中的每个单词,然后将其插入到键为前一个单词的字典中,并在内部字典中每次增加该单词的计数器。 这将生成一个词典,其中每个键都指向该键之后的所有单词以及实例数。

# 从文件中读取文本并标记化.
path = '/content/pride-and-prejudice.txt'
with open(path) as f:
  text = f.read()
tokenized_text = [
    word
    for word in re.split('\W+', text)
    if word != ''
]

# 创建图.
markov_graph = collections.defaultdict(lambda: collections.Counter())

last_word = tokenized_text[0].lower()
for word in tokenized_text[1:]:
  word = word.lower()
  markov_graph[last_word].update([word])
  last_word = word

# 预览图.
limit = 3
for first_word in ('the', 'by', 'who'):
  next_words = list(markov_graph[first_word].keys())[:limit]
  for next_word in next_words:
    print(first_word, next_word)
the feelings
the minds
the surrounding
by jane
by a
by the
who has
who waited
who came

产生句子

现在是有趣的部分。 定义一个功能来帮助我们走链。 它从一个随机词开始,然后是下一个词的可能选择,它使用np.random.choice进行加权随机选择。

def walk_graph(graph, distance=5, start_node=None):
  """返回随机加权步行中的单词列表."""
  if distance <= 0:
    return []

  # 如果未给出,则随机选择一个起始节点.
  if not start_node:
    start_node = random.choice(list(graph.keys()))

  weights = np.array(
      list(markov_graph[start_node].values()),
      dtype=np.float64)
  # 标准化字数总和为1.
  weights /= weights.sum()

  # 使用加权分布选择目的地.
  choices = list(markov_graph[start_node].keys())
  chosen_word = np.random.choice(choices, None, p=weights)

  return [chosen_word] + walk_graph(
      graph, distance=distance-1,
      start_node=chosen_word)

for i in range(10):
  print(' '.join(walk_graph(
      markov_graph, distance=12)), '\n')
was with each other of communication it kitty and such a doubt 

when the country ensued made for she cried miss elizabeth that had 

it would have taken a valuable neighbour lady s steady friendship replied 

on these recollections that he considered as well is but her companions 

and laugh that i only headstrong and what lydia s mr darcy 

till supper his it a part us yesterday se nnight elizabeth had 

on that he that whatever she thus addressed them that he might 

countenance of both joy jane when it which mr darcy was suddenly 

woods to me you know him at five years longer be adapted 

unless charlotte s letter though she did before they must give her 

这就是基本的马尔可夫链! 可以从此处进行很多增强,但是希望这表明您可以仅用几十行Python来实现Markov Chain文本生成器。

Contents (top)

Comments