The Most Commonly Used Chinese Words

I think the most effective way to learn a language is to prioritize learning the day-to-day most frequently used words. Picking words to study in order of frequency is the optimal way to maximally increase your marginal understanding of the language for each successive word you learn.

To that end, I took a dataset of Weibo (China’s Twitter) posts and ranked all words that appeared in the dataset order of frequency.

The CSV contains a “back” column so you can also load it into Anki and use it as flashcards. The dataset is thanks to Dr. Minlie Huang at Tsinghua University. You can find the dataset here.

Let me know if this is useful!

Code to Generate Top Word List

If you’re interested in how this was generated and want to remix the code, it’s here on Colab, as well as written out below.

import pandas as pd
!pip install chinese
!pip install pinyin
Collecting chinese
[?25l  Downloading https://files.pythonhosted.org/packages/15/fe/35c1cd7792f0c899fbeae66d35491721cae6be6d8a128d4f77e6e3479b3a/chinese-0.2.1-py3-none-any.whl (12.6MB)
     |████████████████████████████████| 12.6MB 2.8MB/s 
[?25hCollecting pynlpir
[?25l  Downloading https://files.pythonhosted.org/packages/7c/66/79d353119143f92fdf80aea0e8b5b8289baf60708a3202fc7a4d3a530d0e/PyNLPIR-0.6.0-py2.py3-none-any.whl (13.1MB)
     |████████████████████████████████| 13.1MB 1.3MB/s 
[?25hRequirement already satisfied: jieba in /usr/local/lib/python3.6/dist-packages (from chinese) (0.42.1)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from pynlpir->chinese) (7.0)
Installing collected packages: pynlpir, chinese
Successfully installed chinese-0.2.1 pynlpir-0.6.0
Requirement already satisfied: pinyin in /usr/local/lib/python3.6/dist-packages (0.4.0)
from chinese import ChineseAnalyzer
analyzer = ChineseAnalyzer()
result = analyzer.parse('就')
print(result.tokens())
print(result.pinyin())

import pinyin
import pinyin.cedict
['就']
jiù
!curl -O http://coai.cs.tsinghua.edu.cn/media/files/ecm_train_data.zip
!unzip ecm_train_data.zip
!du -csh *
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44.7M  100 44.7M    0     0   294k      0  0:02:35  0:02:35 --:--:--  218k
Archive:  ecm_train_data.zip
  inflating: train_data.json         
45M	ecm_train_data.zip
55M	sample_data
126M	train_data.json
225M	total
import json
with open('train_data.json') as f:
  weibo_posts = json.load(f)
print('Num posts:', len(weibo_posts))
Num posts: 1119207
# Preview Weibo posts.
for post in weibo_posts[:10]:
  print(post)
[['希望 九 哥 日日 开心 同 我 打 羽毛 波\n', 1], ['加 埋 哦 ! 句 需要 运动 !\n', 0]]
[['哈哈 、 生日 快乐 。 我 地 居然 同一 日 生日\n', 5], ['哈哈 … 生日 快乐 。\n', 5]]
[['有人 问 , 何时 去 北京 演出 呢 ? 正好 你 回答 一下 。\n', 0], ['北京 ? 争取 2011 年底 。 回答 完毕\n', 0]]
[['这 美 ?\n', 1], ['哈哈 ~ ~\n', 5]]
[['一定 要 支持 ~\n', 1], ['谢谢 支持 祝 你 好运 哦\n', 1]]
[['肿 么 呢 ? 青奈 滴 ? 吃 点 退烧 药\n', 3], ['这 都 不是 重点 , 马 春然 完美 演出 才 是 正经\n', 1]]
[['粗粗 吧 , 我 觉得 还 好 啊 。 你 回来 没 啊 上报 哥 !\n', 2], ['怎么 叫 上报 哥 啊 看 我 八 月 的 排班 怎样 有 时间 就 回去\n', 4]]
[['我 听 在 块\n', 0], ['今晚 你们 全都 要 来 老 公司 么 ?\n', 4]]
[['我 想 我 会 一直 孤单 , 过 着 孤单 的 生活\n', 2], ['原来 都 是 或 曾经 孤单 , 或 正在 孤单 的 主儿 。\n', 2]]
[['混 得 好 我 就 不 回来 了 , 你们 在 天朝 要 坚强 。 到 时候 一起 接 你们 过去 。\n', 0], ['要 奸 墙 , 不要 被 强奸 。\n', 4]]
import re

weibo_word_frequency = {}

for question, response in weibo_posts:
  words = question[0].split(' ') + response[0].split(' ')
  for word in words:
    word = word.strip()
    # Only include words that consist of Chinese characters.
    if re.match(r'[\u4e00-\u9fff]+', word):
      if word not in weibo_word_frequency:
        weibo_word_frequency[word] = 0

      weibo_word_frequency[word] += 1
      
weibo_word_frequency['北京']
8044
df = pd.DataFrame(data=weibo_word_frequency.items(), 
                   columns=['word', 'occurrences'])
df.head(10)
# Add frequency percentiles
df['percentile'] = df.rank(pct=True, numeric_only=True)

# What are the most frequently used words?
df.sort_values('percentile', ascending=False, inplace=True)
df.head(10)
# What are the least frequent words?
df.tail(5)
# Limit list to 2000 words
df = df[:2000]
print(df.describe())
df.head()
         occurrences   percentile
count    2000.000000  2000.000000
mean     7750.468500     0.989464
std     33741.542328     0.006088
min       880.000000     0.978922
25%      1297.000000     0.984193
50%      2100.000000     0.989464
75%      4473.250000     0.994732
max    772887.000000     1.000000
def word_to_pinyin(word):
  if not word:
    return ''
  pin = pinyin.get(word)
  return pin

word_to_pinyin('你')
'nǐ'
def word_to_definition(word):
  if not word:
    return ''

  definition = pinyin.cedict.translate_word(word)
  if not definition:
    return ''
  return '<br/>'.join(list(definition))

word_to_definition('是')
'variant of 是[shi4]<br/>(used in given names)'
df['pinyin'] = df['word'].apply(word_to_pinyin)
df['definition'] = df['word'].apply(word_to_definition)
df['back'] = df['word'].apply(lambda word: '<b>'+word_to_pinyin(word)+'</b><br/>'+word_to_definition(word))
df[['word', 'pinyin', 'percentile', 'definition']].head(50)
df.to_csv('most_common_5k_chinese_words_v6.csv')
!du -sh /content/most_common_5k_chinese_words_v6.csv
372K	/content/most_common_5k_chinese_words_v6.csv

Contents (top)

Comments