Measuring My Chinese Progress

Last summer I started learning Mandarin Chinese. To start I began taking classes at a Chinese language school in SF. For more practice I started an Instagram @jeffcarp_zh and tried writing a couple blog posts.

Almost a year later, I’m still going to Chinese class on a semi-regular basis (1 hour a week except when I’m taking a break) and keep up a daily spaced-repetition flashcard habit using the Pleco Chinese dictionary app (usually on the train into work).

Since I’m not in college, I don’t have a semester or quarter system to add structure to my learning. So recently I’ve been looking for ways to reliably measure my Chinese learning progress to celebrate wins and keep me motivated to continue studying.

Flashcard Count

To start in my investigation, I downloaded my Pleco backup file containing all my flashcards. Currently I have 1862 flashcards.


The standardized Chinese Proficiency Test is the HSK, or Hànyǔ Shuǐpíng Kǎoshì—汉语水平考试 (over-literal translation: Chinese language water level test 😄). There are 6 HSK levels; level 1 is the easiest. Getting a job in China seems to require around level 4 or 5. The number of vocabulary words per level increases drastically as the levels go up:

Current HSK Level

The HSK is offered twice a year in SF and costs between $20-70, so taking it frequently isn’t a good option for regularly measuring my progress. However, given the HSK vocab lists and my flashcards, I can quickly check how my vocabulary stacks up:

And here is the percent of each HSK I’ve completed. Now that I’ve added all the HSK 1-3 vocabulary to my flashcards, I’m about 20% of the way there for HSK 4.

How does this translate to real-world Chinese Progress?

I’m not learning Chinese to pass the HSK. I’m learning Chinese to talk with people who speak Chinese. So the overarching question I want to answer is: given my current vocabulary, how much of day-to-day spoken Chinese can I understand?

I thought a great approximation for this could be Weibo, the Chinese version of Twitter. I downloaded a dataset with 1.1 million “tweets” to see how my vocabulary stacks up. The top 10 most common words in the dataset were those you might expect:

Word Pinyin Percentile Meaning
de 100 (possessive particle)
99.99 I
99.9979 you
le 99.9968 (completed action marker)
shì 99.9958 to be
ā 99.9947 interjection (phoenetic)
99.9937 no
哈哈 hāhā 99.9926 haha
hǎo 99.9916 good
yǒu 99.9905 to have

Normalizing all the words in the dataset by their relative frequency, I found that with my current vocabulary, I can read about 63.58% of Weibo.

Using Weibo (and specifically this dataset) isn’t perfect—even if I can read all the words in a sentence, it doesn’t mean I’ll understand its meaning—so I’m making a lot of assumptions.

But I still think there is something here. Standardized language tests came from a time when we didn’t have access to huge datasets of current speech from millions of speakers for large-scale language analysis. The order of the HSK vocabulary words and the order of the most common Chinese words are completely different.

In a future blog post I’d really like to explore in further depth how well the HSK prepares you to read and write the most common day-to-day Chinese words, and ideas for learning based on word frequency.

Contents (top)