TL;DR: Mandarin pronunciation has been hard for me, so I took ~300 hours of transcribed speech and trained a small CTC model to grade my pronunciation. You can try it here. In my previous post about Langseed, I introduced a platform for defining words using only vocabulary I had already mastered. My vocabulary has grown since then, but unfortunately, people still struggle to understand what I'm saying. Part of the problem is tones. They're fairly foreign to me, and I'm bad at hearing my own mistakes, which is deeply frustrating when you don鈥檛 have a teacher. First attempt: pitch visualisation My initial plan was to build a pitch visualiser: split incoming audio into small chunks, run an FFT, extract the dominant pitch over time, and map it using an energy-based heuristic, loosely inspired by Praat. But this approach quickly became brittle. There were endless special cases: background noise, coarticulation, speaker variation, voicing transitions, and so on. And if there鈥檚 one thing we鈥檝e learned over the last decade, it鈥檚 the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems. So instead, I decided to build a deep learning鈥揵ased Computer-Assisted Pronunciation Training (CAPT) system that could run entirely on-device. There are already commercial APIs that do this, but hey, where鈥檚 the fun in that? Your browser does not support the video tag. Architecture I treated this as a specialised Automatic Speech Recognition (ASR) task. Instead of just transcribing text, the model needs to be pedantic about how something was said. I settled on a Conformer encoder trained with CTC (Connectionist Temporal Classification) loss. Why Conformer? Speech is weird: you need to catch both local and global patterns: Local interactions The difference between a retroflex zh and an alveolar z happens in a split second. CNNs are excellent at capturing these short-range spectral features. Global interactions Mandarin tones ar...
First seen: 2026-01-31 01:39
Last seen: 2026-01-31 14:41