Show HN: I trained a 9M speech model to fix my Mandarin tones

https://news.ycombinator.com/rss Hits: 14

Summary

TL;DR: Mandarin pronunciation has been hard for me, so I took ~300 hours of transcribed speech and trained a small CTC model to grade my pronunciation. You can try it here. In my previous post about Langseed, I introduced a platform for defining words using only vocabulary I had already mastered. My vocabulary has grown since then, but unfortunately, people still struggle to understand what I'm saying. Part of the problem is tones. They're fairly foreign to me, and I'm bad at hearing my own mistakes, which is deeply frustrating when you don’t have a teacher. First attempt: pitch visualisation My initial plan was to build a pitch visualiser: split incoming audio into small chunks, run an FFT, extract the dominant pitch over time, and map it using an energy-based heuristic, loosely inspired by Praat. But this approach quickly became brittle. There were endless special cases: background noise, coarticulation, speaker variation, voicing transitions, and so on. And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems. So instead, I decided to build a deep learning–based Computer-Assisted Pronunciation Training (CAPT) system that could run entirely on-device. There are already commercial APIs that do this, but hey, where’s the fun in that? Your browser does not support the video tag. Architecture I treated this as a specialised Automatic Speech Recognition (ASR) task. Instead of just transcribing text, the model needs to be pedantic about how something was said. I settled on a Conformer encoder trained with CTC (Connectionist Temporal Classification) loss. Why Conformer? Speech is weird: you need to catch both local and global patterns: Local interactions The difference between a retroflex zh and an alveolar z happens in a split second. CNNs are excellent at capturing these short-range spectral features. Global interactions Mandarin tones ar...

First seen: 2026-01-31 01:39

Last seen: 2026-01-31 14:41

Read Full Article More from this Source

Show HN: I trained a 9M speech model to fix my Mandarin tones

Summary

Related News

Implementing the Transcendental Functions in Ivy

Naples' 1790s civil war was intensified by moral panic over Real Analysis (2023)

Disrupting the largest residential proxy network

Designing a Passively Safe API

htmx: Server Sent Event (SSE) Extension