image

Did you notice that Siri sounds a little more sprightly today? Apple’s ubiquitous virtual assistant has had a little virtual work done on her virtual vocal cords, and her newly dulcet-ized tones went live today as part of iOS 11[1]. (Check out a few more lesser-known iOS 11 features here[2].)

It turns out a lot of work went into this little upgrade. The old methods of creating speech from text produced the familiar but stilted voices we’re all familiar with from the last decade or two. Basically you took a big library of voice sounds — “ah,” “ess,” etc. — and stuck them together to make words.

The new way, like everything else these days, involves machine learning. Apple detailed the technique[3] earlier in the year (published[4], even), but it’s worth recounting here. First Apple recorded more than 20 hours of a “new voice talent” performing tons of scripted speech: books, jokes, answers to questions.

A sentence being broken down into pieces.

That speech was then segmented into tiny pieces called half-phones; phones are the smallest sounds that make up speech, but of course they can be said in different ways — rising, falling, quicker, slower, with more or less aspiration, that kind of thing. Half-phones… well, obviously, they’re half a phone.

All these tiny sound pieces were run through a machine learning model that figures out more or less which piece makes sense in which situation. This type of “er” sound when starting a sentence, that type when ending a sentence — that kind of thing. (Google’s WaveNet did something like this[5] by reconstructing voice sample by sample, which Apple’s researchers acknowledge,...

Read more from our friends at TechCrunch