How Apple Finally Made Siri Sound More Human – WIRED
The first time Alex Acero saw Her, he watched it like a normal person. The second time, he didn’t watch the movie at all. Acero, the Apple executive in charge of the tech behind Siri, sat there with his eyes closed, listening to how Scarlett Johansson voiced her artificially intelligent character Samantha. He paid attention to how she talked to Theodore Twombly, played by Joaquin Phoenix, and how Twombly talked back. Acero was trying to discern what about Samantha could make someone fall in love without ever seeing her.
When I ask Acero what he learned about why the voice worked so well, he laughs because the answer is so obvious. “It is natural!” he says. “It was not robotic!” This hardly counts as a revelation for Acero. Mostly, it confirmed that his team at Apple has spent the last few years on the right project: making Siri sound more human.
This fall, when iOS 11 hits millions of iPhones and iPads around the world, the new software will give Siri a new voice. It doesn’t include many new features or tell better jokes, but you’ll notice the difference. Siri now takes more pauses in sentences, elongates syllables right before a pause, and the speech lilts up and down as it speaks. The words sound more fluid and Siri speaks more languages, too. It’s nicer to listen to, and to talk to.
Apple spent years re-architecting the technology behind Siri, transforming it from a virtual assistant into the catch-all term for all the artificial intelligence powering your phone. It has relentlessly expanded into new countries and languages (for all its faults, Siri’s by far the most worldly assistant on the market). And slowly at first but more quickly now, Apple has worked to make Siri available anywhere and everywhere. Siri now falls under the control of Craig Federighi, Apple’s head of software, indicating that Siri’s now as important to Apple as iOS.
It’ll still be a while before the tech’s good enough to make you fall in love with your virtual assistant. But Acero and his team think they’ve taken a giant leap forward. And they believe firmly that if they can make Siri sound less like a robot and more like someone you know and trust, they can make Siri great even when it fails. And that, in these early days of AI and voice technology, might be the best-case scenario.
Siri Grows Up
If you want a good example of why Apple likes to control everything about its products, just look at Siri. Six years after its launch, Siri has by most accounts fallen behind in the virtual assistant race. Amazon’s Alexa has more developer support; Google Assistant knows more stuff; both are available in many kinds of devices from many different companies.
Apple says it’s not its fault. When Siri first launched, another company provided the back-end technology for voice recognition. All signs point to Nuance as that company, though neither Apple nor Nuance ever confirmed a partnership. Whoever it was, Apple happily blames them for Siri’s early issues. “It was like running a race and, you know, somebody else was holding us back,” says Greg Joswiak, Apple’s VP of product marketing. Joswiak says Apple always had big plans for Siri, “this idea of an assistant you could talk to on your phone, and have it do these things for you in a more easy way,” but the tech just wasn’t good enough. “You know, garbage in, garbage out,” he says.
A few years ago, the team at Apple, led by Acero, took control of Siri’s back-end and revamped the experience. It’s now based on deep learning and AI, and has improved vastly as a result. Siri’s raw voice recognition rivals all its competitors, correctly identifying 95 percent of users’ speech. The AI works in two distinct and critical parts of the system: speech-to-text, in which Siri tries to figure out what you said; and text-to-speech, in which Siri speaks back.
Among Siri’s most important jobs entails distinguishing your voice from everyone else’s, especially as these systems become more personalized. The more data Siri has, and the better Apple’s models become, the more it can discern between people and understand even heavy accents. It’s also a security concern: Researchers recently found they could communicate with Siri at frequencies too high for humans to hear, rendering the hack invisible. Siri needs to learn to separate human speech from machine speech, and your speech from everyone else’s.
Learn to Talk
One helpful way to understand how these systems work is through Apple’s process of teaching Siri a new language. When bringing Siri into a new market—say, Shanghai—the team first finds pre-existing databases of local speech. They supplement that by hiring local voice talent, and having them read books, newspapers, web articles, and more.
Apple’s team transcribes those recordings, matching words to sounds—and more importantly, identifying phonemes, the individual sounds that make up all speech. (In English, “fourteen” is a word, the toothy “e” sound in the middle is a phoneme.) They try to capture these phonemes spoken in every imaginable way: trailing off at the end of the word, harder at the beginning, longer before a pause, rising in a question. Each utterance has a slightly different sound wave, which Apple’s algorithms analyze to find the best fit for any given sentence. Every sentence Siri speaks contains dozens or hundreds of these phonemes, assembled like magazine cut-outs in a ransom note. It’s likely that none of the words you hear Siri say were actually recorded the way they’re spoken.
Acero offers an example: “You want to watch this?” versus “I like your watch.” In the first case, Acero’s voice naturally ticks upward as he says “watch,” but moves down in the latter. “It’s the same word, but it sounds completely different,” Acero says. He couldn’t use the same recording of the word “watch,” or even the same individual phonemes, in both sentences. Systems that do sound like your old GPS navigating to “one Siiiix NINE fourteenth STREET PhilaDELphia.” It’s hard to listen to, especially for more than a few words at a time.
Even a few years ago, computers and servers didn’t offer enough processing power to pore over a vast database to find the perfect combination of sounds for every call and response. Now that they do, Acero and his team want as much data as possible. So once they’ve built an initial model, they roll out Siri in what they call “dictation-only mode.” You can’t talk to Siri, but you can tap the microphone button and dictate a text message or web search. This gives Apple’s machines inputs from many accents, different-quality microphones, and a variety of situations, all of which make Siri work better for more people. Apple collects (anonymously, it says) and transcribes that data, improving the algorithms and training the networks. They supplement with location-specific data and spoken customs—you’d say the score is three-zero in the US, but three-nil in the UK—and continue to refine the system until Siri has a near-perfect understanding both of what Shanghainese words are, and how people say them.
At the same time, Apple launches an epic search for the right voice talent. They begin with hundreds of people, all brought in to record a sampling of things Siri might say. Acero then works with Apple’s designers and user-interface team to decide which voices they like best. This part skews more art than science—they’re listening for some ineffable sense of helpfulness and camaraderie, spunky without being sharp, happy without being cartoonish.
The next part is all science. “There are many voice talents that sound good,” Acero says, “but it doesn’t mean they’d be a good text-to-speech voice.” They run speech through the models they’ve built looking for what’s called phoneme variability—essentially, the sound-wave difference between the left and right side of each tiny utterance. More variability within a phoneme makes it hard to stitch a lot of them together in a natural-sounding way, but you’d never hear the problems listening to them speak. Only the computer sees the difference. “It’s almost like when you’re doing wallpaper on a wall, and you have to look at the seams to make sure they line up,” Acero says.
When they find the person who sounds right to both human and computer, Apple records them for weeks at a time, and that becomes the voice of Siri. This has been the process for each of Siri’s 21 supported languages, localized for 36 countries—more than all its major competitors combined. In all, 375 million people use Siri every month. That’s a big number, especially for a much-panned voice assistant with a long list of serious flaws.
Still, 375 million people pales next to the billion-plus Apple devices in use around the world. Nearly everything Apple sells includes Siri, from iPhone to Apple Watch to MacBook to Apple TV. At some point soon, analysts estimate more than a billion iPhones alone will be active simultaneously. Siri’s a popular and important feature, but it’s not quite ubiquitous. And for most people, it’s definitely not essential; you don’t need Siri to function the way you need your phone. Now that Apple has an assistant it trusts, it has to teach people how to use it.
Ask Me Anything
All you need to know about Apple’s intentions for Siri can be gleaned from one commercial. The spot follows Dwayne Johnson through a day in his life with his sidekick Siri. Johnson uses Siri to check his calendar while working out and zen-gardening; he checks his reminders; he summons a Lyft, which of course he drives; he checks the weather while speeding recklessly; he checks his email while painting the Sistine Chapel; he does centiliter conversions with his hands full; he FaceTimes and takes selfies from space. Siri calls him “Mr. Big, Bald, and Beautiful,” in a way that hopefully will feel slightly less uncomfortable in iOS 11.
From the beginning, Joswiak says, Apple wanted Siri to be a get-shit-done machine. It drives him crazy that people compare virtual assistants by asking trivia questions, which always makes Siri look bad. “We didn’t engineer this thing to be Trivial Pursuit!” he says.
Instead, Joswiak is still focused on helping people do more with the help of an automated friend. He points to Siri’s ability to do complicated file-search on the Mac, or the upcoming HomePod‘s deep knowledge of music. Another example came a few days after our meeting, when Siri won a technical Emmy for its voice search and controls. There really is something wonderful about saying, “Hey Siri, rewind two minutes,” and watching it happen.
Siri can’t do everything, or even most things. It’s most useful for saving you a few taps and types, notsolving complicated trivia or debating whether we’re living in a simulation. Yet because Siri shows no bounds—you can ask it anything—users will try everything. “It is not trivial for users to know what they can say,” Acero says. Part of his job entails helping Siri communicate its skills better, and fail gracefully when it must. “We’re trying to endow Siri with these kind of capabilities, where it may know what it doesn’t know,” he says. “But that’s a tough problem.” Apple’s website, and even its commercials, are designed to help people better understand what Siri can and can’t do.
Another challenge is just getting people to remember Siri exists. “People have their habits of doing something,” Acero says. “If they’re used to typing, all of a sudden changing that, it takes a while.” So Apple’s trying to nudge users in the right direction. In iOS 11, Siri’s becomes a lot more present and a lot more proactive. It’ll watch you browse the web and then suggest Apple News stories for you to read, or help you add a calendar event for the massage you just booked through Groupon. The new Siri is a shape-shifter, syncing your settings between devices so no matter what gadget you’re using, Siri knows you as well as always.
Over the years, Apple’s been slow to let developers integrate with Siri. While Alexa and to a lesser extent Google Assistant have encouraged others to build apps for and including their assistants, Siri’s walls have stayed closed. All those things The Rock can do, he can only do in Apple’s own apps. It refuses to acknowledge the existence of Google Maps or Outlook on your phone, and certainly won’t turn on any light bulbs made without HomeKit. Last year, the company cautiously let more developers in, allowing users to use Siri to make calls with WhatsApp, summon a ride from Uber, or send money with Venmo. The doors creak wider in iOS 11, but only slightly.
Such slow-moving has cost Apple its lead in many people’s eyes, as Amazon and Google hoover up developer support and race ahead in features. Joswiak at least projects patience. The question, he says, is not how many things Siri could do. “It’s ‘how do you do it right?’ Because what we didn’t want to do is become prescriptive.” He bristles at Amazon’s and Google’s demanding syntax, which require you to say things like, “Alexa, ask Daily Horoscopes about Taurus” or “OK Google, let me talk to Todoist.” He’d rather wait until you just say what you want, however you want, and have it happen. Apple, as always, prefers doing nothing to doing something halfway.
The syntax problem ultimately comes back to the same thing Acero heard listening to Samantha and Theodore Twombly fall in love on-screen. The best computers—even the science-fiction ones—sound human. “It has the right pauses, the right intonations, smooth voice,” he says. “And just a little bit metallic in the sound.” He wants to build something that good, and give it to everyone. Anytime you want to check the progress, just check in with Siri.
UPDATE: This story now spells Greg Joswiak’s name correctly.