Prototyping for Voice: Design Considerations

Published: 24 August 2017

Henry Cooke
Senior Producer & Creative Technologist

In 主播大秀 R&D, we鈥檝e been running a project called which aims to understand how to design and build software and experiences for voice-driven devices - things like , and so on. The project has two main strands: a practical strand, which builds working software in order to understand the platforms, and a design research strand, which aims to devise a user experience language, set of design patterns and general approach to creating voice interfaces (VUI), independent of any particular platform or device.

This is the second of two posts about our design research work. In the , we discussed a prototyping methodology we've been developing for VUI. In this one, we'll outline some of our findings from doing the work - key considerations we've found useful to bear in mind while designing VUI prototypes.

Above: the team in a prototyping role-play session.

Tone of voice

When your only method of communication with your users is through voice, then the tone of voice your application uses is super-important - as important as the choices made about colour, typeface and layout in a visual application.

Think about the vocabulary and writing style used in your voice prompts and responses. Is your application mostly concerned with responding to direct requests for specific information? Then you probably want to be pithy and concise with your responses. Is your application designed for people in a more relaxed, open frame of mind? Then you can afford to be more discursive and chatty.

If you鈥檙e using recorded talent (although this could apply to synthesized voices, to a lesser degree), think about timbre, intonation and delivery style - everything you鈥檇 think about when recruiting a voiceover artist.

Use of system synthesized voice vs. recorded talent voice

Let鈥檚 be honest - while the system voices on Alexa or 主播大秀 are good, they鈥檙e optimised for short, pithy answers to requests. They鈥檙e not really suitable for reading out long chunks of text, especially if that reading requires natural intonation.

Using recorded human voice allows for much more natural speech, and allows for a reading with a particular tone: excitable / serious / etc. The downside is that you add a production overhead to your app, and once the voice is recorded, it can鈥檛 be changed. A voice app using recorded talent speech will never be as adaptable as one which generates speech on the fly.

It鈥檚 all about the data

This one applies more to data-driven applications than narrative experiences. One thing we鈥檝e found when designing conversational user interfaces (CUIs) - text or speech - is that while it seems intuitive to be able to ask questions against a large dataset (say: the news, or a list of programmes), these types of application can only be built if there鈥檚 an extensive, well-tagged and searchable data source to query. In these cases, the actual interface itself, and parsing the user鈥檚 intent is a relatively straightforward problem to solve - the really hard problems arise from collating, sifting and re-presenting the data required to answer the user鈥檚 questions. These kinds of application are a lot more about juggling data than they are about natural language.

The expectation of 鈥榮mart鈥�

Towards the end of 2016, we did some prototyping and user testing around CUI - mostly in text messaging channels, but with some VUI. One of the most striking things we found from the testing was the expectation that users had about the intelligence of the entity they were talking to. Essentially, since people were communicating with something that appeared to be smart enough to respond to natural language, and pretended to have a personality, they assumed it was also smart enough to be able to answer the kinds of question they鈥檇 ask another person.

This is an important thing to bear in mind when designing systems with which someone will converse, because it鈥檚 very rare that you will be able to deal with spontaneous, open speech. Most applications will have a limited domain of knowledge; a story about witches, say, or the programme catalog of a large broadcaster. How are you going to communicate to people the limits of your system without driving them away?

Letting the user know what they can say / dealing with a limited vocabulary

Most VUI systems don鈥檛 allow completely free, spontaneous speech as input; as a developer, you have to register upfront the collection of phrases you expect a user to say in order to interact with your application, and keep it updated as unexpected variations creep in.

Given that this limitation exists, you have the problem of communicating to people what they can say to navigate your application. Some developers choose to do this upfront, listing possible commands when an application starts for the first time. However, this can sound clunky, provides a speed bump for people wanting to get started with the meat of your application and requires them to remember what you told them at the point where an interaction becomes available.

Another way to do this is to wait until an interaction is about to happen, and then tell a person what they can say: 鈥測ou can say forward, backward or stop.鈥� However, this can seem a little mechanical, and interrupts the flow of a longer conversation or fictional piece.

Things to try:

In a fictional piece, you could set up a choice as an argument between two characters
You could use a set of choices that is naturally limited, e.g numbers from 1-10, star signs.

Modes of address / managing the user, narrator and other characters

When a person interacts with a voice application, they鈥檙e always interacting with at least one voice. For simple applications, one voice is often enough - although Google 主播大秀鈥檚 model of handing off to other voices for different functions - 鈥淥K Google, talk to Uber about a ride鈥� is interesting, and helps someone understand when they鈥檙e shifting contexts. For more complex, narrative-driven applications, it鈥檚 likely there will be more than one character talking over the course of the experience. In these applications, managing how the characters talk to one another and how the user is addressed becomes a challenge with some subtleties.

In this case, there鈥檚 a few questions you need to ask yourself:

is the user present in the piece, or an unnoticed observer?
is the user directly participating in the narrative with their voice, or distanced from the narrative and using their voice to navigate at a level of remove (the difference between speaking to characters directly, or using voice to choose branches in a storyline)?
Can all the characters in the piece address the user, or just one? Using a narrator / mediator to communicate with the user can simplify things, but it鈥檚 still important to consider how the user will know when a character is addressing them directly and when characters are talking between themselves (the 鈥榯urning to the user鈥� problem)

Turn-taking / Letting the user know when they can speak

The 鈥榙ing鈥�

This seems like the most straightforward way to let someone know they can speak - 鈥渁fter the ding, say your choice鈥�. However, there鈥檚 subtlety here: do you say 鈥渄ing鈥� or play the sound itself when referring to it? Is this confusing to the user? They have to understand the difference between a referential ding and a real one. If you say the word 鈥渄ing鈥�, do people understand that this means a 鈥渄ing鈥� sound when it鈥檚 played?

Audio mix

A more subtle way of letting the user know that they can speak in fictional, radio-like pieces is by using the audio mix. If you鈥檙e using music or sound beds for the action, you can drop these out at the time a character or narrator is addressing the user, signifying that our focus has moved away from the fiction and the user is alone with the narrator. Closer mic placement for a recorded narrator voice can also indicate closeness to the listener / user.

System voice

While we鈥檝e identified some problems with using the system voice on VUI devices, it can be useful to let the user know they鈥檙e being asked a direct question, since that鈥檚 the voice they鈥檙e used to interacting with on a device. If you鈥檙e making a piece that includes many voices, consider using the system voice as a 鈥榖ridge鈥� or 鈥榤ediator鈥� between the user and the fiction world.

Maximum talk time

At the time of writing, you鈥檙e limited to 90 seconds of audio playback on Alexa (120 seconds on Google 主播大秀) between each user request. This means you cannot write a large chunk of dialogue to be played back as an audio file without having the user respond regularly. This is a constraint to bear in mind when designing your dialogue - how can you cause an interaction every minute or two without making it seem forced?

Thanks for reading!

These considerations are all things we've run across while doing our design research work on that come up over and over again and have always proved useful to bear in mind while thinking about VUI design. We hope they're useful to you as you do your own thinking about VUI design - we've also developed a which contains some practical VUI prototyping tips.

Thanks to the whole Talking with Machines project team for their work on the design research which led to these posts: Andrew Wood, Joanne Moore, Anthony Onumonu and Tom Howe. Thanks also to our colleagues in 主播大秀 Children鈥檚 who worked with us on a live prototyping project: Lisa Vigar, Liz Leakey, Suzanne Moore and Mark O'Hanlon.

主播大秀

Accessibility links

Henry Cooke

Rebuild Page

Useful links

Theme toggler

主播大秀