Racism and Vocal Tone in Voice User Interfaces

An exploration of Siri Voice through the lens of sociomateriality and critical race theory

Xuan Song
10 min readDec 31, 2020

Introduction

One nonverbal cue that can be expressed through Voice User Interfaces (VUIs) is vocal tone. Tone can have a measurable impact on user perceptions. As Nass and Brave note in their book, Wired For Speech (2005), voice can elicit powerful responses. Humans evolved to facilitate human-human conversation, which is based on vocal cues. This research provokes a crucial research question: “How will a voice-activated brain that associates voice with social relationships react when confronted with technologies that talk or listen?” [1] To begin, I will discuss key concepts of social material theory and critical race theory that come into play in Apple’s Siri Voice feature. Then I will unpack an example of the Siri voice feature where users can find limited options of different vocal tones in Siri settings. Next, I examine how the combined lenses of these theories extend each other to give us greater detail and understanding of Siri and vocal tones. Finally, I discuss the implications of these theories for other technology design problems and HCI research problems.

Critical race theory

Race should be recognized as an important topic in HCI. There are many biases in Technology and Design, including autonomous cars that misidentify darker-skinned pedestrians, commercial facial recognition accuracy for people in colors, hate speech in chatbots, etc. Although there has been some forward movement regarding racial diversity in the HCI field, the outcomes of those efforts have been lacking. Technological racism is not only limited to face recognition and biased algorithms. There is a personal story in the article Critical Race Theory for HCI which examines the issue of editing away voices of color. To quote, “Now I have to question and evaluate others in HCI to judge whether they will limit my opportunities or unfairly judge my ability if I do not consistently use the language of elite Whites.” [2] In order to respond to racism and social inequity, here are two tenets that are related to Siri voice technology that I will tackle in this paper. The first tenet is that racism is ordinary, not aberrational. [2] Racism exists in all aspects of our lives. Those who are not impacted by its struggle to recognize this. In order to think about diversity and inclusion in technology, we must first recognize this phenomenon. The second tenet is, “there is a uniqueness to the voice of color, and storytelling is a means for it to be heard.” [2] In Ogbonnaya- Ogburu’s Critical Race Theory for HCI, we learned that the voices of color have uniqueness and therefore their unique stories should be heard.

Sociomaterial theory

Sociomateriality is constitutive, shaping the possibilities of everyday activities. Materiality is constituted in everyday life. We shouldn’t see the social and material life independent from each other; instead, social and material practices are intertwined. Humans rely on materials for social functions, and materials are produced through human practices. ‘Constitutive entanglement’ is a concept that further explains and details that sameness. ‘Constitutive entanglement’ refers to the idea that social and the material are constitutively entangled in everyday life. Humans and technology are constitutively entangled and create a form of mutual reciprocation, neither humans or technology should come first. [3] Understanding social practice isn’t enough, we need an understanding of material practices and vice versa. Agency, or capacity to act, is emergent through interactions with material forms (e.g. technology). [3] I will examine Siri voice through a combined theoretical perspective that draws from both the Critical Race and Social Material theories.

Background: A New Lens for Siri Voice

The original Siri voice comes from an American white woman, Susan Bennett in 2015. She described the process of concatenation, in which she needs to read hundreds of sentences and phrases that were created specifically to extract the sounds. After that, the technicians came in and extracted the sounds and reformed new phrases and sentences. These ended up in the Apple digital devices that we use now. [4] Five years later, the voice of Siri by default is still an American white female voice. The tone, the way how Siri communicates are centralized by the language of upper-class Whites.

The combination of critical theory and social material theory, theoretical perspectives from two different waves of HCI theory, examines both specific and holistic points of view when analyzing VUIs. It allows us to zoom out to understand the bigger social picture and zoom in to learn specific language patterns. Critical race theory explains specific social phenomena while social material theory creates a more holistic point of view and includes multiple critical theories including critical race theory. Sociomateriality emphasizes the feedback loop between human activities and materials/technology. Race as a part of the social category shapes the possibility of technology.

Analysis: The Power of Diversity in Siri Voice

The critical race theory provides an avenue to discuss ethical power but it doesn’t show the full picture of Siri in the social categories. Meanwhile, from a sociomateriality lens, society and materials constantly shape each other, but this theory fails to show its power and why we should care about Siri Voice. Bringing these two theories together, we can view Siri Voice from the combined lens that explains why we should care about diversity in AI voices and the relationship between diverse human activities and Siri voices.

First, these theories help us to understand the importance of diversity in AI voices within different social contexts. Taking English for instance, Marianna Pascal mentioned in her Ted talk that if we listen to every conversation in English on the planet right now, only 4% of those conversations are native speakers to native speakers. [5] Additionally, Miriam Sweeney observes that most AI assistants’ voices suggest a form of “‘default whiteness’ that is assumed of technologies (and users) unless otherwise indicated.” [6] Based on my personal experience and research, the option to change Siri’s voice and tone are hard to find and poorly designed.

Here are a few experiments with current Siri:

User: Can you change your tone to an Asian woman?

Siri: Sorry, I can’t do that.

User: Hey Siri what kinds of accents can you speak to?

Siri: I am not sure I understand.

Siri: I can’t change my voice, but you can do it yourself in Settings — -> Siri Voice Settings

There are some obvious problems that were discovered: first, It is not obvious how to change different tones in Siri; second, Siri Voice is not diverse enough; finally, Siri’s knowledge bases are dominated by the language of White people while other groups of people are less concerned.

Fig. 1: Author’s conversations with Siri
Fig. 2: Siri voice feature in Mac Siri setting

Moreover, people with different racial and language backgrounds can improve the diversity of voice products, and more diverse voice products can benefit racial minorities. This refers to the term “constitutive entanglement” as well as the second tenet of critical race theory: racism is ordinary, not aberrational; race and racism are socially constructed. Currently, Siri provides 12 different vocal tones, users are able to choose different voices in Siri settings (see fig. 2). Compared to the original Siri voice in 2015, there are definitely some improvements in terms of adding more diverse voices. It is hard to get access to different voices in the first place if users don’t go to the Siri settings.

The lack of research and testing make other Siri voices sound less accurate compared to the default American female voice. There are still many other tones that are not included in the current Siri. If we looked at all of the English conversations in the whole world, we would see that the overwhelming majority of English conversations are held between non-native speakers. In fact, 95% of all English conversations involved non-native speakers. [5] Because of the complexity of language and how we train the speech models, there will be more new problems and concerns arise in the HCI field. We need to consider various vocal tones and speech models through continued testing with people of color and non-English speakers. Also, I suggest we conduct initial voice recordings with minorities for speech models instead of only with white women or native speakers. Moving forward, we should consider how people talk in different social contexts and how to make these conversations contribute to the improvement of Siri voice.

Discussion: Influence on other design and research areas

It is striking to notice that building AI bots and other bot frameworks also can be discussed from critical race and sociomaterial theory lenses. I will provide a personal example to better illustrate this point. When I built an AI bot at Microsoft, I discovered and learned technical tools such as Bot framework composer, Azure Bot Service, Azure QnA Maker Cognitive Service, etc. Those tools provide a medium for me to discuss with my peers who might have different language and cultural backgrounds. My peers will construct sample dialogue and conversations in a different manner from me. In order to build a bot that works for everyone, I took different personas requirements into considerations, which involved understanding users’ racial backgrounds, their working behaviors, and the ways that they interact with Microsoft tools. All these elements and considerations help me and my team build a more inclusive AI bot. However, with the complexity of language and races, it is challenging to build a bot that works for everyone. Yet, allowing Siri and other VUI technology to help the diversity of people is critical to make products more inclusive. If I had a chance to improve these AI tools, I would definitely provide more guidance and personas to make the product more inclusive. I believe the feedback loop between diverse human activities and materials could help other researchers or designers improve their AI bot in the future.

If I could expand to other design problems outside of VUI, the combined theoretical perspectives also provide value for designing a search engine. Search engines face similar social and racial issues as VUI products like Siri. In Algorithms of Oppression, Safiya Umoja Noble challenges the idea that search engines like Google offer an equal playing field for all forms of ideas, identities, and activities. Data discrimination is a real social problem. [7] For example, if you run a Google search for “black girls”―what will you find? “Big Booty” and other sexually explicit terms are likely to come up as top search terms. But, if you type in “white girls,” the results are radically different. [7] Because internet search engines possess a monopoly and can promote certain sites, these search algorithms potentially privilege whiteness and discriminate against people of color. As more and more information is uploaded to the internet, the algorithms could bias search results based on racial background of the uploader. This directly changes the ways people access and interpret information. However, there are still some limitations that these combined theories could not address. For example, how search engines could change over time, the biases of current search engines in LGBTQ queries, and the limitation of current technology.

Conclusion

Language and voice are complex systems because they may include different accents, dialogs, idioms, etc. More than 100 million people in the U.S. and Europe have speech patterns that may not work with today’s VUIs — perhaps because they have a stutter, or they’ve had a stroke and their language is less intelligible. Nonverbal behavior (using gestures, eye contact, facial expressions) is impossible to recreate through voice-only interactions. But one nonverbal cue that can be expressed through VUIs is vocal tone. This can have a measurable impact on user perceptions. All voices project a persona whether intentional or not. That’s why it is important to consider the emotional needs of users when choosing a tone. What does that mean at a larger scale? What social norms might change because of Siri? AI voice assistants activate an ambivalent relationship with users, giving them the illusions of control in their interactions with the assistant while at the same time withdrawing them from actual control over the computing systems that lie behind the interface. [8] The boundaries between human activities and technology are blended and unclear.

In short, critical race theory and sociomateriality lenses help us understand the importance of diversity in AI voices and the relationship between diverse human activities and Siri voices at scale. It is essential that issues of race and racism in VUIs be raised more frequently in the community in a way that is heard by everyone. There is still a long way to go, but for people who work in the relevant design and research, I hope this paper can give you some inspiration on how to apply these HCI theories to your field.

Work cited

[1] Clifford Nass and Scott Brave. 2005. Wired for speech: How voice activates and advances the human-computer relationship. MIT press.

[2] Ihudiya Finda Ogbonnaya-Ogburu, Angela D.R. Smith, Alexandra To, and Kentaro Toyama. 2020. Critical Race Theory for HCI. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ‘20). Association for Computing Machinery, New York, NY, USA, 1–16. DOI:https://doi.org/10.1145/3313831.3376392

[3] Wanda J. Orlikowski, Sociomaterial Practices:Exploring Technology at Work, SAGE Publications, 2007

[4] Meet the Real Voice of Siri, OPRAH.COM

http://www.oprah.com/wherearetheynow/meet-susan-bennett-the-real-voice-of-siri-video

[5] Pascal, Marianna. “Learning a language? Speak it like you’re playing a video game”, 11 May 2017, TEDxPenangRoad. https://www.youtube.com/watch?v=Ge7c7otG2mk

[6] Sweeney, M. E. (2020). Digital Assistants. In N. B. Thylstrup, D. Agostinho, A. Ring, C. D’Ignazio, & K. Veel (Eds.), Uncertain Archives: Critical Keywords for Big Data. Cambridge, Mass.: MIT Press.

[7] Noble, Safiya Umoja. Algorithms of Oppression: How Search Engines Reinforce Racism, NYU Press, January 8, 2018

[8] Natale, Simone. To believe in Siri: A critical analysis of AI voice assistants, 32. 1–17. March 2020

--

--

Xuan Song

UX designer@Microsoft| MS Human-Computer Design& Engineering at University of Washington