James Parker (00:00:46) - Thanks so much Beth. I was wondering, maybe would you like to start off just by introducing yourself however seems right to you?
Beth Semel (00:00:55) - Yeah sure. So my name is Beth Semel. I'm an assistant professor in the anthropology department at Princeton. And I like to say that I'm an anthropologist of science, technology, and language. So my training is in science and technology studies, linguistic anthropology, also dabbling in medical anthropology. But the science and technologies that I study as the first part of that label is communication, engineering, speech signal processing, this thing that we could call machine listening, and then also the science and doing of psychiatry, mental health care, and the kind of various technologies that are involved in that practice. And the language bit is me kind of thinking about, you know, not just modes of speech, not just the production of language, but also the that modes of interpreting, receiving language, which again is where I see the kind of, when I think about machine listening, I think about it always in that kind of relational interplay between both speaking and listening. So, I guess that's me.
James Parker (00:02:13) - Amazing. I mean, yeah, it sounds awesome. Do you want to say anything about how you arrive at that set of concerns? I don't know if that's, I don't know if you've got a kind of a story, some people that we've spoken to, you know, their sort of biographical information that really sheds light on their kind of series of concerns. I don't know if that's the case for you.
Beth Semel (00:02:36) - Yeah, it's, you know all anthropologists kind of have their practice origin story, but I'll maybe veer off of that a little bit. Thinking about the personal side of things, you know, my mom is a retired speech therapist. And so, yeah, so, you know, growing up, I was kind of aware of this idea that there is a need for, you know, bringing people up to a kind of like standard threshold of intelligibility. And that's not, you know, that standard is something that can be kind of fluid, right? It can be set between the client and the therapist, but they're also kind of like broader systems that set that standard in place, right? Standard kind of acceptable quote unquote professional or whatever ways of being intelligible, sounding intelligible. And just the idea that that's, you know, that has to be put into practice and made, I think is something that I realize more and more really does like play a big role in the approach that I the critical rather critical approach I take at thinking about this weird, strange field of focal biomarker research, which I sometimes call machine listening and mental health care.
But I mean, the more standard origin story is that I was at MIT because I was studying, pursuing a PhD in science and technology studies and anthropology. Because I was really interested in mental health care as a thing that's both technical but also interactional, right? The primary tools are talk and interpretation and interaction. And at the same time, there's this whole idea of like evidence-based therapies, evidence-based treatments. So there's this quantifying matrix that practitioners are trying to push patients’ talk through. And it just so happens that I entered into the PhD and was kind of chugging along at this moment of change. American mental health care and the kind of primary funding bodies of American mental health care kind of rejecting the standard technology of mental health, which is the diagnostic and statistical manual of mental disorders. And saying instead well we should be hanging and collaborating with engineers and using data driven methods instead of ones that are more based on clinical wisdom, clinical know-how. Being at MIT, there were people there doing this data driven research about ‘ok, can we use functional magnetic resonance imaging to predict which patients will respond well to cognitive behavioral therapy.’ And so this is like a type of research that falls under the umbrella of digital psychiatry or precision medicine. And so in kind of like hanging out with those people, getting to know them, talking with them, they mentioned, oh yeah, you know, it sounds like you might be interested in this one lab that does vocal biomarkers. research. And I said, excuse me? What is that? What is a vocal biomarker? That doesn't really make any sense. How can the voice be biological? And this person said, well, you know, your brain controls everything, including your voice. So if you analyze the sound that the voices make at a kind of, you know, the level of the physical waveform of speech, then you can find out about the source that made it, the brain. And so through there, I just went on this kind of wild goose chase to find people who were doing this work because it kind of miraculously allowed me to unite an interest I had in thinking about interpretation, mental health care, but also technology. And again, that impetus to push talk and interaction through, in this case, not just a quantified matrix, but a specifically computational one, which is like almost like hyper quantified.
James Parker - Right. And when did you sort of when does that origin story like take place? Like what year?
Beth Semel - Yeah. Around 2015. Yeah, so that so that was the time that the director of the National Institute of Mental Health, Thomas Insel, he, he was a director at the time, and he put out this really, you know, this funding call that really kind of upset a lot of people that said we're not we're not going to be supporting any research that uses the DSM which was kind of like earth-shattering because people use the diagnostic categories and the diagnostic criteria in the DSM not just like in a clinical context but in a research context so they use it let's say you your lab wants to study bipolar disorder you need patients who who have been diagnosed or presumably like exist within that diagnostic category. So in the past, people would typically use the DSM as a way to create a research cohort. And in so basically said, no, we won't fund that research anymore.
James Parker (00:08:09) - But this was like early days, right? So I mean, so it sounds like you're saying that this is like a big driver or something that was kind of like a sort of a sleeping giant or something for a while. Because I mean, I didn't hear about the kind of vocal biomarkers research and sort of computational vocal diagnostics until much, much more recently, like really only the last couple of years. And then there's been some kind of huge like US national announcement about like sort of pouring money into this project. So it's or related projects, it seems like sort of you were you were studying studying a field? I don't know at its birth, or it's sort of I don't know. I don't I don't know. How would you describe it? And like what's changed? I suppose? I mean, we should get into the details. But just in terms of laying out the landscape, like, it seems like it would have been, you would have had to explain yourself to absolutely everybody when you said what research you were doing when you started. And now maybe you don't need to so much.
Beth Semel (00:09:21) - Yeah, it's really been remarkable. Indeed, when I first started this research and I would tell people that this exists or that people have a desire for this or think it's a good idea, they would say this is absurd. How could anyone believe in this as a concept or want it? And as you say, this giant, I think it's, I don't know, three, four million dollar NIH grant, that's, it's for voice, it's called voice as a biomarker for health and they're kind of different arenas of health that they want to study and one of which is mental health care, the other are more like I mean, I think, you know, thinking about what's that the question of what has changed or what has led to the coalescing of these two kind of paradigms or two kind of orientations towards like language and specifically like language and mental health care. That's been a question that I'm trying to think through now through archival research. So the archival material that I found, I found papers that are doing what we would today call vocal biomarker research, are trying to say like, can we, you know, ring something of the neurobiological, psychopathological, biologically speaking, from the voice using, you know, computational methods. I found papers that are doing that, like in the 1930s, trying to do that, proposing that as a concept. A bunch of dissertations in Germany, which is slightly concerning. Why were people interested in this in 1930s in Germany, PhD students in particular?
But even in the US as well, I found papers kind of parroting this concept, engineering papers parroting this idea that you could get something meaningfully biological about the psyche through the voice in the 1960s. I think, you know, maybe to to conjecture a little bit and kind of pull the, you know, the camera lens outward, I think there's a kind of, in many ways, a broader acceptance or really a kind of, like, capitalist, like, market hunger for computational things. There's a kind of inertia that I see happening that, you know, I think voice stuff is just one way that people are trying to do precision psychiatry, but I think because there is this imaginary about the voice as being easeful, right, as being immaterial, as being kind of a public object, freely available, freely floating. I've heard, you know, startup people who talk about the voice being a very cheap signal, right? The face is so multifaceted, the visual world is so multi-dimensional, the voice is flat, it's a singular dimensional thing, it's just one waveform that makes it not just easy to analyze, but inexpensive. So I think, again, there's
James Parker (00:13:05) - microphones are cheap, like still, right there, you know, that it's it's sort of cheap across every dimension, isn't it? I mean, just to give one sort of very stark example of the kind of the drivers or the kind of capitalist orientation or the market for this kind of stuff. I mean, an Australian based company, we're based in Australia, just got bought by Pfizer for $100 million on the basis of a claim that it can do COVID cough diagnostics. So it's not exactly what you're talking about in terms of computational sort of psychiatry specifically, but the kind of the sort of voice body nexus and the sort of the idea about the the the sort of the way in which the body speaks through the voice, even when it's not speaking, if you know what I mean, you know, it's sort of very clear in the cough is kind of the perfect example of something on the fringe of the voice, the voice that is sort of without it's kind of not laden by speech. So I mean, yeah, I mean, it seems like there's and they're not the only company to have done that. I mean, it's interesting that Pfizer did that, you know, in the wake of COVID. but like there are a number of other companies, Sonder Health in India and like a few others that have been pushing this. It feels like the pandemic context has, you know, and the sort of the riskiness of the voice and the relationship between breath and contagion and stuff has really kind of been a bit of a kind of, yeah, kind of a trigger for an explosion of interest in voice diagnostics. I could just it could be coincidence, but it does certainly feel that way that that, yeah, the pandemic is a kind of a driver into this field.
Beth Semel (00:15:03) - Yeah. And I think, you know, even before the pandemic, a lot of these, the, you know, emerging vocal biomarker companies are people who work constantly, you know, in like the initial starting, there's quite a few actually vocal biomarker companies that were cropping up in this like 2015, 2016 time that ultimately ended up pivoting to a different offering like, you know, something like, oh, we'll give you personalized recommendations for a therapist, or we're a therapy app now. I know a few companies have done that. But I think this, this impetus or this desire to like be tethered to the patient, even in the absence of any kind of physical connection, and this imaginary of easefulness and easefulness particularly of capture and of knowing the patient. I think that's, to bring it back to your question about why did this all happen at the time that it did, that's where I think we can start to see the kind of connective tissue with really a very like, biologizing impulse in psychiatry and mental health care that's driven by capitalism but Also really kind of about like capture, right? Like pinning down something essential about not just the person but trying to funnel down mental illness, psychiatric suffering, mental suffering as like a definitive object that can be held onto and known and done something to which, you know, from like a disability studies perspective, that's, you know, sounds, it sort of rhymes a whole lot with eugenics and other kinds of modes of social control that are pretty oppressive.
So I think there is, you know, because the aim or the impetus or the goal here is like health, right? Health benefits. Of course, we want to do whatever we can to mitigate the spread of COVID. Of course, you know, asterisk in the absence of like actual state infrastructure to help mitigate COVID, we need, you know, something of a band-aid that might do the best that it can to help. But I think lots of really smart people in the critical code studies, race critical code studies like Ruha Benjamin and Safia Noble have done a lot of great work to show how that kind of beneficent intention isn't enough and sometimes can ultimately draw attention away from looking at the kind of unintentionally harmful side of things. Yeah.
James Parker (00:17:48) - I'd love to get into some of the nitty gritty of your, I mean, you've written a lot about this topic or these topics. And you have a wonderful thesis, which has all of these amazing ethnographic case studies. And I'd love to get into the nitty gritty of at least some of them. Talk through your own experiences, sort of navigating this strange new and emergent world. But I just kind of feel like it's worth pointing out as a segue into that, that those examples are all of kind of research labs, sort of really in the kind of prototyping phase, like none of the, it seems like, am I right, that none of the projects that you investigated sort of gone to market? And I just wondered if it's worth setting up a little bit, you know, what the contemporary landscape is other than kind of hype and flooding investment and so on, you know, are there any extant companies that are really already doing vocal biomarker stuff? Are they are they all pivoting out because it doesn't really work, you know, into being other kinds of apps? How do you how do you understand the sort of the lay of the land as far as this field goes in the contemporary moment before we dive back into those case studies.
Beth Semel (00:19:16) - Yeah. Yeah, it's a, it's a complicated question. You know, I think, on the one hand, the field of like real time sentiment analysis, which is like a branch of affective computing, that stuff exists, and we could call that quote unquote, vocal biomarker research or kind of machine listening for sentiment analysis research, right? I'm not sure that people would connect the two, but to me, I think they are connected. And those technologies are, I mean, I don't want to, I want to be a little careful naming companies just because I don't want to be slapped with a libel lawsuit. One of the other benefits of working with the university is you don't have to deal with that scary corporate boogeyman. But there are companies who use this real-time sentiment analysis stuff for call center workers, right? To help not only help the call center worker kind of manage the affect of the person that they're on the phone with, but also as a way for you know, managers to surveil the workers, right? And it creates a kind of quota making system. And it does work under a similar premise, which is that there's like emotion, affect, sentiment exists as a kind of physical feature of the voice that you can track and, you know, hold on to in a sense.
But in terms of like specifically vocal biomarker stuff, you know, there's a lot of companies. There are some research labs that are pivoting into or kind of spinning off into companies. I'm assuming too there are people who are sharing or selling AI models or data sets with kind of companies. But as to whether or not it works is, that's a question that I used to get a lot, especially in the earlier kind of times when people were less, you know, this wasn't like a regular headline in the US news. It doesn't work.
But is it real? I would get that a lot and can I hear it right can you have an example for me that you can play for me. You know, I used to say like no, of course, it doesn't work. It doesn't work according to you know, all linguistic anthropological framework, which says you you can try to disentangle language from social context, from history, but doing so requires ignoring a lot of really important, essential things about how people make meaning or have meaning imposed onto them through language and talk. But does it work in the terms that it's trying to work, which is, again, to capture something essential about the mentally ill speaking subject? It sort of doesn't really matter because, like the paradigm there is one of capture, right? One of essentializing. So, you know, those technologies are being developed. They're being developed by being tested on people in the same capacity that I studied in my field work. And, you know, while my understanding is that the labs that I studied, you know, kind of close up shop or shelve their companies, I can either confirm or deny that they didn't hand over their data sets, that they didn't integrate their data sets with other people. And it's really hard to prevent people from building models from voice data sets that do something slightly different, a little bit different, or very different from what the data set was originally gathered for.
So it's hard to say what the field is doing right now, there has been this cycle of kind of hype and then no one really producing great statistically sound, you know, results, but somehow it feels like this NIH grant is somewhat of a turning point. And I don't know several other things. I don't know if it's too much detail to go into happening in the US, like legal cases and revelations that like McDonald's old is collecting voice biometric data through the like the drive-through kiosk, you know, stuff like that just makes me, you know, my like initial kind of naive anthropologist like, no, of course it doesn't work. It's like, well, I can't deny that it's working for some end, right? Right. And that it exists. People are investing lots of time and money and, you know, investing voices, right, their own voices in it. So it's enrolling people into systems of surveillance whether or not whether or not it quote unquote work. So it's doing a kind of work in the world and it warrants study as a result. I mean, yeah. Yeah, I think if I could just like add on to that a little bit. Shoshana magnets concept of biometric failure, I think is really helpful because she essentially says, you know, these biometric technologies, the failure is not like, okay, that's it, it's done. But things are constantly failing by virtue of working. So they worked by making this kind of misalignment or doing that kind of decontextualizing work. That's how they work. That's what they're designed to do is to decontextualize. So they are working, but from another perspective, they're failing.
James Parker (00:25:01) - Right. Should we do some of the case studies?
Beth Semel (00:25:09) - Sure.
James Parker (00:25:10) - I can't help but begin with your your chapter on depression. And there's this.
Beth Semel (00:25:19) - What an opener.
James Parker (00:25:20) - Right. Well, it's not because of the depression bit. It's because of the scene that you describe. I use the word scene deliberately. I mean, you know, you should introduce it yourself of but of patients or in one case you I think being inside an MRI and then being asked to recite or perform, you know, a script and I couldn't help get have in my mind this idea of a kind of an MRI theater and then the researchers are kind of the idea is to study your the speakers brain through the medium of their voice via the MRI so there's this kind of an incredible kind of. Performance dynamic you describe of the the best sort of conducting all the direction of the speaker inside the MRI. As a kind of in order to solicit speech that would yield insight about their brain and i just it's just such an amazing sort of scene and scene and I just wondered if yeah if there's a way of telling us a little bit about, you know, vocal biomarkers and through the medium of this case study that you solve. It's just so sort of amazing.
Beth Semel (00:26:43) - Thank you. Yeah, I mean, I'll try to do it justice, but so, you know, biomarker, right? Some people, some in startup people say that's an incomplete metaphor. It's a metaphor. It's not necessarily something there isn't something concretely biological there. And a lot of focal biomarker people, they're not actually looking at the brain. They're kind of doing the engineering cheat sheet thing where they're like, "Well, it doesn't matter. We don't really need to see the thing that's causing it. We can just, it can be made knowable through our techniques without us having to, you know, go in there." But in this particular lab, you know, they really are doing, or we're trying to do like basic science work to say, "Okay, well, what is happening at the brain level when people are producing speech and when they are under the diagnostic category of depression or not.
And so in order to do that, you can't just have people talking, speaking, right? You have to have them speak in a particular way, right? A particularly kind of regimented way. And so there's this really wild parallel story about these standardized vocal tasks that are used in speech therapy, but also used in more looking like Parkinson's brain studies, ALS brain studies, where the tasks are supposedly, and they're designed explicitly for English speakers, right? So the task wouldn't necessarily, it can't travel globally, right? And the tasks are designed to make you use as many of the articulators as possible to kind of maximize your articulatory action in one go so that you can get as much brain data as possible. But they're very bizarre sounding, like Dadaist poems, I call them. Like, "Pah-tah-kah" is one of them.
There's this one passage called like grandfather passage, that's, you know, it sort of lights up all of the articulators you use, the full range of phonetic features in English in order to like say it out loud. But you know, the catch is that you, it's not even enough to just say these like highly stylized poems, right? You have to say them in a particular way, like a correct way. So there's just this kind of like narrowing of precision that's really like a horizon point because people are not standardized. Their vocal apparatus is not standardized. Even the way that the researchers kind of wrote directions about how to say these tasks were, it was a constant source of frustration of, okay, this person isn't saying it right. And the setup of the way this was done in the fMRI machine is there'd be like, the research subject in the and the machine in one room and then the researchers in another room and they would every now and then because the subject's mic'd up, they would intercom them through the control room as it's called and would say okay this guy's not saying it loud enough or like you're supposed to say these vowel sounds at different pitches, but when she says he's supposed to be sitting and he's like a manly man.
So he's only saying it like because he doesn't you know want to compromise his masculinity by making this like girly quote-unquote girly high-pitched sound. So again, like things like gendered expectations about vocal performance kind of like bleed into what's supposed to be this like highly controlled setup and like constantly destabilize it, right? And so I think it's another example too of the tension between, again, the precision that this whole thing is supposed to produce, right? Okay, once we have the data that we need, we'll just be able to capture these signals from your voice without even touching you with our special machine learning magic, but it's actually quite haphazard and really kind of full of weird noises and noises both in the sense of non-language sound, but also like error glitches, fuzziness, things that can't be captured because they don't fit quite neatly into the boxes that researchers are that they're requiring.
James Parker (00:31:33) - Could you say a little bit more about like what the specific sort of line of critique? I mean, I know you're just doing description at the moment, but because there's a couple of different things going on in your writing about this. One is to draw attention to the obviously and overtly non-machinic in the production of the data set. So you talk a lot about all of the care work and often feminized care work that is sort of co-opted into these sort of scientific systems and then immediately kind of excised out in the name of kind of objectivity. So it's sort of re on one hand it's like re-inscribing the human and the and the careful and the feminized and so on into the system.
So that's one kind of and that's a care that involves a certain kind of listening always right because it's not just the machines are listening, it's the quite sort of highly tuned, careful listening on the behalf of the researchers. But then also, like there's a line of critique that sort of, well, what is the status of the data that's being produced since in order to that the idea is that you can capture an authentic depressed voice, but the performance of depression is so highly stage managed, then it's sort of hard to understand like that it's really a performance of depression. And then I mean, on that point, there's this kind of amazing moment in the thesis where you describe somebody or a system whereby for ethic, on ethical grounds or financial and ethical grounds combined, the researchers routinely turn away people who are too depressed because they don't actually have the facility and the risk is too high.
You know, they can't manage somebody who, you know, they're not doctors, they're not clinicians, right? So, the subject that is producing the data is a sort of this kind of weird kind of subject that's sort of depressed enough to have met a DSM threshold, not so depressed that they're actually in crisis, sort of like Goldilocks depressed and who is also heavily sort of directed in their vocal performance. And so now you don't, I think you don't get to the point of like saying this means that the data set is a complete nonsense, but you could. And so I was just wondering if you could tell us a little bit about how you think through those different dimensions of the other kind of the critic i mean i probably missed some other dimensions of the critique that you sort of draw out like what do we make of this strange theater that is. You know you producing all of this data that sort of then goes out into the world and has all of these sort of strange and potentially harmful after lives.
Beth Semel (00:34:48) - Yeah i mean that's a good question. I think it is important in and of itself to really emphasize that the kind of concreteness of depression as a thing is not just fabricated, but also fabricated in a somewhat arbitrary way that does any kind of claim to that the bio part of vocal biomarker should always, again, you know, I think we should be inherently skeptical that there isn't necessarily a kind of, you know, it's not a representational relation, it is a performative relation, right? It's being made to be in connection to each other rather than actually like, okay, this is what it is, this is depression. Then, you know, I think that is, again, it's really important to emphasize that that's sort of how mental health care works in general. I mean, even outside of, you know, we can do this highly elaborate, put someone in an expensive machine, ask them to stay very still, ask them to speak in a precise way, get, you know, fancy, beautiful brain pictures and do, you know, fancy high-powered stats on them. But at the end of the day, just like, you know, the experience of moving through the mental health care system, it involves, right, a kind of reductiveness and a kind of, you know, treating the depressed person as if there is something inherently, stably, definitively pathological about them, even though that object is always kind of slippery.
And, you know, at the end of the day, too, it's kind of, it's slipperiness doesn't, how much does it really matter, given the stakes of the situation? I mean, one thing that I don't think it really got into, it was something that was sort of haunting both the field work that I did with these labs and also the dissertation and that I think something that I've really been wrestling with, which is just the whiteness of all of this, not just that the data sets were primarily comprised of white people. But also just even the kind of imagined user, the person who would benefit from this type of technology or benefit from this mode of intervention, or even would be, you know, there to receive it, is, I think, a kind of normatively white subject, right?
So thinking specifically about the context of the US, and maybe this is taking your question in a different direction, but, you know, who gets a kind of nice mental health care interaction who begrudgingly seeks care or who has care kind of imposed on them versus who doesn't even have a choice as to what kind of intervention they receive, right? So thinking about not just things like non-consensual, like police intervention, like when someone is in a crisis, they have to like, you know, they call a crisis line or somebody calls, you know, a crisis unit to check in on them, but also the way that, you know, I mean, mental illness is a very, or, you know, trauma and anger, psychosis is a kind of very reasonable response to like an inherently anti-black world, right? It's not, it makes sense in many ways. So if the, if the mental health care system is, you know, really built in a way that kind of is always not kind of naming race or, you know, acknowledging race or trying to, in the case of vocal biomarker research, really kind of trying to push race out of the picture and say, you know, this is really like the hidden asterisk that people use all the time. They'll say, oh, vocal biomarkers, they're language agnostic, right? It doesn't matter.
James Parker - That is bananas, isn't it?
Beth Semel - That is bananas, but it aligns with a very kind of, you know, liberal democratic way of like doing like, like this is a, like this is a non discriminatory form of medication, right? And it's if we look at like something like the pulse oximeter, right? Very, very kind of banal healthcare technology, right? That a lot of kind of in an emergency medical context, a lot of medical decisions depend on the pulse oximeter. Lo and behold, it turns out the pulse oximeter is calibrated towards skin with like less melanin, So it's calibrated to work best with lighter skin tone than darker skin tone. So it's the kind of white supremacy, if I can be so bold, of not just the mental health care system, but the medical system in the US is just so much in the background, that I don't know. I mean, in many ways, In many ways, it's like Anthony Ryan Hatch, a sociologist at Wesleyan, he has this distinction that he talks about between liberal science and liberatory science. Liberal science says, okay, this is a band-aid. We need a quick fix, we need a patch, but doesn't really disrupt anything or change radically alter the conditions. of things, but liberatory science does do that altering. And I think like, yeah, in many ways, like focal biomarker research is kind of like doing the same kind of normative work of the mental health care system. I mean, very, that was a very like, you know, runaway response to your question, but I think, you know, again, getting back to like the, is it real? Does it work? Like, again, like how does it work? What are the stakes of it working in that particular way for whom is it working right.
James Parker (00:40:59) - I mean what one thing that occurs to me is. So there's like a double universalization going on on the one hand the voice is universal by the biological voices universal across languages across i mean we haven't talked about disability or sort of. Unstable voices and how. Like obviously there are people trying to measure, you know, Parkinson's in the voice or whatever. But what do you do if you have Parkinson's and you're and depression? And like, how do how do these things like intersect with each other? That's like a huge problem. And then there's the universalism of like the the idea that there's something called depression that's located in the body that you situate that you obviously mentioned.
And one of the things you do in the thesis, and maybe this is a good time to just sort of bring it out, is to sort of decenter in a way that the sort of big data contemporary context, something a little bit that we've done in some of our work with machine listening to is like a lot of these imaginaries are in place before we get like massive, you know, sort of the kind of brute force machine learning that arrives and, you know, precisely around the time that you're talking about, like a lot of a lot of this, you know, one of the things you do in the thesis is say, well, look, computational psychiatry kind of arrives before computational psychiatry with a certain kind of empiricism of the body and the, you know, the placing of the mental health diagnosis in the body.
I mean, I know that you've sort of gestured it before, but it's like quite an important move, right? Because just to be able to say, yes, I'm doing work on this flashy new thing but the flashy new thing is really largely this old thing and it's you know white supremacy and it's um, biologists um and sort of hubristic science or you know whatever however you want to put it I mean I feel like science and technology studies sort of really good for that kind of move um But anyway, it came out strongly from your work. I don't know if you have any reflections on on that aspect of it.
Beth Semel (00:43:18) - Yeah, I'm the kind of like, is this just like, what is it? Old lion and new bottles or something? Yeah, I, I mean, I do think, you know, I think there is a particularity here to the fact that the object of analysis is both mental illness and the voice. And, you know, I think while on the one hand, like in my, in the work of mine that you've read, especially, you know, as I was writing the dissertation, like fresh out of fieldwork, like it is just kind of like a big giant extraction machine. a big kind of like, let's continue to do this like hegemonic way of doing mental health care machine. But on the other hand, I do think there's something interesting going on with the interactions that researchers are having with research subjects, which is, I think, a kind of scale of observation that's really lost in a lot of the kind of top down accounts of local biomarker research. So actually looking into like, okay, what does this work involve? What are the kinds of relations that people are put into in doing this work? What kind of relations do they make with each other in kind of producing these voice data sets? There is the example you talked about the FMRI theater. There is also these, I think, really powerful instances across the three field sites I was working at. One was the neuroscience lab. Another one was in the top floor of a psychiatric hospital. They were trying to find vocal biomarkers of bipolar disorder.
Another one was looking at PTSD and depression also in a kind of, you know, away from research subjects altogether, away from like any kind of hospital. But you know, research subjects, right, so they have to, in order to like make the data happen, they have to talk, right? So researchers have to figure out ways to like cajole them into speaking. And sometimes in those interactions and in talking with each other, you know, people are having these really kind of cathartic moments, like really actually transformative moments in their interaction, interactions with each other, where, you know, it's not, it's not like capital H healing or like capital C care that would happen in like an actual, like official mental health care context, but there's something, the person leaves that interaction, that encounter, transformed in some way. That doesn't, I think, fit kind of neatly into again like a hegemonic model of like what cure looks or sounds like, right? And a lot of the times I would hear research subjects say like, it actually just feels good to be listened to, which, you know, might not be much and it's, you know, maybe speaks to like how like sad it is that people don't feel like they have someone available in their lives to do, to just do that kind of being with. Um, them, right? But I don't think that that can be discounted as something, you know, something is happening there. And yeah, it's sort of like, like, there are these moments of access that are produced within these, the, you know, big extraction, like vocal biomarkers, sausage making machine that I'm currently like very, very interested in.
James Parker (00:47:21) - But there's a real irony there, isn't there? Because It sounds like the examples that you're giving are the examples where the subject has been listened to by a person. Whereas the way that they're being listened to is precisely in order to, for reasons of efficiency and insurance companies and la la la to prevent them from having that kind of listening as care, you know, in the future or people like them. That comes through really strongly in the chapter on bipolar. And because that the sort of the end game there is kind of, sort of, in contrast to the MRI one, is much more about us in explicitly about producing a surveillance architecture. So like, it's about your cell phone, your cell phone will be listening to you constantly for you know the possibility of a flare up or whatever I don't know the language in your bipolar and that will trigger some kind of system which will get you the care you need but in other words like wouldn't it be amazing to have to be listened to non machinically almost all of the time as a kind of a substitute for occasional human listening.
And then in that chapter, you also talk about this idea of listening like a computer and the way in which the, um, the annotation and marking up of the data set that you have, um, or, um, requires inattention, a certain kind of inattention, what people mean when they say listening as a computer is sort of not really listening. And so you've got this kind of weird, um, dynamic where. Your saying there's something real and therapeutic in a certain kind of way coming out of this process but the whole aim of this process is to leverage. A non listening as a form of listening or an inattention and inattentive listening at scale within a surveillance architecture that by the way is heavily corporatized via like API is. from Apple and Google and yada yada yada.
I found that chapter extremely rich. I'd love to talk about the annotation process. It sounds like some of the examples you're giving are come from this, you know, listening to these people, the people providing the data and you've been you've been told not to not to listen to their stories. But you talk a lot about how well you you can't help but listen to the stories. And there's a kind of an irony there. And could you could you tell some of that story?
Beth Semel (00:50:19) - Yeah. Yeah, it's you know, that was a I will say I don't I think it's in the chapter a little bit and also kind of vaguely in the in the research article that came from that that chapter but um that was a very hard field work for me to do it was pretty I mean ironically I became very depressed while doing that work because you know most of what I did during the day was listen to these you know the the team had been doing this longitudinal study of bipolar disorder and as part of they kind of like hooked onto that study, this voice data gathering stuff. So research subjects would agree for a six to 12 month period to have this souped up phone that the study provided, like a nice smartphone, which many of the subjects, otherwise, they didn't have a smartphone, or they just had one person in their family had a smartphone that they shared. So they got their own smartphone. But the smartphone had this app that was recording all of their phone conversations, including this conversation they would have once a week with a social worker on this research team. And so I was floating back and forth between sitting right behind the social worker while they're doing these phone calls and not listening to what the person on the call was saying, just listening to how the social worker kind of coax these answers out of them. And then I would literally walk down the hall and go sit down and begin annotating voice data that had been gathered before I got there of these same phone calls. So listening to these chopped up segments of the calls, but listening but also not listening, as you say, right? So trying to listen and assign a label to the sound of the person's voice, but try to not have that label correspond at all with the content of what they were saying.
James Parker (00:52:25) - Could you say a little bit more about that annotation process and method? Because I mean, I'm just, there's that phrase from Foucault, like the micro physics of power and you know, like, it just seems like that labeling process, like, what are the labels? Where do they come from? What's the process of you sitting there trying to assign a label? What happens to the labeled data afterwards? It just seems like, there's like a whole world of political decision making going into this practice of labeling of voice audio. I've never spoken to anybody before. I know that lots of Amazon Turk workers and so on have to do this, but I've never spoken to anybody before who's sort of gone through that process. So I'm especially not an ethnographer, especially not one who works on STS. critical voice studies and so on. So I just would love to know like the incredibly finely grained detail of that process and what you were able to learn about it and understand about it because it just seems fascinating to me.
Beth Semel (00:53:38) - It's so funny that you asked that because as part of my field work since like as the anthropologist you are you know pressured into doing all the stuff that no one wants to do So what I did was they were like, okay, we need to make a training video to train other people to do this annotation task. We're all going to collaboratively write a script, but Beth, we're going to have you read the script and be the narrative voice of the video. So there's this video of me actually walking through step by step, as if I was telling another annotator how to do this. It made so much sense and was very clear to me then. In listening to it now, I was like, this is so bizarre and I can't believe that this is like at the time I was like okay yeah I see how this makes sense within the kind of what what the study is trying to produce it makes sense that they would you know chop it up in this way that there would be these so there's these like strict not strict but fabricated criteria of when to exclude a segment from annotation and to say that okay this is fundamentally and unannotatable you couldn't understand the person.
That would be when in the span of the segment, so the segments would be five to ten seconds long, the person was just saying "um" or "yes" and dead air or "no". Or if they were laughing, or if they were coughing or sneezing, or if they were talking to someone else, or if they were talking on speakerphone and that kind of distorted the audio quality, or as was the case with one research subject, they had many pet birds and the the birds made it hard to hear the person's voice, right? Or if they had their child sitting in their lap and you could hear the child's voice on the audio. So you would have to say this is not, this has sensitive information. We have to remove it from the corpus. So, you know, already there, that's, you know, these again, like the MRI theater, there is this kind of in notation theater of, okay, like let's, you know, Let's dig into this thing. Let's all agree upon these rules, which otherwise seem, you know, quite arbitrary. And the rules about laughing, the rules about crying are ones that we kind of made up, but what would be the tipping point of them being unannotatable and not?
But, you know, I have to, you know, be a bit of a downer and say that the I explained the kind of setup of the experiment in a great detail in the article on science, technology, and human values. I feel like to really explain it all would be very, very boring. Like sort of why they're doing it the way that they're doing it has to do with these theories about the relationship between mood and emotion and kind of studies of bipolar disorder and why they chose the, you know, the scales for the annotating tasks like activation and valence, right? So you're supposed to rate the segment based on how the segment feels. Does it feel really negative? Does it feel really positive? Or is the speech really, really energized? Or is it lacking in energy? Is it dull? Those come from the dimensional model of emotion. So there are all these theories jam-packed into this one task. And despite that apparatus, it was a very-- Again, emotionally intense work. And also very, you know, when I first, I remember when I first started doing it, I was just stressed out. Like, how am I going to do this correctly? Are these other annotators who are literally undergraduate students, how do they seem to have more confidence than I do in doing this work of just kind of ignoring the content and just only like kind of trying to feel the feelings of this voice, right? And, you know, I got better at it as I, the longer that I did it, because we all just kind of developed this like tacit knowledge of like what, you know, for instance, like on the scale of one to nine, there's always the five, like five activation, five valence, which is like the neutral speech. So we would just kind of develop this like collective sense of how neutral speech feels that I don't know I could replicate or like I couldn't tell you because we would, you know, we'd be listening and that the annotators would, we all annotated the same subjects like multiple times over. And, you know, we were supposed to annotate on a, like person to person basis.
So instead of like invoking some kind of like generalized notion of what neutral speech sounds like, we have to figure out, okay, what is neutral for this one, you know, person who has a lot of pet birds? What does their neutral speech sound like? How does their five-five speech feel? And, you know, that involves, like, listening to their segments over and over and over again before annotating them, really kind of like getting to know how this person sounds in which, yeah, you just so happen to without kind of meaning to, like, learn their life story, right? This is where they work. This is, you know, they're having a fight with their sister and this is why, you know what I mean? It's a year's worth of phone calls with a social worker, right? But you know, over time we really started to develop these like very kind of, you know, assured feelings of like, yeah, this person has a lot of neutral segments. This person has a lot of segments quote unquote good for depression, right? So they have a lot of segments that are low activation low valence meaning like negatively charged speech kind of very like very like this like I don't know I just don't know how I feel that would be that's like okay that person has lots of depressive segments great we're really excited about those very clear data points right um but then you know I don't think if I don't think I could go back and do the same work that I did then right and
I think that's because I have these strong feelings about what this whole apparatus is supposed to be producing but also too because it was something that was collaboratively made with these specific engineering, junior and senior computer science, right? I'm working alongside the grown-up squeeze into a little desk with them.
James Parker - That is absolutely wild. I've got so many things I'd like to ask and sort of unpack a little bit, but just I hadn't come across the phrase hetero-amation before as a kind of antidote to or alternative to automation. And it just seems like that's the perfect example of, you know, just as far as I understand it, the term hetero-amation is trying to get at the sort of the non-automated in the production of automated systems and like everything you've described there is like so profoundly social. So like, like, like, just like, the kind of negotiations that you're doing with your colleagues over coffee on how to understand a thing like it's just that's where so much of this is really happening. And, you know, in the designation of the people as undergrads who have a particular economic and sort of attitudinal like context they bring to anyway, it's just amazing.
Beth Semel - But they would always say Like the PIs would always say, Beth, we're so glad that you're here because we bet you're really good at this because you're an anthropologist. We're engineers. What do we know about other people's feelings? But you talk to people for your job. So you're probably really good at noticing how they're feeling and stuff like that. So you're actually-- hence, they're like, you should be in charge here. You should-- we want to know what you think, which I oftentimes, you know, very honestly, would be like, this is like, what? Are you guys serious?
James Parker - Right, because we don't know anything about this thing that we are doing and rolling out, you know, potentially at scale.
Beth Semel (01:02:15) - I mean, that site, just one more thing, that site in particular is really interesting because, you know, I really don't want to like reveal the identities of these people, but someone who is working very closely in that context, you know, kind of said to me, like, look, I'm doing this because I don't think it can be done. I don't necessarily want to challenge my PI and say that it can't be done because of my own life experiences as someone, you know, who grew up, this person grew up outside of the US.
They had come to the US fairly recently, you know, they didn't grow up speaking English. And they said, I can't tell when I oftentimes think my supervisor is mad at me. I know that they're, I later learned that they're not, but I always think that they're mad at me. So what does that say about, you know, this kind of like innate ability? So yeah, they said like, yeah, I'll do it kind of to show that it can't be done in a somewhat passive aggressive way that protected their position. So the idea that it can be done is sort of rooted on some level in this idea of the dimensional model of emotion.
James Parker - I'm sure that that's got an incredible, you could just do like, write a whole book probably about that history. But could you give two lines like what, what, where does this come from? It sounds like that's something that comes maybe out of a certain branch of psychology that's not necessarily computational. Like, is it people accept that? Is it some kind of crackpot fringe theory that just happens to have been taken up by this particular, like, what is I mean, could you just say a couple of lines about that because it's doing so much work in this study.
Beth Semel - Yeah, I mean, oddly enough, my understanding is that in the kind of psych world, it actually is considered a more capacious model than some earlier models of emotion. So, like, rather than like being there being like a set kind of list of emotions that people could possibly experience, like anger or sadness or fear, that, you know, like big subcategories that smaller categories of emotions fit into. The dimensional model of motion says like, okay, let's make this four quadrant graph and plot valence, like high valence on one axis, axis and then low valence on the other, like on the, let's say the vertical, vertical axis and then on the horizontal we have, you know, low activation, high activation and so you can rather than kind of add these labels, which are, you know, so non universal, what if we instead just figure out how to we could, what if we just plot these emotions in this kind of much more capacious four dimensional space? Got it. And so is, I think, seen as something that is more capacious and allows for more granularity and more kind of difference and it's got numbers!
James Parker - You know numbers are very.. But I mean… I'm serious because we were looking at for example this Toronto emotional speech set tests recently where they they have all of these They have actors perform the phrase. Yeah, say the word sheep. Say the word fish. But like and then they got these seven sort of core emotions, obviously like not only just like call for who like whatever but also they're literally being performed by actors and they're like but they've got musical training it says in the in the paper like oh okay well then that's fine then so that's a totally different model so that's that's a data set that's being used to train systems based on yeah like a sort of linguistically grounded or something idea of emotion or a sort of socio cultural idea of emotion, whereas this is, yeah, it's sort of, it sounds like, I mean, I'm sure it doesn't achieve anything particularly different, but it's a it's trying to somehow circumvent that problem of like, what do you mean by fear was like, Oh, it's not fear. It's like a four and a two in the like, in the kind of make spatial mapping of like, you know, la la la. So it's sort of, yeah.
Beth Semel - I think it's another way to kind of coround things into this very kind of like biological, like thermodynamic model, right. So activation and valence are like, you know, energetic states, right, or like charges, like valence is like this very kind of, you know, what valence is like, I don't know, like electrical engineering or something Like it's some kind of, it's like more like sciency and like, oh, okay, yeah, no, we're not talking about feelings. We're talking about energies. We're talking about like, you know, physical things, right, that can be measured with math. So I think on the one hand, yeah, that is doing this more, this work of like, okay, we're outside of the kind of hand of language and culture, bringing it back down to the body. But again, you see that same kind of gesture to the body as being someplace that's, like outside of history, like outside of white supremacy, right? Outside of all these other, you know, kind of differences, somehow it's like a protected space.
James Parker (01:07:43) - Can I ask, like for example, you've got all these university undergrads doing this annotation, like is there any attention to the like biographies of those listeners? So on one level, this person you're mentioning, like wants to sort of prove it wrong. But if you get enough for the sake of argument, 20 year old white boys who have a particular socio economic background, probably are going to get a fairly kind of normative listening kind of emerge that will look sort of stable. So it's obviously like the kind of the histories of the annotators are like crucial. So it's like, is there any attention to that at all? Or are we just, or is this just an example where you're going to reproduce? If you produce anything, you'll reproduce and re-inscribe and then automate and roll out a kind of whatever, like whatever the demographic of the listeners is that that are enrolled to do the annotation.
Beth Semel - Yeah, I mean, I think in many ways that's what a lot of them said. Where they were like i guess we're kinda just making this like what one of them said like an american culture machine like this is for making american culture machine that's amazing. Yeah i mean those are you know the words like an engineer not mine i just happen to be in the room you know. Yeah after you leave we would have loads of conversation about this thing and it was why was so weird and. But, you know, I think that the point that you're making is one that I believe really bears repeating, kind of in response to this real kind of, I don't know, fortification of vocal biomarker stuff, which is that actually, and this is something that Nina Sun-Eidsheim talks about in "The Race of Sound," right? That actually it's not that, like the vocal biomarker stuff up says the voice in and of itself can be meaningful. Whereas, Aidshem is saying like, no, that voices are made meaningful through techniques of listening, right? And I, I actually think that that like consideration, like you're saying of the kind of who are these annotators and biographies of them.
James Parker (01:10:14) - I think there is some importance in saying, hey, actually, look, that is doing something here that is not just, you know, identifying, calling out features that are inherent to this speech signal. Because that is what the vocal biomarker is promising, right? Precisely by kind of papering over or downplaying or, I don't know, misplacing the role of, like, you know, actual listeners and not, you know, like mathematical modeling waveforms, which is what machine listening is in this sense, quote unquote machine listen, listening, right? The listening of the machine is like fundamentally different from the actual, like, again, human listeners who are trying to mine this idea of how the detached way that, you know, listening is supposed to take place. But again, it's sort of like, I don't know, I don't know the best way to say it, but like the this thing it produces its own critiques, right? So we see even from within the context of making this stuff that machine listening as an object is like one that depends on human listening and that even the those categories are not kind of neutral things that like have any kind of like inherent thing to them that make them human or machine like, right? In the same way that like as you were pointing out, about how like, well, isn't it ironic that the way that people actually feel something kind of reparative or transformative in these encounters is through, you know, like a person listening to them talk, like even though the point, the reason why they're there is to basically, you know, discourage people from listening to other people, like yes, precisely, like the critique actually, it's like, you know, like the blueprints of its own undoing is actually like, like That's how to build this.
Beth Semel - Exactly. Yeah. Exactly.
James Parker - So, you know, that's the part that, you know, maybe it's like a, to make that critique is maybe like a little bit annoying to people who are making these things and, you know, perhaps too much like you played yourself. But it's all tantalizingly there, right? I'm conscious that we've had a lot of your time already. I'm wondering if it's worth talking about like, yeah, where you end up sort of with this research, you know, I don't mean to say, can you give us a list of five empirical kind of conclusions or regulatory, you know, reforms or whatever, but sort of where you end up or maybe it's easier to talk about what you're doing next and what seems like sort of fertile territory to be thinking with now? Yeah, I'm just sort of wondering where you exit or where you've begun to move towards after doing this research.
Beth Semel - Yeah, I mean, you know, I have a second project that is very exciting to me, but I am trying to, you know, it's on the shelf for now as I focus on really, you know, turning the dissertation into a manuscript. that I could talk about, but yeah, I'm trying really hard to, I guess just to give a preview, in kind of doing this vocal biomarker research, I found this really less than savory branch in the history of this field that involves voice-based lie detection, voice stress analysis, it's really treated oftentimes as occupying, again, a very separate branch in the family history of this type of technology. But the second project is looking about how they're actually quite intertwined. Looking at the history of this one particular technology. The field of voice stress analysis, likewise, like vocal biomarker research, So many people have talked about how spurious it is and yet it keeps churning on. And so I think this question of like, well, if this stuff is just the same old kind of extractive, harmful apparatus, then why is it still going? I think in the vocal biomarker case, it has a lot to do with the kind of liberal versus liberational science, right? People are genuinely trying to help. I don't want to discount that people really, you know, want to do something to help people. It's just, you know, the grammar of that care is one that, you know, really rhymes with control, right? And I think with the, this kind of other what I'm calling like a carceral prehistory of algorithmic clinical listening, I think is kind of the the double whammy there to say, you know, this is there's something about this illusion of care and control that begs, I think, closer attention and greater scrutiny to the kind of field of the mental health care system in the US and it's kind of close relationship with Carcerality and policing really which is yeah, so I mean that's you know, I Don't want to say that's where the first project Leaves off.
I think with the the first project, you know what I am hoping to really you know get people who are working in these systems to think critically about them. And, you know, to perhaps question whether or not this is like the best use of funding, right? That's the horizon. I mean, but also, too, you know, I think that a lot of the again, like with the whole band aid approach that the reason why it's just a band aid is because there are fundamental issues with the way that mental health care is done and like arranged in this country. And so I think in many ways the fact that Amazon is like, there's some hope that Amazon is promising that it can be your therapist or your nurse. Yes, Amazon is gross and evil, but also what can people do? People who are experiencing mental illness, people who also don't want to go to the doctor and don't trust um, you know, the mental health care system, like again, for good reasons, right? So, um, I think, you know, asking people to sit with that tension and to say like it comes in part not just from the kind of technical apparatus, but also from the, you know, the broader system that it's being built to like be hooked into is, um, you know, what I'm hoping is the kind of takeaway. I mean, we, I think that The crucial difference too with the vocal biomarker stuff, which you know, it's like maybe a whole other conversation is that it's really not, the technologies are really not diagnostic technologies. They're always being put out as screening technologies.
James Parker (01:18:06) - Same with COVID stuff.
Beth Semel (01:18:09) - Right. In the US, it's because if you are building a clinical decision support tool, actually you, you can bypass certain federal regulations. And so I think thinking about the paramedical and the parabiological, this space that's at the margins of the mental health care system, this place where people are interfacing with the mental health care system and how that encounter hooks them into all other kinds of systems in the US, like child protective services, the welfare state, the carceral system.
James Parker (01:18:43) - Insurance.
Beth Semel - Insurance, yeah, exactly. I think those are the legal, criminal legal system, like, you know, the fact that you, you know, you can, these things can be used against you in the court of law like years down the line, right? So they're in many ways like evidence producing technology. So I think kind of asking people to see how, to look beyond again, the actual technology itself, itself and see all the kinds of systems that the technology kind of draws together and will draw people through is I think the kind of at least for now the kind of you know bigger takeaway of the of the book. I mean for what it's worth um I thought that stuff came through really strongly in the finely grained sort of description like Giertzian sort of thick description in the thesis like it's just impossible to read your account of these things and not and feel like the main event here is computation. I mean, I'm not trying to say that there's nothing going on, but like, yeah, it's just like the way it's described just draws the reader draws the reader into the embroilment of these systems with everything that's ahead of the game on this stuff.
James Parker - A lot of the people we've been speaking to are sort of just finished a PhD or kind of, you know, in that sort of stage. And then there's a lot of people sort of turning their mind, you know, more established figures turning their minds to this as well. But it seems like there's always more people out there that we haven't spoken to or don't know about. And so if you've got any ideas who people should read or listen to, that would be great.
Beth Semel - Yeah, I mean, I, so Edward Kang is a PhD student at USC in the Annenberg School of Communication. And he has a really great article about voice and bodies and race and specifically like contemporary voice print technologies, where he's like critically reading through patents. I mean, you know, Nina Sun Eidsheim is a sound studies person, but I think a lot of the stuff that she has to say about the racialization of timber, I think, you know, it's not necessarily, well, I think there are some dimensions in her book that, where there is a kind of machine object in question, right? But, you know, I mean, really the people that I like to read and kind of bring into this conversation are like, increasingly like, sound studies people, like Dylan Robinson who does work on indigenous sound studies, thinking about alternative paradigms of relationality that happen through listening, trying to think of others. Yeah, so really, again, the computational is kind of like a receding object in the way that I think. But and like who I'm kind of wanting to be in conversation with. You might have already talked to Xiaochang Li. Yeah. But she's also, yeah, I mean, she's, her work is like super helpful. Likewise, Mara knows, I'm sure you've also talked to, right? Those are like my, you know, my people and linguistic anthropo, I mean, there's linguistic anthropologists too. I don't know if that's like the jam of this interview series, but
James Parker - Well, it's not not the jam. And you know, that's already so helpful. And I kind of put you on the spot as well. Look, that was an absolute pleasure talking with you. Thanks so much.
Beth Semel - Yeah, thank you. This is it's been it's been great. And I appreciate, you know, you having done such a close reading of this, of this material that I, you know, I need to go back and revisit now. So it's been a good it's been a good call/inspiration to revisit. So thank you.