Adversarial Audio

James Parker

Adversarial audio belongs to the larger category of adversarial machine learning, and ‘attack algorhythms’ more broadly. The term ‘adversarial’ was first used used in this way in the early 2000s to describe the dynamic interplay between spam filters and spambots as they processed and evaluated ‘good words’ and ‘bad words’.

The adversaries here are computers, or machines, in a contest where one device tries to cause misclassification in the machine learning model of the other.

Adversarial audio is contingent on machine listening. It requires audio-enabled devices using automated speech/sound recognition systems and machine learning models.

The proliferation of smart speakers and other voice user interfaces in homes, cities, and other spaces, produces a context in which adversarial audio ‘attacks’ can also proliferate.

A typical implementation of adversarial audio is the embeding of an encrypted speech command within audio, legible to a ‘listening’ device, but largely imperceptible to human listeners. This is done by exploiting a psychoacoustic model of human hearing, in which noise interference patterns are used to mask an underlying speech signal.

A good overview of adversarial audio can be found at The scientists claim that ‘it is possible to hide any transcription in any audio file with a success rate of nearly 100 %’

Adversarial audio has an aesthetic or conceptual prehistory in other forms of audio encryption, including backmasking, famously deployed by heavy metal bands to insert ‘satanic’ messages in songs.

The examples of adversarial audio given by computer scientists in studies shared online have given rise to what seems like an unintentional aesthetic or artistic genre in which nonsensical statements are juxtaposed with one another, malicious commands are imagined (DEACTIVATE SECURITY CAMERA AND UNLOCK FRONT DOOR), and unusual sonic assemblages are composed (scientist repeating the word Alexa over and over against a backdrop of electronic Bach music).