How to read a dataset (notes towards a Live Action Role Play)
In this session, we will attempt to read a dataset together. The dataset in question - the PTC-SemEval20 corpus - is one of many compiled for the purposes of automatic propaganda detection. It comprises hundreds of news articles, all annotated against fourteen alleged propaganda techniques.
- How should we read a text like this?
- What can we learn about computer science or big tech by spending time with a dataset in this way?
- How might our collective reading practice differ from the forms of highly motivated dataset auditing currently practiced by industry?
- In what institution or para-institution does such a reading practice belong?
The session is part of our preparation for a Live Action Role Play we are hoping to run in late 2026 / early 2027, in which a group of us will design and then conduct a simulated dataset audit together. So the question of reading is also a question of role and positionality.
- Who reads datasets anyway?
- Who else needs to?
Who are we?
Machine Listening is a platform for collaborative research and artistic experimentation, founded in 2020 by Sean Dockray, James Parker, and Joel Stern. We work across writing, installation, performance, software, curation, pedagogy, and radio. A lot of this work has involved thinking with and about datasets, including various experiments in dataset critique. Most recently:
55 Falls / Ambient Assisted Living (2025)
For this project, we are teaming up with Connal Parsley, leader of the Future of Good Decisions project, to develop this critical practice via Live Action Role Play (LARP). Connal is a critical legal scholar working across legal and political theory and visual culture. He is developing LARP as a research method for discovering new approaches to decision-making involving algorithmic technologies. His current LARPs focus on shifting conceptions of responsibility in decision-making, participatory system design and evaluation, and the collective reinvention of value concepts.
What is a dataset?
A dataset is never just a collection of files. It is:
- primary data (labels, recordings, measurements)
- metadata describing how/when it was gathered
- the code that processes it
- the papers that cite it
- the spreadsheets that organise it
- the communities who interpret and repurpose it.
What is a dataset audit?
Every dataset undergoes some kind of auditing process. Sometimes this is more technically oriented (’cleaning’, ‘augmented’), sometimes more political (’debiasing’, ‘bias mitigation’). But most of it is done in-house, by the computer scientists and engineers involved in producing the dataset, and with little regulatory oversight or public scrutiny. As a result, dataset audits tend to be self-serving, as the high-profile firings of various whistleblowers and internal critics attest (eg Timnit Gebru).
Computer scientists and engineers
Regulators
What is dataset critique?
We are joining a tradition of artists and technology critics interested in diversifying and expanding on these techniques, and especially who gets to practice them, as a form of counter-auditing. We are interested in developing more critical and inclusive forms of dataset auditing, but we do not presume to know in advance what they might be.
Artists and tech critics
- Kate Crawford and Trevor Paglen, Excavating AI
- Adam Harvey, Exposing AI
- Anna Ridler, Myriad Tulips
- Everest Pipkin, Lacework
What is LARP?
When you think of LARP, historical re-enactments might spring to mind. But in a way, LARP is the opposite.
- It takes place in a fictional constructed setting where people interact freely in character.
- There are no lines to recite, no script, no ‘original’ to reproduce (or subvert).
- But there might be generative game conditions, limitations, and objectives for the players to navigate.
LARP is becoming more widely used in academic research, especially as a method to explore social and political dimensions of new technologies – existing or latent. It can be particularly useful in highly constrained contexts, where there is little space to examine alternative futures. Our turn to LARP for dataset critique is based on allowing people who don’t have specialist knowledge to learn about machine learning systems in a practical way, and enabling them to bring their situated concerns, knowledge and points of view to bear. At its best, it might allow us to collectively reimagine the process and what kinds of perspectives it includes.
Why propaganda detection?
We don’t need to explain the technopolitical context to you. The (dis)information society. Culture wars. Polarisation. Social media. Trump. Authoritarianism. Gaza.
‘Propaganda detection’ is one response by data scientists to a version of this political problem. We think it’s interesting that they chose to call it that, even though - as you’ll see - it’s clearly bound up with other similar practices that go by other names. Fake news. Sentiment analysis. Bias detection. Fact checking. And of course their account or propaganda is maybe... odd or unfamiliar.
To be clear: automated ‘propaganda detection’ in this form is already a thing. Tanbih, for instance, is a collaboration between the Qatar Computing Research Institute (HBKU), Qatar University, MIT, Northwestern, Sofia University, and two data analytics companies which aims to ‘make explicit media stance, bias, and propaganda, thus limiting the effect of fake news.’
Why the PTC-SemEval20 corpus?
We chose this particular propaganda dataset because it’s small, text based, and (therefore) easily accessible to a group like this. These factors also mean it was relatively easy to build a tool to access and analyse it.
Although we don’t necessarily think it’s the best propaganda dataset, or that propaganda datasets are the best datasets to audit, the PTC-SemEval20 corpus is also a ‘classic’ dataset in lots of ways. It was part of a competition held in 2020 in which 250 teams from universities and industry competed to build the best models.
This is absolutely classic. It’s how data scientists define problems and build infrastructure together.
Proceedings of SemEval 2020 Task 11What are we going to do today?
We’re not here to tell you how to critique a dataset. We’re here to find out. We want to create a space for people to think for themselves about how to do it. So we’ve devised three ways of approaching the dataset, which we’d like to have to go at together with the following questions in mind.
- What tools do you need in order to read this dataset?
- What do you need to know in order to read this dataset?
- Who needs to be involved in reading this dataset?
- Why read this dataset?
For each approach, please enter your thoughts/answers/comments in this chart. We will use them to help us design our LARP.
Approach 1
- Download the dataset (1.1MB) and skim read the accompanying paper by Martino et al.
- Closely read parts 1 and 2 in small groups.
- You may also like to look at the introductions to some of these papers for comparison.
Approach 2
- Use this interface to explore the dataset.
- Select an article and use the toggle to clear the annotations and then try to annotate it yourself.
- You could also try asking eg ChatGPT or Claude to do it instead. Or try applying the same techniques to a text outside the dataset.
- Use the toggle to compare with the annotations in the dataset.
Here are some example articles without annotations in case you prefer a different interface.
Example articles (1)Approach 3
Thinking forward to our future LARP or a ‘real life’ dataset audit:
- What kinds of people would you want in the room?
- What 10 characters would you include in a scenario for a dataset LARP?
- What kinds of outcomes would you like to imagine resulting from a dataset audit? How might these outcomes be related to the kinds of people in the room?
Plenary
Collectively review the responses to the 4 questions in the spreadsheet.