A review of the Bitter Aloe Project, a machine learning approach to studying South Africa’s Truth and Reconciliation Commission, led by Stephen Davis, William Mattingly, Robert Vaughan, and Jamari Turner
Bitter Aloe Project
Nabeel Siddiqui, Susquehanna University
Stephen Davis, William Mattingly, Robert Vaughan, Jamari Turner
The Bitter Aloe Project applies advanced machine learning (ML) methods to records produced by South Africa’s Truth and Reconciliation Commission. Our project is predicated on the claim that machine learning methods can generate new forms of legibility in truth commission archives. Structured data extracted via machine learning methods grants diverse audiences the ability to read across thousands of individual testimonies. This cross-cutting analysis enables the discovery of patterns and retelling of stories that might be otherwise obscured by the horizons of narrated personal experience.
Our work is situated in the ML subfield of natural language processing (NLP). Our primary methods are entity recognition, document embedding, and span identification. We use these methods to extract structured data that can then be further processed, searched and/or read through data visualizations. To date we have extracted data from two distinct sources: (1) a list of 21,500 human rights violation incident (HRV) descriptions and (2) transcripts of hearing testimony provided by victims and perpetrators. Both sources were publicly available, but had not been cleaned and structured to a significant degree. Our team completed that work to produce novel datasets. We coordinated our efforts with a variety of stakeholders within South Africa, such as the South African History Archive.
Over 20 collaborators have worked on the project. The principal investigators of the project are Stephen Davis, author of The ANC’s War Against Apartheid; Umkhonto we Sizwe and the Liberation of South Africa (Indiana University Press, 2018), and William Mattingly, author of Introduction to Python for Humanists (Chapman and Hall, 2023). Jennifer Hootman is responsible for data management, Jamari Turner produces content for social media, and Robert Vaughan created our ArcGIS map. We have also benefited from consultations with Stanley Sello, a specialist in data management for human rights archives based in Cape Town, South Africa.
Like many emergent machine learning projects, our ongoing intention is to deploy computational means to extend humanistic understanding of political violence through the production of structured data derived from a large corpus of qualitative description. We serve two audiences in this regard: the public, broadly conceived, and humanities and social sciences researchers. Although machine learning applications are becoming increasingly user friendly, a key part of our mission is to deliver datasets to the public in a relatively accessible format via intuitive search functions and data visualizations. In addition, we developed novel datasets for use by programming literate researchers. Bitter Aloe data outputs represent the first large corpus of contemporary truth commission testimonies ready for training and further computational methods.
Bitter Aloe received funding from the Harry Frank Guggenheim Foundation. We are currently working on a National Endowment for the Humanities grant proposal aimed at expanding our audience through workshops. Our tools are presently utilized by researchers, the public, as well as in undergraduate and faculty digital humanities courses and by the South African government’s Missing Persons Task Team.
In 1996, South Africa’s Truth and Reconciliation Commission (TRC) initiated a process for victims of apartheid-era human rights abuses to narrate their stories. The gathered testimonials deliver chilling narratives of brutality, justice, and resilience in the form of extensive physical records and born-digital (materials created in digital form only) content that showcase approximately 21,500 instances of human rights violations between 1960 and 1995. Accessing this wealth of materials has been hindered by logistical and infrastructural difficulties, restricting their broad utilization and recognition.
The Bitter Aloe Project, led by Stephen Davis and William Mattingly, aims to address these challenges by employing contemporary digital humanities tools to analyze and provide a user-friendly interface to navigate the wealth of information contained in these materials. By employing natural language processing and machine learning, the team not only exposes the complexities and challenges inherent in analyzing postcolonial linguistic materials through computational methodologies but also underscores the urgency and paramount importance of undertaking such analyses.
While there are numerous components to the project, three arguably serve as the project’s core. The first is a dashboard, created in ArcGIS, which provides various maps and filters for examining incidents and their descriptions. This dashboard serves as a visual guide that provides a constellation of maps and filters to aid users in deciphering and navigating the voluminous data on the incidents. Users can sift through data points, delve into larger data clusters, and explore regions with high incident density — factors that are conveniently weighted by population parameters. One of the most compelling aspects this dashboard offers is the ability to spotlight geographical “hotspots” of human rights abuses committed during the apartheid era. By doing so, it presents a geospatial narrative that amplifies the profound impact these incidents had on specific communities.
The second component is a co-occurrence network graph of entities that co-exist in incident descriptions. Similar to the map, users can filter different nodes to explore the data. This component thereby allows users to visualize the intricate web of actors, locations, and events that intertwine the numerous accounts of human rights violations. As the authors note on the site, the “co-occurrence of two entities [in the incident descriptions] should not be interpreted as indicating membership in or complicity with any organization or act, or as an indication of the guilt or innocence of any particular party.” Yet, the current default visualization includes arrows indicating directionality. What this directionality entails is unclear and can potentially create misinterpretations, especially since calculations for in and out-degree (measures that take into account directionality) are included in the explorable data table itself.
The final and most ambitious of the three components is a text analysis app, created through StreamLit — a Python library for creating web applications — that investigates how machine learning and natural language processing techniques can reveal deeper narratives within the testimonials themselves. At the time of review, this component was in the “alpha” stage but draws on the work the authors have done with named entity recognition in the TRC archives. In postcolonial archives, named entity recognition often fails to account for entity ambiguity and code-switching. By using a domain-specific training set, the authors mitigated these limitations. In the app, users can explore different entities and their related narratives through interactive selections. The team has also started to utilize sentence embeddings to track thematic shifts throughout the testimonials. This provides researchers the ability to search through semantic meaning rather than simply keyword-based searching, thereby enabling a more nuanced and contextual understanding of the testimonials.
Choropleth maps, used in the project’s TRC v7 dashboard, appear to be straightforward ways to visualize geographic data. As a result, choropleth maps are ubiquitous in digital humanities and spatial humanities mapping projects. However, choropleth maps have known limitations and can be misleading due to the Modifiable Areal Unit Problem (MAUP), which causes measurements to be heavily influenced by the choice of geographic units. When the event being measured is directly tied to an enumeration unit, such as a municipality, results may still be informative. However, it is unclear if human rights violations documented by the TRC fit into this categorization. Consequently, it might be a useful step forward to consider the implementation of global grid systems, such as S2 or H3, to overcome these issues or acknowledge them in the narrative of the project.
These issues notwithstanding, the project is an ambitious undertaking that strikes a careful balance between historical research and technological innovation. The Bitter Aloe Project serves as a vital tool for understanding and accessing human rights violation archives while pioneering a new frontier in the realm of digital humanities. The project is a testament to the transformative potential of integrating humanistic studies with computational analysis. Moreover, the project makes apparent the complexities and challenges in digitizing postcolonial linguistic materials. It goes beyond merely transposing physical records into digital formats and drives towards leveraging digital tools to surface hidden relationships, patterns, and narratives within archival content. By shedding light on the rich narratives and stories contained within testimonies of human rights violations, it advances our collective understanding of the past, while granting a voice to those who might otherwise remain silenced.