A review of BERT for Humanists, an online resource helping humanists use natural language processing, directed by Matt Wilkens, David Mimno, and Melanie Walsh
BERT for Humanists
Lidia Bocanegra Barbecho, Universidad De Granada
BERT for Humanists is a set of workshops, tutorials, and online resources that introduces an important new style of natural language processing (NLP) tool to researchers in the humanities — large pre-trained language models. We focus on BERT (Bidirectional Encoder Representations from Transformers) as a particularly successful variant of these models that has become very popular in the field of NLP. BERT and similar models can be used to represent words and documents in context, where previous tools like word vector models (e.g., word2vec) provided only a single representation for a word regardless of its use. Compared to prior methods, BERT offers humanities scholars more nuance when exploring datasets and also more accuracy when automatically classifying texts into categories.
This project arose out of the divide we observed between current NLP techniques and those commonly used in the digital humanities. While BERT offers a lot of potential, it poses challenges for humanities scholars. Its architecture is complicated and often unfamiliar to researchers outside of NLP. The paradigm of selecting a large pre-trained model and then fine-tuning this model on a target dataset is a new workflow distinct from those more familiar to humanities scholars. Because of their size, BERT and similar models also frequently require specialized equipment like graphical processing units (GPUs), which are often not included in a basic laptop.
Our tutorials and materials provide technical training and mathematical intuition for using BERT, such as how to fine-tune a model using the open source library HuggingFace or how to alter parameters that will influence the model. They also include demonstrations of BERT applied to humanities datasets and common humanities research tasks — for example, using BERT to explore associations between words in a dataset of Spanish poetry. In workshops and focus group sessions, we have talked with researchers who are active in digital humanities to learn about the challenges and opportunities they face when using large language models like BERT for their datasets and interests.
The BERT for Humanists team includes Melanie Walsh, a professor in the Information School at the University of Washington; David Mimno, a professor in Information Science at Cornell University; Maria Antoniak, a PhD candidate in Information Science at Cornell University; and Matthew Wilkens, a professor in Information Science at Cornell University. Our team includes scholars with backgrounds in NLP, data science, U.S. literature, and literary reception, which has helped us to design materials and host conversations across these and related disciplines.
Our audience for this project includes researchers from the humanities who have some coding and machine learning background but who have limited experience with large language models. Because these models are new and often require specialized equipment, they can pose challenges to new learners. Our goal has been to increase the accessibility of these tools for researchers in the humanities. However, as much as humanists benefit from learning about NLP tools, NLP can also benefit from the perspectives of humanists, particularly from the more common humanistic focus on context, data provenance, and historical specificity.
The project is funded by an NEH ODH Level 1 Digital Humanities Advancement Grant (HAA-277275-21).
Lidia Bocanegra Barbecho
BERT is a state-of-the-art NLP method trained on a very large dataset of texts, including millions and millions of words from a large corpus of English books, as well as Wikipedia in English. It is precisely this extensive textual training, together with its own artificial neural architecture, that makes this model a very useful and powerful tool for use in text analysis. Initiatives such as the BERT for Humanists project are needed to facilitate the transfer of knowledge on this topic and help improve the model through peer-to-peer information exchange and continued use.
BERT for Humanists project is an initiative based on facilitating the transfer of knowledge about the use of LLMs (Large Language models) for humanities researchers, particularly the use of the BERT (Bidirectional Encoder Representations from Transformers) model. Humanists with basic knowledge in programming and machine learning but little experience in the use of large language models are the audience for this project. The project is convincing in its examination of the benefits of using BERT. Browsing through the sections of the blog or workshops offers explanations about the extent to which the model can be used in certain analyses. For example, this model is able to identify whether a sentence expresses a positive or negative sentiment, predict which sentence should follow in a paragraph, and eliminate ambiguities between multivalent words.
One way transfer of knowledge is one of the main aspects that I highlight within the web-based training model used. BERT for Humanists project offers code tutorials and workshops for those interested in using the aforementioned model. These people can find code for specific projects where this model has been used, as well as the slides and videos of the different workshops held. Important to highlight is the glossary section, which contains explanations and tutorials about how to implement BERT, along with technical requirements and concepts. Another important aspect is the inclusion of humanistic case studies. One way to foster knowledge transfer at a bidirectional level could be to include a section on the project's website identifying international initiatives and scholars who have implemented the model for humanities use cases. This could assist with uptake of BERT with humanists by facilitating international networking.
The BERT model has been trained with a large corpus of texts and metadata in English. As BERT for Humanists indicates, this model has also been applied in other languages such as Spanish with positive results. Translation of key materials in BERT for Humanists might assist with broadening its audience with digital humanists who carry out important research with other non-Anglophone text corpora. Use of BERT for digital humanities projects in other languages might lead to refinement of the model for multilingualism. BERT for Humanists could contribute to this goal through additional online workshops with researchers who have large non-English datasets and texts and by connecting with communication channels of major digital humanities associations.
The significance of BERT for Humanists is evident in its funding support by the National Endowment for the Humanities. All suggestions aside, BERT for Humanists is an important initiative for digital humanities. It will be especially beneficial for those wishing to analyze texts and metadata with LLMs.