A review of jsLDA, a browser-based tool for topic modeling, developed by David Mimno
Project
jsLDA
Project Director
David Mimno, Cornell University
Project URL
https://mimno.infosci.cornell.edu/jsLDA/
Project Reviewer
Quinn Dombrowski, Stanford University
David Mimno
jsLDA is a simple browser-based tool for running Latent Dirichlet allocation (LDA) topic models. While there is nothing specific to digital humanities in the jsLDA tool, it is motivated by the needs and challenges of researchers working in humanities as they learn about text analysis from a computational perspective.
LDA is a powerful and popular tool for uncovering patterns in text collections for two main reasons. First, it is widely applicable and approachable because it only requires a collection of raw, unannotated segments of text. Second, LDA model outputs represent each text segment in the collection as combinations of recognizable, interpretable groups of words. These "topics" then support both quantitative measurements of otherwise poorly-defined concepts as well as qualitative theory-building and targeted reading.
The initial implementation of jsLDA was written by David Mimno based on experience leading workshops and tutorials. When introducing complicated and unfamiliar systems like LDA, it is often more effective to show users what systems do rather than try to explain how they do it. However, complicated setup processes caused these demonstrations to be as much as 50% devoted to how to set up environments and manage data, a hard thing for participants to attend to when they have no immediate grounding. These challenges were not borne equally; Windows users consistently faced more challenges, and users with less experience setting up programming or command line interfaces often felt especially intimidated.
To address these challenges, jsLDA implements the Gibbs sampling algorithm to learn LDA topic models entirely in JavaScript. Users can start interacting with jsLDA on any device with a modern browser simply by loading a URL. It takes advantage of the interactive, graphical environment of the browser to show model outputs in a simple, engaging format. Further, the Javascript browser environment supports more advanced explanations of the internals of the topic model and Gibbs sampling algorithm through in-browser inspection code and variables. With a clearer sense of the goals and specific examples of outputs, participants are better able to contextualize explanations and formulate their own questions.
The tool supports a range of humanities researcher goals, from getting a first exposure to quantitative methods by doing advanced analytical work. jsLDA is designed to build user confidence through positive first impressions based on an example collection. However, it is also intended to support expert use — users can also load their own collections and generate and save reports. The core sampling code has been optimized over several years for performance that is competitive with any other implementation. While it would not necessarily scale to massive, library-sized collections, jsLDA is well suited for the scale of collections that are common in humanities research with thousands or tens of thousands of text segments.
The code is available under open license at GitHub, where at the time of writing it has 171 stars and 46 forks. Further development of a new version is underway in collaboration with a team of students led by Xanda Schofield at Harvey Mudd College.
Quinn Dombrowski
jsLDA is a browser-based, mobile-friendly tool for exploring topic modeling both as a computational method and on one’s own set of texts. The landing page provides a straightforward set of instructions, and choosing "run a model" loads a default corpus of paragraphs from State of the Union speeches. Users can freely play with the different parameters (e.g., number of iterations, number of topics) and just as easily start over from scratch by reloading the page. This no-setup implementation removes one of the major barriers to entry for other widely-used topic modeling tools like MALLET by providing a graphical user interface (GUI).
jsLDA is not the first tool to provide a web-based GUI for Latent Dirichlet allocation (LDA) as an alternative to the command-line tool MALLET. Voyant Tools provides a configurable interface for topic modeling, which allows users to specify a stopword list, along with the number of topics, terms, and iterations. Each topic also comes with a small visualization of its frequency in each document. The topic modeling tool in Voyant isn't visible by default, can be difficult to access on mobile devices, and doesn't have the ability to explore connections between topics like jsLDA. Where Voyant offers topic modeling as an optional part of a large suite of tools, jsLDA is clearly designed to facilitate topic modeling as a deliberately chosen method.
The context of the tool's creation is significant — and different from many digital humanities projects: David Mimno and Xanda Schofield are both professors of computer science. As such, professional guidelines for evaluating digital scholarship in the humanities don't apply in the same way. There is no humanistic argument underpinning the tool as such; it elegantly and efficiently implements an algorithm. While left unstated in both the jsLDA interface and the project description, Mimno is the author of the LDA package for MALLET, and the weight of his authority on the computational side is significant. At the same time, Mimno and Schofield's long history of engagement with digital humanities is apparent in the tool, particularly through the instructions. Tools developed by computer scientists for computer scientists tend to include terse, jargon-filled documentation. jsLDA goes beyond what's strictly necessary to use the tool, including helpful explanations of things that commonly trip up scholars new to topic modeling (e.g., "Remember that 'document' really means 'segment of text'. A few hundred words is a good length; longer passages tend to shift their topical focus, making inference more difficult."). There are also clear instructions upfront for how to format the input files.
The instructions offer a small clue about potential multilingual challenges with a reference to CJK (Chinese, Japanese, and Korean) texts not working, but additional guidance or pointers for other linguistic contexts (e.g., agglutinative and/or highly-inflected languages) would support newcomers who work in a variety of non-English languages. Some further explanation of the "expected" values for topic correlations would also be helpful for scholars who are hesitant to look at the source, where it is spelled out through well-commented code. Nonetheless, jsLDA is a significant upgrade in the suite of tools available for doing topic modeling, whether for personal exploration, pedagogy, or working with text data that should not be posted online.