A review of Newspaper Navigator, a multi-phase project redefining searches for visual data in historic newspapers, directed by Benjamin Charles Germain Lee
Benjamin Charles Germain Lee, University of Washington
Lorella Viola, University of Luxembourg
Benjamin Charles Germain Lee
Created for the Library of Congress, Newspaper Navigator re-imagines how we search the rich visual content in historic newspapers. The first phase of the project utilized machine learning techniques to extract visual content from 16.3 million digitized newspaper pages in Chronicling America.1 This resulted in the Newspaper Navigator dataset, released in May 2020. The dataset and finetuned machine learning model2 are in the public domain. A paper on the dataset was presented at the 2020 ACM Conference on Information Knowledge & Management (CIKM).
The second phase consisted of building a search application for 1.5 million photos from the dataset. The search application was launched in September 2020. In addition to supporting faceted and keyword search, it empowers users to search by visual similarity by training an interactive machine learning model called an “AI navigator,” which enables users to retrieve photos of topics such as “baseball players” or “sailboats” even if their captions do not contain these keywords. An AI navigator can train and predict over all 1.5 million photos in a couple seconds. This new search affordance forms the basis for Benjamin Lee’s Ph.D. dissertation research, which re-imagines standard faceted search as “open faceted search.” A demo of the search application was presented at the 2020 ACM Symposium on User Interface and Software Technology (UIST).
The third phase is ongoing and concerns how the dataset and visual content recognition model can be utilized for downstream digital humanities research. In a collaboration with Devin Naar, Lee is utilizing the machine learning model to extract and study visual content within Ladino newspapers, enabling the first large-scale analysis of the Ladino press.3 In a collaboration with print historians and periodicals researchers Jim Casey, Molly Hardy, Joshua Ortiz Baco, and Sarah Salter, Lee has been utilizing the Newspaper Navigator dataset to study the evolution of visual content layouts in 19th century ethnic presses in Chronicling America in order to uncover editorial practices.
Drawing from science and technology studies, critical data studies, and media archaeology, the Newspaper Navigator data archaeology investigates the ways in which digitization and machine learning algorithms affect search and discovery and perpetuate erasure. Any use of machine learning with cultural heritage must be done with an understanding of the broader socio-technical ecosystems in which the algorithms have been utilized; the data archaeology supplements the dataset construction with this critical examination.
Newspaper Navigator re-imagines search and discovery, encouraging new modes of interaction with historic newspapers and new methodologies for humanistic inquiry. The dataset enables humanists to study the visual content at a scale previously not possible. It also empowers them to study various subsets of visual content of interest: for example, maps of a specific time period or cartoons published in select newspaper titles. The visual similarity search affordances in the application empower humanists to search for visual motifs and patterns that are not capturable using standard keyword search. Thus, Newspaper Navigator enables humanists to formulate and investigate their own claims using entirely new methodologies.
Newspaper Navigator is both a data set and a dedicated search application to navigate it. The data set includes extracted visual and textual content from the historical newspaper pages as collected in Chronicling America, a directory of American newspapers from 1777-1963, born from the partnership between the Library of Congress and the National Endowment for the Humanities. The visual content in the Newspaper Navigator data set consists of headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, while the textual content includes titles and captions (Lee et al. 2020). Extracted from 16.3 million pages of historical newspapers, this corpus represents the largest dataset of its kind ever produced (ibid). The app allows users to search the 1.56 million images identified from 1900 to 1936 through keyword searches as well as visual similarity. The way the Newspaper Navigator data set has been created offers digital humanities scholarship a number of methodological innovations.
First, in order to disambiguate articles, headlines have been treated as visual rather than textual content. Unlike previous approaches to historical newspapers article disambiguation (i.e., Ehrmann et al. 2020; Rautiainen 2019), this novel method exploits the graphic distinctiveness of newspapers’ headlines and advertisements to train a classifier that predicts bounding boxes around these items on each page. The subsequent textual content extraction from headlines and advertisements allows not only for the disambiguation of articles on each page but also between textual content and advertisements.
Second, the project builds on the crowdsourcing Chronicling America project Beyond Words, which aimed to identify and improve information about the images within the Chronicling America digitized newspaper collection. Newspaper Navigator repurposed the images’ bounding box annotations to fine-tune a classifier that could identify seven categories of illustrations: headlines, advertisements, photographs, illustrations, comics, editorial cartoons, and maps. Third, unlike existing data sets of visual content typically designed for training purposes, Newspaper Navigator is a repository of extracted visual material from a full archive.
Finally, the data set allows for fast similarity querying for search and recommendation tasks in the Chronicling America corpus due to the image embeddings that were generated from the extracted visual content. Significantly, image embeddings support visualizations of latent spaces that capture semantic similarity between text and images thus allowing for images to be studied “in context.”
Through the interface, users can aggregate images of interests via the “My Collection” feature and use those images to train a model on the fly for visual similarity. While the human-in-the-loop method is very helpful for refining searches, the interface is limited in its capacity for displaying these results. In terms of accessibility, for example, because of reduced size, it is difficult to assess the relevance of each image for improving the model. Moreover, when hovering over the results, users are only given the option to either add or discard the image, but not to access a better quality or bigger size of the image itself. Making these adjustments would significantly increase the value of this feature and improve the overall user experience.
In addition to its innovative digital humanities methodologies, Newspaper Navigator represents a resource of exceptional value for academics and practitioners with an interest in historical newspapers and historical visual content. The resource indeed opens up exciting avenues for historical research in fields such as cultural heritage, transatlantic studies, communication studies, media studies, and migration studies to name but a few. The reviewer also praises both the project’s critique of machine learning methods and the considerable efforts undertaken to keep the work transparent, shareable, open access, and reproducible, which greatly contribute to knowledge reuse and digital humanities scholarship advancement.
Ehrmann, Maud, Matteo Romanello, Simon Clematide, Phillip Benjamin Ströbel, and Raphaël Barman. 2020. “Language Resources for Historical Newspapers: The Impresso Collection.” In Proceedings of The 12th Language Resources and Evaluation Conference, 958–68. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.121.
Lee, Benjamin Charles Germain, Jaime Mears, Eileen Jakeway, Meghan Ferriter, Chris Adams, Nathan Yarasavage, Deborah Thomas, Kate Zwaard, and Daniel S. Weld. 2020. “The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America.” ArXiv:2005.01583 [Cs], May. http://arxiv.org/abs/2005.01583.
Rautiainen, Juha. 2019. “Opening Digitized Newspapers for Different User Groups - Successes and Challenges: IFLA World Library and Information Congress 2019.” IFLA WLIC 2019 - Athens, Greece - Libraries: Dialogue for Change, July. http://library.ifla.org/2485/.