Yael Netzer, Ben Gurion University
Nancy Cooey, Kepa Joseba Rodriguez, and Vladimir Alexiev
In the spring of 2017, the European Holocaust Research Infrastructure (EHRI) began a project to expand and enhance its authority records for Holocaust-era ghettos using the data integration platform Wikidata. As an international project aimed at bringing together dispersed Holocaust-related archival records held in institutions across the world, EHRI faces challenges in developing name authorities that can be applied across multiple languages and various metadata standards. Wikidata emerged as a promising tool to address these challenges since it functions as a knowledge base incorporating various sources, including encyclopedias, catalogs, name authority lists, gazetteers, published research, and knowledgeable contributors.
To begin our project, we determined the primary descriptive elements we wanted to include in the EHRI Ghettos vocabulary list: geographic coordinates, administrative territories, and alternative place names. At the time, the EHRI authority list included 1391 ghettos drawn from the online edition of The Yad Vashem Encyclopedia of the Ghettos During the Holocaust, and Wikidata held 80 entries for Holocaust-era ghettos. To expand the catalog of ghettos contained in Wikidata and to enhance the descriptions of the EHRI authority records, we extracted place names from Wikidata relating to the EHRI list of ghettos. We then collated these data and matched the appropriate city or village name and its Wikidata entity (i.e. its unique identifier) with the ghetto that existed in that place during the Holocaust. Since Wikidata is an extensive gazetteer and most ghettos carry the name of the city in which they were located, we were able to expand the variant names for ghettos listed in the EHRI authority. We did this by extracting the names of the places in different languages included in Wikidata and, using separate rules for each language, we created labels for the ghettos in up to 15 different languages. By increasing the diversity of our cross-references for each vocabulary entry, we were able to identify and label archival records in various languages more effectively within the EHRI Portal.
After this process, we imported the following information for each ghetto into Wikidata:
English name of the ghetto;
a statement qualifying the entry as an instance of a “ghetto in Nazi-occupied Europe” (Q2583015);
variant names of the ghetto in languages other than English;
Wikidata unique identifier for the place where the ghetto was located;
geographic coordinates for the ghetto;
EHRI-assigned unique identifier for the ghetto; and
unique identifiers for the ghetto from online resources (if they existed), including The Yad Vashem Encyclopedia of the Ghettos During the Holocaust and the online USHMM Holocaust Encyclopedia.
We anticipate, in the long term, that Wikidata will operate as a knowledge repository for the authority lists we have created. Consequently, these ghetto entries will continue to be improved through crowd-sourced initiatives, information that may in turn be vetted and assimilated into the EHRI Portal. By incorporating descriptive information from Wikidata into EHRI and the identifiers from EHRI into Wikidata, both benefit from a complementary relationship: EHRI improves its authoritative descriptions and Wikidata gains reliable references.
Digitization of archives, libraries, and museum collections requires an integrative and open approach to knowledge to enable concise, efficient, and comprehensive access and discoverability both by humans and by machines. Since 2010, EHRI has served as one such infrastructure for Holocaust research, commemoration, and education. In EHRI Ghettos, Nancy Cooey, Kepa Joseba Rodriguez, and Vladimir Alexiev present a methodology to enrich and widen the accessible digital knowledge on Jewish ghettos during the Holocaust. In this project, several sources of knowledge (Yad Vashem Encyclopedia and USHMM Encyclopedia of Ghettos During the Holocaust) were merged and enriched with geographical and toponymic information from Wikidata.
Toponyms or, as they are more generally known, place names, tend to change through time and cultures. Place names change when national borders change as a result of language and differing cultural groups using their own writing systems. A simple name of a town such as “Lvov,” for example, has more than twenty different forms in Wikidata (e.g., Lwów, Lemberg, L'viv, לבוב, לעמבערג, Львів, etc.). The result of this variety is that references to the same ghetto in archival material and historical literature can vary widely.
The solution in EHRI Ghettos is to use Wikidata gazetteer content to identify locations. For each location, the EHRI authority list allows for the addition of new elements for the ghettos, assuming that most ghettos were named after the name of the settlement where they were established and operated. Each instance of a ghetto in Nazi-occupied Europe item (Q2583015) has been enriched by adding labels in all existing languages in the knowledge sources, as well as geographic data and reference to the source of information. The entry for the Lvov Ghetto now contains more than fifty labels in 22 languages, which enables finding relevant archival documents in EHRI by querying the various spellings and writing systems. Significantly, this is a bidirectional process, where an EHRI cultural heritage institution does not just utilize the Wikidata to expand and enrich its own catalogue but is able to, in return, export its authorized data to Wikidata, thereby adding to the labels and associated dataset. This approach of linking data positions Wikidata and the articles in Wikipedia that use Wikidata as central players in the field of Linked Open Data for cultural heritage institutions and for study of the Holocaust. This increases access to knowledge by broadening its identification and utilization. Querying Wikidata with SPARQL and its graphical aids are relatively user-friendly, and the outputs of queries can include formats such as maps with images and labels as well as comma separated values (CSV) and other data formats. One of the advantages of such practices is the democratization of knowledge, making tools, knowledge, and practices accessible and manageable to all, not only for experts or academics, which is achieved by, for instance, advancing practices of collective management and crowdsourcing.
As EHRI Ghettos demonstrates, Wikidata is a most valuable resource and tool for using LOD in digital humanities projects. However, Wikidata, like Wikipedia, can be a fertile field of political battles. Its policies may change in the future and interest groups are able to alter facts and add unauthorized data, as in the case of the Wikipedia hoax of the Warsaw Concentration Camp. It is therefore crucial to keep the authorized nature of the data of cultural heritage institutions independent, managed by the institutions themselves in a machine-actionable accessibility (i.e. either LOD, open API, or CSV/JSON/XML files) for future preservation and persistent integrity of the data. Cultural heritage and academic institutions that share and withdraw knowledge from Wikidata must further aim for an ongoing authoritative process even when a project is over.