Project Twitter Literature (TwitLit)
Christian Howard-Sukhil, Bucknell University
Christina Boyles, Michigan State University
Established by Christian Howard-Sukhil in 2017, Project Twitter Literature (TwitLit) sprang from a question: How is social media changing the shape of the literary world? Project TwitLit addresses this question by analyzing the growing community of amateur writers who are using Twitter as a means of publication and dissemination for their literary output. There are three parts to the project: 1) the research findings related to the global literary community on Twitter, 2) the tools and resources developed as part of the project and made openly available to other scholars, and 3) ongoing efforts to ensure the long-term preservation of the collected data.
Hashtags—rather than keywords—were chosen as the primary method of data collection for Project TwitLit due to the use of hashtags as a means of garnering more wide-spread publicity around a given topic. In fact, amateur writing communities on Twitter overwhelmingly identify themselves through hashtags that signify literary form or genre (e.g., #Haiku or #VeryShortStory). Project TwitLit has identified 35 commonly used English-language hashtags on Twitter that are related to writing as a form or genre (including #3Lines, #FlashFiction, #Micropoetry, #ShortStory, #Tanka, and #TwitterFiction) or to the writing community itself (e.g., #FictionCommunity, #PoetSociety, #WritersLife). Project TwitLit has collected these tweets from 2011—the year Twitter underwent a major site redesign and grew in global popularity—through 2019.
The tools used for Project TwitLit fall into one of three categories: 1) data collection, 2) data cleanup, and 3) data analysis. A Python scraper—modified by Christian Howard-Sukhil, Shane Lin, and Alyssa Collins from a pre-existing script developed by Tom K. Dickinson—queries a Twitter search API and allows for the collection of Twitter data from Twitter’s inception in 2006 to the present. The data collected by this Python scraper is cleaned using tools developed by the Documenting the Now (DocNow) project, particularly the Hydrator tool and deduplicating functions. The number of results in a given file can easily be counted using the JSONL Counter Python script developed by Varundev Sukhil. Finally, Nick Caravias has created a Python script that uses the Natural Language Toolkit (NLTK) to analyze tweets collected in CSV files; this script is actively being refined for better ease-of-use. More information about the Python scripts associated with the project—as well as detailed instructions on how to use these scripts—are available on the project website. These scripts are open source and freely available for download on the project’s GitHub site.
Social media data brings with it special ethical and legal considerations. In gathering and working with Twitter data, the TwitLit team has sought to abide by the four primary ethical imperatives outlined by social media scholars: 1) respect user privacy and intent, 2) obtain informed consent, 3) minimize potential harm, and 4) (if necessary) anonymize data for user protection. The precise ways these considerations have influenced data practices employed by Project TwitLit are clearly outlined on the project website. From a legal standpoint, Twitter’s Terms of Service prevents the project team from publishing the raw Twitter data collected as part of the project. The project team is, however, currently exploring legal methods for sharing this data with other scholars.
Christian Howard-Sukhil, who received her Ph.D. in English literature and is a Digital Humanities Postdoctoral Fellow at Bucknell University, is the Principal Investigator of Project TwitLit. Jimmy Pronchick and Meghan Coyle, undergraduate research assistants at Bucknell, assisted with Twitter data collection from spring to summer 2020. Nick Caravias, an undergrad at Bucknell majoring in computer science, joined the project team in fall 2020 and developed the NLTK Python script to analyze collected files. Kyle Courtney, the Copyright Advisor for Harvard University, provides legal advice on the project.
The intellectual content of Project TwitLit is relevant to scholars of contemporary Anglophone literature, while the tools developed as part of the project are more broadly applicable to those working in the fields of new media and digital humanities. Research based on Project TwitLit has been presented at various conferences, including the 2019 American Comparative Literature Association and the Digital Humanities 2020 conferences. Project TwitLit is the recipient of several grants, namely the Prototyping Fellowship from the University of Virginia Scholars’ Lab (2018) and a Mellon Confounding Problem Grant from Bucknell University (2019-2020).
Project Twitter Literature (TwitLit), a social media analysis project, examines the use of writing-related hashtags by Twitter users. The project includes data from January 2011 to December 2019, collecting tweets pertaining to 35 unique English language hashtags including #3Lines, #PoetTweet, #WritersLife, and #Twifi. On its GitHub page, TwitLit summarizes its findings by providing data visualizations and analyses for observed trends: the increased use of multiple hashtags and the increased number of users tweeting with these hashtags.
One compelling component of TwitLit is its engagement with the #TwitLit discourse community—a group of writers using shared language to engage in conversations about writing habits, practices, and publications. For rhetoricians, this project offers a starting point for asking questions about community writing practices: How do Twitter users define writing? Who counts as a writer? How do writers network with one another through social media? What do community writing practices look like on Twitter? In his blog post on the project, Nick Caravias, a student researcher for TwitLit, begins to analyze some of these questions, noting the ways in which tweets and hashtags often are posted across multiple social media platforms to reach broad writing audiences.
Breadcrumbs such as these are sprinkled throughout TwitLit’s site. For example, the visualizations reveal that use of the hashtag #TwitLit peaked in 2012—hinting that the project’s name derived from this particular moment in time. Hints such as these are one compelling—and risky—feature of projects relying upon mass data collection.
Of particular concern are the privacy and surveillance issues raised by the aggregation and analysis of social media posts. TwitLit has followed strong privacy protocols by not showing individual tweets without consent and only presenting other information in aggregate. At the same time, there are many ways in which TwitLit could strengthen its commitment to user privacy and consent. For example, the project could provide more avenues for research subjects to provide consent or opt out of being studied. TwitLit’s documentation argues that collection of user data is acceptable because Twitter’s terms of service inform users that they will “make your content available to the rest of the world and…let others do the same.” However, Twitter is a commercial service designed for profit—not a platform centering values of user safety and consent. As Chris Gilliard observes, “When we look at digital technology and platforms, it’s always instructive to remember that they exist to extract data.”1
To address these challenges, digital humanists have developed recommendations on how to develop ethical social media research—practices that TwitLit could incorporate into their current and future research. Dorothy Kim and Eunsong Kim’s #TwitterEthics Manifesto states, “Both academics and journalists should ask each individual user on Twitter for consent. They should explain the context and the usage of their tweets.”2 Moya Bailey similarly has outlined steps for ethical interactions with Twitter communities including acquiring informed consent, forming an advisory board, and developing research questions with and for the community being studied.3
Web scraping projects like TwitLit also face challenges for future use. Preserving Twitter data—or any other aggregated social media data—poses questions about safety and vulnerability: Who has access to these data sets? How could they be used in ways outside of their original intent? How could they exacerbate existing inequalities or be used to promote harm? One way to address these issues is by speaking with community members to determine how the project can best meet their needs. By building stronger relationships with the Twitter writing community, TwitLit can identify opportunities for growth and develop an ethic of care that will benefit both the project and its community.