Now available online, accessible to all: Alien Reading: Text Mining, Language Standardization, and the Humanities.
For all the talk about how computers allow for new levels of scale in humanities research, new debates over institutional structures, and new claims to scientific rigor, it is easy to lose sight of the radical difference between the way human beings and computer programs “read” texts. Topic modeling, one of the most touted methods of finding patterns in large corpora, relies on a procedure that has little resemblance to anything a human being could do. Each text is converted into a matrix of word frequencies, transforming it into an entirely numerical dataset. The computer is directed to create a set of probability tables populated with random numbers, and then it gradually refines them by computing the same pair of mathematical functions hundreds or thousands of times in a row. After a few billion or perhaps even a few trillion multiplications, additions, and other algebraic operations, it sutures words back onto this numerical structure and presents them in a conveniently sorted form. This output, like the paper spit out by a fortune-telling machine, is supposed to tell us the “themes” of the texts being analyzed. While some of the earliest computational text-analysis projects, like Father Roberto Busa’s famous collaboration with IBM on the Index Thomisticus, began by attempting to automate procedures that scholars had already been doing for centuries, topic modeling takes us well beyond the mechanical imitation of human action (Hockey). When we incorporate text-mining software into our scholarly work, machines are altering our interpretive acts in altogether unprecedented ways.
Yet, as Alan Liu has argued, there has been relatively little interchange between the scholars who are applying these computational methods to literary history and those in fields like media studies who critically examine the history and culture from which this computational technology emerged (“Where Is Cultural Criticism in the Digital Humanities?”). Many scholars of technology, including Lisa Gitelman, Wendy Hui Kyong Chun, Tara McPherson, and David Golumbia, have argued that the seemingly abstract structures of computation can serve ideological ends; but scholars who apply text mining to literary and cultural history have largely skirted the question of how the technologies they use might be influenced by the military and commercial contexts from which they emerged (Gitelman, Paper Knowledge; Chun, Control and Freedom; McPherson, “Why Are the Digital Humanities So White?”; Golumbia, Cultural Logic of Computation). As a way of gesturing toward a fuller understanding of the cultural context surrounding text-mining methods, I will give a brief account of the origins of a popular technique for topic modeling, Latent Dirichlet Allocation (LDA), and attempt to situate text mining in a broader history of thinking about language. I identify a congruity between text mining and the language standardization efforts that began in the seventeenth and eighteenth centuries, when authors such as John Locke called for the stabilization of vocabularies and devalued “literary” dimensions of language such as metaphor, wordplay, and innuendo as impediments to communication. I argue that, when applied to the study of literary and cultural texts, statistical text-mining methods tend to reinforce conceptions of language and meaning that are, at best, overly dependent on the “literal” definitions of words and, at worst, complicit in the marginalization of nonstandard linguistic conventions and modes of expression.
While text-mining methods could potentially give us an ideologically skewed picture of literary and cultural history, a shift toward a media studies perspective could enable scholars to engage with these linguistic technologies in a way that keeps their alienness in sight, foregrounding their biases and blind spots and emphasizing the historical contingency of the ways in which computers “read” texts. What makes text mining interesting, in this view, is not its potential to “revolutionize” the methodology of the humanities, as Matthew Jockers claims, but the basic fact of its growing influence in the twenty-first century, given the widespread adoption of statistical methods in applications like search engines, spellcheckers, autocomplete features, and computer vision systems. Thinking of text-mining programs as objects of cultural criticism could open up an interchange between digital scholarship and the critical study of computers that is productive in both directions. The work of media theorists who study the ideological structures of technology could help us better understand the effects that computerization could have on our scholarly practice, both in explicitly digital work and in more traditional forms of scholarship that employ technologies like databases and search engines. On the other side, experimenting with techniques such as topic modeling in a critical frame could support a more robust analysis of the cultural authority that makes these technologies seem natural at the present moment, baring the ideological assumptions that underlie the quantification of language, and creating, perhaps, a renewed sense of the strangeness of the idea that words can be understood through the manipulation of numbers.