CRL Reports

When Machines Do Research, Part 2: Text-Mining and Libraries


There has been a fair amount of discussion of late in information industry circles about text mining. Researchers in academia now have access to immense corpora of text that are openly available on the Web: the millions of public domain books and serials available courtesy of Google; and vast troves of government documents courtesy of “open government” initiatives in the U.S. and U.K. and third-party actors like WikiLeaks and the National Security Archive. The growing application of text mining techniques and technologies in many fields of research has implications that are beginning to be felt by libraries.

Text mining is generally defined as the automated processing of large amounts of digital data or textual content for purposes of information retrieval, extraction, interpretation, and analysis. Modern researchers now employ proprietary and open source software and tools to process and make sense of the oceans of information at their disposal in ways never before possible. Most text mining involves downloading a fixed body of text and accompanying metadata to a local host system or platform, and running it through certain processes that can detect patterns, trends, biases, and other phenomena in the underlying content. These phenomena can then form the basis for new observations, visualizations, models, and so forth.1


The wealth of content available on the open Web has created tremendous opportunities for academic and non-academic researchers alike. Some of the most sophisticated and ambitious work is in the fields of national security, business, and political media. Content mining was pioneered by U.S. intelligence agencies and their contractors and service providers, who since the 1990s have been continuously mining “open sources” for clues to public opinion abroad and for evidence of threats to American interests in foreign postings and Web and cable broadcasts.

Not surprisingly, the field of journalism is close behind the security industry. Organizations like the New York Times, Guardian, Associated Press, and others have begun to deploy text processing and mining techniques to deal with the vast corpora of government and NGO information released––or leaked––to the Web. In October 2010 WikiLeaks made available online 391,832 classified U.S. Defense Department SIGACT (“significant action”) incident reports from the Iraq War in an SQL file. This was far too large an archive to be sifted through effectively by humans on the urgent timetable dictated by the news cycle. The Associated Press, in collaboration with the Guardian and other news organizations, created an open-source system to enable the rapid identification, weighting and visualization of topics in the massive digital archive. AP’s Jonathan Stray reported that, “Whether they’re leaked, obtained under freedom of information laws, or released as part of government transparency initiatives, large document sets have become a significant part of journalism.”2


Much is being done in the academic realm as well. In a 2008-2009 study, Matthew Gentzkow and Jesse M. Shapiro at the University of Chicago’s Booth School of Business used algorithms of their own devising to process the texts of speeches published in the Congressional Record and thousands of articles in big-city newspapers from a large ProQuest database. Doing this they were able to systematically map demographic trends influencing the editorial bias in several major U.S. metropolitan media outlets.3

Political science researchers at Harvard University’s Berkman Center for Internet and Society have also used high-powered computing to process and analyze open Web content. Some of their work relies on tools and processing services originally developed for the business world. In 2008 researchers at the Center, assisted by the firm Morningside Analytics, published an impressive, high-level analysis of political chatter on the Iranian blogosphere. Morningside harvested and tagged the texts of more than 60,000 blogs over a period of several months and “mapped” the topics discussed, detecting thought trends, spheres of influence, and other social-intellectual phenomena4. More recently the Berkman Center created its own Media Cloud, an open data platform that will eventually enable researchers to analyze and track patterns and trends in online news.

Perhaps the most ambitious use of text mining in an academic setting is the Cline Center for Democracy’s Societal Infrastructure and Development Project (SID). For several years now, the Cline Center has been mining a digital corpus of post-World War II news texts, gathered from various commercial publishers, in order to test certain theories about economic development and civil society. The project draws upon the complete news texts published by the New York Times and the Wall Street Journal between 1946 and 2002, and the millions of broadcast transcripts produced during the same period by two intelligence agency news services: the Foreign Broadcast Information Service (CIA) and the BBC’s Summary of World Broadcasts.5

The greatest promise for text mining in the academic domain, however, may be in the sciences. The Cultural Observatory, at Harvard, plans to inaugurate a browser that mines the text of the millions of articles in the ArXiv open Web high-energy physics archive, hosted by Cornell University. On a site called Bookworm-arXiv one can track and graph the occurrence of particular terms in articles deposited in the repository between 1995 and 2007. The Cultural Observatory has also developed a tool designed to chart the frequency of words and phrases in the over 5.2 million books digitized by Google.6


Success in mining open access materials has led to demands to open the commercial databases like science journals and humanities databases to this kind of analysis. The problem is that text mining usually entails copying or downloading and locally re-hosting large amounts of text. Without the permission of the copyright holders, such activity may violate copyright laws in most jurisdictions, and the size and scope of the large text corpora normally renders securing this permission prohibitively time-consuming or expensive.

Much of the recent debate about text mining, in fact, revolves around the issue of rights. That debate has risen to the national policy level in the U.K. In 2010 British Prime Minister David Cameron ordered an independent review of whether and to what degree the national Intellectual Property framework supports growth and innovation in the U.K. The review team was chaired by Ian Hargreaves of Cardiff University and issued their report, “Digital Opportunity: a Review of Intellectual Property and Growth,” in May 2011. The Hargreaves report called for major changes in the UK copyright regime. The authors asserted that current copyright law obstructed development in the UK technology and information sectors. In particular, Hargreaves recommended that the UK government “introduce a UK exception –– under the non-commercial research heading to allow use of analytics for noncommercial use –– as well as promoting at EU level an exception to support text mining and data analytics for commercial use.”

More focused on academic research was the 2011 study “Journal Article Mining: A research study into Practices, Policies, Plans … and Promises” by two Dutch consultants, Eefke Smit and Maurits van der Graaf.7 The study examined extant content mining activities in the sciences, and surveyed publishers about their views on the subject and expectations for the future.

The JISC-sponsored March 2012 report by Diane McDonald and Ursula Kelly, The Value and Benefits of Text Mining, echoed the Hargreaves report, offering a number of case studies illustrating the potential economic benefits of text mining. The JISC report also asserted that the barriers to wider research use are not only legal, but financial and technological as well.

In April 2012 investment analyst Claudio Aspesi noted the potential impact on the publishing world of the rising demand for open content among academic researchers. University researchers, Aspesi observed, are “increasingly protesting the limitations to the usage of the information and data contained in the articles published through subscription models, and –– in particular –– to the practice of text mining articles.” Most of this pressure is coming from researchers in the fields of Chemistry, Biology, Physics, and Linguistics. Aspesi even predicted that, “The arguments which academics are putting forward could further inflame the Open Access debate by leading critics to conclude that commercial subscription publishers, in addition to charging excessive prices for accessing research, are hindering the work of researchers as well.”8

Sure enough, Cambridge University Chemist Peter Murray-Rust is now spearheading an initiative among scientists to adopt a “Content Mining Declaration.” The declaration identifies a broad set of rights that researchers “should assert,” which include unlimited machine processing of texts, reproduction of “facts and excerpts” discovered through mining, and so forth.


Some publishers of commercial databases have already begun to respond to this demand, in various ways. Some publishers allow downloading of text (with permission) and, in some instances, even provide an API to mine their text database content. In its recently released database Nineteenth Century Collections Online Gale makes available raw text files of the books and documents included in the database, which can be downloaded and mined on the user’s own device. The most common solution, however, is to incorporate analytical capabilities into the proprietary databases themselves. We have already seen creation and deployment of these tools in Dow Jones’s Factiva iWorks and Bloomberg Professional. And text mining tools will be a feature of the new Elsevier and Nature journals platforms.

Aside from the obvious limitations of proprietary tools, there is also the issue of interoperability of content mining tools from one publisher’s content to another’s. The Smit and van der Graaf study suggests some measures that commercial journal publishers might take to facilitate content mining across multiple publishers’ content. These include the creation of a shared content mining platform; commonly agreed permission rules for research based mining requests; and standardization of mining-friendly content formats.

A particular challenge for academic libraries will also be to provide researchers access to text mining tools and engines that enable replication of this kind of analysis by peers in the field, in order to test a given researcher’s findings or results. These tools can’t be one-off or “home brewed” solutions. There will have to be some uniformity to them within (and potentially across) disciplines and fields.

Building the right kinds of conditions and terms for computer-assisted processing and analysis of commercial database content is going to be difficult without a clear sense of the practices in this kind of research activity. Librarians can play a critical role in this process but only if they fully understand the practices of their constituents and integrate that understanding into their licensing and resource development work.9

  1. Text mining is distinct from real-time processing of live Web content that uses APIs to generate new analysis, often based on interactions between Web sites, social networks, etc.
  2. Stray, Jonathan. AP on Guardian Datablog, “Wikileaks Iraq: How to Visualise the Text.”December 16, 2010 <>.
  3. Gentzkow, Matthew and Jesse M. Shapiro, “What Drives Media Slant? Evidence from U.S. Daily Newspapers, ” Econometrica, 78:1 (January 2010), 35–71.
  4. See the report at <>.
  5. For more about the SIDS project, see <>.
  6. On these developments at the Cultural Observatory, see Anne Eisenberg, “Avalanches of Words, Sifted and Sorted, ”. New York Times. March 25, 2012.
  7. “Journal Article Mining: A research study into Practices, Policies, Plans … and Promises” Amsterdam: Commissioned by the Publishing Research ConsortiumMay 2011 <>.
  8. Aspesi, Claudio; Rosso, Andrea. “Reed Elsevier: Is Elsevier Heading for a Political Train-Wreck?” in AB Bernstein ResearchApril 20, 2012.
  9. The Center for Research Libraries recently hosted a Global Resources Collections Forum, New Horizons in Primary Source Research, which featured two presentations on mining and analysis of primary text content by researchers in the humanities and social sciences: “Analysis and Visualization Using Large Bodies of Electronic Text, ” by Elizabeth Long and Peter Leonard of the University of Chicago, and “Old News, New Research: Observations from the Field, ” by Debora Cheney, of Penn State. See <> for these presentations.