text and data mining
We’re taking part in Copyright Week, a series of actions and discussions supporting key principles that should guide copyright policy. Every day this week, various groups are taking on different elements of the law, and addressing what’s at stake, and what we need to do to make sure that copyright promotes creativity and innovation.
Today’s topic is about supporting fair use, a legal doctrine in the United States and a few other countries that permits some uses of copyrighted works without the author’s permission for purposes such as parody, criticism, teaching, and news reporting. Fair use is an important check on the exclusive bundle of rights granted to authors under copyright law. Fair use is considered a “limitation and exception” to copyright.
One area of particular importance within limitations and exceptions to copyright is the practice of text and data mining. Text and data mining typically consists of computers analyzing huge amounts of text or data, and has the potential to unlock huge swaths of interesting connections between textual and other types of content. Understanding these new connections can enable new research capabilities that result in novel scholarly discoveries and critical scientific breakthroughs. Because of this, text and data mining is increasingly important for scholarly research.
Recently the United Kingdom enacted legislation specifically excepting noncommercial text and data mining from copyright. And as the European Commission conducts their review of EU copyright rules, some groups have called for the addition of a specific text and data mining exception. Copyright for Creativity’s manifesto, released Monday, urges the European Commission to add a new exception for text and data mining, in order to support new uses of technology and user needs.
Another view holds that text and data mining activities should be considered outside the purview of copyright altogether. The response from the Communia Association to the EU copyright consultation takes this approach, saying “if text and data mining would be authorized by a copyright exception, it would constitute a de facto recognition that text and data mining are not legitimate usages. We believe that mining texts and data for facts is an activity that is not and should not be protected by copyright and therefore introducing a legislative solution that takes the form of an exception should be avoided.” Similarly, there have been several actions advocating that “The right to read should be the right to mine.”
Whether text and data mining falls under a copyright exception or outside the scope of copyright, it is clearly an activity that should not be able to be controlled by the copyright owner. But unfortunately, that is exactly what some incumbent publishing gatekeepers are trying to do by setting up restrictive contractual agreements. One example we’ve seen of this practice is with the deployment of a set of “open access” licenses from the International Association of Scientific, Technical & Medical Publishers (STM), many of which attempt to restrict text and data mining of the licensed publications. In jurisdictions such as the United States, users do not need to ask permission (or be granted permission through a license) to conduct text and data mining because the activity either falls outside of the scope of copyright or is squarely covered by fair use.
Ensuring that licenses give copyright owners no more control over their content than they have under copyright law is a fundamental principle of CC licensing. That’s why the licenses explicitly state that they in no way restrict uses that are under a limitation or exception to copyright. This means that users do not have to comply with the license for uses of the material permitted by an applicable limitation or exception (such as fair use) or uses that are otherwise unrestricted by copyright law, such as text and data mining in many jurisdictions.
Today’s topic of fair use rights reminds us that “for copyright to achieve its purpose of encouraging creativity and innovation, it must preserve and promote ample breathing space for unexpected and innovative uses.” To liberate the massive potential for innovation made possible by existing and future types of text and data mining, we need user-focused copyright policy that enables these new activities.
This post with invaluable assistance from the CC legal and policy teams.
Text and data mining (TDM) is becoming an increasingly important scientific technique for analyzing large amounts of data. The technique is used to uncover both existing and new insights in unstructured data sets that typically are obtained programmatically from many different sources.
A few of the innovative examples include GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles; improving human curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database; and discovering a new link between genes and osteoporosis.
While the science and technology of TDM are complex enough involving information retrieval (IR), optical character recognition (OCR), and natural language processing (NLP), the legal complications are, sadly, equally dizzying. The legal status of TDM is unclear at best, both because there are a multitude of techniques to engage in TDM, and because the implications of various techniques vary from jurisdiction to jurisdiction. This makes cross-national collaboration, integral to science, difficult at best. For example, TDM is generally considered to not implicate copyright in the U.S. There are several theories as to why TDM falls outside copyright, but the most obvious is that it uses copyrighted material for a transformative purpose and is therefore a fair use. Judge Baer, writing in Author’s Guild, Inc., et. al. v. Hathi Trust, et. al. (Case 1:11-cv-06351-HB)
“The use to which the works in the HDL are put is transformative because the copies serve an entirely different purpose than the original works: the purpose is superior search capabilities rather than actual access to copyrighted material. The search capabilities of the HDL have already given rise to new methods of academic inquiry such as text mining.”
Judge Baer goes on to state:
“I cannot imagine a definition of fair use that would not encompass the transformative uses made by Defendants’ MDP and would require that I terminate this invaluable contribution to the progress of science and cultivation of the arts.”
The clarity, however, is far from universal as the situation outside the U.S. gets muddy. While there have been a few welcome developments in the U.K., the copyright laws of many other countries have little to no clarity on whether TDM falls outside of the reach of copyright and related laws. Where TDM does implicate copyright, the license status of the original material can make automated access and analysis very complicated, requiring additional checks to ensure any material is only being used as permitted by the license. And, even where the relevant licenses are free and open, and conducive to TDM, contractual agreements between research institutions and publishers, who are often the gatekeepers of the corpora, can create significant hurdles.
In a comment on proposed U.K. exception for information mining, both iCommons and the Open Knowledge Foundation (OKFN) supported the UK Government’s opinion that it is inappropriate for “Certain activities of public benefit such as medical research obtained through text mining to be in effect subject to veto by the owners of copyrights in the reports of such research, where access to the reports was obtained lawfully.” PLOS opined, “Enabling content mining is a core part of the value offering for Open Access publication services.” In its response to EU copyright review, LIBER stated, “All exceptions related to education, learning and access to knowledge to be made mandatory. In particular, we would like to see a specific exception for text and data mining for all research purposes.” OKFN’s Working Group on Open Access stated:
“We assert that there is no legal, ethical or moral reason to refuse to allow legitimate accessors of research content (OA or otherwise) to use machines to analyse the published output of the research community. Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes.”
Support for text and data mining under the guise of “The right to read is the right to mine” has been demonstrated by other organizations including the declarations by Copyright for Creativity (July 2013) and the International Federation of Library Associations and Organizations (December 2013). If we as a society wish to realize the incredible potential for text and data mining, the practice should not be controlled through contractual terms or licensing.
Instead of relying on contractual restrictions or licensing to engage in text and data mining, non-consumptive uses of texts should be expressly eliminated from the reach of copyright and contract. The UK’s Hargreaves Report (PDF, p. 47) suggested the adoption of an exception to copyright law for non-consumptive uses, which are “uses of a work enabled by technology which does not trade on the underlying creative and expressive purpose of the work.”
Most recently, the UK copyright reform legislation introduced changes that makes it easier to engage in TDM for non-commercial purposes, allows storing of the corpus locally as long as it remains protected from general public access, and perhaps most importantly, disallows contractual negotiations that would make it difficult to conduct TDM.
The above sentiments are laudable, and copyright reforms friendly to TDM are very important, and we support such efforts. However, we believe the more knowledgeable potential users of TDM are about the technology and related issues, the better they will be able to negotiate conditions that make their research easy and efficient. Hence, we want to push forward with education and awareness building as a bottom-up effort.
Building Bottom-Up Support
We are working with the ContentMine team developing an agenda for a workshop that would provide training in TDM and educate the participants regarding the legal considerations through hands-on exercises. We will introduce the topic, the tools and techniques, tackle a specific problem, and then use that to expose researchers to the legal complications that they may encounter in conducting their research and the legal considerations they should keep in mind when choosing a license for their works. We have three objectives for this series of workshops—
- Introduce participants to the basic tools and techniques of text and data mining (TDM);
- Make participants aware of the legal intricacies of TDM and the implications of choosing the right licenses that enable TDM for downstream users;
- Nurture a community of practice whose members may draw upon each other for continued help.
To be clear, we are not intending the workshop to be a detailed and comprehensive training in TDM, and it is certainly not a replacement for expertise in this deep and comprehensive technique. Instead, the workshop is designed to be both an introduction to basic technical and legal concepts as well as an opportunity to get to network with experts as well as novices with interest in the field. We hope participants intending to use TDM for their work will be better informed when seeking collaboration with TDM experts.
The first instance of this workshop will be held at the 2014 Open Knowledge Festival. We hope to follow it with one in Nairobi in Aug 2014 at the International Workshop on Open Data for Science and Sustainability in Developing Countries (OpenDataSSDC) organized by the CODATA Task Group on Preservation of and Access to Scientific and Technical Data in Developing Countries (CODATA PASTD), and one possibly at SciDataCon in New Delhi in Nov 2014. We hope to make these workshops a recurring event, building a roster of interesting exercises and problems to solve, and constantly improving the content based on audience feedback and ongoing research.
In cooperation with computing, legal and library experts, we will adapt the workshop agenda to make it more suitable and relatable to the host institutions. Our aim is to reach communities of researchers in countries that are otherwise under-represented in the global conversation on open science and data. We have identified researchers, and will continue to identify more, both on the technical as well as legal side with whom we intend to start building a network. If you are working with TDM, intend to work with TDM, and have expertise either in its technology or in related legal issues specific to your jurisdiction, please contact us.
We also intend to develop a community of practice for TDM, either standalone or via existing platforms such as StackExchange, and will utilize online resources such as forums, mailing lists, and a roster of technical, legal and institutional experts available to provide assistance with TDM.2 Comments »