text and data mining

Liberating the Haystack for the Needles

Puneet Kishor, June 2nd, 2014

This post with invaluable assistance from the CC legal and policy teams.

Text and data mining (TDM) is becoming an increasingly important scientific technique for analyzing large amounts of data. The technique is used to uncover both existing and new insights in unstructured data sets that typically are obtained programmatically from many different sources.

pbdb

PBDB Navigator screenshot released under a CC0 1.0 Public Domain Dedication

A few of the innovative examples include GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles; improving human curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database; and discovering a new link between genes and osteoporosis.

Legal Uncertainty

While the science and technology of TDM are complex enough involving information retrieval (IR), optical character recognition (OCR), and natural language processing (NLP), the legal complications are, sadly, equally dizzying. The legal status of TDM is unclear at best, both because there are a multitude of techniques to engage in TDM, and because the implications of various techniques vary from jurisdiction to jurisdiction. This makes cross-national collaboration, integral to science, difficult at best. For example, TDM is generally considered to not implicate copyright in the U.S. There are several theories as to why TDM falls outside copyright, but the most obvious is that it uses copyrighted material for a transformative purpose and is therefore a fair use. Judge Baer, writing in Author’s Guild, Inc., et. al. v. Hathi Trust, et. al. (Case 1:11-cv-06351-HB)

“The use to which the works in the HDL are put is transformative because the copies serve an entirely different purpose than the original works: the purpose is superior search capabilities rather than actual access to copyrighted material. The search capabilities of the HDL have already given rise to new methods of academic inquiry such as text mining.”

Judge Baer goes on to state:

“I cannot imagine a definition of fair use that would not encompass the transformative uses made by Defendants’ MDP and would require that I terminate this invaluable contribution to the progress of science and cultivation of the arts.”

The clarity, however, is far from universal as the situation outside the U.S. gets muddy. While there have been a few welcome developments in the U.K., the copyright laws of many other countries have little to no clarity on whether TDM falls outside of the reach of copyright and related laws. Where TDM does implicate copyright, the license status of the original material can make automated access and analysis very complicated, requiring additional checks to ensure any material is only being used as permitted by the license. And, even where the relevant licenses are free and open, and conducive to TDM, contractual agreements between research institutions and publishers, who are often the gatekeepers of the corpora, can create significant hurdles.

Public Sentiment

In a comment on proposed U.K. exception for information mining, both iCommons and the Open Knowledge Foundation (OKFN) supported the UK Government’s opinion that it is inappropriate for “Certain activities of public benefit such as medical research obtained through text mining to be in effect subject to veto by the owners of copyrights in the reports of such research, where access to the reports was obtained lawfully.” PLOS opined, “Enabling content mining is a core part of the value offering for Open Access publication services.” In its response to EU copyright review, LIBER stated, “All exceptions related to education, learning and access to knowledge to be made mandatory. In particular, we would like to see a specific exception for text and data mining for all research purposes.” OKFN’s Working Group on Open Access stated:

“We assert that there is no legal, ethical or moral reason to refuse to allow legitimate accessors of research content (OA or otherwise) to use machines to analyse the published output of the research community. Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes.”

Support for text and data mining under the guise of “The right to read is the right to mine” has been demonstrated by other organizations including the declarations by Copyright for Creativity (July 2013) and the International Federation of Library Associations and Organizations (December 2013). If we as a society wish to realize the incredible potential for text and data mining, the practice should not be controlled through contractual terms or licensing.

Instead of relying on contractual restrictions or licensing to engage in text and data mining, non-consumptive uses of texts should be expressly eliminated from the reach of copyright and contract. The UK’s Hargreaves Report (PDF, p. 47) suggested the adoption of an exception to copyright law for non-consumptive uses, which are “uses of a work enabled by technology which does not trade on the underlying creative and expressive purpose of the work.”

Most recently, the UK copyright reform legislation introduced changes that makes it easier to engage in TDM for non-commercial purposes, allows storing of the corpus locally as long as it remains protected from general public access, and perhaps most importantly, disallows contractual negotiations that would make it difficult to conduct TDM.

The above sentiments are laudable, and copyright reforms friendly to TDM are very important, and we support such efforts. However, we believe the more knowledgeable potential users of TDM are about the technology and related issues, the better they will be able to negotiate conditions that make their research easy and efficient. Hence, we want to push forward with education and awareness building as a bottom-up effort.

Building Bottom-Up Support

Content Mine


Image by R. Mounce extracted from: doi: 10.11646/phytotaxa.163.5.1 licensed under the Creative Commons Attribution Licence (CC-BY) 3.0 license

We are working with the ContentMine team developing an agenda for a workshop that would provide training in TDM and educate the participants regarding the legal considerations through hands-on exercises. We will introduce the topic, the tools and techniques, tackle a specific problem, and then use that to expose researchers to the legal complications that they may encounter in conducting their research and the legal considerations they should keep in mind when choosing a license for their works. We have three objectives for this series of workshops—

  1. Introduce participants to the basic tools and techniques of text and data mining (TDM);
  2. Make participants aware of the legal intricacies of TDM and the implications of choosing the right licenses that enable TDM for downstream users;
  3. Nurture a community of practice whose members may draw upon each other for continued help.

To be clear, we are not intending the workshop to be a detailed and comprehensive training in TDM, and it is certainly not a replacement for expertise in this deep and comprehensive technique. Instead, the workshop is designed to be both an introduction to basic technical and legal concepts as well as an opportunity to get to network with experts as well as novices with interest in the field. We hope participants intending to use TDM for their work will be better informed when seeking collaboration with TDM experts.

TDM workshops

Original artwork by Puneet Kishor released under CC0 Public Domain Dedication

The first instance of this workshop will be held at the 2014 Open Knowledge Festival. We hope to follow it with one in Nairobi in Aug 2014 at the International Workshop on Open Data for Science and Sustainability in Developing Countries (OpenDataSSDC) organized by the CODATA Task Group on Preservation of and Access to Scientific and Technical Data in Developing Countries (CODATA PASTD), and one possibly at SciDataCon in New Delhi in Nov 2014. We hope to make these workshops a recurring event, building a roster of interesting exercises and problems to solve, and constantly improving the content based on audience feedback and ongoing research.

In cooperation with computing, legal and library experts, we will adapt the workshop agenda to make it more suitable and relatable to the host institutions. Our aim is to reach communities of researchers in countries that are otherwise under-represented in the global conversation on open science and data. We have identified researchers, and will continue to identify more, both on the technical as well as legal side with whom we intend to start building a network. If you are working with TDM, intend to work with TDM, and have expertise either in its technology or in related legal issues specific to your jurisdiction, please contact us.

We also intend to develop a community of practice for TDM, either standalone or via existing platforms such as StackExchange, and will utilize online resources such as forums, mailing lists, and a roster of technical, legal and institutional experts available to provide assistance with TDM.

2 Comments »


Subscribe to RSS

Archives

  • collapse2014
  • expand2013
  • expand2012
  • expand2011
  • expand2010
  • expand2009
  • expand2008
  • expand2007
  • expand2006
  • expand2005
  • expand2004
  • expand2003
  • expand2002