Yesterday, NBC News published a story about IBM’s work on improving diversity in facial recognition technology and the dataset that they gathered to further this work. The dataset includes links to one million photos from Flickr, many or all of which were apparently shared under a Creative Commons license. Some Flickr users were dismayed to learn that IBM had used their photos to train the AI, and had questions about the ethics, privacy implications, and fair use of such a dataset being used for algorithmic training. We are reaching out to IBM to understand their use of the images, and to share the concerns of our community.
CC is dedicated to facilitating greater openness for the common good. In general, we believe that the use of publicly available data on the Internet has led to greater innovation, collaboration, and creativity. But there are also real concerns that data can be used for negative activities or negative outcomes.
While we do not have all the facts regarding the IBM dataset, we are aware that fair use allows all types of content to be used freely, and that all types of content are collected and used every day to train and develop AI. CC licenses were designed to address a specific constraint, which they do very well: unlocking restrictive copyright. But copyright is not a good tool to protect individual privacy, to address research ethics in AI development, or to regulate the use of surveillance tools employed online. Those issues rightly belong in the public policy space, and good solutions will consider both the law and the community norms of CC licenses and content shared online in general.
I hope we will use this moment to build on the important principles and values of sharing, and engage in discussion with those using our content in objectionable ways, and to speak out on and help shape positive outcomes on the important issues of privacy, surveillance, and AI that impact the sharing of works on the web.
We are taking this opportunity to speak to this particular type of reuse – improving artificial intelligence tools designed for facial recognition through the reuse of content found on the web (not just CC-licensed content) – to help clarify how the licenses work in this context. We have published new FAQs here that we will continue to update.
If you have comments or questions, please write CC at email@example.com. We will also be creating other opportunities to engage in public discussion in the coming weeks and months. We look forward to joining these discussions as we look for ways to resolve ethical public policy issues around data, AI, and machine learning as a community.