This is part of a series of posts introducing the projects built by open source contributors mentored by Creative Commons during Google Summer of Code (GSoC) 2020 and Outreachy. K. S. Srinidhi Krishna and Charini Nanayakkara were two of those contributors and we are grateful for their work on this project.
The Creative Commons (CC) Catalog collects and stores metadata about more than 500 million CC-licensed images scattered across the internet so that they can be made accessible to the general public via the CC Search and CC Catalog API tools. Creative Commons uses two approaches to collect metadata about CC-licensed images:
- Gather metadata directly from the HTML around the image on the webpage where it appears. We source this HTML from an open-source web-crawling project called CommonCrawl (link).
- Pull metadata about an image from a service provided by some image hosts called an API (Application Programming Interface). An API is an interface whereby a computer program can access some hosted data (or metadata) in a consistent and organized manner suitable for developing a pipeline to further process that data.
During our internship, we worked on using the second method described above to expand the number of images indexed in CC Catalog, as well as improve related infrastructure. We refer to a computer program that implements the second method for a given provider as a Provider API Script.
The CC Catalog collects and stores metadata about more than 500 million CC-licensed images scattered across the internet so that they can be made accessible.
Before diving into the specifics of our contributions, it’s worth clarifying precisely what is meant by a Provider API script. The providers are essentially the cultural or other organisations (such as museums and image hosting services) which host CC-licensed images. Some of these organisations provide APIs to allow external entities to access data such as image metadata hosted by them. The Provider API scripts which can be found in the CC Catalog repository are programs which make use of these APIs to automatically retrieve and aggregate metadata about CC-licensed images hosted by different providers.
Integration of new provider API scripts
Newly implemented Provider API scripts are:
- Science Museum: The Science Museum collection has around 60,000 CC-licensed images from the Science Museum, Science and Industry Museum, National Science and Media Museum, National Railway Museum and Locomotion.
- Statens Museum: Statens Museum for Kunst is Denmark’s leading museum for artwork. This is a new integration and 39115 images have been collected.
- Museums Victoria: Museums Victoria in Australia, features collections of zoology, geology, palaeontology, history, indigenous cultures and technology. It has around 140,000 images.
- NYPL: New York Public Library is a new integration. As of now, it has around 1296 images.
Grouping images by their source
While the organisations that host CC-licensed images are referred to as providers, there is the possibility that some of these providers obtain those images from third party organisations. We refer to such third party organisations as the image source.
For example, NASA is a source organisation that publishes their images through the provider Flickr (which is an American image and video hosting service) as well as the provider nasa.gov. Previously, when accessing CC-licensed images through the CC Search tool, users were only able to categorize those images based on the provider but not the source. However, for images made available by certain significant providers such as Flickr and the Smithsonian, the images are now grouped by their source organisations rather than the provider.
After this implementation, all images from a single source—which were previously scattered across different providers—are accessible from one place. The following screenshot shows how the images from the Smithsonian Institution (a provider) are now categorized under multiple source organizations (museums and research centers within the Smithsonian) in the CC Search tool.
Scheduled reingestion of Europeana images
Europeana is home to over 58 million artworks, artefacts, books, films and music from European museums, galleries, libraries, and archives. Data is collected from Europeana daily, but we only pull data about artefacts added within the previous day. This necessitates refreshing the data on a recurring basis.
The idea is that new data should be refreshed more frequently and as the data gets old, refreshing should become less frequent. While developing the strategy, the API key limit (the maximum number of requests that can be made within a given time period) and the maximum collection expected must be kept in mind. Considering these factors, we implemented a strategy that ensures the data in our database is at most six months old while refreshing the most recently uploaded metadata more frequently.
Retaining metadata from an old data source
Creative Commons sometimes generates metadata on the images collected from different third-party sources. When a provider is shifted from one data source to another, the metadata generated for an image is not available in the new data source. Therefore, there is a need to associate the new data with tags corresponding to that image from the old data source. A direct URL match is not possible as the data sources have different image URLs for the same image, so our goal is to match it on the number or identifier that is associated with the URL.
Expiration of outdated images
As explained under a previous section, we update the images obtained from different providers at varying frequencies depending on the magnitude of the image collection of each provider and the age of each image (i.e. newer images are updated more frequently than older images). Even though this update strategy helps to reflect changes to image metadata (such as the number of views per image), information regarding image deletions was not reflected in the database. This resulted in the accumulation of obsolete images in the database and the presenting of broken links to non-existing images through the CC Search and CC Catalog API tools.
However, since we are aware of the update frequency for each provider and their corresponding images, it is possible to determine whether an image is obsolete or not by looking at their last date of update. Therefore, to resolve this issue, an image expiration strategy which is dependent on the last date an image was implemented!
Identification of creator types
CC Search helps users filter CC-licensed images in numerous ways, including filtering results by the “creator.” However, providers often identify “creators” by different names, such as ‘author’, ‘designer’, ‘cartoonist,’ and ‘modeler’. The ambiguity of names used to identify creators is accentuated for the Smithsonian (a provider). Thus, we carefully analysed the image metadata from the Smithsonian and created a collection of various tags that identify the creators. With this implementation, the number of instances where the creator value is missed for Smithsonian images was considerably reduced. We hope this can be done for other providers as well in the future.
We are extremely proud and grateful for the work done by K. S. Srinidhi Krishna and Charini Nanayakkara throughout their 2020 Google Summer of Code internship. We look forward to their continued contributions to the CC Catalog as project core committers in the CC Open Source Community!