Those who have followed the spread of open source software (OSS) know that a bewildering thicket of OSS licenses were created in the early days. They also know that although the Open Source Initiative was formed in part to certify which of these documents should be permitted to call itself an “open source software license,” that didn’t mean that each approved license was compatible with the other. Ever since, it’s been a pain in the neck to vet code contributions to ensure that an OSS user knows what she’s getting into when she incorporates a piece of OSS into her own program.
In the intervening years, more and more entities – private, public and academic – have decided to make public the increasingly large, various and valuable data sets they are producing. One resulting bonanza is the opportunity to combine these data sets in order to accomplish more and more ambitious goals – such as informing the activities of autonomous vehicles. But what if the rules governing these databases are just as diverse and incompatible as the scores of OSS licenses unleashed on an unwitting public?
Avoiding a similar and unfortunate result is the goal of a Linux Foundation legal task force that’s been working hard to create both permissive and restrictive versions of a Common Open Data License Agreement (CDLA). (Disclosure: the LF is a long-term client of mine). As with open source licenses, the permissive version imposes only minor requirements on users of the data, while the restrictive version includes “give back” obligations similar to GPL-type licenses.
Yesterday, drafts of those licenses were announced and released by LF at the Open Source Summit in Prague. In doing so, LF noted:
In an era of expansive and often underused data, the CDLA licenses are an effort to define a licensing framework to support collaborative communities built around curating and sharing “open” data. Inspired by the collaborative software development models of open source software, the CDLA licenses are designed to enable individuals and organizations of all types to share data as easily as they currently share open source software code. Soundly drafted licensing models can help people form communities to assemble, curate and maintain vast amounts of data, measured in petabytes and exabytes, to bring new value to communities of all types, to build new business opportunities and to power new applications that promise to enhance safety and services.
Part of the impetus behind the CDLA effort arises from the understandable, but not particularly productive, experience of the early OSS days. Early on, many companies created their own, often quite-quite-similar-but-different-enough licenses. The differences, variously large, small and eccentric, sometimes made determining compatibility among the licenses something of a black box. With time, it all got sorted out, but that process in some cases was achieved through consensus agreement rather than because compatibility could be necessarily be clearly demonstrated (God forbid) in court. The resulting confusion and need to acquire sophisticated knowledge about open source licensing held back the uptake of OSS in many quarters for years.
Whether similarly negative results can be avoided in the case of open data will depend in large part on the success of the CDLA initiative, and others like it that produce commonly accepted licenses tailored to specific use cases.
But the window of opportunity is short. Already, various owners of data sets are creating their own licenses out of necessity, because no generally accepted model has, to date, evolved. If the CDLA effort is successful, creators of data will be more likely to share that data, and users will be more likely to use it, each side having certainty over what can and can’t be done with the data, and what risks, if any, are to be borne by whom.
What kinds of beneficial results can flow from achieving that goal? Here are a couple of examples from the LF press release:
For instance, if automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year.
Similarly, climate modeling can integrate measurements captured by government agencies with simulation data from other organizations and then use machine learning systems to look for patterns in the information. It’s estimated that a single model can yield a petabyte of data, a volume that challenges standard computer algorithms, but is useful for machine learning systems. This knowledge may help improve agriculture or aid in studying extreme weather patterns.
Such efforts are not over the horizon. The movement towards open science (e.g., exposing data and research early rather than late, and permitting reuse of both) has been gathering steam for years. Sharing discoveries and data immediately, instead of after years of secrecy and eventual peer-reviewed publication, and dramatically accelerate progress in vital areas like health care and drug discovery.
Those wishing to practice open science can already make use of tools such as the Open Science Framework, developed and maintained by the Center for Open Science (another client of mine). Anyone can use the Framework for free to host her research project files, data and protocols, and then designate what will be made available to whom. If researchers at sites like this broadly choose to adopt the CDLA, the beneficial use of research and data can spread far more easily and quickly.
Will that happen? It will if collaborative initiates take a look at the CDLAs and decide to use them. As you would expect, the CDLAs are free to download and use. You can find links to them here and here, and an explanatory FAQ is here. A context document for using them is also provided at the CDLA site.