View on GitHub

OSODOS

Open Science, Open Data, Open Source

How to share research data fairly and sensibly

Collaboration is a natural, essential element in research. However, sharing of resources amongst scientists should be a lot easier than it is, as it finds expected and unexpected barriers everywhere. In recent years, issues surrounding scientific data sharing have received a lot of attention (e.g. see [Gewin2016]), and this has led both to a better understanding of the right principles and practices that should surround such sharing as well as to better infrastructure.

The FAIR Guiding Principles

The principles that (should) guide scientific data sharing are abbreviated as FAIR, which stands for Findable, Accessible, Interoperable, Reusable. What is meant by this is outlined below, and discussed in much greater detail in [Wilkinson2016].

To be Findable:

  • F1. (meta)data are assigned a globally unique and persistent identifier
  • F2. data are described with rich metadata (defined by R1 below)
  • F3. metadata clearly and explicitly include the identifier of the data it describes
  • F4. (meta)data are registered or indexed in a searchable resource

To be Accessible:

  • A1. (meta)data are retrievable by their identifier using a standardised communications protocol
  • A1.1 the protocol is open, free, and universally implementable
  • A1.2 the protocol allows for an authentication and authorisation procedure, where necessary
  • A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

  • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • I2. (meta)data use vocabularies that follow FAIR principles
  • I3. (meta)data include qualified references to other (meta)data

To be Reusable:

  • R1. meta(data) are richly described with a plurality of accurate and relevant attributes
  • R1.1. (meta)data are released with a clear and accessible data usage license
  • R1.2. (meta)data are associated with detailed provenance
  • R1.3. (meta)data meet domain-relevant community standards

Many different standards, databases, and policies can be adopted and combined to develop practices that comply with these general principles. These are collected on FAIRSharing, which also contains an extensive educational section.

Assuming that we are persuaded by the wisdom of these FAIR principles, we may want to adopt them in our own research when it comes to sharing our data in various contexts. Among these contexts we can at least distinguish between the sharing of raw and intermediate data (for example, within a collaborative network) and the publishing of our “final”, result data, i.e. when we finished with our analyses and want to share our conclusions with the world.

In either case, many of the FAIR principles can be implemented following guidelines we establish elsewhere on these pages. For example, to be properly Findable, our data should treated at least according to our suggestions in the section on versioning; and to be Interoperable and Reusable we should describe our data collection process following reporting standards such as described in the section on data capture, we should express the meaning of our data and metadata using clear, unambiguous semantics, and we should adopt open, community standards for representing our data, as we elaborate on in the section on open source.

What remains to be discussed are the options for making our data Accessible, and for this there are different technologies to choose from depending on whether we are sharing raw and intermediate data or whether we are publishing result data. In the following sections we will discuss these options.

Sharing raw and intermediate data

In the more private context of a research collaboration before publication it might seem that it does not matter that much how data are shared: the collaborative network is probably small enough that most people know each other and can jointly reconstruct where the data are, what has been done to it so far, and so on. However, research projects usually take longer to arrive at the final results than planned, and meanwhile some people might leave, others come in, and everyone forgets what they were exactly doing or thinking a few years ago. Even in this setting there are therefore good and bad ways to share data. Here are some of the main ways in which small research groups might share data, assessed within the context of the FAIR principles:

The above solutions are well within reach of every researcher: the server technologies that are recommended are either already installed on some operating systems, or are freely available as very popular, well-documented open source software with large user communities. For larger institutions that have the ICT staffing resources to maintain more advanced data sharing infrastructures, two open source systems are worth mentioning within the context of bioinformatics applications: iRODS and Globus. What these systems add is that they can track (meta)data, and thus file versions, across distributed architectures and protocols. For example, such systems can figure out that the raw data file you are trying to fetch exists as a local copy near you so as to use bandwidth more efficiently, or allow for the discovery of related files by custom metadata that researchers attach to their project files, integrate file management in data processing workflows, and a lot more. The downside of these systems lies in the complexity of installing, configuring and maintaining these systems.

Publishing result data

The final outcomes of a data-intensive research project in terms of data are referred to as “result data”. For example, for a genome re-sequencing project, these might be the specific gene variants detected in the individuals that were sequenced (e.g. see this pyramid). These are the data that scientific conclusions - such as presented in a scholarly publication - will probably be most intimately based on. (However, these cannot be decoupled from data nearer to the base of the pyramid, so developing an approach that allows you to “drill down” from the top to these lower levels is vital.) and The pressures to share result data come from more sides than in the case of raw or intermediate data:

  1. Funding agencies more and more require that result data are shared. Research projects funded by such agencies probably need to submit a data management plan for approval to the agency that describes how this will be handled. (The NIH, the US funding agency for medical research, has published a list of frequently asked questions surrounding data sharing that researchers might ask a funding agency.)
  2. Many journals require that data discussed in a paper are deposited in a data repository such that the paper refers to a public, clearly identifiable, data record. (In addition to policies of individual journals, community initiatives to establish author guidelines to this end are being developed, see [Nosek2015].)
  3. A publication based on open, reusable data is more likely to be cited (e.g. by 9% in the case of microarray data [Piwowar2013]), so there is also an aspect of enlightened self interest to depositing result data. Along this vein, a growing trend is towards “data publications”, where the paper is mostly just an advertisement for a re-usable data set.
  4. Data sets themselves might be citable, which is a vision that is being advanced, for example, by the Joint Declaration of Data Citation Principles and by infrastructural initiatives such as DataCite and the Data Citation Index.

These different pressures have created a need for online data repositories, and in response a bewildering array of choices has arisen. Broadly speaking, these can be subdivided in generic data repositories that will accept many different data types but will not process (e.g. validate, index, visualise) these in great detail, and domain-specific repositories that do a lot more with the data. The choices are discussed below.

Generic data repositories

In what concerns platforms to share various types of data, a researcher has access to a reasonable amount of choice. At present (August 2017), we invite you to explore a number of repositories. Often they are complemented with a series of services that offer some specific advantages, such as linking to publications, providing licensing options, offering digital object identifiers (DOI). They differ quite considerably in their policies, level of curation and organization methods. Specificity also occurs in the case of repositiories orientated towards certain communities of users. Such is the case of digital libraries, for example.

Domain-specific repositories

In addition to the generic data repositories listed above, a very large number of domain-specific data repositories and databases exists. Such repositories accept only a limited number of data types - for example, DNA sequences - that need to be provided in specific formats and with specific metadata. This means that, probably, not all different data and results generated by a research project can be uploaded to the same repository, and that uploading is sometimes cumbersome and complicated. On the other hand domain-specific repositories can provide more services tailored to a specific data type (such as BLAST searching of uploaded DNA sequences), and perhaps do things with the data such as synthesizing it in a meta-analysis or a computed consensus. Nature publishes an online list of a variety of data repositories, while re3data provides a searchable database of repositories.

Licensing, Attribution and Openness in Data Repositories

Data made available via open repositories enables a series of attractive opportunities for researchers. First, the opportunity for attaching an attribution to the work of capturing, storing and curating data. Credit is assigned to the depositor following well established and understood rules, in general this is ensure by assigning one of the variants of the Creative Commons licensing scheme. (Refer to the equivalent session on this topic for software development to see what is done with source code.)

Data sharing in this way also enables reuse and, in particular, reprocessing. This exposure of data to other resarchers, students and the public is gaining popularity. It is obvious that others may find more from the same data by using different tools with different parameter settings and options, and it is interesting to observe that it can be used preserving attribution. On the other hand, the opportunities that arise when combining heterogeneous datasets may allow for important steps that would be much more difficult otherwise.

Expected outcomes

In this section we have discussed the different infrastructural solutions for sharing research data and how these relate to the developing principles for data sharing. You should now be able to: