How to share research data fairly and sensibly
Collaboration is a natural, essential element in research. However, sharing of resources amongst scientists should be a lot easier than it is, as it finds expected and unexpected barriers everywhere. In recent years, issues surrounding scientific data sharing have received a lot of attention (e.g. see [Gewin2016]), and this has led both to a better understanding of the right principles and practices that should surround such sharing as well as to better infrastructure.
The FAIR Guiding Principles
The principles that (should) guide scientific data sharing are abbreviated as FAIR, which stands for Findable, Accessible, Interoperable, Reusable. What is meant by this is outlined below, and discussed in much greater detail in [Wilkinson2016].
To be Findable:
- F1. (meta)data are assigned a globally unique and persistent identifier
- F2. data are described with rich metadata (defined by R1 below)
- F3. metadata clearly and explicitly include the identifier of the data it describes
- F4. (meta)data are registered or indexed in a searchable resource
To be Accessible:
- A1. (meta)data are retrievable by their identifier using a standardised communications protocol
- A1.1 the protocol is open, free, and universally implementable
- A1.2 the protocol allows for an authentication and authorisation procedure, where necessary
- A2. metadata are accessible, even when the data are no longer available
To be Interoperable:
- I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
- I2. (meta)data use vocabularies that follow FAIR principles
- I3. (meta)data include qualified references to other (meta)data
To be Reusable:
- R1. meta(data) are richly described with a plurality of accurate and relevant attributes
- R1.1. (meta)data are released with a clear and accessible data usage license
- R1.2. (meta)data are associated with detailed provenance
- R1.3. (meta)data meet domain-relevant community standards
Many different standards, databases, and policies can be adopted and combined to develop practices that comply with these general principles. These are collected on FAIRSharing, which also contains an extensive educational section.
Assuming that we are persuaded by the wisdom of these FAIR principles, we may want to adopt them in our own research when it comes to sharing our data in various contexts. Among these contexts we can at least distinguish between the sharing of raw and intermediate data (for example, within a collaborative network) and the publishing of our “final”, result data, i.e. when we finished with our analyses and want to share our conclusions with the world.
In either case, many of the FAIR principles can be implemented following guidelines we establish elsewhere on these pages. For example, to be properly Findable, our data should treated at least according to our suggestions in the section on versioning; and to be Interoperable and Reusable we should describe our data collection process following reporting standards such as described in the section on data capture, we should express the meaning of our data and metadata using clear, unambiguous semantics, and we should adopt open, community standards for representing our data, as we elaborate on in the section on open source.
What remains to be discussed are the options for making our data Accessible, and for this there are different technologies to choose from depending on whether we are sharing raw and intermediate data or whether we are publishing result data. In the following sections we will discuss these options.
Sharing raw and intermediate data
In the more private context of a research collaboration before publication it might seem that it does not matter that much how data are shared: the collaborative network is probably small enough that most people know each other and can jointly reconstruct where the data are, what has been done to it so far, and so on. However, research projects usually take longer to arrive at the final results than planned, and meanwhile some people might leave, others come in, and everyone forgets what they were exactly doing or thinking a few years ago. Even in this setting there are therefore good and bad ways to share data. Here are some of the main ways in which small research groups might share data, assessed within the context of the FAIR principles:
- email - For small data sets (spreadsheets, for example) it might seem easy and quick to just email them as attachments. This is a bad idea for numerous reasons: 1) an email is uniquely not findable. Within a group of people you would have to refer to it as, for example, “that email that I sent a few weeks ago”, and this of course gets worse when other people start replying with other versions of data files, 2) there is no way to link to an email such that everyone has access to it, 3) email attachments have a size limit. Sending important files (such as data, manuscripts, or source code) by email within a research project should be discouraged under nearly all circumstances.
- sneakernet - Large data sets (such as produced by an external DNA sequencing facility) are often shipped on external hard drives, and subsequently carried around the halls of a research institution, a practice referred to as “sneakernet”. This has obvious access problems (you somehow need to get access to the actual drive), but also a deeper “findability” problem: unless the team is disciplined in taking checksums of the data on the drive, there is no good way to tell whether the data have been changed or corrupted, and what the “original” data set was. Data from portable drives need to be copied onto a networked system, have checksums verified, and then shared using one of the methods below in such a way that the drive can subsequently be discarded without problems.
- peer-to-peer - Numerous technologies exist for sending data directly “peer-to-peer”. An example of this that works for large blocks of data is WeTransfer. This has similar problems as with email attachments: it is practically impossible to refer unambiguously to a specific transmission (and anyone that was not among the recipients in the first place will not be able to access it). This will make it hard to track versions. In addition, peer-to-peer transfers are ephemeral (i.e. a service such as WeTransfer will remove the transferred file from its system after a certain amount of time). Taken together, peer-to-peer file transfer of important research data is not a good approach and should be discouraged.
- HTTP servers - Online sharing systems such as DropBox and Google Drive have the great virtue that there is a unique location that all collaborators that have been granted access can refer to when talking about the data. For most of these systems, this location is a HTTP URL, which means that the uniqueness of the location is guaranteed by how the internet (i.e. DNS) works, that analytical workflows can directly access the location (HTTP access is available in all common scripting languages and workflow systems), and that semantic web applications can make machine readable statements about the resource at the URL location. There are two main downsides: 1) given a URL for a file, it is not obvious how neighbouring files can be discovered (for example, there is no standard folder structure to navigate), 2) there is no standard way to track versions. Different commercial systems (such as the aforementioned DropBox and Drive) have their own versioning systems, however, versions created by these systems are simply periodic snapshots without descriptive metadata (i.e. unlike a “commit” using version control systems such as
git
orsvn
). An open protocol that extends the functionality of HTTP, WebDAV, forms the basis for how version control systems communicate with remote servers. Such version control systems are nearly perfect for FAIR data sharing, except for the bandwidth and storage limitations that these systems (may) have. Two optional extensions to git, git-lfs and git-annex, have been developed to address these issues. Data sharing based on HTTP servers, especially when enhanced by open extensions for version and access control, is well worth considering when files meaningfully and notably change through time. - FTP servers - Data sharing using FTP has the same advantage as HTTP in that it results in unambiguous locations based on DNS. In addition, it has the virtue that neighbouring locations on a server (e.g. the other files in a folder, and so on) can be navigated using a standard protocol supported by many tools. Bandwidth challenges are better addressed than in HTTP because downloads can be parallellised (trivially by opening different connections for different files, and by fetching multiple chunks of the same file in parallel). FTP is therefore one of the preferred methods for downloading large genomic data sets from NCBI. A downside is that there are no standards built on top of FTP for tracking file versions. Because FTP allows for navigating through file systems, this downside is commonly addressed by placing checksum files next to the data files. This allows for a quick check to see whether a remote file is the same version as a local copy without having to download the remote file. However, this is not sufficient for tracking entire file histories. Data sharing based on FTP servers is well worth considering for navigating static libraries of large files.
- rsync servers - RSync is an open source utility for synchronisation and incremental transfers between file systems. The advantage of this system is that libraries of large files can be kept in sync between systems, so if you are keeping sets of files where occasionally some change (such as a library of reference genomes that go through releases), this system is more efficient with bandwidth than FTP, which is why NCBI prefers it over FTP. Using RSync means that changed files will be updated (which is why a number of backup systems are based on it) but it is not a versioning system. Data sharing using RSync is appropriate for synchronising file structures across systems, assuming that the changes in the file structures are managed by some other systems (such as release notes).
The above solutions are well within reach of every researcher: the server technologies that are recommended are either already installed on some operating systems, or are freely available as very popular, well-documented open source software with large user communities. For larger institutions that have the ICT staffing resources to maintain more advanced data sharing infrastructures, two open source systems are worth mentioning within the context of bioinformatics applications: iRODS and Globus. What these systems add is that they can track (meta)data, and thus file versions, across distributed architectures and protocols. For example, such systems can figure out that the raw data file you are trying to fetch exists as a local copy near you so as to use bandwidth more efficiently, or allow for the discovery of related files by custom metadata that researchers attach to their project files, integrate file management in data processing workflows, and a lot more. The downside of these systems lies in the complexity of installing, configuring and maintaining these systems.
Publishing result data
The final outcomes of a data-intensive research project in terms of data are referred to as “result data”. For example, for a genome re-sequencing project, these might be the specific gene variants detected in the individuals that were sequenced (e.g. see this pyramid). These are the data that scientific conclusions - such as presented in a scholarly publication - will probably be most intimately based on. (However, these cannot be decoupled from data nearer to the base of the pyramid, so developing an approach that allows you to “drill down” from the top to these lower levels is vital.) and The pressures to share result data come from more sides than in the case of raw or intermediate data:
- Funding agencies more and more require that result data are shared. Research projects funded by such agencies probably need to submit a data management plan for approval to the agency that describes how this will be handled. (The NIH, the US funding agency for medical research, has published a list of frequently asked questions surrounding data sharing that researchers might ask a funding agency.)
- Many journals require that data discussed in a paper are deposited in a data repository such that the paper refers to a public, clearly identifiable, data record. (In addition to policies of individual journals, community initiatives to establish author guidelines to this end are being developed, see [Nosek2015].)
- A publication based on open, reusable data is more likely to be cited (e.g. by 9% in the case of microarray data [Piwowar2013]), so there is also an aspect of enlightened self interest to depositing result data. Along this vein, a growing trend is towards “data publications”, where the paper is mostly just an advertisement for a re-usable data set.
- Data sets themselves might be citable, which is a vision that is being advanced, for example, by the Joint Declaration of Data Citation Principles and by infrastructural initiatives such as DataCite and the Data Citation Index.
These different pressures have created a need for online data repositories, and in response a bewildering array of choices has arisen. Broadly speaking, these can be subdivided in generic data repositories that will accept many different data types but will not process (e.g. validate, index, visualise) these in great detail, and domain-specific repositories that do a lot more with the data. The choices are discussed below.
Generic data repositories
In what concerns platforms to share various types of data, a researcher has access to a reasonable amount of choice. At present (August 2017), we invite you to explore a number of repositories. Often they are complemented with a series of services that offer some specific advantages, such as linking to publications, providing licensing options, offering digital object identifiers (DOI). They differ quite considerably in their policies, level of curation and organization methods. Specificity also occurs in the case of repositiories orientated towards certain communities of users. Such is the case of digital libraries, for example.
-
Dryad Digital Repository “The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad provides a general-purpose home for a wide diversity of datatypes.” http://datadryad.org/
-
FigShare “As governments and funders of research see the benefit of open content, the creation of recommendations, mandates and enforcement of mandates are coming thick and fast. figshare has always led the way in enabling academics, publishers and institutions to easily adhere to these principles in the most intuitive and efficient manner.” Mark Hahnel, Founder and CEO, figshare https://figshare.com/
-
Zenodo “Zenodo helps researchers receive credit by making the research results citable and through OpenAIRE integrates them into existing reporting lines to funding agencies like the European Commission. Citation information is also passed to DataCite and onto the scholarly aggregators.” https://zenodo.org/
-
Dataverse “Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others, and allows you to replicate others’ work more easily. Researchers, data authors, publishers, data distributors, and affiliated institutions all receive academic credit and web visibility.” https://dataverse.org/
-
EUDAT “EUDAT offers heterogeneous research data management services and storage resources, supporting multiple research communities as well as individuals, through a geographically distributed, resilient network distributed across 15 European nations and data is stored alongside some of Europe’s most powerful supercomputers.” https://eudat.eu/
-
Mendeley Data “Mendeley Data is a secure cloud-based repository where you can store your data, ensuring it is easy to share, access and cite, wherever you are.” https://data.mendeley.com/
Domain-specific repositories
In addition to the generic data repositories listed above, a very large number of domain-specific data repositories and databases exists. Such repositories accept only a limited number of data types - for example, DNA sequences - that need to be provided in specific formats and with specific metadata. This means that, probably, not all different data and results generated by a research project can be uploaded to the same repository, and that uploading is sometimes cumbersome and complicated. On the other hand domain-specific repositories can provide more services tailored to a specific data type (such as BLAST searching of uploaded DNA sequences), and perhaps do things with the data such as synthesizing it in a meta-analysis or a computed consensus. Nature publishes an online list of a variety of data repositories, while re3data provides a searchable database of repositories.
Licensing, Attribution and Openness in Data Repositories
Data made available via open repositories enables a series of attractive opportunities for researchers. First, the opportunity for attaching an attribution to the work of capturing, storing and curating data. Credit is assigned to the depositor following well established and understood rules, in general this is ensure by assigning one of the variants of the Creative Commons licensing scheme. (Refer to the equivalent session on this topic for software development to see what is done with source code.)
Data sharing in this way also enables reuse and, in particular, reprocessing. This exposure of data to other resarchers, students and the public is gaining popularity. It is obvious that others may find more from the same data by using different tools with different parameter settings and options, and it is interesting to observe that it can be used preserving attribution. On the other hand, the opportunities that arise when combining heterogeneous datasets may allow for important steps that would be much more difficult otherwise.
Expected outcomes
In this section we have discussed the different infrastructural solutions for sharing research data and how these relate to the developing principles for data sharing. You should now be able to:
- Explain what the terms of the FAIR acronym stand for in relation to data sharing
- Explain the difference between sharing raw and intermediate data, and result data
- Assess the advantages and drawbacks of different data sharing approaches within teams
- Be able to locate the appropriate domain-specific repository for a given, common research data type (e.g. DNA sequence)
- Be able to locate generic repositories for opaque research data and weigh their advantages and drawbacks against domain-specific ones