How to manage data and plan for it
Anyone handling data should be aware of the need to manage it. There are risks of loss or corruption, there is growth, there is the possible need to share, there is attribution, etc.
The need for data management and the need to create plans
There are several reasons why one should think that data from one or many projects can easily be at risk if not managed properly. Naturally the need to manage data is reduced if the volume of data is low, but it is also true that even small volumes of data that go unmanaged cause serious problems when there is a loss of any kind. Research Funding agencies began to ask for a data management plan in the grant applications. This has obliged researchers to at least get informed, but it is quite clear that training provision in this area is far below the present and foreseeable levels of demand.
Data from research projects
Research projects generate enormous quantities of data, every single day. In many cases it is clear that raw, unprocessed data should be kept, but in many others it is not adequate at all. In large scale sequencing, but also in radio astronomy and particle physics, for example, the volumes of raw data are often enormous and their permanent storage is unreasonable. Performing the sequence again may be a lot more adequate, if the need arises.
Permanent, long term data storage requires tailored strategical decision-making. What to store, when, where and how it is backed-up. The availability of cloud resources, where there is a physical distribution of the storage servers has brought very considerable benefits in this area. But that is the technical side of it. Often harder is the management of a team of people involved in the progress of the same project. The need for a frequently updated guided process, that takes into considerations that people exhibit very different ways of thinking, not always in the best structured way, calls for plans that are goal-directed yet very adaptable to a wide range of circumstances.
Getting to know how to write a Data Management Plan
One possible way to make some progress in increasing one’s capabilities for writing data management plans is to look at well built examples to get inspired. No single plan will suit everyone’s needs, but it is common practice to use a good one as a seed and modify it to one’s needs. You can find some good examples here:
http://data.library.arizona.edu/data-management-plans/data-management-plan-examples
Browsing though several of these you will discover ideas that may fit your requirements, here and there.
Examples to illustrate Data Management and sharing (from NIH)
The precise content and level of detail to be included in a data-sharing plan depends on several factors, such as whether or not the investigator is planning to share data, the size and complexity of the dataset, and the like. Below are several examples of data-sharing plans.
Example 1
The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects. Therefore, we are not planning to share the data.
Example 2
The proposed research will include data from approximately 500 subjects being screened for three bacterial sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement that provides for: (1) a commitment to using the data only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed.
Example 3
This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years. Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/ Link to Non-U.S. Government Site
Disclaimer
User registration is required in order to access or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource. Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to users will not be used for commercial purposes, and will not be redistributed to third parties.
Useful resources about data management
- Teachers with the Data Carpentry initiative have been motivated by the above referenced concerns have worked on a set of recommendations (best practices) in a scientific paper “Ten Simple Rules for Digital Data Storage”.
- William K. Michener has prepared a set of recommendations in a scientific paper “Ten Simple Rules for Creating a Good Data Management Plan”.
- The Digital Curation Center has published online resources that provide guidance and examples on Data Management Plans for a variety of purposes on this website.
- The purpose of LEARN is to take the LERU Roadmap for Research Data produced by the League of European Research Universities (LERU) and to develop this in order to build a coordinated e-infrastructure across Europe and beyond. The “Toolkit of Best Practice for Research Data Management” associated with this initiative can be downloaded from http://learn-rdm.eu/en/dissemination/
- The FAIR principles, which we discuss elsewhere, have been adopted by the Horizon 2020 framework program. Hence, Guidelines on FAIR Data Management in Horizon 2020 have been established.
- A suitable alternative can be acquiring some online training via a simple web-based tutorial on Data Management Plans (DMP), such as this one from PennState University.
Specific issues with sensitive data
Particularly acute cases arise when we speak about sensitive data, as for example in data management plans that involve patient data, even when there is informed consent. It is not always obvious but human genetic data consisting only of stretches of sequence from a single human sample may be sufficient to uniquely identify a person. This could even be considered good, were it not for the fact that it may be used to violate individual privacy in ways that do not protect citizens.
To a large extent, the problems of lack of protection arise from the inadequacy of legislation. In general terms, there is very limited or no punishment for perpetrators that abuse the fundamental rights of individuals using sensitive data, for example non-anonimised health data used for businesses such as insurance companies, employers, etc.
Expected outcomes
After exploring this module you will:
- Be aware of the factors that justify the need for a data management plan.
- Know where to get guidelines or inspiration to create a data management plan
- Be confident about what best practices to pick from the Research Data Management resource in http://learn-rdm.eu/en/dissemination/
- Be aware of the specific difficulties that arise when sensitive data is used