Document & Preserve Data

Selecting Data

Storing vs. Archiving Data

There is a difference between storing and archiving data. Storing refers to putting the data in a safe location while the research is ongoing. Because you are still working on the data, the data still change from time to time: they are cleaned, and analysed, and this analysis generates output. As the image below illustrates, storing could be like cooking a dish: you are cleaning and combining ingredients.

Archiving, on the other hand, refers to putting the data in a safe place after the research is finished. The data are in a fixed state, they don’t change anymore. Archiving is done for verification purposes: so others can check that your research is sound. Or: it is done so that others can reuse the resulting dataset. There is also a difference between archiving and publishing, but in essence, archiving and publishing happen at a similar moment and for both, data do not change anymore.

A Scriberia sketch about archiving and storage

This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Selecting Data for Archiving

There are various reasons to archive your data: replication, longitudinal research, data being unique or expensive to collect, re-usability and acceleration of research inside or outside your own discipline. It is VU policy to archive your data for (at least) 10 years after the last publication based on the dataset. Part of preparing your dataset for archiving is appraising and selecting your data.

Make a selection before archiving your data

During your research you may accumulate a lot of data, some of which will be eligible for archiving. It is impossible to preserve all data infinitely. Archiving all digital data leads to high costs for storage itself and for maintaining and managing this ever-growing volume of data and their metadata; it may also lead to decline in discoverability (see the website of the Digital Curation Centre). For those reasons, it is crucial that you make a selection.

Remove redundant and sensitive data

Selecting data means making choices about what to keep for the long term, and what data to archive securely and what data to publish openly. This means that you have to decide whether your dataset contains data that need to be removed or separated. Reasons to exclude data from publishing include (but are not limited to):

  • data are redundant
  • data concern temporary byproducts which are irrelevant for future use
  • data contain material that is sensitive, for example personal data in the sense of the GDPR, like consent forms, voice recordings, DNA data; state secrets; data that are sensitive to competition in a commercial sense. These data need to be separated from other data and archived securely
  • preserving data for the long term is in breach of contractual arrangements with your consortium partners or other parties involved

In preparing your dataset for archiving, the first step is to determine which parts of your data are sensitive, which can then be separated from the other data. Redundant data can be removed altogether.

Different forms of datasets for different purposes

Once you have separated the sensitive data from the rest of your dataset, you have to think about what to do with these sensitive materials. In some cases they may be destroyed, but you may also opt for archiving multiple datasets. For example, you may want to archive your dataset in more than one form depending on the purpose. For example:

  1. One for reusability to share, and
  2. A second one that contains the sensitive data, and needs to be handled differently.

For the first, the non-sensitive data can be stored in an archive under restricted or open access conditions, so that you can share it and link it to publications. For the second, you need to make a separate selection, so the sensitive part can be stored safely in a secure archive (a so-called offline or dark archive). In the metadata of both archives you can create stable links between the two datasets using persistent identifiers.

What to appraise for archiving

There are several factors that determine what data to select for archiving. For example, whether data are unique, expensive to reproduce, or if your funder requires that you make your data publicly available. This might also help you or your department to think about a standard policy or procedures for what needs to be kept, what is vital for reproducing research or reuse in future research projects.

More information on selecting data:

Data Set Packaging: Which Files should be Part of my Dataset?

A dataset consists of the following documents:

See the section Metadata for more information about documenting your data.

Depending on the research project it may be that more than one dataset is stored in more than one repository. Make sure that each consortium partner that collects data also stores all necessary data that is required for transparency and verification. A Consortium Agreement and Data Management Plan will include information on who is responsible for archiving the data.

Data Documentation

Documenting your data

Data documentation aims to describe the collected data to make it easier to use, retrieve and manage. Data documentation takes various forms and describes the data on multiple levels. The description of the dataset and data object is also referred to as metadata, i.e. data about the data. One way to do add metadata is to attach a readme file to your data. ResearchData NL offers guidance for this. The CESSDA has made very detailed guidance available for creating documentation and metadata for your data.

An image layering five different kinds of metadata, with the FAIR principles on the outside, followed by discipline metadata standards, project documentation, metadata data set, and metadata data object.

In addition to describing their own datasets and objects, researchers can cross-refer to the project proposal where other researchers can find information about the research, e.g. aims and goals, methodology and data collection, the persons responsible for the project etc. The type of research and the nature of the data also influence what kind of documentation is necessary.

Different types of data are governed by different standards (see also the image above), and these should be taken into account when documenting data. These requirements include, but are not limited to:

  1. FAIR data principles: the set of principles (Findable, Accessible, Interoperable, Reusable) for data exchange.
  2. Disciplinary metadata standards: guidelines for documenting data. This can refer to the dataset documentation, the object description, or both. Disciplinary metadata standards can document the dataset as a whole or as a data object (see number 5).
  3. Project documentation: the description of a project involving data collection. This documentation is often used for research verification and provenance.
  4. Metadata dataset: the description of a dataset, often used for discovering datasets within a repository.
  5. Metadata of a data object: name definition of a data object, often set up by the researcher to structure data or by the research group for collaboration during the project.

Codebooks

A codebook is a technical description of the data that were collected for a particular purpose in one or more datasets. It describes how the data are arranged in the computer file or files and what the parts or variables (numbers and letters) mean. A good description may also include specific instructions on how to use and interpret the data properly.

A low resolution screenshot of a spreadsheet

Like any other kind of “book,” some codebooks are better than others. The best codebooks include the following elements:

  • Description of the study: who did it, why they did it, how they did it
  • Sampling information: what was the population studied, how was the sample drawn, what was the response rate
  • Technical information about the files themselves: number of observations, record length, number of records per observation, etc.
  • Structure of the data within the file: hierarchical, multiple cards, etc.
  • Details about the data: the meaning of the variables, whether they are character or numeric, and if numeric, what format
  • Text of the questions and responses: some even include how many people responded a particular way.

More information about codebooks can be found on the website of the Kent State University Library (specifically useful if you want to create a codebook in SPSS) and on the website of the Data Documentation Initiative (specifically useful for researchers in the social sciences). You can also look at The Guide to Social Science Data Preparation and Archiving. 5th ed. Inter-university Consortium for Political and Social Research (ICPSR) (2012).

Metadata

Controlled Vocabularies & Classifications

When a metadata section in a Data Management Plan template includes a question on the used ontology (if any) what is usually meant is: is there a specific vocabulary or classification system used. Controlled vocabularies are created by domain experts to help translate ontological concepts as well as to organise knowledge for subsequent (information) retrieval. Controlled vocabularies (CESSDA: “structured controlled vocabularies”) are intended to reduce ambiguity that is inherent in normal human languages where the same concept can be given different names and to ensure consistency. Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Some vocabularies are very internationally accepted and standardized and may even become an ISO standard or a regional standard/classification. Controlled vocabularies can be broad in scope or very limited to a specific field. When a Data Management Plan template includes a question on the used ontology (if any), what is usually meant is: is there a specific vocabulary or classification system used.

Examples are:

Many examples of vocabularies and classification systems can be found at the FAIRsharing.org website. It has  a large list for multiple disciplines. If you are working on new concepts or new ideas and are using or creating your own ontology/terminology, be sure to include them as part of the metadata documentation in your dataset (for example as part of your codebook).

Controlled vocabularies help make searching for and re-using information or data much easier when they are part of a machine-readable metadata scheme or system.

Metadata & Datasets

Metadata is descriptive information about data / information. Metadata allow humans and programs to more easily understand and interpret information or data. Controlled vocabularies are often used to help make searching for and re-using information or data much easier when they are part of a machine-readable metadata scheme or system.

The CESSDA has created helpful guidance about creating metadata.

There are three main levels of metadata: Data assets, Dataset documentation and Dataset registration (more information):.

Data assets

On a micro-level there are four functional categories of metadata standards for datasets themselves that describe elements like structure, content, values (definitions, see also code book), and data formats (CSV, XML, etc.). Additionally, research groups often use a discipline’s standards to also describe data objects using naming conventions. There are, however, other guidelines for naming conventions and document versioning which can be useful for all documents, independent of whether they are research data or not. Often The table below gives an example of this.

Data Stage Dataset description Type of data Versioning
Raw data Consumer spending data Text files 2017-02-23_ConsumerSpending_1.2.txt
Processed data Anonymized Transcription of patient interviews Word files, Excel 2014-11-17_RawTranscription_Checked1.docx
Analysed data Photo Images with descriptions TIFF files C:\Images\Raw\2016-07-01_Subject1-V2.tiff C:\Images\Clean\2016-07-01_Subject1-H1c.tiff
Word files C:\Images\Clean\Descript\2016-07-01_Subject1-H1c.Docx

Dataset documentation

On this general, descriptive, level the metadata concerns data packaging & metadata documentation on the dataset. It can include items like:

  • Readme files that lists the period of research, collaborators, a short description of the research as well as the elements within the dataset
  • Code Book: it provides descriptions, explanations or definitions of variables in a dataset
  • Policy documents describing the context of the research as well as referring to standard operating procedures used
  • Re-use guidelines (or licences) describing if there are re-use restrictions or limitations, including contact details.

Dataset registration

When you want to make sure that your dataset is findable it is recommended that the elements of the description of your dataset are made according to a certain metadata standard that allows for easier exchange of metadata and harvesting of the metadata by search engines. Many certified archives use a metadata standard for the descriptions. If you choose a data repository or registry, you should find out which metadata standard they use. At the VU the following standards are used:

Many archives implement or make use of specific metadata standards. The UK Digital Curation Centre (DCC) provides an overview of metadata standards for different disciplines. The list is a great and useful resource in establishing and carrying out your research methodology. Go to the overview of metadata standards. More important tips are available at Dataset & Publication.

Archiving & FAIR Principles

If you want to archive your dataset in such a way that it is compatible with the FAIR-principles, you can use the information in this practical guide which describes how to implement the FAIR data policy and this table which matches metadata fields from different systems (these documents were written for the Faculty of Behavioural and Movement Sciences).

The Dutch Techcentre for Life Sciences has developed open source software code to enable you to make your dataset’s metadata FAIR. The software is being developed through GitHub and full details on the FAIR Data Point Software are available there. The Dutch eScience Center also developed Fair Data Point software, of which full details are, similarly, available on GitHub.