Data quality control

Douwe Molenaar

Systems Biology Lab

6/1/23

Time spent on data analysis

  • 25% getting to know the data (again …)
  • 25% wrangling data, correcting, getting it into the right format
  • 50% calculating, making figures and text

Example

Effort that pays off

  • Storing data in the right, tidy format
  • Building a routine in organising project directories
  • Combining code with figures tables and text
  • Getting to know tidyverse (R) or pandas (Python) functions:
    • for data wrangling
    • to make figures

What will we do?

  • Give seven rules with examples for data formating
  • Show how to manipulate data tables in R

Seven rules for formating raw data

  1. Give each variable its own column
  2. Avoid using variable names with spaces or having special characters
  3. Give every observation its own row
  4. Split data into different tables if you would (often) repeat a combination of variables
  5. Use values consistently
  6. Use empty positions for missing data
  7. Add a separate metadata file

One variable - one column

  • Do not combine variables
  • Do not use cell formatting to convey information (spreadsheets)
  • Put comments in a separate column
  • Label bad or uncertain data in a separate column

One observation - one row

Split tables with redundant information

Rows 1–20 from a huge (over 30.000 rows) table with redundant information

Split tables with redundant information

The same information split into two, non-redundant linked tables

Why should information be non-redundant?


For example because of this:

Consistent values

Counter-examples:

  • M, m, Male, male, etc.
  • True, true, yes, 1, etc.
  • The sneaky variant: “ M”, “M ”, “M”

Represent missing data by empty positions

  • Or, if that is not explicit enough to your taste, use a single, unique flag like NA (not available)
  • Avoid using numbers like 999, -999, 0, etc.


Example: 101 was not a number

Add a metadata file

  • A general description of the data files
  • Describe the source of the data
    • Author, dates, places etc.
  • Describe each of the variables
    • What are their units
    • What are their types and ranges
    • How are missing values reported
  • Add a citation or DOI, if the work was published

Example:

Quality checking a data set

  • Explicitly state your assumptions about (combinations of) variable values:
    • their absence, presence, cardinality and type, range, distribution.
    • how are missing values reported?
  • Check these assumptions programmatically.
  • If you detect irregularities:
    • are your assumptions correct or are these errors?
      • should they be corrected/deleted?
      • how does quality affect your results?
      • do you still trust the data?
  • Publish your quality analysis with the results.