Data quality control
Douwe Molenaar 
        
            Systems Biology Lab
          
     
 
  6/1/23
Time spent on data analysis
- 25% getting to know the data (again …)
 
- 25% wrangling data, correcting, getting it into the right format
 
- 50% calculating, making figures and text
 
Example
Effort that pays off
- Storing data in the right, tidy format
 
- Building a routine in organising project directories
 
- Combining code with figures tables and text
 
- Getting to know 
tidyverse (R) or pandas (Python) functions:
- for data wrangling
 
- to make figures
 
 
What will we do?
- Give seven rules with examples for data formating
 
- Show how to manipulate data tables in R
 
One variable - one column
- Do not combine variables
 
- Do not use cell formatting to convey information (spreadsheets)
 
- Put comments in a separate column
 
- Label bad or uncertain data in a separate column
 
 
 
One observation - one row
Consistent values
Counter-examples:
- M, m, Male, male, etc.
 
- True, true, yes, 1, etc.
 
- The sneaky variant: “ M”, “M ”, “M”
 
Represent missing data by empty positions
- Or, if that is not explicit enough to your taste, use a single, unique flag like 
NA (not available) 
- Avoid using numbers like 
999, -999, 0, etc. 
Example: 101 was not a number
 
 
 
Quality checking a data set
- Explicitly state your assumptions about (combinations of) variable values:
- their absence, presence, cardinality and type, range, distribution.
 
- how are missing values reported?
 
 
 
- Check these assumptions programmatically.
 
 
- If you detect irregularities:
- are your assumptions correct or are these errors?
- should they be corrected/deleted?
 
- how does quality affect your results?
 
- do you still trust the data?
 
 
 
 
- Publish your quality analysis with the results.