Data quality control
Douwe Molenaar
Systems Biology Lab
6/1/23
Time spent on data analysis
- 25% getting to know the data (again …)
- 25% wrangling data, correcting, getting it into the right format
- 50% calculating, making figures and text
Example
Effort that pays off
- Storing data in the right, tidy format
- Building a routine in organising project directories
- Combining code with figures tables and text
- Getting to know
tidyverse
(R) or pandas
(Python) functions:
- for data wrangling
- to make figures
What will we do?
- Give seven rules with examples for data formating
- Show how to manipulate data tables in R
One variable - one column
- Do not combine variables
- Do not use cell formatting to convey information (spreadsheets)
- Put comments in a separate column
- Label bad or uncertain data in a separate column
One observation - one row
Consistent values
Counter-examples:
- M, m, Male, male, etc.
- True, true, yes, 1, etc.
- The sneaky variant: “ M”, “M ”, “M”
Represent missing data by empty positions
- Or, if that is not explicit enough to your taste, use a single, unique flag like
NA
(not available)
- Avoid using numbers like
999
, -999
, 0
, etc.
Example: 101
was not a number
Quality checking a data set
- Explicitly state your assumptions about (combinations of) variable values:
- their absence, presence, cardinality and type, range, distribution.
- how are missing values reported?
- Check these assumptions programmatically.
- If you detect irregularities:
- are your assumptions correct or are these errors?
- should they be corrected/deleted?
- how does quality affect your results?
- do you still trust the data?
- Publish your quality analysis with the results.