Practicing data wrangling

The Portal Project data

The Portal Project is a long-term project on desert ecology, for example on the banner-tailed kangaroo rat. Part of the data can be found in the folder data as a tab separated values file. We are going practice our data wrangling skills using these data. We will use the tidyverse packages for this.

Banner-tailed kangaroo rat (Dipodomys spectabilis)

We load and display the first few lines of the data. Alternatively, you open the data in excel or a text editor.

library(tidyverse)                        # loads several packages from the tidyverse "ecosystem"
d <- read_tsv('data/portal_project.tsv')  # read_tsv() is from the readr package
d |>
  slice_head(n=20) |>                     # select first 20 rows
  knitr::kable() |>                       # the kable() function outputs html tables 
  kableExtra::kable_classic(font_size=12) # nicer formatting
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
1 7 16 1977 2 NL M 32 NA Neotoma albigula Rodent
2 7 16 1977 3 NL M 33 NA Neotoma albigula Rodent
3 7 16 1977 2 DM F 37 NA Dipodomys merriami Rodent
4 7 16 1977 7 DM M 36 NA Dipodomys merriami Rodent
5 7 16 1977 3 DM M 35 NA Dipodomys merriami Rodent
6 7 16 1977 1 PF M 14 NA Perognathus flavus Rodent
7 7 16 1977 2 PE F NA NA Peromyscus eremicus Rodent
8 7 16 1977 1 DM M 37 NA Dipodomys merriami Rodent
9 7 16 1977 1 DM F 34 NA Dipodomys merriami Rodent
10 7 16 1977 6 PF F 20 NA Perognathus flavus Rodent
11 7 16 1977 5 DS F 53 NA Dipodomys spectabilis Rodent
12 7 16 1977 7 DM M 38 NA Dipodomys merriami Rodent
13 7 16 1977 3 DM M 35 NA Dipodomys merriami Rodent
14 7 16 1977 8 DM NA NA NA Dipodomys merriami Rodent
15 7 16 1977 6 DM F 36 NA Dipodomys merriami Rodent
16 7 16 1977 4 DM F 36 NA Dipodomys merriami Rodent
17 7 16 1977 3 DS F 48 NA Dipodomys spectabilis Rodent
18 7 16 1977 2 PP M 22 NA Chaetodipus penicillatus Rodent
19 7 16 1977 4 PF NA NA NA Perognathus flavus Rodent
20 7 17 1977 11 DS F 48 NA Dipodomys spectabilis Rodent

Split the table

There’s quite some redundancy in the data. In particular, genus, species and taxa values are repeated many times for the same species_ID. We’ll split the original table into two: species and observations, which are linked via the species identifier species_id.

species <- select(d, species_id, species, genus, taxa) |>
  distinct() |>
  mutate(species_name = paste(genus, species)) # add a proper name for the species

observations <- select(d, -c(species, genus, taxa))
observations |>
  slice_head(n=10) |>
  knitr::kable() |>
  kableExtra::kable_classic(font_size=12, full_width=FALSE)
record_id month day year plot_id species_id sex hindfoot_length weight
1 7 16 1977 2 NL M 32 NA
2 7 16 1977 3 NL M 33 NA
3 7 16 1977 2 DM F 37 NA
4 7 16 1977 7 DM M 36 NA
5 7 16 1977 3 DM M 35 NA
6 7 16 1977 1 PF M 14 NA
7 7 16 1977 2 PE F NA NA
8 7 16 1977 1 DM M 37 NA
9 7 16 1977 1 DM F 34 NA
10 7 16 1977 6 PF F 20 NA
species |>
  knitr::kable() |>
  kableExtra::kable_classic(font_size=12, full_width=FALSE)
species_id species genus taxa species_name
NL albigula Neotoma Rodent Neotoma albigula
DM merriami Dipodomys Rodent Dipodomys merriami
PF flavus Perognathus Rodent Perognathus flavus
PE eremicus Peromyscus Rodent Peromyscus eremicus
DS spectabilis Dipodomys Rodent Dipodomys spectabilis
PP penicillatus Chaetodipus Rodent Chaetodipus penicillatus
SH hispidus Sigmodon Rodent Sigmodon hispidus
OT torridus Onychomys Rodent Onychomys torridus
DO ordii Dipodomys Rodent Dipodomys ordii
OX sp. Onychomys Rodent Onychomys sp.
SS spilosoma Spermophilus Rodent Spermophilus spilosoma
OL leucogaster Onychomys Rodent Onychomys leucogaster
RM megalotis Reithrodontomys Rodent Reithrodontomys megalotis
SA audubonii Sylvilagus Rabbit Sylvilagus audubonii
PM maniculatus Peromyscus Rodent Peromyscus maniculatus
AH harrisi Ammospermophilus Rodent Ammospermophilus harrisi
DX sp. Dipodomys Rodent Dipodomys sp.
AB bilineata Amphispiza Bird Amphispiza bilineata
CB brunneicapillus Campylorhynchus Bird Campylorhynchus brunneicapillus
CM melanocorys Calamospiza Bird Calamospiza melanocorys
CQ squamata Callipepla Bird Callipepla squamata
RF fulvescens Reithrodontomys Rodent Reithrodontomys fulvescens
PC chlorurus Pipilo Bird Pipilo chlorurus
PG gramineus Pooecetes Bird Pooecetes gramineus
PH hispidus Perognathus Rodent Perognathus hispidus
PU fuscus Pipilo Bird Pipilo fuscus
CV viridis Crotalus Reptile Crotalus viridis
UR sp. Rodent Rodent Rodent sp.
UP sp. Pipilo Bird Pipilo sp.
ZL leucophrys Zonotrichia Bird Zonotrichia leucophrys
UL sp. Lizard Reptile Lizard sp.
CS scutalatus Crotalus Reptile Crotalus scutalatus
SC clarki Sceloporus Reptile Sceloporus clarki
BA taylori Baiomys Rodent Baiomys taylori
SF fulviventer Sigmodon Rodent Sigmodon fulviventer
RO montanus Reithrodontomys Rodent Reithrodontomys montanus
AS savannarum Ammodramus Bird Ammodramus savannarum
SO ochrognathus Sigmodon Rodent Sigmodon ochrognathus
PI intermedius Chaetodipus Rodent Chaetodipus intermedius
ST tereticaudus Spermophilus Rodent Spermophilus tereticaudus
CU uniparens Cnemidophorus Reptile Cnemidophorus uniparens
SU undulatus Sceloporus Reptile Sceloporus undulatus
RX sp. Reithrodontomys Rodent Reithrodontomys sp.
PB baileyi Chaetodipus Rodent Chaetodipus baileyi
PL leucopus Peromyscus Rodent Peromyscus leucopus
PX sp. Chaetodipus Rodent Chaetodipus sp.
CT tigris Cnemidophorus Reptile Cnemidophorus tigris
US sp. Sparrow Bird Sparrow sp.

What do we want to know about these data?

Suggestions by audience?

My suggestions + code:

Mean length of hindfoot per species
# overall mean length
mean_length <- mean(observations$hindfoot_length, na.rm=TRUE) # have to remove NA's
mean_lenght_ps <- observations |>
  group_by(species_id) |>
  summarise(mean_hf_length = mean(hindfoot_length, na.rm = TRUE)) |>
  inner_join(species)
mean_lenght_ps |>
  knitr::kable() |>
  kableExtra::kable_classic(font_size=12, full_width=FALSE)
species_id mean_hf_length species genus taxa species_name
AB NaN bilineata Amphispiza Bird Amphispiza bilineata
AH 33.00000 harrisi Ammospermophilus Rodent Ammospermophilus harrisi
AS NaN savannarum Ammodramus Bird Ammodramus savannarum
BA 13.00000 taylori Baiomys Rodent Baiomys taylori
CB NaN brunneicapillus Campylorhynchus Bird Campylorhynchus brunneicapillus
CM NaN melanocorys Calamospiza Bird Calamospiza melanocorys
CQ NaN squamata Callipepla Bird Callipepla squamata
CS NaN scutalatus Crotalus Reptile Crotalus scutalatus
CT NaN tigris Cnemidophorus Reptile Cnemidophorus tigris
CU NaN uniparens Cnemidophorus Reptile Cnemidophorus uniparens
CV NaN viridis Crotalus Reptile Crotalus viridis
DM 35.98235 merriami Dipodomys Rodent Dipodomys merriami
DO 35.60755 ordii Dipodomys Rodent Dipodomys ordii
DS 49.94887 spectabilis Dipodomys Rodent Dipodomys spectabilis
DX NaN sp. Dipodomys Rodent Dipodomys sp.
NL 32.29423 albigula Neotoma Rodent Neotoma albigula
OL 20.53261 leucogaster Onychomys Rodent Onychomys leucogaster
OT 20.26741 torridus Onychomys Rodent Onychomys torridus
OX 19.12500 sp. Onychomys Rodent Onychomys sp.
PB 26.11592 baileyi Chaetodipus Rodent Chaetodipus baileyi
PC NaN chlorurus Pipilo Bird Pipilo chlorurus
PE 20.19554 eremicus Peromyscus Rodent Peromyscus eremicus
PF 15.58339 flavus Perognathus Rodent Perognathus flavus
PG NaN gramineus Pooecetes Bird Pooecetes gramineus
PH 25.77419 hispidus Perognathus Rodent Perognathus hispidus
PI 22.22222 intermedius Chaetodipus Rodent Chaetodipus intermedius
PL 20.02778 leucopus Peromyscus Rodent Peromyscus leucopus
PM 20.42621 maniculatus Peromyscus Rodent Peromyscus maniculatus
PP 21.75157 penicillatus Chaetodipus Rodent Chaetodipus penicillatus
PU NaN fuscus Pipilo Bird Pipilo fuscus
PX 19.50000 sp. Chaetodipus Rodent Chaetodipus sp.
RF 17.52055 fulvescens Reithrodontomys Rodent Reithrodontomys fulvescens
RM 16.44185 megalotis Reithrodontomys Rodent Reithrodontomys megalotis
RO 15.37500 montanus Reithrodontomys Rodent Reithrodontomys montanus
RX 18.50000 sp. Reithrodontomys Rodent Reithrodontomys sp.
SA NaN audubonii Sylvilagus Rabbit Sylvilagus audubonii
SC NaN clarki Sceloporus Reptile Sceloporus clarki
SF 26.70732 fulviventer Sigmodon Rodent Sigmodon fulviventer
SH 28.54962 hispidus Sigmodon Rodent Sigmodon hispidus
SO 25.65854 ochrognathus Sigmodon Rodent Sigmodon ochrognathus
SS NaN spilosoma Spermophilus Rodent Spermophilus spilosoma
ST NaN tereticaudus Spermophilus Rodent Spermophilus tereticaudus
SU NaN undulatus Sceloporus Reptile Sceloporus undulatus
UL NaN sp. Lizard Reptile Lizard sp.
UP NaN sp. Pipilo Bird Pipilo sp.
UR NaN sp. Rodent Rodent Rodent sp.
US NaN sp. Sparrow Bird Sparrow sp.
ZL NaN leucophrys Zonotrichia Bird Zonotrichia leucophrys
# Nicer formatting
mean_lenght_ps |>
  select(species_name, mean_hf_length) |>        # only display name and hindfoot length
  filter(is.finite(mean_hf_length)) |>           # only display numbers for observed species
  knitr::kable() |>
  kableExtra::kable_classic(font_size=12, full_width=FALSE)
species_name mean_hf_length
Ammospermophilus harrisi 33.00000
Baiomys taylori 13.00000
Dipodomys merriami 35.98235
Dipodomys ordii 35.60755
Dipodomys spectabilis 49.94887
Neotoma albigula 32.29423
Onychomys leucogaster 20.53261
Onychomys torridus 20.26741
Onychomys sp. 19.12500
Chaetodipus baileyi 26.11592
Peromyscus eremicus 20.19554
Perognathus flavus 15.58339
Perognathus hispidus 25.77419
Chaetodipus intermedius 22.22222
Peromyscus leucopus 20.02778
Peromyscus maniculatus 20.42621
Chaetodipus penicillatus 21.75157
Chaetodipus sp. 19.50000
Reithrodontomys fulvescens 17.52055
Reithrodontomys megalotis 16.44185
Reithrodontomys montanus 15.37500
Reithrodontomys sp. 18.50000
Sigmodon fulviventer 26.70732
Sigmodon hispidus 28.54962
Sigmodon ochrognathus 25.65854
Ratio of males / females per species and per year
ratios <- observations |>
  group_by(species_id, year, sex) |>
  summarise(nr = n()) |>
  pivot_wider(names_from = sex, values_from = nr) |>
  mutate(mfratio = M/F)
ratios |>
  ungroup() |>
  filter(!is.na(mfratio)) |>
  inner_join(species) |>
  select(species_name, year, mfratio) |>
  arrange(species_name, year) |>
  slice_head(n=20) |>
  knitr::kable() |>
  kableExtra::kable_classic(font_size=12, full_width=FALSE)
species_name year mfratio
Baiomys taylori 1990 0.3750000
Baiomys taylori 1991 0.3157895
Baiomys taylori 1992 0.5000000
Chaetodipus baileyi 1996 0.9500000
Chaetodipus baileyi 1997 1.3272727
Chaetodipus baileyi 1998 0.8947368
Chaetodipus baileyi 1999 0.4462366
Chaetodipus baileyi 2000 0.8461538
Chaetodipus baileyi 2001 0.7818792
Chaetodipus baileyi 2002 0.5927928
Chaetodipus penicillatus 1977 0.7500000
Chaetodipus penicillatus 1978 1.1818182
Chaetodipus penicillatus 1979 0.4285714
Chaetodipus penicillatus 1980 0.2142857
Chaetodipus penicillatus 1981 0.4285714
Chaetodipus penicillatus 1982 0.2400000
Chaetodipus penicillatus 1983 0.2800000
Chaetodipus penicillatus 1984 1.2727273
Chaetodipus penicillatus 1985 0.6666667
Chaetodipus penicillatus 1986 0.2500000
Number of observations per species per year
obspspy <- observations |>
  group_by(species_id, year) |>
  summarise(obs = n()) |>
  mutate(totobs = sum(obs)) |>
  filter(totobs > 300) |>
  inner_join(species) |>
  ungroup()

Make a graph of this

p <- ggplot(obspspy, mapping=aes(x=year, y=obs, color=species_name)) +
  geom_line()
print(p)

# Or a better one
p1 <- ggplot(obspspy, mapping=aes(x=year, y=obs)) +
  geom_line() +
  facet_wrap(vars(species_name)) +
  scale_y_log10()
print(p1)

Make a table with numbers of observations with species in columns and years in rows.

obspspy |>
  select(species_name, year, obs) |>
  pivot_wider(names_from = species_name, values_from = obs) |>
  arrange(year) |>
  knitr::kable() |>
  kableExtra::kable_classic(font_size=12)
year Amphispiza bilineata Ammospermophilus harrisi Dipodomys merriami Dipodomys ordii Dipodomys spectabilis Neotoma albigula Onychomys leucogaster Onychomys torridus Chaetodipus baileyi Peromyscus eremicus Perognathus flavus Peromyscus maniculatus Chaetodipus penicillatus Reithrodontomys megalotis
1977 NA NA 264 12 98 31 10 17 NA 6 31 NA 7 2
1978 NA 1 389 22 320 48 47 66 NA 17 40 3 24 3
1979 NA NA 209 31 204 30 46 68 NA 17 19 6 20 6
1980 5 NA 493 89 346 57 85 96 NA 20 95 1 17 46
1981 7 NA 559 123 354 63 55 65 NA 43 59 16 20 29
1982 34 1 609 117 354 111 83 94 NA 107 152 72 32 154
1983 41 2 528 167 280 98 65 86 NA 38 114 42 32 127
1984 12 2 396 108 76 64 71 49 NA 20 7 12 26 60
1985 14 14 667 225 98 45 94 62 NA 51 1 17 35 72
1986 5 19 406 115 88 60 43 25 NA 26 2 34 26 53
1987 35 41 469 213 104 128 79 43 NA 142 4 133 45 189
1988 39 26 365 144 54 102 51 86 NA 190 2 54 53 211
1989 31 30 321 119 33 67 57 102 NA 177 13 27 45 398
1990 27 22 462 171 17 29 25 77 NA 75 51 NA 34 231
1991 15 21 404 122 11 30 36 107 NA 65 62 NA 88 307
1992 10 16 307 94 18 15 45 42 NA 41 57 NA 131 158
1993 9 15 253 29 18 24 35 22 NA 13 70 NA 100 103
1994 7 19 293 25 9 10 21 21 NA 3 73 NA 74 40
1995 4 36 436 58 NA 8 29 38 7 14 158 8 277 81
1996 NA 25 492 174 NA 7 13 108 39 36 330 50 298 90
1997 NA 37 576 253 6 48 8 258 259 57 186 271 325 158
1998 2 33 503 111 9 32 NA 164 329 24 26 103 208 13
1999 NA 14 348 84 7 20 NA 105 272 7 NA 44 167 28
2000 3 12 233 91 NA 29 NA 154 555 15 NA 2 381 15
2001 1 23 305 81 NA 48 NA 167 538 35 27 3 273 15
2002 2 28 309 249 NA 48 8 127 892 60 18 1 385 20