cleaning penguins with the janitor package

The janitor package by Sam Firke contains probably my FAVOURITE R function: clean_names(). By default when I am reading data into R, I pipe clean_names() onto the end of my read_csv(). I never have to look at inconsistently formatted variable names. But janitor package includes lots of other useful functions that make it easier to deal with dirty data and count stuff.

new_df <- read_csv(here("data", "df.csv") %>%
        clean_names())
        

Exploring package functions

Are you keen to dig into the little known functions of a package that you use all the time? Here is a tip: in console type the name of the package with a double colon (i.e. janitor::), all the functions in the package will pop up and you can explore them by scrolling up and down the list.

Alternatively you can load the package in the console and then use ls("package:packgename") to get a list of all the objects in the package.

library(janitor)

ls("package:janitor")
##  [1] "%>%"                   "add_totals_col"        "add_totals_row"       
##  [4] "adorn_crosstab"        "adorn_ns"              "adorn_pct_formatting" 
##  [7] "adorn_percentages"     "adorn_rounding"        "adorn_title"          
## [10] "adorn_totals"          "as_tabyl"              "chisq.test"           
## [13] "clean_names"           "compare_df_cols"       "compare_df_cols_same" 
## [16] "convert_to_date"       "convert_to_datetime"   "convert_to_NA"        
## [19] "crosstab"              "describe_class"        "excel_numeric_to_date"
## [22] "fisher.test"           "get_dupes"             "make_clean_names"     
## [25] "remove_constant"       "remove_empty"          "remove_empty_cols"    
## [28] "remove_empty_rows"     "round_half_up"         "round_to_fraction"    
## [31] "row_to_names"          "signif_half_up"        "tabyl"                
## [34] "top_levels"            "untabyl"               "use_first_valid_of"

Lets try a few these functions.

read some dirty data

The penguin data isn’t very dirty out of the package, but I added some funky things for illustrative purposes.

dirty <- read_csv("penguin_raw_dirty.csv")

1. clean_names()

The penguin variable names are not great. A nasty mix of capital and little letters, gaps and brackets- not fun to type over and over again.

names(dirty)
##  [1] "studyName"           "Sample Number"       "Species"            
##  [4] "Region"              "Island"              "Stage"              
##  [7] "Individual ID"       "Clutch Completion"   "Empty Column"       
## [10] "Date Egg"            "Culmen Length (mm)"  "Culmen Depth (mm)"  
## [13] "Flipper Length (mm)" "Body Mass (g)"       "Sex"                
## [16] "Delta 15 N (o/oo)"   "Delta 13 C (o/oo)"   "Comments"

clean_names() will take all the variable names and make them lower case and replace gaps/brackets with underscores.

clean <- dirty %>%
  clean_names()

names(clean)
##  [1] "study_name"        "sample_number"     "species"          
##  [4] "region"            "island"            "stage"            
##  [7] "individual_id"     "clutch_completion" "empty_column"     
## [10] "date_egg"          "culmen_length_mm"  "culmen_depth_mm"  
## [13] "flipper_length_mm" "body_mass_g"       "sex"              
## [16] "delta_15_n_o_oo"   "delta_13_c_o_oo"   "comments"

2. remove_empty()

Sometimes dirty data contains whole rows or columns that are empty. You can quickly remove them with remove_empty(). By default it is a “quiet” function, but specify quiet = FALSE and it will give you a little feedback about what it has done.

empty <- clean %>%
  remove_empty(which = c("rows", "cols"), quiet = FALSE)
## Removing 1 empty rows of 347 rows total (0.288%).
## Removing 1 empty columns of 18 columns total (Removed: empty_column).

3. get dupes()

Lets imagine an RA made a mistake and entered the data for a couple of penguins twice. get_dupes() will tell you if there are duplicate entries in your dataset.

empty %>%
  get_dupes(sample_number, species)
## # A tibble: 4 x 18
##   sample_number species dupe_count study_name region island stage individual_id
##           <dbl> <chr>        <int> <chr>      <chr>  <chr>  <chr> <chr>        
## 1            18 Adelie…          2 PAL0708    Anvers Torge… Adul… N9A2         
## 2            18 Adelie…          2 PAL0708    Anvers Torge… Adul… N9A2         
## 3            60 Chinst…          2 PAL0910    Anvers Dream  Adul… N95A2        
## 4            60 Chinst…          2 PAL0910    Anvers Dream  Adul… N95A2        
## # … with 10 more variables: clutch_completion <chr>, date_egg <chr>,
## #   culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## #   comments <chr>

Then you can remove the duplicates with the distinct() function from dplyr. It only keeps distinct observations.

dupes_gone <- empty %>%
  distinct()

dupes_gone %>%
  get_dupes(sample_number, species)
## No duplicate combinations found of: sample_number, species
## # A tibble: 0 x 18
## # … with 18 variables: sample_number <dbl>, species <chr>, dupe_count <int>,
## #   study_name <chr>, region <chr>, island <chr>, stage <chr>,
## #   individual_id <chr>, clutch_completion <chr>, date_egg <chr>,
## #   culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## #   comments <chr>

4. tabyl()

But perhaps the most surprisingly awesome function in the janitor package is tabyl(). Counting things in R is surprisingly hard see post, but tabyl() is a huge help.

Advantages…

  • works with %>%
  • takes a dataframe
  • outputs a dataframe
  • is compatible with other table packages like kableExtra and gt

one variable (gets you %)

dupes_gone %>%
  tabyl(species) 
##                                    species   n   percent
##        Adelie Penguin (Pygoscelis adeliae) 152 0.4418605
##  Chinstrap penguin (Pygoscelis antarctica)  68 0.1976744
##          Gentoo penguin (Pygoscelis papua) 124 0.3604651

two variables

dupes_gone %>%
  tabyl(species, sex)
##                                    species . FEMALE MALE NA_
##        Adelie Penguin (Pygoscelis adeliae) 0     73   73   6
##  Chinstrap penguin (Pygoscelis antarctica) 0     34   34   0
##          Gentoo penguin (Pygoscelis papua) 1     58   61   4

Hmmmm why is there a “.” column? Turns out for one penguin their sex is entered as “.” instead of NA.

Use na_if() from dplyr to convert pesky values to NA.

dupes_gone$sex <-  na_if(dupes_gone$sex, ".")

dupes_gone %>%
  tabyl(species, sex)
##                                    species FEMALE MALE NA_
##        Adelie Penguin (Pygoscelis adeliae)     73   73   6
##  Chinstrap penguin (Pygoscelis antarctica)     34   34   0
##          Gentoo penguin (Pygoscelis papua)     58   61   5

three variables

dupes_gone %>%
  tabyl(species, sex, island) 
## $Biscoe
##                                    species FEMALE MALE NA_
##        Adelie Penguin (Pygoscelis adeliae)     22   22   0
##  Chinstrap penguin (Pygoscelis antarctica)      0    0   0
##          Gentoo penguin (Pygoscelis papua)     58   61   5
## 
## $Dream
##                                    species FEMALE MALE NA_
##        Adelie Penguin (Pygoscelis adeliae)     27   28   1
##  Chinstrap penguin (Pygoscelis antarctica)     34   34   0
##          Gentoo penguin (Pygoscelis papua)      0    0   0
## 
## $Torgersen
##                                    species FEMALE MALE NA_
##        Adelie Penguin (Pygoscelis adeliae)     24   23   5
##  Chinstrap penguin (Pygoscelis antarctica)      0    0   0
##          Gentoo penguin (Pygoscelis papua)      0    0   0

adorn_ ing things

totals by row and/or col

dupes_gone %>%
  tabyl(species, sex) %>%
  adorn_totals(c("row", "col"))
##                                    species FEMALE MALE NA_ Total
##        Adelie Penguin (Pygoscelis adeliae)     73   73   6   152
##  Chinstrap penguin (Pygoscelis antarctica)     34   34   0    68
##          Gentoo penguin (Pygoscelis papua)     58   61   5   124
##                                      Total    165  168  11   344

percentages

dupes_gone %>%
  tabyl(species, sex) %>%
  adorn_percentages("row")
##                                    species    FEMALE      MALE        NA_
##        Adelie Penguin (Pygoscelis adeliae) 0.4802632 0.4802632 0.03947368
##  Chinstrap penguin (Pygoscelis antarctica) 0.5000000 0.5000000 0.00000000
##          Gentoo penguin (Pygoscelis papua) 0.4677419 0.4919355 0.04032258

percent formatting

  dupes_gone %>%
  tabyl(species, sex) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting()
##                                    species FEMALE  MALE  NA_
##        Adelie Penguin (Pygoscelis adeliae)  48.0% 48.0% 3.9%
##  Chinstrap penguin (Pygoscelis antarctica)  50.0% 50.0% 0.0%
##          Gentoo penguin (Pygoscelis papua)  46.8% 49.2% 4.0%

5. tabyl + other nice tables (kableExtra, gt)

The nice thing is that the output of tabyl() can be assigned as a dataframe object in your environment OR you can pipe on a kable()

dupes_gone %>%
  tabyl(species, sex) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>%
  kable() 
species FEMALE MALE NA_
Adelie Penguin (Pygoscelis adeliae) 48.0% 48.0% 3.9%
Chinstrap penguin (Pygoscelis antarctica) 50.0% 50.0% 0.0%
Gentoo penguin (Pygoscelis papua) 46.8% 49.2% 4.0%

… or a gt() to get a really nicely formatted summary table

dupes_gone %>%
  tabyl(species, sex) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>%
  gt()
species FEMALE MALE NA_
Adelie Penguin (Pygoscelis adeliae) 48.0% 48.0% 3.9%
Chinstrap penguin (Pygoscelis antarctica) 50.0% 50.0% 0.0%
Gentoo penguin (Pygoscelis papua) 46.8% 49.2% 4.0%