The janitor package by Sam Firke contains probably my FAVOURITE R function: clean_names()
. By default when I am reading data into R, I pipe clean_names()
onto the end of my read_csv()
. I never have to look at inconsistently formatted variable names. But janitor package includes lots of other useful functions that make it easier to deal with dirty data and count stuff.
new_df <- read_csv(here("data", "df.csv") %>%
clean_names())
Are you keen to dig into the little known functions of a package that you use all the time? Here is a tip: in console type the name of the package with a double colon (i.e. janitor::), all the functions in the package will pop up and you can explore them by scrolling up and down the list.
Alternatively you can load the package in the console and then use ls("package:packgename")
to get a list of all the objects in the package.
library(janitor)
ls("package:janitor")
## [1] "%>%" "add_totals_col" "add_totals_row"
## [4] "adorn_crosstab" "adorn_ns" "adorn_pct_formatting"
## [7] "adorn_percentages" "adorn_rounding" "adorn_title"
## [10] "adorn_totals" "as_tabyl" "chisq.test"
## [13] "clean_names" "compare_df_cols" "compare_df_cols_same"
## [16] "convert_to_date" "convert_to_datetime" "convert_to_NA"
## [19] "crosstab" "describe_class" "excel_numeric_to_date"
## [22] "fisher.test" "get_dupes" "make_clean_names"
## [25] "remove_constant" "remove_empty" "remove_empty_cols"
## [28] "remove_empty_rows" "round_half_up" "round_to_fraction"
## [31] "row_to_names" "signif_half_up" "tabyl"
## [34] "top_levels" "untabyl" "use_first_valid_of"
Lets try a few these functions.
The penguin data isn’t very dirty out of the package, but I added some funky things for illustrative purposes.
dirty <- read_csv("penguin_raw_dirty.csv")
The penguin variable names are not great. A nasty mix of capital and little letters, gaps and brackets- not fun to type over and over again.
names(dirty)
## [1] "studyName" "Sample Number" "Species"
## [4] "Region" "Island" "Stage"
## [7] "Individual ID" "Clutch Completion" "Empty Column"
## [10] "Date Egg" "Culmen Length (mm)" "Culmen Depth (mm)"
## [13] "Flipper Length (mm)" "Body Mass (g)" "Sex"
## [16] "Delta 15 N (o/oo)" "Delta 13 C (o/oo)" "Comments"
clean_names()
will take all the variable names and make them lower case and replace gaps/brackets with underscores.
clean <- dirty %>%
clean_names()
names(clean)
## [1] "study_name" "sample_number" "species"
## [4] "region" "island" "stage"
## [7] "individual_id" "clutch_completion" "empty_column"
## [10] "date_egg" "culmen_length_mm" "culmen_depth_mm"
## [13] "flipper_length_mm" "body_mass_g" "sex"
## [16] "delta_15_n_o_oo" "delta_13_c_o_oo" "comments"
Sometimes dirty data contains whole rows or columns that are empty. You can quickly remove them with remove_empty()
. By default it is a “quiet” function, but specify quiet = FALSE and it will give you a little feedback about what it has done.
empty <- clean %>%
remove_empty(which = c("rows", "cols"), quiet = FALSE)
## Removing 1 empty rows of 347 rows total (0.288%).
## Removing 1 empty columns of 18 columns total (Removed: empty_column).
Lets imagine an RA made a mistake and entered the data for a couple of penguins twice. get_dupes()
will tell you if there are duplicate entries in your dataset.
empty %>%
get_dupes(sample_number, species)
## # A tibble: 4 x 18
## sample_number species dupe_count study_name region island stage individual_id
## <dbl> <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 18 Adelie … 2 PAL0708 Anvers Torge… Adul… N9A2
## 2 18 Adelie … 2 PAL0708 Anvers Torge… Adul… N9A2
## 3 60 Chinstr… 2 PAL0910 Anvers Dream Adul… N95A2
## 4 60 Chinstr… 2 PAL0910 Anvers Dream Adul… N95A2
## # … with 10 more variables: clutch_completion <chr>, date_egg <chr>,
## # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## # comments <chr>
Then you can remove the duplicates with the distinct()
function from dplyr
. It only keeps distinct observations.
dupes_gone <- empty %>%
distinct()
dupes_gone %>%
get_dupes(sample_number, species)
## No duplicate combinations found of: sample_number, species
## # A tibble: 0 x 18
## # … with 18 variables: sample_number <dbl>, species <chr>, dupe_count <int>,
## # study_name <chr>, region <chr>, island <chr>, stage <chr>,
## # individual_id <chr>, clutch_completion <chr>, date_egg <chr>,
## # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## # comments <chr>
But perhaps the most surprisingly awesome function in the janitor package is tabyl()
. Counting things in R is surprisingly hard see post, but tabyl()
is a huge help.
Advantages…
kableExtra
and gt
dupes_gone %>%
tabyl(species)
## species n percent
## Adelie Penguin (Pygoscelis adeliae) 152 0.4418605
## Chinstrap penguin (Pygoscelis antarctica) 68 0.1976744
## Gentoo penguin (Pygoscelis papua) 124 0.3604651
dupes_gone %>%
tabyl(species, sex)
## species . FEMALE MALE NA_
## Adelie Penguin (Pygoscelis adeliae) 0 73 73 6
## Chinstrap penguin (Pygoscelis antarctica) 0 34 34 0
## Gentoo penguin (Pygoscelis papua) 1 58 61 4
Hmmmm why is there a “.” column? Turns out for one penguin their sex is entered as “.” instead of NA.
Use na_if()
from dplyr
to convert pesky values to NA.
dupes_gone$sex <- na_if(dupes_gone$sex, ".")
dupes_gone %>%
tabyl(species, sex)
## species FEMALE MALE NA_
## Adelie Penguin (Pygoscelis adeliae) 73 73 6
## Chinstrap penguin (Pygoscelis antarctica) 34 34 0
## Gentoo penguin (Pygoscelis papua) 58 61 5
dupes_gone %>%
tabyl(species, sex, island)
## $Biscoe
## species FEMALE MALE NA_
## Adelie Penguin (Pygoscelis adeliae) 22 22 0
## Chinstrap penguin (Pygoscelis antarctica) 0 0 0
## Gentoo penguin (Pygoscelis papua) 58 61 5
##
## $Dream
## species FEMALE MALE NA_
## Adelie Penguin (Pygoscelis adeliae) 27 28 1
## Chinstrap penguin (Pygoscelis antarctica) 34 34 0
## Gentoo penguin (Pygoscelis papua) 0 0 0
##
## $Torgersen
## species FEMALE MALE NA_
## Adelie Penguin (Pygoscelis adeliae) 24 23 5
## Chinstrap penguin (Pygoscelis antarctica) 0 0 0
## Gentoo penguin (Pygoscelis papua) 0 0 0
dupes_gone %>%
tabyl(species, sex) %>%
adorn_totals(c("row", "col"))
## species FEMALE MALE NA_ Total
## Adelie Penguin (Pygoscelis adeliae) 73 73 6 152
## Chinstrap penguin (Pygoscelis antarctica) 34 34 0 68
## Gentoo penguin (Pygoscelis papua) 58 61 5 124
## Total 165 168 11 344
dupes_gone %>%
tabyl(species, sex) %>%
adorn_percentages("row")
## species FEMALE MALE NA_
## Adelie Penguin (Pygoscelis adeliae) 0.4802632 0.4802632 0.03947368
## Chinstrap penguin (Pygoscelis antarctica) 0.5000000 0.5000000 0.00000000
## Gentoo penguin (Pygoscelis papua) 0.4677419 0.4919355 0.04032258
dupes_gone %>%
tabyl(species, sex) %>%
adorn_percentages("row") %>%
adorn_pct_formatting()
## species FEMALE MALE NA_
## Adelie Penguin (Pygoscelis adeliae) 48.0% 48.0% 3.9%
## Chinstrap penguin (Pygoscelis antarctica) 50.0% 50.0% 0.0%
## Gentoo penguin (Pygoscelis papua) 46.8% 49.2% 4.0%
The nice thing is that the output of tabyl()
can be assigned as a dataframe object in your environment OR you can pipe on a kable()
…
dupes_gone %>%
tabyl(species, sex) %>%
adorn_percentages("row") %>%
adorn_pct_formatting() %>%
kable()
species | FEMALE | MALE | NA\_ |
---|---|---|---|
Adelie Penguin (Pygoscelis adeliae) | 48.0% | 48.0% | 3.9% |
Chinstrap penguin (Pygoscelis antarctica) | 50.0% | 50.0% | 0.0% |
Gentoo penguin (Pygoscelis papua) | 46.8% | 49.2% | 4.0% |
… or a gt()
to get a really nicely formatted summary table
dupes_gone %>%
tabyl(species, sex) %>%
adorn_percentages("row") %>%
adorn_pct_formatting() %>%
gt()
species | FEMALE | MALE | NA_ |
---|---|---|---|
Adelie Penguin (Pygoscelis adeliae) | 48.0% | 48.0% | 3.9% |
Chinstrap penguin (Pygoscelis antarctica) | 50.0% | 50.0% | 0.0% |
Gentoo penguin (Pygoscelis papua) | 46.8% | 49.2% | 4.0% |