Thanks to Tom Mock for saving me from this blogpost on twitter. I was monster number 6… beginning to convince myself that there were gremlins in the across()
function.
In fact, I had forgotten the golden rule:
don’t make objects that have the same names a functions
That is for another post…
I have been playing with the new across()
function in dplyr
which is a magical solution when you are looking to get several summary statistics across a range of variables. BUT…. working out how to get those summary stats nicely formatted in a table is a bit of a challenge.
library(tidyverse)
library(palmerpenguins)
library(skimr)
library(tableone)
library(gt)
Here I am using data from the palmer penguins package.
penguins <- penguins
df <- tibble(penguin_bits = c("bill", "flipper"),
mean = c(43.92, 200.92),
sd = c(5.46, 14.06),
min = c(32.1, 172),
max = c(59.6, 231))
df %>%
gt()
penguin_bits | mean | sd | min | max |
---|---|---|---|---|
bill | 43.92 | 5.46 | 32.1 | 59.6 |
flipper | 200.92 | 14.06 | 172.0 | 231.0 |
I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. But this manual dataframe construction is a bit silly and not at all reproducable.
Is there an easy way to get summary stats in this format so they are compatible with the gt package?
penguins %>%
summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE),
bill_sd = sd(bill_length_mm, na.rm = TRUE),
bill_min = min(bill_length_mm, na.rm = TRUE),
bill_max = max(bill_length_mm, na.rm = TRUE),
flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
flipper_sd = sd(flipper_length_mm, na.rm = TRUE),
flipper_min = min(flipper_length_mm, na.rm = TRUE),
flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>%
gt()
bill_mean | bill_sd | bill_min | bill_max | flipper_mean | flipper_sd | flipper_min | flipper_max |
---|---|---|---|---|---|---|---|
43.92193 | 5.459584 | 32.1 | 59.6 | 200.9152 | 14.06171 | 172 | 231 |
The summarise function spits out summary stats in a SUPER wide format.
across()
The new across()
function gets you the same thing in many fewer lines of code, but still, it is VERY wide.
penguins %>%
summarise(across(contains("length"),
list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
gt()
bill_length_mm_mean | bill_length_mm_sd | bill_length_mm_min | bill_length_mm_max | flipper_length_mm_mean | flipper_length_mm_sd | flipper_length_mm_min | flipper_length_mm_max |
---|---|---|---|---|---|---|---|
43.92193 | 5.459584 | 32.1 | 59.6 | 200.9152 | 14.06171 | 172 | 231 |
The skimr package is nice because it is tidyverse compatible (i.e. you can select with contains) AND the output can be a dataframe, which you can then edit.
skimtable <- penguins %>%
select(contains("length")) %>%
skim()
skimtable
Name | Piped data |
Number of rows | 344 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Table 1: Data summary
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
For my purpose it is more than I need though. There is some selecting and renaming to do after the fact.
skimtable_renamed <- skimtable %>%
select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>%
rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100)
You do end up with a gt compatible dataframe.
skimtable_renamed %>%
gt()
penguin_bits | mean | sd | min | max |
---|---|---|---|---|
bill_length_mm | 43.92193 | 5.459584 | 32.1 | 59.6 |
flipper_length_mm | 200.91520 | 14.061714 | 172.0 | 231.0 |
The TableOne package gives you mean and SD, is there a way to add other summary stats (like min & max to TableOne?) And make the formatted nicer?
variables <- c("bill_length_mm", "flipper_length_mm")
CreateTableOne(vars = variables, data = penguins)
##
## Overall
## n 344
## bill_length_mm (mean (SD)) 43.92 (5.46)
## flipper_length_mm (mean (SD)) 200.92 (14.06)
Ideally, I want to be able to use summarise()
and across()
and somehow make the wide output long. The problem is that pivot_longer()
will take more than 1 “names_to” argument, but not more than a single “values_to” argument.
I would like to be able to pivot wide summary stats long like this…
penguins %>%
summarise(across(contains("length"),
list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")
… but not sure if that is possible.