Thanks to Tom Mock for saving me from this blogpost on twitter. I was monster number 6… beginning to convince myself that there were gremlins in the
In fact, I had forgotten the golden rule:
don’t make objects that have the same names a functions
That is for another post…
I have been playing with the new
across() function in
dplyr which is a magical solution when you are looking to get several summary statistics across a range of variables. BUT…. working out how to get those summary stats nicely formatted in a table is a bit of a challenge.
library(tidyverse) library(palmerpenguins) library(skimr) library(tableone) library(gt)
Here I am using data from the palmer penguins package.
penguins <- penguins
df <- tibble(penguin_bits = c("bill", "flipper"), mean = c(43.92, 200.92), sd = c(5.46, 14.06), min = c(32.1, 172), max = c(59.6, 231)) df %>% gt()
I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. But this manual dataframe construction is a bit silly and not at all reproducable.
Is there an easy way to get summary stats in this format so they are compatible with the gt package?
penguins %>% summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE), bill_sd = sd(bill_length_mm, na.rm = TRUE), bill_min = min(bill_length_mm, na.rm = TRUE), bill_max = max(bill_length_mm, na.rm = TRUE), flipper_mean = mean(flipper_length_mm, na.rm = TRUE), flipper_sd = sd(flipper_length_mm, na.rm = TRUE), flipper_min = min(flipper_length_mm, na.rm = TRUE), flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>% gt()
The summarise function spits out summary stats in a SUPER wide format.
across() function gets you the same thing in many fewer lines of code, but still, it is VERY wide.
penguins %>% summarise(across(contains("length"), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>% gt()
The skimr package is nice because it is tidyverse compatible (i.e. you can select with contains) AND the output can be a dataframe, which you can then edit.
skimtable <- penguins %>% select(contains("length")) %>% skim() skimtable
|Number of rows||344|
|Number of columns||2|
|Column type frequency:|
Table 1: Data summary
Variable type: numeric
For my purpose it is more than I need though. There is some selecting and renaming to do after the fact.
skimtable_renamed <- skimtable %>% select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>% rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100)
You do end up with a gt compatible dataframe.
skimtable_renamed %>% gt()
The TableOne package gives you mean and SD, is there a way to add other summary stats (like min & max to TableOne?) And make the formatted nicer?
variables <- c("bill_length_mm", "flipper_length_mm") CreateTableOne(vars = variables, data = penguins)
## ## Overall ## n 344 ## bill_length_mm (mean (SD)) 43.92 (5.46) ## flipper_length_mm (mean (SD)) 200.92 (14.06)
Ideally, I want to be able to use
across() and somehow make the wide output long. The problem is that
pivot_longer() will take more than 1 “names_to” argument, but not more than a single “values_to” argument.
I would like to be able to pivot wide summary stats long like this…
penguins %>% summarise(across(contains("length"), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>% pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")
… but not sure if that is possible.