Sometimes things that are really easy to do in excel are not so intuitive in R. Like counting things. Because most of the time I am working with data in long format, you can end up with hundreds of observations, so functions like length() aren’t useful. Today I just wanted to check how many participants were in this dataset and it took me some significant googling.
library(tidyverse)
library(ggbeeswarm)
library(janitor)
df <- data.frame("pp_no" = 1:16,
"group" = c("control", "control","control","control", "exp", "exp", "exp", "exp"),
"delay" = c("short","long"),
"condition" = c("easy", "easy", "difficult", "difficult"),
"score" = c(82, 75, 76, 72, 86, 89, 85, 87, 87, 76, 78, 85, 97, 87, 94, 87))
Having data in long format makes it difficult to count things because values repeat. You are really wanting to count how many distinct values there are. My intuition is to use the distinct()
function from dplyr, but it SELECTS distinct rows, but doesn’t count them.
It is the n_distinct()
function will give you a count of the distinct values in a variable
n_distinct(df$pp_no)
## [1] 16
In order to count the number of participants in each group, you need to combine group_by
and summarise, with n_distinct like this
df %>%
group_by(group) %>%
summarise(pp_count = n_distinct(pp_no))
## # A tibble: 2 x 2
## group pp_count
## <chr> <int>
## 1 control 8
## 2 exp 8
The other counting thing I do a lot if counting observations by group (or other categorical variable). Although there is a few lines of code, combining group_by()
and summarise()
is useful because you create a df that can combines both the count and other summary stats.
df %>%
group_by(delay) %>%
summarise(count = n(), mean_score = mean(score))
## # A tibble: 2 x 3
## delay count mean_score
## <chr> <int> <dbl>
## 1 long 8 82.2
## 2 short 8 85.6
If you just want a fast count, table()
by categorical variable will count observations by condition
table(df$delay)
##
## long short
## 8 8
When things are less evenly distributed janitor::tabyl()
is useful because it gives % as well as n
janitor::tabyl(df$delay)
## df$delay n percent
## long 8 0.5
## short 8 0.5