It is definitely true that it takes much longer to get your data ready for analysis than it does to actually analyse it. Apparently up to 80% of the data analysis time is spent wrangling data (and cursing and swearing).
Did you know up to 80% of data analysis is spent on the process of cleaning and preparing data? - cf. Wickham, 2014 and Dasu and Johnson, 2003— Miguel Á. Armengol (@miguearmengol) September 10, 2018
So here is an excellent approach to data wrangling in #rstats https://t.co/tqogwNSSjN
And if you need a rationale for why it is a good idea to acquire some wrangling skills, a quote by Jenny Bryan
“Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth”
In addition to gather() and spread(), the
tidyr package can also be used to separate() i.e. pull parts of a single variable apart into separate columns and unite() i.e. combine several columns into one.
When using filter() from
dplyr, specify group membership using %in%. Also distinct() will remove duplicate rows and slice(3:5) will subset by particular rows.
dplyr summarise(), sometimes you want to count the number of participants but n() will give you the number of observations. There is an n_distinct() function that might be useful in counting the number of participants.