It is definitely true that it takes much longer to get your data ready for analysis than it does to actually analyse it. Apparently up to 80% of the data analysis time is spent wrangling data (and cursing and swearing).
Did you know up to 80% of data analysis is spent on the process of cleaning and preparing data? - cf. Wickham, 2014 and Dasu and Johnson, 2003
— Miguel Á. Armengol (@miguearmengol) September 10, 2018
So here is an excellent approach to data wrangling in #rstats https://t.co/tqogwNSSjN
Here is another great wrangling resource, this time by Bradley Boehmke.
And if you need a rationale for why it is a good idea to acquire some wrangling skills, a quote by Jenny Bryan
“Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth”
tidyr
and dplyr
In addition to gather() and spread(), the tidyr
package can also be used to separate() i.e. pull parts of a single variable apart into separate columns and unite() i.e. combine several columns into one.
When using filter() from dplyr
, specify group membership using %in%. Also distinct() will remove duplicate rows and slice(3:5) will subset by particular rows.
When using dplyr
summarise(), sometimes you want to count the number of participants but n() will give you the number of observations. There is an n_distinct() function that might be useful in counting the number of participants.