I have been doing lots of data wrangling recently and decided a needed a quick rundown of data cleaning in R. Here are notes on useful things I learned recently.
class()
will let you know whether you are working with a dataframe or not
dim()
gives you a little info about the dimensions of your data by telling you how many rows nd columns you have
names()
will print the column names
str()
aka structure and glimpse()
, from dplyr, give you slightly different versions of hte same thing, a list of variables, each with a data type and a preview of the data
summary()
will give you a “sometimes useful, sometimes not” summary of each variable. You get a distribution of numeric things and frequencies of categorical things, but it doesn’t group_by like you would like it to.
skim()
from the skimr
package gives a slightly more meaningful summary that includes summary stats for each variable by data type, details about missing values and draws the cutest little histograms you have EVER seen.
head()
and tail()
will show you the top and bottom 6 rows, you can add an argument and specify how many rows you want to see. i.e. head(data, n=12) would show the top 12 rows.
print()
will show you all the data in the console, which could be useful depending how big your data is.
If skim()
doesn’t do it for you, there are fast ways to visualise your data without diving head first into ggplot
Histogram: to get an idea of the distribution of data in a particular variable use. Can use optional breaks argument to specify how many buckets the data are broken into.
hist(dataset$variable1, breaks = 20)
Scatterplot: to plot the relation between variable 1 and 2 from the dataset use
plot(dataset$variable1 , dataset$variable2)
How do we know data is not tidy? Often because columns are values not variables names.
Use gather(data, key, value, …)
data = the dataframe you want to morph from wide to long
key = the name of the new column that is levels of what is represented in the wide format as many columns
value = the name of the column that will contain the values
… = columns to gather, or leave (use -column to gather all except that one)
Use spread(data, key, value)
data = the dataframe you want to morph from long to wide
key = the name of the column that contains the key
value = the name of the column contains the values
Use separate(data, col, into)
data = dataframe
col = name of column to separate
into = character vector of new column [note- these need quotes]
separate(treatments, year_no, c(“year”, “month”))
separate()
assumes you want to split on a space, period, forward slash or dash. You can give it an extra argument sep = “-” to specify what to separate on.
Use unite(data, col, …)
data = dataframe
col = name of column to separate
… = bare names of columns to unite
column headers are values, not variable names — fix it by using the gather()
function to combine the many columns into one
variables are stored in both rows and columns — fix it by using the spread()
function to combine the many columns into one
multiple variables are stored in one column — fix it by using the separate()
function to split a column into many
Other problems 4. single observational unit stored in multiple tables 5. multiple types of observational units are stored in the same table
Changing the data type is called coercing.
as.character
as.numeric
as.integer
as.factor
as.logical
Use class()
to learn what kind of data you are dealing with.
Often R thinks that dates are strings, so functions from the lubridate package are useful for coercing them to date format.
i.e. ymd(“2015-August-25”) will parse a string and return a date in standard YMD format
str_trim(" string with lots of white space ")
trims away white space from character input
str_pad()
adds padding to the left or right of the string
i.e. this example takes an ID number, makes it 7 digits wide by padding the left side with 0s, result = 0024493
str_pad(“24493”, width= 7, side = left, pad = 0)
str_detect(data, "stringname")
determine whether a particular string is present, returns TRUE, FALSE
str_replace(data, "string1", "string2")
find instances of string1, replace with string2
tolower()
and toupper()
are functions from baseR that convert strings to all lower case or all uppercase.
Use is.na() to check for NAs; will give TRUE or FALSE for each observation.
is.na(dataframe)
Are there any() NAs?; will give TRUE if there are any NAs
any(is.na(dataframe))
How many NAs? Use sum to count. This works because TRUE is represented as 1 and FALSE as 0.
sum(is.na(dataframe))
Don’t forget that summary()
also counts how many NAs in each variable as does skim()
complete.cases(dataframe)
complete cases will give a TRUE FALSE for each row according to whether there are any missing values. You can subset the data, keeping only complete cases using…
dataframe[complete.cases(dataframe), ]
OR use na.omit(dataframe)
to keep only NA free data.
Use boxplots, summary stats, and histogram to view outliers.