Tidy Data Preparation

Package Dependencies

The core packages for tidy data preparation are listed below:

Of these, the tibble and tidyr packages are core to data consistency and preparation.1

Creating tibble Data

The tibble package provides a new data class for storing tabular data, the tibble. tibbles inherit the data.frame class, but improves 3 behaviors:

  • Subsetting – Always returns a new tibble, maintaining data consistency
  • No partial matching – Subsetting requires full column or variable names, which can be easily accommodated using RStudio’s auto-fill feature
  • Display – Prints a more concise view of data, which fits on one screen.  Use the glimpse() or View() functions for alternative data views

Creating tibble objects of class tbl_df is simple enough:

Packages in the tidyverse will also create tibble objects by default. The same object is created below using dplyr functions and the pipe operator:

You can also use the following options() command to control the default appearance of tibble data:

Finally, you can use as.data.frame() to coerce tibble data back to a standard data.frame.

Manage Missing Data

One of the most basic needs in creating tidy data is to handle missing data values. The first problem with missing data is when the missing data values are not presented using the standard value in R for missing values: NA.  Instead, missing values might be presented as a coded value like 999 or as a character value like “*”.  Converting the coded values to NA is straightforward:

Once all missing values have the stadnard NA value expected in R, then the function drop_na() can be used to drops rows containing missing values:

Meanwhile, the fill() function will fill in NAs with the most recent value available, recognizing the fill can proceed from the top or bottom value available:

Finally, the replace_na() function replaces NAs and NULL values by column:

Finally, it can be helpful to impute missing values by replacing NAs with the mean or medium column value with the replace() function:

Of course, imputed values can be replaced using other methods.

Reshaping Data

Reshaping data is basic step to create tidy data sets.  The goal is to change the layout of the data and to ensure every column is a variable and and every row is an observation.

To this end, the gather() function creates “long data” by moving column names into a “key” column and gathering the column values into a single “value” column:

Long data may seem odd to those used to spreadsheets, but it is well suited for functional programming and data visualizations in R.

The spread() function creates “wide data” by moving the unique values of the “key” column into column names and spreading the values of a “value” column across the new columns:

Wide data is more typical of what is found in spreadsheets.  However, when column names are variables, then the data is not tidy and requires reshaping.

The unite() function serves to combine columns into one.  This is particularly helpful when one variable has been split into many and when trying to trim down column count:

The separate() function, as the name implies, separates each cell in a column into several columns.  This is useful when “messy” data combines variables.  In the example below, a single variable is split to create two variables:

Other reshaping functions are listed below:

add_column()This is a convenient way to add one or more columns to an existing data frame.
add_row()This is a convenient way to add one or more rows of data to an existing data frame.
arrange()Order rows by values in a column from low to high
arrange(mtcars, mpg)
arrange(desc())Order rows by values in a column from high to low
arrange(mtcars, desc(mpg))
bind_rows()Bind many data.frames together by row. Similar to do.call(rbind, dfs)
bind_cols()Bind many data.frames together by columns. Similar to do.call(cbind, dfs)
rename()Rename the columns of a data.frame
rename(tbl, y=year)
rownames_to_column()The rownames_to_column() function is especially helpful for creating a new column from names. Other functions detect, remove and manage row names.

select()Select defines columns and column order. The functions c() and : can be used inside select(). Several other functions exist for use within select only:

separate_rows()Separate each cell in a column to make several rows
separate_rows(x, rate)
set_tidy_names()Ensures data input has non-missing and unique names (duplicated names get a suffix of the format ..# where # is the position in the vector).


Back | Next

  1. Kirill Mueller is the creator of the tibble package, along with Hadley Wickham as author and Romain Francois as collaborator.  Hadley Wickham is the creator and author of the tidyr package, along with Lionell Henry as author