Tidy Data Transformations

Package Dependencies

The core packages for tidy data transformations are listed below:

The dplyr package is by far the most important of the packages in the “tidyverse” for data transformation and manipulation.1  Verb-based functions are one of the advantages of the package.  The syntax is much easier to use when compared to the cryptic syntax of base R.

Example Data

Basic data transformations will be explored using the nycflights13::flights dataset supplied by the US Bureau of Transportation Statistics. The data contains all 336,776 flights that departed from New York City in 2013.  The data can be obtained as follows:

Subset Observations (Rows)

The filter() function is used to subset the rows in a data.frame. The first argument is the tibble data (or data frame).  The subsequent arguments are logical expressions that refer to variables within the data.frame, selecting rows where the logicals are TRUE.

There are a number of functions to subset observations:

[table id=79 /]

Subset Variables (Columns)

Typically, only a few columns of a large dataset are required for analysis.  The select() function narrows the variables to a useful subset and avoids the use of numeric variable positions when indexing:

Several helper functions exist for use within the select() function:

[table id=80 /]

The pull() function extracts a column data, but returns a vector.  This can be very helpful in many cases when tibbles are not desirable. For example:

Summarize Data

The summarise() function collapses data variables to a single row summary statistic:

summarise() uses summary functions, be it base, package or custom functions.  Some standard summary functions are listed below:

[table id=81 /]

Group Data

Perhaps the most significant data transformation is the ability to take a large dataset and to create sub-groups on which distinct data analysis is conducted.  In the following example, all planes leaving New York are grouped by destination, the summary number of planes and flights for each destinations is calculated and then the results are ordered by the highest number of flights.  The result for 105 destinations is obtained using the pipe operator and in the absence of loops:

In the next example, the dataset is split into individual planes and for each plane a set of summary stats is calculated, including the number of flights, the average distance traveled and the observed arrival delay. Results are filtered to focus on planes with at least 20 flights and traveling less than 2,000 miles.  The output covers almost 3,000 airplanes and is presented in a plot using the ggplot2 package.

Make New Variables

It is often necessary to add new columns that are functions of existing columns. This is the job of the mutate() function.

The mutate() function is similar to the transform() function in base R, but has the advanatage that you can refer to columns that you’ve just created:

If you only want to keep the new variables, use transmute():

There are  set of functions common to many database tools that can be used with mutate for enhanced data transformations and insight:

[table id=82 /]

Combine Data Sets

Currently dplyr supports four types of mutating joins and two types of filtering joins.

Mutating joins combine variables from two data.frames:

  • left_join(): return all rows from x, and all columns from a and b, as shown in the diagram below. Rows in a with no match in b will have NA values in the new columns. If there are multiple matches between a and b, all the matches are returned.
  • right_join(): return all rows from b, and all columns from a and b. Rows in b with no match in a will have NA values in the new columns. If there are multiple matches between a and b, all matches are returned.
  • inner_join(): return all rows from a where there are matching values in b, and all columns from a and b. If there are multiple matches between a and b, all matches are returned.
  • full_join(): return all rows and all columns from both a and b. Where there are not matching values, returns NA for the one missing.

Filtering joins keep cases from the left-hand data.frame:

  • semi_join(): return all rows from a where there are matching values in b, keeping just columns from a, as shown below. A semi join differs from an inner join because an inner join will return one row of a for each matching row of b, where a semi join will never duplicate rows of a.
  • anti_join(): return all rows from a where there are not matching values in b, keeping just columns from a.

Back | Next

  1. Hadley Wickham is the creator and author of the dplyr package along with Lionel Henry Romain Francois as co-authors