Principles of Tidy Data

Introduction to Tidy Data

Despite the enormous amount of data available, there is surprisingly little alignment or information on how to create clean, consistent and easy to use data.

Human interface with data and code can benefit from some simple principles to facilitate repeatable research and results. The “tidy” approach to data requires that:

  • Data is structured consistently and reusable;
  • Code flow relies on simple function calls using the pipe;
  • Functional programming is embraced;
  • Code is written for human readability.

Data Consistency

Standard data object types help data consistency in R.  Above all, avoid use of custom data classes.  For example, a single data class is sufficient for data analysis:

  • Rectangular data in R is best managed using the data.frame formatted as a tibble.  The tbl_df data class or tibble is a data.frame that provides stricter checking and better print formatting than the traditional data.frame.
  • Geospatial data sets are best managed using sp object classes. The sp data classes include spatial points, lines, polygons, and grids…all of which contain 2D and 3D location information.

Both data classes will cover the majority of analysis needs.

Code Flow

Click to enlarge

Writing code that is simple, efficient and readable should be easy to achieve….but is contradictory in nature.  For example, coding the nested expression f(g(x)) is very efficient, but may not be simple and readable, especially when each function is  complex or involves nested loops and iteration. 

In response, the infix operator %>% known as “the pipe” was created.  Coding with the pipe serves to decompose nested functions into sequential steps like x -> g(x) -> f(x). Code flow is simplified using pipes and the transparency of steps is often more clear. 

Example of nested function: The following code is a simple nested function using the syntax of base R:

The function reads as follows:

  • use automobile data (mtcars),
  • filter cars with a carburetor value greater than 1,
  • then group the cars by cylinder count,
  • summarize each car group by average miles per gallon, and
  • sort the mean results in descending order largest to smallest.

The example function is simple in nature and the code chunk is efficient. However, the nested code is not simple to read, complicating broad collaboration and code maintenance.

Example of pipe operator: The same functionality is presented below using the pipe operator %>%:

The code is shorter and easier to read.  In summary, the the pipe is a valuable tool for calling any function.

Functional Programming

The pipe operator is an essential feature for functional programming in R.  However, functional programming has other other elements:

  • Create functions in support of object orientated programming and use function semantics to make code simpler and easier to read;
  • Use R’s indexing or vector operations to abstract away from for() and while() loops.  The apply() family of functions in base R and the various map() functions in the purrr package are the core tools for this purpose.

For example, the following code shows a classic “split-apply-combine” strategy that avoids the use of loops.  First, the data is split into groups.  Next, a linear model is applied to each group and summary stats estimated.  Finally, the results are combined for each group.

The example shows functional coding that is efficient and easy to read, while avoiding the use of for() or while() loops. The code also uses less overhead and is much faster than loops, which is essential when working with large data sets.

Human Readability

Code is a form of literacy.  Code is for people people to read, not just for machines to execute. For instance:

  • Comments in code are always helpful;
  • Variable names must be clear and concise.  Most important, avoid use of acronyms, even if clear names are lengthy;
  • Functional programming with pipes will always be easier to read than nested functions or iteration based on loops.

Back | Next