Split-Apply-Combine Techniques

Introduction

Many data projects involve a split-apply-combine strategy, where a big dataset is split into manageable pieces, a function is applied to operate on each piece and the results are then combined to put all the pieces back together. These are common actions that are repeated in many analysis projects.

Basic awareness of the split-apply-combine strategy, when it occurs and the most efficient way to proceed can be very useful.  Data can be split many ways, both structurally and using logical operators.  The application of complex functions and the combining of results across multiple data chunks also needs to be done thoughtfully so machine resources are used optimally and code objects are easier to read and maintain.  New awareness has given rise to new R packages.  The new packages have extended the capabilities of base R and simplified the language of R as it becomes more terse and better supports the the needs of split-apply-combine. The innovations are material.  They are bending and elevating the language of R to solve an essential problem.

Package Dependencies and Data

The core packages of the “tidyverse” are defined below.  Of these, the apply functions of purrr have extended R’s functional programming tools for split-apply-combine strategies.1

Meawnhile, the repurrrsive package provides data sets useful for understanding the  functions in purrr.2

map() Functions

The map() functions in purrr are wrappers around the apply functions of base R.  The functions provide a replacement for loops and abstracts code development away from the details of the underlying data structure. An alternative to loops is not required because loops are slow,  but because loop code can be fragile and prone to “off-by-one” errors given unimportant book-keeping code.  Complex loops can also be hard to read and maintain.3

Fortunately, there are alternatives to loops for many applications.  Here is an example of the map() function that uses Edgar Anderson’s iris data.  The iris data is first split by species, then a basic regression model is applied to each species, after which result data is  extracted and combined:

map() functions come in different variations based on need and given different input/output requirements:

FunctionDescription
map_if(.x, .p, .f, ...)map_if() only applies .f to those elements of the list where .p is true.
map_at(.x, .at, .f, ...)map_at() only applies .f to an integer vector of element positions.
map2(.x, ,y, .f, ...)
map3(.x, ,y, z, .f, ...)
map_n(.x, ,y, z, ... n, .f, ...)
map2() applies a function to pairs of elements from two lists, vectors. map3() does the same with 3 parallel objects, while map_n() is the generalized case for more than 3 objects
map2(x, y, sum)
pmap(.l, .f, ...) Apply .f to groups of elements from a list of vectors.
pmap(list(x, y, z), sum, na.rm = TRUE)
lmap(.x, .f, ...) Apply .f to each list-element of a list or vector.
imap(.x, .f, ...) Apply .f to each element of a list or vector and its index
invoke()
invoke_map()
invoke() is a wrapper around do.call that makes it easy to use in a pipe.

invoke_map() makes it easier to call lists of functions with lists of parameters.
map_lgl(), map_int(), map_dbl(), map_chr()map(), map2(), pmap()imap and invoke_map() each return a list. Use a suffixed version to return the results as a specific type of flat vector, e.g. map2_chr, pmap_lgl, etc.

map_chr: character vector
map_dbl: double (numeric) vector
map_dfc: data frame (column bind)
map_dfr: data frame (row bind)
map_int: integer vector
map_lgl: logical vector

Base R versus map()

The following snippet compares map() to base R functions to confirm that the syntax of the function has been simplified when compared to the syntax of base R.  The data used in the example is got_chars, a large and complex list with data on all the characters in the TV Series “Game of Thrones.”

map_if(), map2(), invoke()

The following code chunks provide simple examples of the other map() functions profiled:

Modifying Function Behavior

Application of the split-apply-combine strategy may require special handling at times.  Special handling could involve if..else contingencies, exceptions for error processing and other needs.  The list below defines functions in the purrr package that return enhanced output and prevent code from  generating side effects such as messages, warnings, and errors:

FunctionDescription
compose()Bundle multiple functions together in a vector that are run from right to left
lift(), lift_dl(), lift_dv(), lift_ld(), lift_lv(), lift_vd(), lift_vl()Helps to reconfigure functions by lifting their input from one kind to another kind. The inputs can be changed from and to a list (l), a vector (v) and dots (d). For example, lift_ld(fun) transforms a function taking a list to a function taking dots.
negate()Negate a predicate function that returns a single TRUE or FALSE
partial()Create a version of a function that has some args preset to values.
safely()Don’t stop execution of your function if something goes wrong and capture the error. Modify function to return list of results and errors.
quietly()Modify return list of results, output, messages, warnings.
possibly()Don’t stop execution of your function if something goes wrong. Modify function to return a default value whenever an error occurs (instead of the error).
rerun()Rerun expression "n" times. This is a good way of generating sample data. Returns a list.

compose(), safely(), possibly()

Previously we saw that invoke() will run a collection of named functions in a list in parallel and on the same data.  compose() will do the same, but this time the collection functions to be run are presented as a simple vector of names, running the functions from right to left:

If you have a function that sometimes throws an error, a warning or isn’t stable for any reason, then use safely()safely()  takes a function  f() and returns the function safe_f().  The new function now  that generates a list with the elements result and error, where result is the output of f()  in the absence of any problems, and NULL otherwise.

The following example keeps a map() function going in case of an error.

Working with Lists

The last section on apply functions with purrrr is to focus on the data handling and transformation of data in lists.  Many original data objects are structured as lists, notably spatial data objects and data sourced on the internet (in JSON format among others).  The following material is for reference purposes and was prepared by RStudio:4

 

 

 

 

 

 

 

Back | Next

  1. Lionel Henry and Hadley Wickham are the creator and authors of the purrr package
  2. Jennifer Bryon is the creator and author of the repurrrsive package and Charlotte Wickham is listed as an important collabroator
  3. Special note: map() functions assume that each piece of data will be processed only once and independently of all other pieces. This means that you can not use the tools when each iteration requires overlapping data (like a running mean), or it depends on the previous iteration (like in recursion routines or dynamic simulation). Loops are still most appropriate for these tasks.
  4. Rstudio, Apply Functions with purrr::Cheat Sheet, 2018.