Introduction
Many data projects involve a split-apply-combine strategy, where a big dataset is split into manageable pieces, a function is applied to operate on each piece and the results are then combined to put all the pieces back together. These are common actions that are repeated in many analysis projects.
Basic awareness of the split-apply-combine strategy, when it occurs and the most efficient way to proceed can be very useful. Data can be split many ways, both structurally and using logical operators. The application of complex functions and the combining of results across multiple data chunks also needs to be done thoughtfully so machine resources are used optimally and code objects are easier to read and maintain. New awareness has given rise to new R packages. The new packages have extended the capabilities of base R and simplified the language of R as it becomes more terse and better supports the the needs of split-apply-combine. The innovations are material. They are bending and elevating the language of R to solve an essential problem.
Package Dependencies and Data
The core packages of the “tidyverse” are defined below. Of these, the apply functions of purrr have extended R’s functional programming tools for split-apply-combine strategies.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Data format/shape library(tibble) # simple data.frames library(tidyr) # data cleaning/reshaping # Data transforms library(dplyr) # data transforming library(forcats) # data factor mgmt library(lubridate) # data/time objects library(hms) # time-of-day values library(stringr) # string mgmt # Programming library(purrr) # functional programming tools library(purrrlyr) # intersection of purrr and dplyr library(magrittr) # pipe operators |
Meawnhile, the repurrrsive package provides data sets useful for understanding the functions in purrr.2
1 2 |
install.packages("repurrrsive") library(repurrrsive) |
map() Functions
The map() functions in purrr are wrappers around the apply functions of base R. The functions provide a replacement for loops and abstracts code development away from the details of the underlying data structure. An alternative to loops is not required because loops are slow, but because loop code can be fragile and prone to “off-by-one” errors given unimportant book-keeping code. Complex loops can also be hard to read and maintain.3
Fortunately, there are alternatives to loops for many applications. Here is an example of the map() function that uses Edgar Anderson’s iris data. The iris data is first split by species, then a basic regression model is applied to each species, after which result data is extracted and combined:
1 2 3 4 5 6 7 8 |
iris %>% split(.$Species) %>% map(~ lm(Sepal.Length ~ Petal.Length, data = .)) %>% map(summary) %>% map_dbl("r.squared") setosa versicolor virginica 0.07138289 0.56858983 0.74688439 |
map() functions come in different variations based on need and given different input/output requirements:
Function | Description |
---|---|
map_if(.x, .p, .f, ...) | map_if() only applies .f to those elements of the list where .p is true. |
map_at(.x, .at, .f, ...) | map_at() only applies .f to an integer vector of element positions. |
map2(.x, ,y, .f, ...) map3(.x, ,y, z, .f, ...) map_n(.x, ,y, z, ... n, .f, ...) | map2() applies a function to pairs of elements from two lists, vectors. map3() does the same with 3 parallel objects, while map_n() is the generalized case for more than 3 objects map2(x, y, sum) |
pmap(.l, .f, ...) | Apply .f to groups of elements from a list of vectors. pmap(list(x, y, z), sum, na.rm = TRUE) |
lmap(.x, .f, ...) | Apply .f to each list-element of a list or vector. |
imap(.x, .f, ...) | Apply .f to each element of a list or vector and its index |
invoke() invoke_map() | invoke() is a wrapper around do.call that makes it easy to use in a pipe. invoke_map() makes it easier to call lists of functions with lists of parameters. |
map_lgl(), map_int(), map_dbl(), map_chr() | map(), map2(), pmap()imap and invoke_map() each return a list. Use a suffixed version to return the results as a specific type of flat vector, e.g. map2_chr, pmap_lgl, etc. map_chr: character vector map_dbl: double (numeric) vector map_dfc: data frame (column bind) map_dfr: data frame (row bind) map_int: integer vector map_lgl: logical vector |
Base R versus map()
The following snippet compares map() to base R functions to confirm that the syntax of the function has been simplified when compared to the syntax of base R. The data used in the example is got_chars, a large and complex list with data on all the characters in the TV Series “Game of Thrones.”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# base R: split list, extract and bind into a data.frame l <- lapply(got_chars[23:25], extract, c("name", "playedBy")) mat <- do.call(rbind, l) (df <- as.data.frame(mat, stringsAsFactors = FALSE)) name playedBy 1 Jon Snow Kit Harington 2 Aeron Greyjoy Michael Feast 3 Kevan Lannister Ian Gelder # split, extract and bind using map() map_df(got_chars[23:25], extract, c("name", "playedBy")) # A tibble: 3 x 2 name playedBy <chr> <chr> 1 Jon Snow Kit Harington 2 Aeron Greyjoy Michael Feast 3 Kevan Lannister Ian Gelder # base R to split, extract and bind multiple variables and class types (data.frame( name = vapply(got_chars[23:25], `[[`, character(1), "name"), id = vapply(got_chars[23:25], `[[`, integer(1), "id"), stringsAsFactors = FALSE )) name id 1 Jon Snow 583 2 Aeron Greyjoy 60 3 Kevan Lannister 605 # split, extract and bind multiple variables and class types usiong map() (tibble::tibble( name = map_chr(got_chars[23:25], "name"), id = map_int(got_chars[23:25], "id") )) # A tibble: 3 x 2 name id <chr> <int> 1 Jon Snow 583 2 Aeron Greyjoy 60 3 Kevan Lannister 605 |
map_if(), map2(), invoke()
The following code chunks provide simple examples of the other map() functions profiled:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# map_if(): to convert factors to characters iris %>% + map_if(is.factor, as.character) %>% + str() List of 5 $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ... # map2(): apply a function across 2 lists map2_dfc(1:3, 2:4, ~ .x * (.y - 1)) # A tibble: 1 x 3 V1 V2 V3 <dbl> <dbl> <dbl> 1 1 4 9 # invoke(): apply a list of functions to a dataset set.seed(1) list(m1 = mean, m2 = median) %>% invoke_map_df(x = rcauchy(100)) # A tibble: 1 x 2 m1 m2 <dbl> <dbl> 1 2.59 0.130 |
Modifying Function Behavior
Application of the split-apply-combine strategy may require special handling at times. Special handling could involve if..else contingencies, exceptions for error processing and other needs. The list below defines functions in the purrr package that return enhanced output and prevent code from generating side effects such as messages, warnings, and errors:
Function | Description |
---|---|
compose() | Bundle multiple functions together in a vector that are run from right to left |
lift(), lift_dl(), lift_dv(), lift_ld(), lift_lv(), lift_vd(), lift_vl() | Helps to reconfigure functions by lifting their input from one kind to another kind. The inputs can be changed from and to a list (l), a vector (v) and dots (d). For example, lift_ld(fun) transforms a function taking a list to a function taking dots. |
negate() | Negate a predicate function that returns a single TRUE or FALSE |
partial() | Create a version of a function that has some args preset to values. |
safely() | Don’t stop execution of your function if something goes wrong and capture the error. Modify function to return list of results and errors. |
quietly() | Modify return list of results, output, messages, warnings. |
possibly() | Don’t stop execution of your function if something goes wrong. Modify function to return a default value whenever an error occurs (instead of the error). |
rerun() | Rerun expression "n" times. This is a good way of generating sample data. Returns a list. |
compose(), safely(), possibly()
Previously we saw that invoke() will run a collection of named functions in a list in parallel and on the same data. compose() will do the same, but this time the collection functions to be run are presented as a simple vector of names, running the functions from right to left:
1 2 3 4 5 6 7 8 9 |
sample(x = 1:6, size = 50, replace = TRUE) %>% table %>% sort(.,decreasing = TRUE) %>% print %>% names dice1 <- function(n) sample(size = n, x = 1:6, replace = TRUE) dice_rank <- compose(names, print , sort, table, dice1) dice_rank(50) |
If you have a function that sometimes throws an error, a warning or isn’t stable for any reason, then use safely(). safely() takes a function f() and returns the function safe_f(). The new function now that generates a list with the elements result and error, where result is the output of f() in the absence of any problems, and NULL otherwise.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# safely() safe_log <- safely(log) safe_log(10) $result [1] 2.302585 $error NULL safe_log("a") $result NULL $error <simpleError in log(x = x, base = base): non-numeric argument to mathematical function> list("a", 10, 100) %>% map(safe_log) %>% transpose() %>% extract("result") |
The following example keeps a map() function going in case of an error.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# possibly() possible_sqrt <- possibly(sqrt, otherwise = NA_real_) numbers_with_error <- list(1, 2, 3, "spam", 4) map(numbers_with_error, possible_sqrt) [[1]] [1] 1 [[2]] [1] 1.414214 [[3]] [1] 1.732051 [[4]] [1] NA [[5]] [1] 2 |
Working with Lists
The last section on apply functions with purrrr is to focus on the data handling and transformation of data in lists. Many original data objects are structured as lists, notably spatial data objects and data sourced on the internet (in JSON format among others). The following material is for reference purposes and was prepared by RStudio:4
- Lionel Henry and Hadley Wickham are the creator and authors of the purrr package ↩
- Jennifer Bryon is the creator and author of the repurrrsive package and Charlotte Wickham is listed as an important collabroator ↩
- Special note: map() functions assume that each piece of data will be processed only once and independently of all other pieces. This means that you can not use the tools when each iteration requires overlapping data (like a running mean), or it depends on the previous iteration (like in recursion routines or dynamic simulation). Loops are still most appropriate for these tasks. ↩
- Rstudio, Apply Functions with purrr::Cheat Sheet, 2018. ↩