- Creating Data Frames in R
- Expanding a Data Grid in R
- Naming Data Frames and Using Names
- Subscripting R Data Frames
- Sorting Data Frames in R
- Combining and Modifying Data Frames in R
- Merging R Data Frames and Redundant Data
- Splitting and Analyzing Data Frames in R
- Analyzing R Data Frames with by()
- Analyzing Data Frames in R with aggregate()
Arrays generalize the dimensional aspect of a matrix and assume only one data mode. Data frames in R generalize the mode of a matrix and allow mode mixing. Data frames with mode mixing are are the most widely used data objects in R.
Creating Data Frames in R
You can create data frames in R several ways:
- importData() and read.table() both read data from an external file as a data.frame
- data.frame() binds together R objects of various kinds.
- as.data.frame() coerces objects of a particular type to objects of class data.frame.
The data.frame() function will create a data frame from existing objects if all columns have a name and equal length:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Generate raw data > my.logical <- sample(c(TRUE, FALSE), size = 20, replace = TRUE) > my.complex <- rnorm(20) + runif(20) * 1i > my.numeric <- rnorm(20) > my.matrix <- matrix(rnorm(40), ncol = 2) # Create a data frame > my.df2 <- data.frame(my.logical, my.complex, my.numeric, my.matrix) > my.df2 my.logical my.complex my.numeric X X2 1 TRUE -0.360+0.287i -0.820 0.180 -0.201 2 TRUE 0.028+0.149i 1.932 0.049 -1.345 3 FALSE 1.695+0.969i 1.547 -1.431 -0.567 … |
The names of the input objects are used for the names in the data frame, but the matrix input reverts to the matrix names since multiple columns are supplied. Row names for the data frame are obtained from the first object with a names(), dimnames(), or row.names() functions.
The attributes of the data objects are not lost when they are combined in a data frame. However, character and logical vectors are converted to factors to facilitate data anlysis. To prevent coercion, pass the vector to data.frame() in a call to the I() function, which returns the vector unchanged but with the added class “AsIs”.
1 |
> my.data <- data.frame(MPG, Dist, Climb, Day = I(day)) |
It is also possible to supply matrices and lists when creating data frames. If a matrix is submitted to data.frame(), it is the same as if the columns were supplied as individual objects. If a list is supplied, it is treated as if its components had been supplied individually. In both cases, suitable names are concocted if none are supplied.
Expanding a Data Grid in R
A unique way to create a data frame in R is to create a data grid. Data grids contain all combinations of the input data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
> my.grid <− expand.grid(col1 = 1:3, col2 = LETTERS[1:3]) > my.grid col1 col2 1 1 A 2 2 A 3 3 A 4 1 B 5 2 B 6 3 B 7 1 C 8 2 C 9 3 C > class(my.grid) “data.frame” |
Naming Data Frames and Using Names
Column names are defined when data is declared in the data.frame() function. The row.names argument to data.frame() creates row names, assuming the input vector is the same length as the data. Meanwhile, the attach() function can be also used to make the columns of a data frame visible by variable name. The detach() function subsequently cleans up the .Data directory of these additional objects.
Subscripting R Data Frames
Many extraction operators will generate vector output with class numeric. Name extraction maintains object class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# Extract column X1 > a <− my.df2$X1 > a [1] -1.73p 1.017 -1.374 -1.846 0.421 1.005 0.067 … # Interrogate data > mode(a) [1] "numeric" > class(a) [1] "numeric" # Extract column X1 using [] > b <− my.df2["X1"] > b X1 1 -1.739 2 1.017 3 -1.374 4 -1.846 5 0.421 6 1.005 7 0.067 … # Interrogate data > mode(b) [1] "list" > class(b) [1] "data.frame" |
Sorting Data Frames in R
The traditional sort() and rev() functions take a vector and return a vector of sorted values. To sort larger data structures with several variables in parallel (e.g. tied values across columns) use the order() and sort.list() functions:
1 2 3 4 5 6 7 8 9 10 |
# Load library data and sort > library(MASS) > painters[sort.list(row.names(painters)), ] Composition Drawing Colour Expression School Albani 14 14 10 6 E Barocci 14 15 6 10 C Bassano 6 8 17 0 D Bellini 4 6 14 0 D ... |
The function produces a positive integer index vector that will arrange its arguments in increasing order. To put a data frame x in decreasing order, use sort.list(-x).
The function order() generalizes sort.list() to an arbitrary number of arguments. The function also breaks ties across columns. The following example sorts painters by composition count (descending as indicated by the negative) and then by school (ascending):
1 2 3 4 5 6 7 8 9 10 |
> painters[order(-painters["Composition"],painters["School"]),] Composition Drawing Colour Expression School Guercino 18 10 10 4 E Rubens 18 13 17 17 G Raphael 17 18 12 18 A Cortona 16 14 12 6 C Le Brun 16 16 8 16 H Da Vinci 15 16 4 14 A Guilio Romano 15 16 4 14 A |
All these functions have an argument na.last that determines the handling of missing values. With na.last=NA (the default for sort()), missing values are deleted; with na.last=TRUE (the default for order()), they are put last.
Combining and Modifying Data Frames in R
You can use data.frame() to combine one or more data frames, or use cbind(), rbind() or merge(). In practice, use rbind() only when you have complete data frames. Do not use it in a loop to add one row at a time to a data frame – this is inefficient.
Merging R Data Frames and Redundant Data
The merge() function combines multiple sources with duplicated data, using shared columns. You can specify different combinations using the by, by.x, and by.y arguments:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# Generate Data > odds.data <− data.frame(label = letters[1:5], odds = seq(1, 9, 2)) > evens.data <− data.frame(label = letters[5:1], evens = seq(2, 10, 2)) > odds.data label odds 1 a 1 2 b 3 3 c 5 4 d 7 5 e 9 > evens.data label evens 1 e 2 2 d 4 3 c 6 4 b 8 5 a 10 > merge(odds.data, evens.data, by.x = "label") label odds evens 1 a 1 10 2 b 3 8 3 c 5 6 4 d 7 4 5 e 9 2 |
The following table summarizes some of the basic rule for combining objects into data frames:
Data Type | Sub Type | Combination Rule(s) |
---|---|---|
vector | numeric complex factor ordered rte its cts | 1. Combine a single variable as is |
character | character logical category | 1. Convert to a factor data type 2. Contribute a single variable |
array | matrix array | 1. Each column creates a separate variable 2. Column names used for variable names |
list | list | 1. Each component creates one or more unique variables 2. Variable names assigned as usual for each component |
model.matrix | model.matrix | 1. Object becomes a single variable in result |
data.frame | data.frame | 1. Each variable becomes a variable in result design 2. Variable names used for variable names |
Splitting and Analyzing Data Frames in R
Splitting data frames is a common manipulation. The split() function works by taking the columns to be included in the split and a group definition equal to the columns used to split the data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# Generate data > x <− data.frame(odds = seq(1, 9, 2), logical = c(T, F, T, T, F), letters = letters[1:5]) > x odds logical letters 1 1 TRUE a 2 3 FALSE b 3 5 TRUE c 4 7 TRUE d 5 9 FALSE e # Spit data > split.data <− split(x[, c(1, 3)], x[, 2]) > split.data $`FALSE` odds letters 2 3 b 5 9 e $`TRUE` odds letters 1 1 a 3 5 c 4 7 d |
A common use for split() is to create a data structure accepted by boxplot.
Analyzing R Data Frames with by()
It is often more convenient to split a data frame using the by() function. by() takes a data frame and splits it by rows into new data frames subsetted by the values of one or more factors (e.g. INDICES). The indices must be declared as list objects and then passed to function FUN, which is applied to each subset in turn. The resulting data object has class “by” and is manipulated further for pretty printing:
1 2 3 4 5 6 7 8 9 10 |
do.call("rbind", as.list(by(iris, list(Species=iris$Species), function(x){ y <- subset(x, select= -Species) apply(y, 2, mean) } ))) Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 5.006 3.428 1.462 0.246 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026 |
Analyzing Data Frames in R with aggregate()
The aggregate() function also allows you to partition a data frame or a matrix by one or more grouping vectors, and then apply a function to the resulting columns that returns a single value (e.g. sum() or mean()).
1 2 3 4 5 6 7 8 |
> iris.x <- subset(iris, select= -Species) > iris.s <- subset(iris, select= Species) > aggregate(iris.x, iris.s, mean) Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 5.006 3.428 1.462 0.246 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026 |
aggregate() returns a data frame with a factor variable column for each group/level in the index vector, and a column of numeric values from applying the specified function to the subgroup variables in the data frame.
The following list of functions can be used to assess a data frame.
Definition | Arguments | Input Object | Output Object | Comment |
---|---|---|---|---|
aggregate() | (x, by=, FUN=, ...) | data.frame | data.frame | FUN should return a scalar |
apply() | (x, MARGIN=, FUN=, ...) | data.frame matrix array | vector array | n/a |
by() | (x, INDICES=, FUN=, ...) | data.frame | by | Indices should be entred as a list |
lapply() | (x, FUN=, ...) | any object | list | n/a |
sapply() | (x, FUN=, ..., simplify = TRUE) | any object | vector matrix list | n/a |
sweep() | (x, MARGIN=, STATS=, FUN=, ...) | data.frame matrix array | matrix array | n/a |
These functions ship with base R. Additional functions from open-source packages will be introduced in the chapter on large data objects.