Category Archives: R Basics
What is R?
Statement of Purpose
Documentation on the R programming language has been developed to provide a comprehensive answer to question “What is R?” The approach taken seeks to appeal to new users and the reliance on practical examples seeks to provide applied, long-term reference for seasoned users.
What is R?
R is an open-source implementation of the the S programming language, which was developed by Bell Labs “to improve data manipulation, analysis, and visualization.”
Project Template
A project template provides a common content structure for data analysis projects. A template can offer several benefits:
- Define a familiar workspace
- Enable collaboration
- Ensure consistency across machines and time.
The project template is simple in nature. It has everything a good project should have to achieve repeatable research and results.
Projects in RStudio
RStudio makes it easy to create a project. In particular, each project has its own working directory, source files, workspace settings and history.
R Dates and Times
Preprocessing work to maintain R dates and times requires synchronize of data and formats across data sources. R dates and times justify care and attention.
Current Date/Time in R
The function date(), Sys.date() and Sys.time() all return a character string of the current system data and time:
1 2 3 4 5 6 7 8 |
> date() [1] "Tue Oct 22 18:43:27 2013" > Sys.Date() [1] "2013-10-22" > Sys.time() [1] "2013-10-22 18:45:54 AST" |
Each of these functions returns a slightly different result, which raises the obvious question how best to manage and format dates in large data objects?
Data Concatenation and Coercion in R
Data concatenation and coercion are common operations in R.
Data Concatenation
The concatenate c() function is used to combine elements into a vector.
1 2 3 4 5 |
> c(T, F, T) [1] T F T > c(8.3, 9.2, 11) [1] 8.3 9.2 11.0 |
When elements are combined from different classes, the c() function coerces to a common type, which is the type of the returned value:
1 2 3 4 |
> x <- c(100, "A", TRUE, as(1, "complex")) > x [1] "100" "A" "TRUE" "1+0i" > class(x) [1] "character" |
Data Formatting in R
There are a number of ways to accomplish data formatting in R.
Data Options in R
R supports a range of data formats and controls. The options() function accesses the default settings R establishes at start-up. Session options that can be changed from the command line include:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
> names(options()) [1] "add.smooth" "bitmapType" "browser" [4] "browserNLdisabled" "check.bounds" "continue" [7] "contrasts" "defaultPackages" "demo.ask" [10] "device" "device.ask.default" "digits" [13] "dvipscmd" "echo" "editor" [16] "encoding" "example.ask" "expressions" [19] "help_type" "help.search.types" "help.try.all.packages" [22] "HTTPUserAgent" "internet.info" "keep.source" [25] "keep.source.pkgs" "locatorBell" "mailer" [28] "max.print" "menu.graphics" "na.action" [31] "nwarnings" "OutDec" "pager" [34] "papersize" "pdfviewer" "pkgType" [37] "printcmd" "prompt" "repos" [40] "rl_word_breaks" "scipen" "show.coef.Pvalues" [43] "show.error.messages" "show.signif.stars" "str" [46] "str.dendrogram.last" "stringsAsFactors" "texi2dvi" [49] "timeout" "ts.eps" "ts.S.compat" [52] "unzip" "useFancyQuotes" "verbose" [55] "warn" "warning.length" "width" |
Each of these variables can be changed to modify R performance. For more details on each element see the HTML help for the options() function. A practical example is given below.
Project Reporting with RMarkdown
Introduction
RMarkdown provides an authoring system for project and data science reporting. RMarkdown is a core component of the RStudio IDE. It braids together narrative text with embedded chunks of R code. The R code serves to demonstrate the model concepts in the text. RMarkdown produces elegantly formatted document output, including publication quality data plots and tables.
Data Sequences and Repetition in R
Data sequences and repetition are useful functions to define data objects, create new objects, control extractions or replacement, and manage function routines.
Data Sequences
The seq() function can be used several ways depending on its argument structure:
1 2 3 4 5 6 |
seq(from, to) seq(from, to, by= ) seq(from, to, length.out= ) seq(along.with= ) seq(from) seq(length.out= ) |
The first form generates the sequence from a number to a number and is identical to from:to:
1 2 3 4 |
> seq(-3, 3) [1] -3 -2 -1 0 1 2 3 > -3:3 [1] -3 -2 -1 0 1 2 3 |
The second form generates a sequence from:to with the step length by:
Data Infix Operators in R
Intro to Infix Operators in R
Infix operators in R are unique functions and methods that facilitate basic data expressions or transformations.
Infix refers to the placement of the arithmetic operator between variables. For example, an infix operation is given by (a+b), whereas prefix and postfix operators are given by (+ab) and (ab+), respectively.
The types of infix operators used in R include functions for data extraction, arithmetic, sequences, comparison, logical testing, variable assignments, and custom data functions.
R Data Subscripting
Intro to R Data Subscripting
Data subscripting in R is a key “motor skill” to extract data by row, column or element. Subscripting is achieved using numeric, character, logical conditions or pattern matching. Subscripting is also used to assign values to data object elements.
The syntax for data subscripting can take several forms depending on data structure and data object type. Examples are provided below.
R Basics
The R Kernel
The R kernel is compromised of scripts written in the R and C programming languages. It includes a set of core function libraries, an interpreter for machine interface to run R scripts or functions, and a set of powerful graphical devices. In total, these elements are referred to as base R.
R Data Import
Introduction
Quantitative analysis depends on the ability to load and manage many different types of data and file formats. There are many R data import functions. Some functions ship with base R and others can be found in R packages.
Data Available in R
R is pre-installed with many data sets in the datasets package, which is included in the base distribution of R. R datasets are automatically loaded when the application is started. A list of all data sets in the package is obtained using the following command:
Tidy Data Transformations
Package Dependencies
The core packages for tidy data transformations are listed below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Data format/shape library(tibble) # simple data.frames library(tidyr) # data cleaning/reshaping # Data transforms library(dplyr) # data transforming library(forcats) # data factor mgmt library(lubridate) # data/time objects library(hms) # time-of-day values library(stringr) # string mgmt # Programming library(purrr) # functional programming tools library(purrrlyr) # intersection of purrr and dplyr library(magrittr) # pipe operators |
The dplyr package is by far the most important of the packages in the “tidyverse” for data transformation and manipulation.1 Verb-based functions are one of the advantages of the package. The syntax is much easier to use when compared to the cryptic syntax of base R.
Data Object Management
Data Object Management
The following functions are useful for data object management in R:
Function | Description |
---|---|
class() | Identify the class of a named object. |
colnames(); rownames() | Retrieve or set the column or row names of an object. |
dim() | Retrieve or set the dimensions of a rectangular data object. |
dimnames() | Get or set the dim names of an object. |
head() | Returns the first n rows of a data object. |
Tidy Data Preparation
Package Dependencies
The core packages for tidy data preparation are listed below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Data format/shape library(tibble) # simple data.frames library(tidyr) # data cleaning/reshaping # Data transforms library(dplyr) # data transforming library(forcats) # data factor management library(lubridate) # data/time objects library(hms) # time-of-day values library(stringr) # string mgmt # Programming library(purrr) # functional programming tools library(purrrlyr) # intersection of purrr and dplyr library(magrittr) # pipe operators |
Of these, the tibble and tidyr packages are core to data consistency and preparation.1
Creating tibble Data
The tibble package provides a new data class for storing tabular data, the tibble. tibbles inherit the data.frame class, but improves 3 behaviors:
- Subsetting – Always returns a new tibble, maintaining data consistency
Data Modes and Classes in R
In R, data modes and classes define the fundamental attributes and behavior of a data object. For example, different modes and classes are handled differently by core functions like print(), summary(), and plot().
Data Object Modes
All data in R is an object and all objects have a “mode.” The mode determines what type of information can be found within the object and how that information is stored. Atomic “modes” are the basic building blocks for data objects in R. There are 6 basic atomic modes: