Category Archives: R Basics

Principles of Tidy Data

Introduction to Tidy Data

Despite the enormous amount of data available, there is surprisingly little alignment or information on how to create clean, consistent and easy to use data.

Human interface with data and code can benefit from some simple principles to facilitate repeatable research and results. The “tidy” approach to data requires that:

  • Data is structured consistently and reusable;
  • Code flow relies on simple function calls using the pipe;
Posted in Data, R Basics, R Data Objects, R Data Syntax, Scientific Computing | Comments Off on Principles of Tidy Data

What is R?

Statement of Purpose

Documentation on the R programming language has been developed to provide a comprehensive answer to question “What is R?”  The approach taken seeks to appeal to new users and the reliance on practical examples seeks to provide applied, long-term reference for seasoned users.

What is R?

R is an open-source implementation of the the S programming language, which was developed by Bell Labs “to improve data manipulation, analysis, and visualization.”

Posted in R Basics | Comments Off on What is R?

Project Template

A project template provides a common content structure for data analysis projects.  A template can offer several benefits:

  • Define a familiar workspace
  • Enable collaboration
  • Ensure consistency across machines and time.

The project template is simple in nature.  It has everything a good project should have to achieve repeatable research and results.

Projects in RStudio

RStudio makes it easy to create a project.  In particular, each project has its own working directory, source files, workspace settings and history.

Posted in R Basics, R Programming | Comments Off on Project Template

R Dates and Times

Preprocessing work to maintain R dates and times requires synchronize of data and formats across data sources. R dates and times justify care and attention.

Current Date/Time in R

The function date(), Sys.date() and Sys.time() all return a character string of the current system data and time:

Each of these functions returns a slightly different result, which raises the obvious question how best to manage and format dates in large data objects?

Posted in R Basics, R Data Objects | Comments Off on R Dates and Times

Data Concatenation and Coercion in R

Data concatenation and coercion are common operations in R.

Data Concatenation

The concatenate c() function is used to combine elements into a vector.

When elements are combined from different classes, the c() function coerces to a common type, which is the type of the returned value:

Posted in R Basics, R Data Syntax | Comments Off on Data Concatenation and Coercion in R

Data Formatting in R

There are a number of ways to accomplish data formatting in R.

Data Options in R

R supports a range of data formats and controls.  The options() function accesses the default settings R establishes at start-up.  Session options that can be changed from the command line include:

Each of these variables can be changed to modify R performance.  For more details on each element see the HTML help for the options() function.  A practical example is given below.

Posted in R Basics, R Data Objects, R Data Syntax | Comments Off on Data Formatting in R

Project Reporting with RMarkdown

Introduction

RMarkdown  provides an authoring system for project and data science reporting.  RMarkdown is a core component of the RStudio IDE.  It braids together narrative text with embedded chunks of R code.  The  R code serves to demonstrate the model concepts in the text.  RMarkdown  produces elegantly formatted document output, including  publication quality data plots and tables.

Posted in Data Science, LaTeX, R Basics, R Programming, Scientific Computing | Comments Off on Project Reporting with RMarkdown

Data Sequences and Repetition in R

Data sequences and repetition are useful functions to define data objects, create new objects, control extractions or replacement, and manage function routines.

Data Sequences

The seq() function can be used several ways depending on its argument structure:

The first form generates the sequence from a number to a number and is identical to from:to:

The second form generates a sequence from:to with the step length by:

Posted in R Basics, R Data Objects | Comments Off on Data Sequences and Repetition in R

Data Infix Operators in R

Intro to Infix Operators in R

postfixInfix operators in R are unique functions and methods that facilitate basic data expressions or transformations.  

Infix refers to the placement of the arithmetic operator between variables.  For example, an infix operation is given by (a+b), whereas prefix and postfix operators are given by (+ab) and (ab+), respectively.  

The types of infix operators used in R include functions for data extraction, arithmetic, sequences, comparison, logical testing, variable assignments, and custom data functions. 

Posted in R Basics, R Data Syntax | Comments Off on Data Infix Operators in R

R Data Subscripting

Intro to R Data Subscripting

Data subscripting in R is a key “motor skill” to extract data by row, column or element.  Subscripting is achieved using numeric, character, logical conditions or pattern matching.  Subscripting is also used to assign values to data object elements.

The syntax for data subscripting can take several forms depending on data structure and data object type. Examples are provided below.

Posted in R Basics, R Data Objects | Comments Off on R Data Subscripting

R Basics

The R Kernel

The R kernel is compromised of scripts written in the R and C programming languages.  It includes a set of core function libraries, an interpreter for machine interface to run R scripts or functions, and a set of powerful graphical devices.  In total, these elements are referred to as base R.

Posted in R Basics | Comments Off on R Basics

R Data Import

Introduction

Quantitative analysis depends on the ability to load and manage many different types of data and file formats.  There are many R data import functions. Some functions ship with base R and others can be found in R packages.

Data Available in R

R is pre-installed with many data sets in the datasets package, which is included in the base distribution of R. R datasets are automatically loaded when the application is started.  A list of all data sets in the package is obtained using the following command:

Posted in R Basics, R Data Import, R Data Objects | Comments Off on R Data Import

Tidy Data Transformations

Package Dependencies

The core packages for tidy data transformations are listed below:

The dplyr package is by far the most important of the packages in the “tidyverse” for data transformation and manipulation.1  Verb-based functions are one of the advantages of the package.  The syntax is much easier to use when compared to the cryptic syntax of base R.

Posted in Data Science, R Basics, R Data Syntax, R Programming | Comments Off on Tidy Data Transformations

Data Object Management

Data Object Management

The following functions are useful for data object management in R:

FunctionDescription
class()Identify the class of a named object.
colnames(); rownames()Retrieve or set the column or row names of an object.
dim()Retrieve or set the dimensions of a rectangular data object.
dimnames()Get or set the dim names of an object.
head()Returns the first n rows of a data object.
Posted in R Basics, R Data Objects | Comments Off on Data Object Management

Tidy Data Preparation

Package Dependencies

The core packages for tidy data preparation are listed below:

Of these, the tibble and tidyr packages are core to data consistency and preparation.1

Creating tibble Data

The tibble package provides a new data class for storing tabular data, the tibble. tibbles inherit the data.frame class, but improves 3 behaviors:

  • Subsetting – Always returns a new tibble, maintaining data consistency
Posted in Data Science, R Basics, R Data Objects, R Data Syntax | Comments Off on Tidy Data Preparation