Category Archives: R Programming

Project Template

A project template provides a common content structure for data analysis projects.  A template can offer several benefits:

  • Define a familiar workspace
  • Enable collaboration
  • Ensure consistency across machines and time.

The project template is simple in nature.  It has everything a good project should have to achieve repeatable research and results.

Projects in RStudio

RStudio makes it easy to create a project.  In particular, each project has its own working directory, source files, workspace settings and history.

Posted in R Basics, R Programming | Comments Off on Project Template

Web Scraping in R

The world-wide web presents enormous amounts of data.  Unfortunately, the majority of the data is not directly available for download.  In response, web scraping exploits indirect means to harvest data from websites.  In practice, web scrapping is not unique and is totally legal.  For example, web browsers rely on the Hypertext Transfer Protocol (HTTP) to fetch data and so does web scrapping.  The difference with web scrapping is that the user retrieves, selects and extracts website content and data intended for browser display.  This article shows how web scraping works and presents tools available in the R programming language for both manual and automated web-scraping.

Posted in Data, R Programming, Web Scrapping | Comments Off on Web Scraping in R

Geospatial Data and Mapping in R

I share slides presented at a recent meeting of  Doha R users on geospatial data and mapping in R .

Geospatial Data and Mapping in R (154 downloads)

 

Posted in Data Science, R Data Objects, R Data Syntax, R Programming, Spatial Analysis | Comments Off on Geospatial Data and Mapping in R

Split-Apply-Combine Techniques

Introduction

Many data projects involve a split-apply-combine strategy, where a big dataset is split into manageable pieces, a function is applied to operate on each piece and the results are then combined to put all the pieces back together. These are common actions that are repeated in many analysis projects.

Posted in Data Science, R Programming, Scientific Computing | Comments Off on Split-Apply-Combine Techniques

Tidy Data Transformations

Package Dependencies

The core packages for tidy data transformations are listed below:

The dplyr package is by far the most important of the packages in the “tidyverse” for data transformation and manipulation.1  Verb-based functions are one of the advantages of the package.  The syntax is much easier to use when compared to the cryptic syntax of base R.

Posted in Data Science, R Basics, R Data Syntax, R Programming | Comments Off on Tidy Data Transformations

Layered Plots

Layered plots are one way to achieve new insight and actionable intelligence when working with complex data.  ggplot is well suited for layered plots.

Data Pre-Processing

To make graphs with ggplot(), the data must be in a data frame and in “long” (as opposed to wide) format.  Converting between “wide” and “long” data formats is facilitated with the reshape2 package.  Specifically, the melt() function converts wide to long format, and the cast() function converts long to wide format.  The following code block presents examples of the two data formats.

Posted in ggplot2, R Graphics, R Programming | Comments Off on Layered Plots

Conditionals in R

Conditionals are expressions that perform different computations or actions depending on whether a predefined boolean condition is TRUE or FALSE.  Conditional statements include if(), the combination if()/esle(), ifelse(), and switch().  Each statement supports source code branching by altering the control flow.  

The if() Statement

The if() statement is common in all programming languages.  The if() statement performs operations based on a simple condition:

Posted in R Programming | Comments Off on Conditionals in R

Creating R Functions

Creating functions and object orientated scripts are the preferred way to use R.  R functions expand the capabilities of R. By nature, R scripts are a way to organize and save data, complicated expressions, or sequences of operations for re-use.  Well configured  R functions rely on proper use of R language concepts and object orientated structures.

R Scripts vs. R Functions

Scripts and functions have several distinguishing characteristics:

Posted in R Programming | Comments Off on Creating R Functions

Data Sorting in R

Data sorting in R is simple and straightforward.  Key functions include sort() and order().   The variable by which sort you can be a numeric, a string or a factor variable.  Argument options also provide flexibility how missing values will be handled:  they can be listed first, last or removed.

Data Sorting Examples

It is also possible to sort in reverse order by using a minus sign ( – ) in front of the sort variable.  For example:

Posted in R Data Objects, R Programming | Comments Off on Data Sorting in R

Debugging in R

There are several utilities for debugging in R.

Debugging with traceback()

Whenever a custom function generates an error, the traceback() function is a good way to focus initial problem solving.  The function lists the nested function calls currently being evaluated, starting with the function from which the error was returned and working outward to the original calling function.

Posted in R Programming | Comments Off on Debugging in R

Iteration in R

Iteration is core to many calculations.  The use of iteration in R is common, but should be avoided whenever possible given vectorized methods that often achieve the same goal.

Iteration, or traditional looping, is a brute force approach to data management that is effective, but costly.  Every time a large data set enters an iteration loop, a copy of the data is saved to disk.  Thus, iteration consumes time and memory.  R supports the following vectorized looping functions: apply(), lapply(), tapply(), sapply() and by().  More traditional functions for iteration in R are described below.

Posted in R Programming | Comments Off on Iteration in R

Local vs Global Objects

Local vs global objects in R serve to distinguish temporary and permanent data.

Local Objects and Frames

Data objects assigned within the body of a function are temporary.  That is, they are local to the function only.  Local objects have no effect outside the function, and they disappear when function evaluation is complete.    

Posted in R Data Objects, R Programming | Comments Off on Local vs Global Objects

R Graphics: Multi-Graph Layouts

The layout() Function

The ability to manage multiple plots in one graphical device or window is a key capability to enhance data visualization and analysis.

The layout() function in base R is the most straightforward method to divide a graphical device into rows and columns.  The function requires an input matrix definition.  Column-widths and the row-heights can be defined using additional input arguments.  layout.show() can then be used to see multi-graph layouts and how the graphical device is being split.

Posted in R Graphics, R Programming | Comments Off on R Graphics: Multi-Graph Layouts

Project Control with Git and GitHub

Introduction

GitHub is a web-based hosting service for file archiving and version control.  Github uses the locally installed software tool Git. Data science projects use Git and GitHub to provide access and control of project data, source code and narrative text files. In practice, RStudio provides Git.

Benefits

Version control is an essential features of any project and the benefits are simple:

  • Collaboration: Provide a central repository of files for collaboration;
Posted in Git/GitHub, R Programming | Comments Off on Project Control with Git and GitHub

R Programing

R shares many programming constructs with other programming languages, but it also offers coding and memory management efficiencies, which simplify scripting and model building.  Fortunately, R programming is easy to learn.  This chapter on R programming is structured as follows:

Creating R Functions
Local vs Global Objects
Conditionals
Iterations
Special Functions
Debugging


Back | Next

 

Posted in R Programming | Comments Off on R Programing