Category Archives: Data Science

Tidy Data Preparation

Package Dependencies

The core packages for tidy data preparation are listed below:

Of these, the tibble and tidyr packages are core to data consistency and preparation.1

Creating tibble Data

The tibble package provides a new data class for storing tabular data, the tibble. tibbles inherit the data.frame class, but improves 3 behaviors:

  • Subsetting – Always returns a new tibble, maintaining data consistency
Posted in Data Science, R Basics, R Data Objects, R Data Syntax | Comments Off on Tidy Data Preparation

R Work Flow

Intro to R Work Flow

Large data projects in R require consistent work flow principles.  The goal is to improve project management.  Above all, the work flow process has a clear priority: to shift time spent from low to high value activity.  The solution is simple: (1) use a basic project template to manage project files and directories; (2) write code with a common set of tools to improve code flow and efficiency, and (3) extend base R with well accepted R packages that support improved work flow and code syntax.

Posted in Data Science, Git/GitHub | Comments Off on R Work Flow

Geospatial Data and Mapping in R

I share slides presented at a recent meeting of  Doha R users on geospatial data and mapping in R .

Geospatial Data and Mapping in R (68 downloads)

 

Posted in Data Science, R Data Objects, R Data Syntax, R Programming, Spatial Analysis | Comments Off on Geospatial Data and Mapping in R

Project Reporting with RMarkdown

Introduction

RMarkdown  provides an authoring system for project and data science reporting.  RMarkdown is a core component of the RStudio IDE.  It braids together narrative text with embedded chunks of R code.  The  R code serves to demonstrate the model concepts in the text.  RMarkdown  produces elegantly formatted document output, including  publication quality data plots and tables.

Posted in Data Science, LaTeX, R Basics, R Programming, Scientific Computing | Comments Off on Project Reporting with RMarkdown

Tidy Data Transformations

Package Dependencies

The core packages for tidy data transformations are listed below:

The dplyr package is by far the most important of the packages in the “tidyverse” for data transformation and manipulation.1  Verb-based functions are one of the advantages of the package.  The syntax is much easier to use when compared to the cryptic syntax of base R.

Posted in Data Science, R Basics, R Data Syntax, R Programming | Comments Off on Tidy Data Transformations

Split-Apply-Combine Techniques

Introduction

Many data projects involve a split-apply-combine strategy, where a big dataset is split into manageable pieces, a function is applied to operate on each piece and the results are then combined to put all the pieces back together. These are common actions that are repeated in many analysis projects.

Posted in Data Science, R Programming, Scientific Computing | Comments Off on Split-Apply-Combine Techniques

Plotting Forecast Data Objects Using ggplot

Robert Hyndman is the author of the forecast package in R. I’ve been using the package for long-term time series forecasts. The package comes with some built in methods for plotting forecast data objects in R that Ive wanted to customize for improved clarity and presentation.  The following article achieves that goal and shares two scripts for plotting forecast data objects using ggplot.

Posted in Data Science, ggplot2, Modeling, R Programming | Comments Off on Plotting Forecast Data Objects Using ggplot

From Least Squares to k-Nearest Neighbor (kNN)

The linear model is one of the most widely used data science tools and one of the most important.  In contrast, there is another basic tool:  the k nearest neighbor method (kNN).  Prediction and classification are two uses for these models.  In practice, classification results (ie. feature classes) are used by machines in many ways: to recognize faces in a crowd, to “read” road signs by distinguishing one letter from another and to set voter registration districts by separating population groups.  This article applies and compares linear and non-linear classification methods

Posted in Data Science, Modeling, R Programming, Website | Comments Off on From Least Squares to k-Nearest Neighbor (kNN)

R Functions for Best Subset Regression

Best subset regression is an technique for model building and variable selection. The method looks at all combinations of independent predictor variables for use in a multiple regression model. Model developers and analysts will often struggle with variable selection, especially when the number of predictors is high.  Ideally, each set of predictors is run and the best set is selected using a criteria for model performance. The following article provides custom functions for best subset selection that are fast and easy to use.

Posted in Data Science, Faster R, Modeling | Comments Off on R Functions for Best Subset Regression

Popularity of R Programming Language

TIOBE IndexThe popularity of R is rapidly increasing and is well on its way to being a top 10 programming language.  The TIOBE index is a standard indicator of the popularity of all programming languages.  The TIOBE index confirms that a subset of languages – those for computational statistics and data analysis – are gaining increased attention. The clear winner of the pack is the open source programming language R.

Posted in Data Science, R Programming | Comments Off on Popularity of R Programming Language

Binary Data In R

There are many reasons to work with binary data in R.  Solar resource data, solar PV performance data, and real-time grid monitoring data are typically stored and transmitted in binary data formats.  

In practice, the ability to access binary data in R is impossible in the absence of a vender or format specific “can opener” and a properly configured scientific programming environment.  As a result, many business applications often bypass binary data use altogether or, instead, rely on secondary sources and summary statistics with no ability to validate data integrity and accuracy.  

Posted in Data, Data Science, GDAL, R Data Import | Comments Off on Binary Data In R

Correlation Plots in R

The standard function for correlation plots in R is pairs(), which generates a matrix of scatter plots based on all pairwise combinations of variables in a data object.  The standard graph looks something like this after a little color enhancement:” plot13Click to enlarge

The code behind this plot is simple:

Posted in Data Science, ggplot2, R Graphics, R Programming | Comments Off on Correlation Plots in R