Today in Energy
TagsAnimation (3) Data (12) Data Science (12) Distributions (1) Economics (5) Engineering (1) Faster R (2) GDAL (5) ggplot2 (19) Git/GitHub (3) LaTeX (12) Linux (9) Misc Tricks (3) Modeling (6) Projects (1) R Basics (17) R Colors (9) R Data Import (6) R Data Objects (18) R Data Syntax (11) R Graphics (20) R Packages (3) R Programming (26) Scientific Computing (4) Spatial Analysis (6) Ubuntu (1) Web Scrapping (2) Website (5)
Category Archives: R Programming
Creating functions and object orientated scripts are the preferred way to use R. R functions expand the capabilities of R. By nature, R scripts are a way to organize and save data, complicated expressions, or sequences of operations for re-use. Well configured R functions rely on proper use of R language concepts and object orientated structures.
R Scripts vs. R Functions
Scripts and functions have several distinguishing characteristics:
Data sorting in R is simple and straightforward. Key functions include sort() and order(). The variable by which sort you can be a numeric, a string or a factor variable. Argument options also provide flexibility how missing values will be handled: they can be listed first, last or removed.
Data Sorting Examples
> x <- c(0.868, -0.066, -0.075, -1.002, 0.646)
 -1.002 -0.075 -0.066 0.646 0.868
 4 3 2 5 1
 -1.00203069 -0.07577924 -0.06647998 0.64641650 0.86889398
> x <- rep(1:4, each = 2)
 1 1 2 2 3 3 4 4
 1 2 3 4
 4 3 2 1
It is also possible to sort in reverse order by using a minus sign ( – ) in front of the sort variable. For example:
There are several utilities for debugging in R.
Debugging with traceback()
Whenever a custom function generates an error, the traceback() function is a good way to focus initial problem solving. The function lists the nested function calls currently being evaluated, starting with the function from which the error was returned and working outward to the original calling function.
Iteration is core to many calculations. The use of iteration in R is common, but should be avoided whenever possible given vectorized methods that often achieve the same goal.
Iteration, or traditional looping, is a brute force approach to data management that is effective, but costly. Every time a large data set enters an iteration loop, a copy of the data is saved to disk. Thus, iteration consumes time and memory. R supports the following vectorized looping functions: apply(), lapply(), tapply(), sapply() and by(). More traditional functions for iteration in R are described below.
Local vs global objects in R serve to distinguish temporary and permanent data.
Local Objects and Frames
Data objects assigned within the body of a function are temporary. That is, they are local to the function only. Local objects have no effect outside the function, and they disappear when function evaluation is complete.
The layout() Function
The ability to manage multiple plots in one graphical device or window is a key capability to enhance data visualization and analysis.
The layout() function in base R is the most straightforward method to divide a graphical device into rows and columns. The function requires an input matrix definition. Column-widths and the row-heights can be defined using additional input arguments. layout.show() can then be used to see multi-graph layouts and how the graphical device is being split.
GitHub is a web-based hosting service for file archiving and version control. Github uses the locally installed software tool Git. Data science projects use Git and GitHub to provide access and control of project data, source code and narrative text files. In practice, RStudio provides Git.
Version control is an essential features of any project and the benefits are simple:
- Collaboration: Provide a central repository of files for collaboration;
R shares many programming constructs with other programming languages, but it also offers coding and memory management efficiencies, which simplify scripting and model building. Fortunately, R programming is easy to learn. This chapter on R programming is structured as follows:
RMarkdown provides an authoring system for project and data science reporting. RMarkdown is a core component of the RStudio IDE. It braids together narrative text with embedded chunks of R code. The R code serves to demonstrate the model concepts in the text. RMarkdown produces elegantly formatted document output, including publication quality data plots and tables.
A project template provides a common content structure for data analysis projects. A template can offer several benefits:
- Define a familiar workspace
- Enable collaboration
- Ensure consistency across machines and time.
The project template is simple in nature. It has everything a good project should have to achieve repeatable research and results.
Projects in RStudio
RStudio makes it easy to create a project. In particular, each project has its own working directory, source files, workspace settings and history.
The world-wide web presents enormous amounts of data. Unfortunately, the majority of the data is not directly available for download. In response, web scraping exploits indirect means to harvest data from websites. In practice, web scrapping is not unique and is totally legal. For example, web browsers rely on the Hypertext Transfer Protocol (HTTP) to fetch data and so does web scrapping. The difference with web scrapping is that the user retrieves, selects and extracts website content and data intended for browser display. This article shows how web scraping works and presents tools available in the R programming language for both manual and automated web-scraping.
Geospatial Data and Mapping in R (68 downloads)
Many data projects involve a split-apply-combine strategy, where a big dataset is split into manageable pieces, a function is applied to operate on each piece and the results are then combined to put all the pieces back together. These are common actions that are repeated in many analysis projects.
The core packages for tidy data transformations are listed below:
# Data format/shape
library(tibble) # simple data.frames
library(tidyr) # data cleaning/reshaping
# Data transforms
library(dplyr) # data transforming
library(forcats) # data factor mgmt
library(lubridate) # data/time objects
library(hms) # time-of-day values
library(stringr) # string mgmt
library(purrr) # functional programming tools
library(purrrlyr) # intersection of purrr and dplyr
library(magrittr) # pipe operators
The dplyr package is by far the most important of the packages in the “tidyverse” for data transformation and manipulation.1 Verb-based functions are one of the advantages of the package. The syntax is much easier to use when compared to the cryptic syntax of base R.
Layered plots are one way to achieve new insight and actionable intelligence when working with complex data. ggplot is well suited for layered plots.
To make graphs with ggplot(), the data must be in a data frame and in “long” (as opposed to wide) format. Converting between “wide” and “long” data formats is facilitated with the reshape2 package. Specifically, the melt() function converts wide to long format, and the cast() function converts long to wide format. The following code block presents examples of the two data formats.