Category Archives: R Data Import
Quantitative analysis depends on the ability to load and manage many different types of data and file formats. There are many R data import functions. Some functions ship with base R and others can be found in R packages.
Data Available in R
R is pre-installed with many data sets in the datasets package, which is included in the base distribution of R. R datasets are automatically loaded when the application is started. A list of all data sets in the package is obtained using the following command:
A new method to extract data tables from PDF files is introduced. The solution combines the R programming language with the open-source Java program Tabula. The result is a convenient method that transforms documents into databases.
The ability to train a machine to extract data tables from PDF files has several benefits:
Aerosol Optical Depth (AOD) defines the degree to which aerosols prevent the transmission of sunlight by absorption or scattering. AOD is measured using an integrated extinction coefficient over a vertical column of air. The extinction coefficient can be used to analyze solar extinction and the performance of solar power systems as a function of location and time.
The maptools package has a pruneMap() function t0 crop map objects in R. In practice, the function extracts data from SpatialPolygon or SpatialLine objects given a boundary box or specific area of interest. Unfortunately, there is no equivalent function for high resolution, large data, raster images, which are common in many Earth Science applications. The following post defines a custom function to crop raster images in R and to extract data from SpatialGridDataFrames. The function is tested using a raster image from the Shuttle Radar Topography Mission (SRTM; shown at left). The resulting data is then mapped using the image() function in R.
The standard way to read text files into R is to use the read.table() command. However, many users struggle with time delays when loading large data sets. An alternative command that offers significant speed improvements is fread(), or fast read, which can found in the data.table package. The following code loads a tab delimited file with a million elements and reveals that fread() reduces load time by almost 99%, as confirmed by the benchmark performance stats at left. The function is still under development, but it is available for download and doesn’t suffer from stability issues. Instead, expect argument structure and command syntax to change over time.
In practice, the ability to access binary data in R is impossible in the absence of a vender or format specific “can opener” and a properly configured scientific programming environment. As a result, many business applications often bypass binary data use altogether or, instead, rely on secondary sources and summary statistics with no ability to validate data integrity and accuracy.