In practice, the ability to access binary data in R is impossible in the absence of a vender or format specific “can opener” and a properly configured scientific programming environment. As a result, many business applications often bypass binary data use altogether or, instead, rely on secondary sources and summary statistics with no ability to validate data integrity and accuracy.
This article describes ways to access and load binary data in R for scientific, engineering and business applications. Three common binary formats are discussed: GRIB, netCDF and HDF5. By way of example, the chart below confirms the benefit of integrating binary data objects from multiple sources and formats.
GRIB (GRIdded Binary) is a binary data format originally promoted by the World Meteorological Organization for exchanging spatially gridded data objects. GRIB Edition 2 is an extension of GRIB, with a much higher degree of flexibility and expandability. Both formats have been widely used to package spatial wind, solar, land use, and environmental data.
The intrinsic value of the GRIB data format is to provide an efficient vehicle for transmitting large volumes of gridded data from automated data centers over high-speed telecommunication lines directly to end-users. GRIB and GRIB2 also serves as a data storage format, generating the same efficiencies in local data management.
For a definition of the GRIB2 format see the WMO GRIB2 description on the WMO codes page.
Grib vs. netCDF
Grib was first introduced in 1999 and updated to Grib2 in 2007. As a result, a significant history of earth science data is stored in this format. While the format enjoys the benefit of widespread use under the WMO standard, the format is old relative to innovations in data technology and data science.
The University Corporation of Atmospheric Research (UCAR) has since pioneered an alternative binary format for spatially gridded data sets. The netCDF data format is the preferred data format and relies on the Unidata Corporation for product support. The popularity of netCDF can be traced to ease of use, expanded metadata control, more flexible data structures, and improved compression technology. NetCDF also enjoys increasing support among equipment venders and is easily access by the R programming language.
The net CDF format is described in more detail here.
Reading GRIB Data Directly into R
There are many “can openers” to be found online for GRIB data. In R, the most direct route is to use the rgdal package, which is a core package for spatial data analysis. rgdal delivers a binary encoder to read GRIB data directly into R. The following example imports GRIB data into R, includes additional lines for handling missing data, and generating a spatial data image:
my.grib <- readGDAL("middle.east.solar.grb")
is.na(my.grib[["GHI"]] <- my.grib[["GHI"]] > 100
This approach appears simple: 1 line to read data, 1 line to clean the data, and 1 line to generate a plot. Goal achieved, but something is missing: rgdal lacks the ability to read and exploit metadata, and other software is required altogether. Metadata includes units of measure by variable, long-form text data descriptions, and data structure summaries. The absence of metadata is a deal breaker since metadata is an efficient way to interrogate and manage large data objects. Finally, for the rgdal package to work in R, it requires the GDAL/PROJ4 framework to be installed external to R.…not a huge issue, but a challenge for people new to spatial data analysis. Instructions for downloading and installing the open-source GDAL/PROJ4 framework can be found here.
Converting Grib to netCDF
A better approach is to install the NCAR Command Language (NCL) and the netCDF framework external to R. In combination, these tools support conversion of GRIB to the preferred binary data format netCDF. Installation is relatively easy since no source code compiling is required for either framework. A simple command line instruction will then convert GRIB to netCDF as follows:
The result is to create a file called middle.east.solar.nc, along with the appropriate attributes, metadata, and spatial coordinates. The command line instruction can also be run from R directly using the system() function. In the example below, GRIB conversion is followed by the import commands from the ncdf package:
> system(ncl_convert2nc middle.east.solar.grb, internal = TRUE)
> my.nc <- open.ncdf("middle.east.solar.nc")
The final print() command does not display the data itself, but instead its internal structure and metadata. The result: direct access to binary data and improved data ownership.
The Hierarchical Data Format (HDF, HDF4, or HDF5) is another binary file format designed to store and organize large amounts of numerical data. HDF5 simplifies the file structure to include only two major types of object:
- Datasets, which are multidimensional arrays of a homogenous type, and
- Groups, which are container structures which hold datasets and other groups, such as metadata.
The latest version of netCDF, version 4 , is based on HDF5. Reading HDF5 files into R follows the same process used for netCDF files. Alternatively, the upload process can rely on the rgdal package and the readGDAL() function, as shown previously.