Category Archives: Data
Introduction to Tidy Data
Despite the enormous amount of data available, there is surprisingly little alignment or information on how to create clean, consistent and easy to use data.
Human interface with data and code can benefit from some simple principles to facilitate repeatable research and results. The “tidy” approach to data requires that:
- Data is structured consistently and reusable;
- Code flow relies on simple function calls using the pipe;
The world-wide web presents enormous amounts of data. Unfortunately, the majority of the data is not directly available for download. In response, web scraping exploits indirect means to harvest data from websites. In practice, web scrapping is not unique and is totally legal. For example, web browsers rely on the Hypertext Transfer Protocol (HTTP) to fetch data and so does web scrapping. The difference with web scrapping is that the user retrieves, selects and extracts website content and data intended for browser display. This article shows how web scraping works and presents tools available in the R programming language for both manual and automated web-scraping.
Introduction to Satellite Observation Networks
Satellite observation networks provide invaluable data on the climate and the layered atmosphere. Space satellite data is a key input to assess the feasibility and operational integrity of renewable energy power systems.
Ground station sensors for weather and climate observation are listed below. The list is limited to station networks that that provide verification of wind and solar resource data.
Energy Content Explained
The energy content of any organic fuel is defined as the fuel’s primary energy. Primary energy is measured given the fuels calorific value or the heat generation from the complete combustion of one unit of fuel under well-defined conditions. The calorific value can be a gross or net number, depending on whether the combustible heat released takes into account the vapor condensation of water. Power production efficiency is typically calculated using Net Calorific Value (NCV) after water vaporization.
Crude oil prices by delivery period define the term structure of the market. The term structure changes shape over time given shifts in price level and slope. Term structure behavior becomes clear by combining discrete futures contracts with similar maturities into a continuous time series. R code is supplied to create continuous prices by delivery period. The purpose is to show term structure behavior and to derive risk and profitability measures for oil production, marketing and trading strategies. The resulting data is tidy, well suited for model training and out-of-sample testing.
An animation showing the term structure of NYMEX crude oil. For source code, go here.
A new method to extract data tables from PDF files is introduced. The solution combines the R programming language with the open-source Java program Tabula. The result is a convenient method that transforms documents into databases.
The ability to train a machine to extract data tables from PDF files has several benefits:
A common task in spatial data analysis is extracting SpatialPoints inside a set of polygons or buffer zones. Analysts can use standard GIS or map tools to extract a set of points within an area of interest using manual “point-and-click” routines. This method is easy, but will probably prove impractical, especially in cases involving big data. The alternative is to train a machine to automatically extract the points in a polygon or buffer zone. This post achieves that task and presents a case-study with R code.
Aerosol Optical Depth (AOD) defines the degree to which aerosols prevent the transmission of sunlight by absorption or scattering. AOD is measured using an integrated extinction coefficient over a vertical column of air. The extinction coefficient can be used to analyze solar extinction and the performance of solar power systems as a function of location and time.
- Simple trigonometry is defined to assess the resolution of the satellite coverage area;
- A land surface analysis is conducted to visualize the geographic coordinates of the satellite pixels across the State of Qatar;
In practice, the ability to access binary data in R is impossible in the absence of a vender or format specific “can opener” and a properly configured scientific programming environment. As a result, many business applications often bypass binary data use altogether or, instead, rely on secondary sources and summary statistics with no ability to validate data integrity and accuracy.