Web Scraping in R

The world-wide web presents enormous amounts of data.  Unfortunately, the majority of the data is not directly available for download.  In response, web scraping exploits indirect means to harvest data from websites.  In practice, web scrapping is not unique and is totally legal.  For example, web browsers rely on the Hypertext Transfer Protocol (HTTP) to fetch data and so does web scrapping.  The difference with web scrapping is that the user retrieves, selects and extracts website content and data intended for browser display.  This article shows how web scraping works and presents tools available in the R programming language for both manual and automated web-scraping.

What is Web Scraping?

Web scraping involves getting a web page and and extracting data from it. Specifically, a message is sent to fetch a web page using Uniform Resource Locators (URLs). A basic URL structure appears below:1

The request components include the protocol, which is typically http or https for secure communications. The host is a target website.  The default port is 80, but one can be set explicitly, as shown here.  The resource path specifies the server path to the data and the query is the data request action or verb.

A web page request is a simple component of web scrapping.  Next, a server response will deliver a status message with web page content.  The web page content must be searched, be it manually or automatically.  Hence, web page crawling is a key feature of web scrapping. Next, the web page content will be parsed, extracted and reformatted. 

Web scrapping, for example, might first retrieve a web page and then extract contact names and phone numbers.  In another instance, the focus might be to grab a research data table or a collection of tables. Finally, another effort might seek to extract intel using text strings found in multiple news or social media reports.

Web Page Content – The basics

The response to a URL request is the web page delivered in a

 

HTML document message.  HTML stands for HyperText Markup Language and defines the content and structure of a webpage.

“Hyper Text” in HTML refers to hyperlinks that connect webpages to one another, either within a single website or between websites, to populate page content.  Meanwhile, “markup” text includes special page elements or “CSS selectors” such as <head>, <title>, <section>, <body>, <div>, <table>, <p>, <img> and may others.  Web crawling and scrapping involves finding and extracting from these elements as needed.

Other language technologies also populate the web page besides HTML.  The technologies include CSS, which describe a webpage’s presentation appearance and JavaScript for web page functionality.  An analyst inspects all web page content using a web browser tool to find what parts will be scraped.

Web Page Inspection

Inspect web page content using web browser tools to find what parts will be scraped. For example, the following web site image (https://oilprice.com/oil-price-charts) shows petroleum product prices from around the world.  The lower half of the image shows the web site HTML content in your web browser inspector (click to enlarge):

The inspector is launched in FireFox by right-clicking the data table and choosing “Inspect Element.”  Other web browsers have selections to “View Page Source.”  The next image shows the page source for the first table (click to enlarge):

If you don’t know HTML, this may look daunting!  But look close.  All the table text and data is there to be grabbed.  Meanwhile, another essential tool for web page inspection is SelectorGadget.  SelectorGadget is an open source tool that makes it easy to identify and define CSS content.   Just install the browser extension as instructed.  The tool simplifies inspection of complicated web sites to just a few point-and-click actions and it makes it a breeze to locate content to scrape.

Available R Packages

Several R packages are available for webscrapping:

Package NameRetrieve?Parse?Must Know?
rvestYESYESYES
RCurlYESNOYES
XMLLimitedYESYES
rjsonNOYESYES
RJSONIONOYESOptional
httrYESYESOptional
xml2NOYESOptional
selectrNOYESOptional

rvest  is a set of wrapper functions around the xml2 and httr packages.  The rest of the tutorial focuses on rvest exclusively.  The main functions are:

  • read_html():  read a webpage into R as XML (document and nodes)
  • html_nodes(): extract pieces out of HTML documents using XPath and/or CSS selectors
  • html_attr(): extract attributes from HTML, such as href
  • html_text(): extract text content

Web Scraping Examples

The first step is to load the rvest package and to read the target web page:

The first two lines above are the first HTML lines of the target web page.  rvest  has read or captured the entire HTML content of the target webpage.  Next, the webpage nodes are read to extract to find data table content…the product names and prices.  The rvest command html_nodes() is told to seek specific nodes as defined by CSS selectors used in the function calls.  The selector string used was defined by SelectorGadget after clicking on a price or name element on the web page,  The reaming command html_text() converts node content to text and other commands use basic R syntax to reshape the data for use in R:

The next example uses the pipe operator to link node parsing, data extracting and formatting.  In this case, the extraction is on the 13th data table returned by the html_table() function.  The table scraping extracts the spot price and price change data for US crude oil blends: