Wednesday, 25 June 2014

Web page scraper design using blueprints

This post is available as a presentation here

Services and apps that supply information from multiple sites generate the data by scraping information from web pages. For example, an application that provides third party retail product information gets the data from the retailers web site. This is done by crawler programs which visit the webpage and scraper programs which extract the required information from the page. Scraping is done by taking advantage of how a site designs or structures its web pages. Sites structure webpages using html and look & feel via Cascaded Style Sheets. Every site has a particular structure or template used by the web developers. For example, the news website CNN has a structure for presenting various information. Their html page structure is different from BBC's page. Retail websites' pages are no different. In order to get to the required information, scraper programs can extract the html or CSS elements of interest one-by-one until it gets to the data. For example, the following snap shows the product name in an element within an html webpage. 

This works until the websites change the structure of the pages altogether. So scrapers need to be designed to quickly adapt to changes in structure of the target pages. Pages populated in a lazy manner with Java Script pose another challenge. But, this latter challenge can be addressed by using tools like Selenium. A scraper must simply adapt to the page changes i.e once the elements containing the target data have been identified, the scraper should go after those elements.

Scrapers can be designed to take metadata of information that needs to be extracted. We can call this the blueprint based on which the scraper will work on a particular page set. The meta data can contain the list of elements to be scraped for a piece of data i.e data piece and list of elements. The blue print for the whole information to be obtained from a page can include a list of data pieces and their corresponding elements. Another program can generate this blue print and once tested can be given to the scraper. A scraper library in Java which uses this blueprint approach is show below. This also includes a small API to generate the blue print and also to use the scraper. 

The core idea in this scraper design is devolution of responsibilities. The scraper/library used is a program that take a web page and a blueprint. It scrapes the page based on the blueprint and returns the information. It should not have to or be made to think of the type or nature of the object/information scraped. i.e it should scrape an abstract object. It should be the job of the program requesting the information to tightly set an identity to the scraped information. 

For example, when a retail web page needs to be scraped, the scraper need not know that it is an object, say a Java object of type say, RetailItem. It only needs to know that, this scraped object has an attribute "name" and also an attribute "price". Plus their values i.e the actual name and price (HydroFlask and $50). Nothing more. A scraper like this builds key value pairs to represent the object scraped. In programming this is an object as a collection of attributes in a map.

Not only that, the blue print itself can be a set of key value pairs where the keys are attribute names and values are the list of elements to be pulled.

A blueprint screen with key "brand" and one html/css element to be pulled is shown below.

Selectors as shown in the image specify the elements to be followed or pulled to get to the information. In this example, a div element with id product-tabs needs to be pulled from the page. 

Now that, the scraper just takes a blue print and generates an abstract object as keys and values, all that remains is sharing the blueprint. This can be done using JSON format. GSON library can be used to handle the format and conversions to and from binary. 

The object that is scraped is a simple java object which holds a map as shown below.

The blueprint is also similar in structure as a key value pair holder.

 The API altogether looks like this in final

The blueprint can be built one selector at a time using the API as follows.

Finally a sample blueprint itself in JSON is as follows