< Regular expressions | Table of contents | NLTK >

Data acquisition¶

Data science projects typically start with the acquisition of data. In many cases, such data sets consist of secondary data made available by commercial or non-commercial organisations. This part of the tutorial explains how you can obtain such online data sets.

In this tutorial, we distinguish three methods of data acquisition: downloading data files, accessing data through APIs and webscraping. You usually choose one of these methods to get your data based on what the data provider offers.

Direct downloads¶

If the resources that you are interested in are available directly via the web (i.e. via HTTP or HTTPS), you can download these files by making use of the requests library. As is the case for all libraries, the requests library needs to be imported before you can use it.

import requests

To download data from a certain web address, you can make a GET request. In Python, such a request can be sent using the get() method in requests, as demonstrated below. Evidently, it is important that you are online when you run the code below.

response = requests.get( 'https://www.universiteitleiden.nl')

This method returns a so-called Response object. It is an object which represents information about the downloaded web resource. In the example above, the result of the method is assigned to a variable named response.

Once this Response object has been created successfully, you can use various pieces of information about the resource that was downloaded. The property status_code, for instance, indicates the HTTP status code that was returned by the server. The status code 200 indicates that the request was successful and the infamous status code 404 indicates that the file was not found.

If the status code is indeed 200, the contents of the resource is accessible in the response's body property. However, this property holds the contents as bytes and can be ignored in most use cases.

Typically, we want to use the data as text, for example, when we downloaded a webpage or a .txt file. In these cases the text property of the Response object contains the full contents of the downloaded website, dataset or other kind of file as a string.

There is something to look out for, however: requests may not always automatically understand a file's character encoding. You can set the correct character encoding explicitly using the encoding property.

When you run the code that is given below, the contents of the webpage that is specified at the beginning (or more specifically, the HTML code that was created to build the webpage) becomes available as a string, assigned to the variable named contents.

import requests

contents = ""
response = requests.get('https://www.universiteitleiden.nl')
print( response.status_code )

if response.status_code == 200:
    response.encoding = 'utf-8'
    contents = response.text
    print (contents)

Using the requests library, you can basically download any type of file from the web, as long as it is retrievable via HTTP(s). The code below, for instance, downloads a specific text from the Project Gutenberg website.

url = "https://www.gutenberg.org/files/98/98-0.txt"

response = requests.get(url)

if response:
    response.encoding = 'utf-8' 
    print (response.text)

Note that the if keyword in the code above does not explicitly test whether the response code is 200. The Response object, which is created when you use the get() method from requests, automatically returns True when the status code is 200.

The requests library can also be used to retrieve data from an API.

Acquiring data via APIs¶

Organisations which aim to make their data available for reuse often do this through an Application Programming Interface (API). An API, simply put, is the interface through which (online) services and applications provide access to their information and functionality. It enables organisations to share some of the data that they have in a strucured format, so that these other external parties can make use of these data in new applications.

The communication between the sender and the recipient of such requests needs to take place according to a specific protocol; the requests need to be formulated according to certain rules. Many APIs allow accessing selections of data, for example so that you don't have to download lots of data.

For many APIs, you need to create an access key before you can send requests. This is the case, for instance, for the Twitter API. There are also many APIs which are fully open, however, such as the Wikipedia API. You can send request to this API without having to provide an access key.

Example: Wikipedia search API¶

To find information about Wikipedia pages via the Wikipedia API, for instance, you need to send a number of parameters to the endpoint of the API, which is available at https://en.wikipedia.org/w/api.php. You need to provide values for the following parameters:

action = opensearch
search = [search term]
limit = [number]
format = [json or xml]

The search term, to be provided after ‘search’, is the word that must occur in the title of the Wikipedia page. If the ‘limit’ parameter is omitted, the API will return 10 results by default.

The following API call returns 20 Wikipedia pages whose titles contain the word ‘Leiden’. The call returns the requested data in the JSON format.

https://en.wikipedia.org/w/api.php?action=opensearch&search=Leiden&limit=20&namespace=0&format=json

The json() method parses the JSON data into a regular Python data structure for further processing. Note that the results are grouped by field, not by result.

import requests


baseUrl = 'https://en.wikipedia.org/w/api.php?action=opensearch'


search= 'leiden' 
limit = 20 
data_format= 'json' # 'format' is a builtin Python term


apiCall = '{}&search={}&limit={}&format={}'.format( baseUrl , search , limit, data_format )
print(apiCall)


response = requests.get( apiCall )
wikiResults = response.json()


for i in range( 0 , len(wikiResults[1]) ):
    print( 'Title: ' + wikiResults[1][i] )
    print( 'Tagline: ' + wikiResults[2][i] )
    print( 'Url: ' + wikiResults[3][i] + '\n')

Webscraping¶

When a website does not offer access to its structured data via a well-defined API, it may be an option to acquire the data that can be viewed on a site by making use of web scraping. It is a process in which a computer program tries to process the contents of given webpage, and to extract the data values that are needed. The aim of such an application is generally to copy information on a web page and to paste it into a local database.

To get the most out of webscraping, you need to have a basic understanding of HTML. Many basic introductions can be found on the web.

Web scraping should be used with caution, because it may be disallowed to download large quantities of data from websites.

In this tutorial we will only look at extracting information from single pages.

One of the libraries that you can use in Python for scraping online resources is Beautiful Soup.

To scrape webpages, you firstly need to download them. This can be done using Requests (explained above) or similar libraries. The code below scrapes data from a page on the Project Gutenberg website.

Once you have obtained the contents of a webpage, in the form of an HTML document, you can process its contents effectively by transforming it into a BeautifulSoup object. If the bs4 library has been imported, you can use the its BeautifulSoup() method. This method demands the full contents of an HTML document as a first parameter. As a second parameter, you need to provide the name one of the parsers that are available. Generally, a parser is an application which can process and analyse data. In this context, it refers to a prorgam which can analyse the HTML file. One of the parsers that we can use is lxml. Using this parser, the BeautifulSoup() method converts the downloaded HTML page into a BeautifulSoup object.

The prettify() method of this object creates a more readable version of the HTML file by adding indents and end of line characters.

from bs4 import BeautifulSoup
import requests
import re

## Project Gutenberg Bookshelves can be found at 
## https://www.gutenberg.org/wiki/Category:Bookshelf
url = 'https://www.gutenberg.org/wiki/Philosophy_(Bookshelf)'
response = requests.get( url )

soup = BeautifulSoup( response.text ,"lxml")

print( soup.prettify() )

The BeautifulSoup object that was created above has a find_all() method, which you can use to find all occurrences of a specific HTML tag.

The web page that was downloaded from Project Gutenberg lists a number of titles in the field of philosophy. All the links are encoded using an <a> element. The actual link is given in the href attribute of tis element. So, to obtain all the links that are visible on the page, we firstly need to find all the <a> elements in the body of the HTML page. More specifically, we need to provide the name of the tag, 'a', as a parameter to the find_all() method. The method returns a result set containing all the occurences of the tag in the HTML file. Each individual occurrence contains the full inner HTML of the tag that we requested.

To retrieve only the text of the tag (i.e. the text which is encoded using the tag), we can use the string property. To retrieve the value of an attribute of this element, we can use the get() method. As an argument, this method demands the name of the attibute we are interested in, href in this case.

links = soup.find_all("a")


for l in links:
    linktext = l.string
    url = l.get("href")
    
    # if the URL contains 'gutenberg', ignoring upper and lower case differences
    if re.search('gutenberg' , str(url) , re.IGNORECASE):

        #The links in the HTML file start with two forward slashes for some reason
        #They are removed in the next line
        url = re.sub( r'^[/]*' , '' , url )
        print(f"{linktext}: {url}") # you could also write "{}: {}".format(linktext, url)

Note that the code above also makes use of the re library, to examine the form of the URLs and to get rid of unwanted characters.

Once you have created a list of URLs using the method outlined above, you can also download all the texts that were found, using the get() method from requests library.

Advanced scraping: Scrapy¶

This tutorial has only touched the surface of web scraping. To get specific data from webpages or APIs, you will need to dig into the data that you get and probably learn more about the data formats. But when you start to understand the data, the Web has lots of sources to get data from.

A very advanced framework (or toolkit) for webscraping with Python is Scrapy.

This framework makes it easy to build a scraper/crawler by providing basic functionalities out of the box.

Although Scrapy doesn't understand what parts of webpages are of interest to you, it does many things for you, like making sure you don't send too many requests at the same time or retrying requests that fail.

Feel free to look at the Scrapy tutorial to try it.

< Regular expressions | Table of contents | NLTK >