< Data acquisition | Table of contents | Data analysis with Pandas >

Natural Language Processing¶

Natural Language Processing (NLP) is an area of research in computer science and in computational linguistics which aims to make computers understand natural languages as spoken by human beings. Researchers in the field of NLP have developd sophisticated tools for the recognition of grammatical and syntactic categories, or for the conversion of inflected word forms into their dictionary forms, among many other purposes. Current more advanced research focuses the development of software for tasks such as named entity recognition, machine translation, or unsupervised summarisation. Tasks such as these all demand an understanding not only of the grammar and the syntax, but also of the logical structure and the semantic contents of the text.

Nltk¶

One of the ways in which you can analyse natural languages in Python is by making use of nltk, the natural language toolkit. This library can be imported as follows:

import nltk

Having imported nltk, you can also import specific methods from this library.

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

Tokenisation¶

As explained, tokenisation is a process in which a full linear text is broken down into smaller linguistic units, such as words, sentences of paragraphs. The nltk method word_tokenize() can be used to tokenise a text into words. The method sent_tokenize() can divide a text into sentences. Listing 5.1 contains an illustration of how these methods can be used.

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

quote = '''
In the late summer of that year we lived in a house in a village that looked across the river and the plain to the mountains. In the bed of the river there were pebbles and boulders, dry and white in the sun, and the water was clear and swiftly moving and blue in the channels. Troops went by the house and down the road and the dust they raised powdered the leaves of the trees. The trunks of the trees too were dusty and the leaves fell early that year and we saw the troops marching along the road and the dust rising and leaves, stirred by the breeze, falling and the soldiers marching and afterward the road bare and white except for the leaves.
'''

words = word_tokenize(quote)

for w in words:
    print(w)

print( len(words) )

sentences = sent_tokenize(quote)

print( len(sentences) )

for s in sentences:
    print(s)

The method word_tokenise() divides the string into words on the basis of spaces. Interpunction marks are viewed as separate tokens as well. When you simply count the number of tokens found using this method, the results may be somewhat unexpected.

Part of speech tagging¶

Part of speech (POS) taggers are applications which can produce data about the syntactic categories of words. Once you have imported the nltk library, you can generate such POS tags by making use of the pos_tag() method. The imput for this method needs to be a list of words. pos_tag() is typically used in combination with word_tokenise().

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

quote = '''
The studio was filled with the rich odour of roses, and when the light summer wind stirred amidst the trees of the garden, there came through the open door the heavy scent of the lilac, or the more delicate perfume of the pink-flowering thorn.
'''

words = word_tokenize(quote)
pos = nltk.pos_tag(words)

for p in pos:
    print(p[0] + ' => ' + p[1] )

The pos_tag() methods returns a composite variable with two values. You can access both values using the square brackets syntax. The first value is the word that was tagged and the second value is the POS tag that was assigned to this word. The meaning of all of the POS tags can be found by printing the value of the following variable:

nltk.help.upenn_tagset()

The meaning of these POS codes can also be found online: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Lemmatisation¶

Lemmatisation is a process in which the conjugated forms of the words that are found in a text are converted to their base dictionary form. This base form is referred to as the lemma. English verbs can be used, for instance, in the past tense, in the present tense, in the continuous form or in the perfect form, and these different forms can evidently make it more difficult to search systematically for occurrences of a specific verb. In this context, lemmatization can offer a solution. In many cases, the manner in which words are to be lemmatised depends on their contexts. Certain homonyms may either be verbs or nouns, for instance, and, depending on their usage, they should be lemmatised to different forms. To solve issues such as these, the lemmatiser that was built as part of the nltk library can be used in combination with POS tags. The lemmatize() method, inside the WordNetLemmatizer module, demands two parameters: the word to be lemmatised and the (simplified) POS tag of the word, given the linguistic context of the word. Listing 5.3 demonstrates how the method can be used.

import nltk
from nltk.stem import WordNetLemmatizer
import dtdpTdm as dtdp
import re

lemmatiser = WordNetLemmatizer()

quote = "It was the best of times, it was the worst of times"
words = dtdp.tokenise(quote)
pos = nltk.pos_tag(words)

for i in range( 0 , len(words) ):
    posTag = ''
    if re.search( '^v' ,  pos[i][1] , re.IGNORECASE ):
        posTag = 'v'
    elif re.search( '^j', pos[i][1] , re.IGNORECASE ):
        posTag = 'a'

    if posTag == 'v' or posTag == 'v':
        print( words[i] + ' => ' + 
           lemmatiser.lemmatize( words[i] , posTag ) )
    else:
        print( words[i] + ' => ' + 
            lemmatiser.lemmatize( words[i] )

< Data acquisition | Table of contents | Data analysis with Pandas >