Research projects that make use of Text and Data Mining often produce quantitative data about text corpora. The texts in such collections are generally stored as files on a disk. A basic initial operation in such studies, therefore, is to open a file and to read the contents of this file. In Python, files can be read using the 'open()' function. The result of this function is a new object which is called a file handler (or, more specifically, a TextIOWrapper object). Simply put, a file handler is an object which establishes a connection to the text file on your disk. You are free to name this object as you like.
Once the connection is established via the 'open()' fuction, you can access the contents of the file in a variety of ways. A first option is to read the contents on a line-by-line or a paragraph-by-paragraph basis. This first approach can be followed when units such as lines or paragraphs are dlineated using the hard return or the newline character. If this is the case, the file handler that is created for the file, using 'open()', also becomes iterable: the ‘for’ keyword can be used to iterate across the various units represented by this file handler. When you use the 'open()' function, you are also recommended to specify the character encoding scheme that has been used in the text file, using the ‘encoding’ parameter. This will help Python to process all the characters correctly.
The code below demonstrates how you can read and display the full contents of a text file, line by line.
text = open("BraveNewWorld.txt" , encoding = 'utf-8' )
for line in text:
print(line)
The code above of course assumes that there is a text file named "BraveNewWorld.txt" in the same folder that contains the code.
In the case of short texts, you can also make use of the read() function. When you do this, the entire text will not be divided into smaller units. The full contents of the text file will become available as one long string.
text = open("tweet.txt" , encoding = 'utf-8' )
fullText = text.read()
It is good practice to close the file handler when you are done working on it, using the 'close()' method.
text.close()
When you run a Python program using the Command Prompt, the full output will normally be printed on the Command Prompt as well. When the program has many lines to print, it can be very difficult to read the output. In such cases, it can useful to create a new text file which will receive all the output. The results of the program can then be inspected by opening this new file in a text editor.
The function open(), which can be used to read files, can also be invoked to create a new file. Instead of referencing a file which already exists on your disk, you need to enter a new file name. Secondly, you also need to supply a second parameter, the character “w”, which stands for “write”. This “w” character makes it clear to Python that you want to write to a file. The open() function used with the “w” parameter similarly creates a file handler. This handler has a write() method, which functions very similarly to the print function. The crucial difference, however, is that the output is not sent to the Command Prompt, but to the file associated with this file handler.
out = open('data.txt' , 'w')
out.write( "This text is in a file named 'data.txt' " )
out.close()
The 'open()' function can be used, as discussed, to read individual files. In Text & Data Mining projects, it is often necessary, however, to read full corpora or entire collections of files. If research data are scattered across multiple files in a specific folder, it can be practical to make use of the 'os' library. The two letters in the name of this library stand for 'operating system'. The library includes various functions that can help you to work with files and folders. One useful method is 'listdir()', which, expectedly, enables you to list all the files in a given directory.
To make use of 'os', this library needs to be imported first.
import os
dir = 'Corpus'
for fileName in os.listdir( dir ):
print( fileName )
The function 'join()', which is part of the 'path' module of 'os', can be used to create a string representing the path to a certain file. If you have one variable which records the base directory of a file, and a second variable which captures the filename, the join() function can concatenate the values of these two variables to create the full path to this file. The 'join()' function always follows the conventions that are in place on a given operating system for representing paths. There can often be certain differences. While Mac OS uses forward slashes, for instance, Windows uses back slashes. Working with 'join()' makes your code more platform-independent.
Another useful function in 'os' is 'isfile()'. As you list the files in a certain directory, using listdir(), you can apply this function to check whether you are dealing with a file or with something else ( e.g. a subdirectory).
The code below offers a demostration of these two functions. Note that the first line imports the two functions that have been discussed above. As a result of this, it is no longer necessary to use the period syntax for 'isfile()' and 'join()'.
from os.path import isfile , join
dir = 'Corpus'
for fileName in os.listdir( dir ):
if isfile( join( dir , fileName ) ):
print( fileName )