Data analyses with pandas

'Pandas' is a Python library which was developed for the manipulation and the analysis of data sets. It is available for Python version 2.7 and higher. The name of the library loosely stands for "Python Data Analysis Library". Pandas contains methods that enable programmers to read data from and write data to a wide range of file formats, including CSV, TSV, Excel spreadsheets and MySQL databases, and it offers numerous statistical methods that can be used to analyse these data.

If pandas has been installed successfully on your computer, this library can be imported into your program using the import keyword.

In [1]:
import pandas as pd

As can be seen, this import statement also assigns an alias, namely pd. This is an abbreviated reference to the pandas library. Without this alias, the full name of the library would have to be typed in each time a method from this library is needed.

Reading a CSV file

If your data set is stored in the CSV format, it can be accessed using the read_csv() function.

In [ ]:
data_set = pd.read_csv( 'data.csv' )

The data that are read from the csv file are represented as a specific data structure which is known within the context of pandas as a data frame. A data frame can be compared to a regular table in a database. It is a data structure consisting of rows and columns. In the code above, the data frame is stored in a variable named 'data_set'.

As is indicated by the name of the file format, CSV files conventionally separate the individual values in the data set using commas. In some cases however, such files may also work with other types of separators, such as hyphens or semi-colons. In the read_csv() method, it is possible to indicate explicitly which character should function as a separator. When you use the parameter named sep, and assign it the value ‘;’, the semi-colon, this character will be used to separate the rows into individual values.

For more information, see the guide on read_csv().

In [ ]:
data_set = pd.read_csv( 'data.csv' , sep = ';' )

If the data set includes floating point numbers, it can also be useful to specify the character that is used as a decimal point via the decimal parameter. Floating point numbers are typically represented using either the period or the comma.

In [ ]:
data_set = pd.read_csv( 'data.csv' , decimal = '.' )

Reading an Excel spreadsheet

The Pandas library also offers a method named read_excel(). You can use this function, unsurprisingly, to process data stored in an Excel spreadsheet. The method needs to be used with two parameters. The first parameter is simply the name of the Excel file that you want to work with. Secondly, as a value to a parameter named sheet_name, you need to provide the name of the sheet within the spreadsheet that you want to read. Similarly to the way in which read_csv() processes data, read_excel() loads the data on the sheet that you mention into a data frame.

In [ ]:
import pandas as pd

df = pd.read_excel( 'Data.xlsx' , sheet_name = "Sheet1")

Basic information about the data frame

Once a data frame has been created, the data set can be examined using a number of methods that are available within the Pandas library. These methods will be explained using the simple data set in the CSV format that is shown below:

A,B,C,D,E
12,5,6,12,Class1
14,11,17,20,Class1
15,6,8,19,Class1
8,3,21,5,Class2
10,9,14,7,Class2
7,5,16,,Class2

As you can see, this is a data set consting of five columns and six rows. The first line defines the names of the columns. The actual values are on the six lines that follow. The first columns contain numbers, and the fifth column, named ‘E’, classifies the data into two separate classes. Also note that, on the final line, the value for column D is missing.

If the values above are stored as a CSV file named ‘data.csv’, these values can be accessed by using the read_csv() method, as explained above.

In [2]:
df = pd.read_csv( 'data.csv' )

To inspect the contents of the data set, you may choose, simply, to print out the full data frame named data_set, as follows:

In [ ]:
print( df )

The output looks as follows:

    A   B   C     D       E
0  12   5   6  12.0  Class1
1  14  11  17  20.0  Class1
2  15   6   8  19.0  Class1
3   8   3  21   5.0  Class2
4  10   9  14   7.0  Class2
5   7   5  16   NaN  Class2

Using the output of this print command, we can get an impression of what pandas has done with the CSV file. The data frame that is created when you run the read_csv() method consist of a collection of rows and of columns. As can be seen in the ouput, the rows are all assigned numbers. These numbers are referred to as ‘indices’ (the plural form of ‘index’). Additionally, the columns have all been given a name. These names have been copied form the header of the CSV file. The header is typically on the very first line of the CSV file. In the sample CSV file shown above, you can see that the first line consists of a number of strings, delimited by commas. Following the execution of the read_csv() method, these separate strings all become available as column names. Pandas represents the missing value in column E of the row with index 5 using the code NaN, which stands for ‘Not a number’. If a column contains a 'Nan' value, all the other numbers in this column will be shown as float values.

Pandas stores the columns using a data structure that is known as a Series. Simply put, a Series is a variable that brings together multiple values. A series is similar to a regular list in Python.

In the case of this simple and relatively small set, the print command shown in the cell above will produce a small number of lines of output only. More realistic data science projects typically work with CSV files containing hundreds or thousands of rows, however. When we simply print the full data frame, this does not necessarily help us to develop a better understanding of the structure and the contents of the data set.

To print only the first few rows of the data frame, we can work with the head() method. If no integer is provided within the parentheses, Pandas will show the first five rows by default. The number of rows to be shown can be specified, however, within the parentheses.

In [ ]:
print( data_set.head(2) )
df.head(3)
# This will print the first two records of the data frame.

The shape property can also come in handy. It returns information about the number of rows and columns. Note that shape needs to be used WITHOUT parentheses, as it is a property and not a method.

In [ ]:
print( df.shape )
# In the current example, the shape property has the value (6, 5)

You can request an overview of all the column names that are available in a data frame using the property named columns. For the sake of readability, the so-called ‘Index’ object that generated by this property is converted to a list using the tolist() method.

In [ ]:
print("Column headings:")
print( df.columns.tolist() )

Series

The values in a column of a data frame can be accessed separately using the name of the column. This assumes, obviously, that such column names have been defined. In order to obtain the data in a specific column, the name of the column must be appended to the name of the full data frame using square brackets, as follows.

In [ ]:
 df['A']

The code above will print all the values in the column named 'A'. As was discussed earlier, this list of values is referred to as a Series. An important difference between regular Python lists and the Series used in pandas is that, in the latter data structure, the individual items consist of two values: (1) the actual items and (2) the indices of these items. When you print the Series using the code that was shown, you also see these two rows of values.

It can be useful, in some situations, to convert a given Series into a regular list. There is a simple method named tolist() which you can use for this purpose:

In [ ]:
print( data_set['A'].tolist() )

In Python, there are often multiple ways of doing the same thing. Instead of working with the tolist() method, you can also invoke the standard list() function, as follows.

In [ ]:
print( list(data_set['A']) )

Statistical functions

Within Pandas, you can use methods such as max(), min(), mean() to receive basic statistical information about numerical values in your data set.

In [ ]:
print( df['A'].max() )
## max() identifies the highest number within the column that is mentioned within the square brackets.

print( df['B'].min() )
# min() identifies the lowest number

print( df['C'].mean() )
# mean() calculates the mean of all the values in a specific column

print( df['D'].sum() )
## sum() performs an addition of all the numbers in a column

If you would like to perform statistical analyses on all the columns of the dataframe, you can also append the name of the method you need directly to the data frame variable, without specifying a column.

In [ ]:
print( df.sum() )
## This will print the sum of the values of all the columns n the data set

Note that this particular method for calculating summations will only produce meaningful results in the case of numerical values. If the column contains strings, as in the case of column ‘E’ of our sample data set, this method will simply concatenate all the string values in the Series.

To iterate (or navigate; these terms are synonyms) across all the rows in a data frame, you can make use of the iterrows() method. This method processes the data frame row by row. When you use iterrows(), pandas will create a new Series for each individual row. This Series will contain all the individual values on that row. Secondly, the method also returns each row’s index.

In the code below, the method iterrows() navigates across all the rows in the data frame. During each iteration, the compound variable named row will be assigned all the values that are available on a given row. When the variable row is used in combination with a column name, this variable will represent the data value in that particular column. The method iterrows() can also return the index of each row, but that index is not actually used in the fragment that follows.

In [ ]:
for index , row in df.iterrows():
    print( row['B'] )

Selecting subsets

In the pandas library, it is also possible to filter data sets. In other words, we can create new data sets containing only some of the rows or the columns of the original data sets. Such filtered data sets are also referred to as subsets. This process of ‘subsetting’ a data frame takes place on the basis of specific criteria. Different types of criteria are possible.

Filtering by index

As a first possibility, we can use the indices of rows and of columns as criteria. These indices need to be added to the data frame using the iloc[] indexer.

When we want to select all the values on the third row of the data frame, this can be accomplished using the code below.

In [ ]:
# Using numeric index values, just like rows. But you need to add `.iloc` to differentiate from selecting columns
df.iloc[2]

The third row has index number '2', because pandas start counting at zero.

We can also select a range of rows. To do this, we need to specify where the range should start and where it should end. These two numbers need to be separated by a colon. Note that the row with the index that is mentioned secondly will NOT be in the subset.

In [ ]:
df.iloc[0:2]
# This selects the first two rows; those with indices 0 and 1. 

If the first value is 0, as in the examole above, it can also be omitted. The code that was given can be reformulated as df.iloc[:2]

It is also possible to filter the columns, on the basis of their indices. The next example selects the first and the second column of the second and the third row. As can be seen, the indices ranges for the rows and the columns need to be separated by a comma.

In [ ]:
df.iloc[1:3, 0:2 ]

Filtering by values

A potential disadvantage of this iloc[] selector is that you need to know the exact values for the indices. In most situations, it is more useful to create subsets on the basis of criteria applied to the values in the data set. Imagine, for instance, that you are only interested in rows in which the column B contains a value of 5 or higher. To select those rows, you can work with the Boolean expression df['B'] >= 5. This Boolean expression can be added to the original data frame in square brackets. In the code that follows, the subset is assigned to a new data frame, one named new_df.

In [ ]:
new_df = df[ df['B'] >= 5 ]
print(new_df)

A similar syntax can be used to filter on the basis of string values. The code below selects all the rows that belong to Class1.

In [ ]:
new_df = df[ df['E'] == 'Class1' ]

The values of the columns may also be combined. The following example selects rows only if the sum of the values in columns ‘A’ and ‘B’ is higher than 20.

In [ ]:
new_df = df[ ( df['A'] + df['B'] ) > 20 ]

For some types of calculations, it can be problematic if there are rows with missing values. To avoid such issues, you may want to remove all the records containing such missing values. This can be done easily using the dropna() method. The code below removes the row with index 5 from the df dataframe, as the column ‘E’ was left empty on this specific row.

In [ ]:
new_df = df.dropna()

Subsets of columns

Next to subsets of rows, it is also possible to create subsets of columns. More specifically, this means that we create new data frame, by copying only some of the columns that are available in another data frame. The easiest way to accomplish this is by supplying a list of the column names that you want to keep within the square brackets appended to data frame that needs to be filtered. After running the code below, the data frame named new_df has the exact same number of rows as df, but it only contains three of its columns.

In [ ]:
columns = ['A', 'B', 'D']
new_df = df[ columns ]

When you want to delete a single column from a data frame, you can make use of the drop() method. The columns parameter of this method needs to mention the name of the column that needs to be removed.

In [ ]:
new_df = df.drop(columns="E")

Calculations

One of the great strengths of Pandas is that it can make calculations using the values captured in data frames. Such calculations can be carried out using the regular mathematical operators in Python.

In [ ]:
# Add the values of columns A and B 
print( df['A'] + df['B'] )

# Multiply all values in the data frame by 2
new_df = df.drop(columns="E")
print( 2 * new_df ) 

The results of these calculations can also be stored within the data frame itself. In the code below, a new column named ‘F’ will be added to df, and it will receive the result of the calculation to the right of the equals sign.

In [ ]:
# Add the same number 3 to all the values in column D
df['F'] = df['D'] + 3

Grouping items and aggregation

Pandas also offers a method named groupby() which can be very useful when your data set contains labels or categories for specific records. Using groupby(), the records which have been assigned the same category can all be placed in a single group. Subsequently, it becomes possible to perform calcutions for each group that has been created. This process is demonstrated below.

In [3]:
df.groupby('E').max()
Out[3]:
A B C D
E
Class1 15 11 17 20.0
Class2 10 9 21 7.0

As was indicated above, Column E of the data frame df contains labels which place the rows either in ‘Class1’ or in ‘Class2’. The name of this column can be used as a parameter for groupby(). The method then creates the various groups, based on the values it finds in the column that is mentioned. Once the groups have been established, you can apply statistical functions to each of theses. Examples of such statistical functions include mean(), sum() or max().

In [4]:
df.groupby('E').mean()
Out[4]:
A B C D
E
Class1 13.666667 7.333333 10.333333 17.0
Class2 8.333333 5.666667 17.000000 6.0

In the output above, you can see that pandas performs calculations for all the columns in the data frame. If you would like to see the values for only one column, you can simply append the name of this column in square brackets, directly following the groupby() method.

In [5]:
df.groupby('E')['A'].min()
Out[5]:
E
Class1    12
Class2     7
Name: A, dtype: int64