Pandas
Overview
Teaching: 20 min
Exercises: 20 minQuestions
How can I do statistical analysis of tabular data?
Objectives
Load tabular data with Pandas to create a data frame.
Do statistical analysis on data stored in a data frame.
What is a dataframe?
A dataframe is conceptually similar to a spreadsheet. It is two dimensional with columns and rows. Columns have labels, similar to the first row of a spreadsheet, for accessing content and similarly rows are indexed by number. Dataframes can be very powerful and are great for data exploration and analysis.
Loading tabular data with pandas
Before we load our csv into a data frame, we need to import the Pandas library. A common alias for Pandas is pd
and we will use that for our examples.
import pandas as pd
To load your csv into a Pandas data frame, you can use the .read_csv() method. .read_csv has a lot of potential parameters but for example we will simply pass .read_csv() the file name, which is located in the data folder as our Jupyter Notebook.
temperature_data = pd.read_csv("./data/temperature.csv")
Spend a few minutes reviewing your data
It is a good idea to ensure your dataframe loaded correctly. Thankfully, Pandas has a few built in methods to help - shape, .head(), .tail(), and .describe(). The shape method will return a tuple that has the number of rows in the first position and the number of columns in the second position. Methods .head() and .tail() return the first 5 rows and the last 5 rows respectively. Finally, .describe() will return a variety of things depending on the data type. For numeric data types, like our example, it includes count, mean, standard deviation, minimum, quartiles, and the maximum. For other object types it will return count, unique, most common, and frequency of the most common. Additionally, timestamps will also include the first and last item.
To show the shape (number or rows, number of columns) use:
temperature_data.shape
(135, 3)
To show the head (first five rows) use:
temperature_data.head()
Year Annual_Mean 5-year_Mean
0 1880 -0.23 NaN
1 1881 -0.15 NaN
2 1882 -0.18 -0.21
3 1883 -0.21 -0.22
4 1884 -0.29 -0.24
To show the tail(last five rows) of the data, use:
temperature_data.tail()
Year Annual_Mean 5-year_Mean
130 2010 0.66 0.57
131 2011 0.55 0.59
132 2012 0.57 0.61
133 2013 0.60 NaN
134 2014 0.67 NaN
To show simple statistics of the data, use:
temperature_data.describe
A quick description of our dataframe:
Year Annual_Mean 5-year_Mean
count 135.000000 135.000000 131.000000
mean 1947.000000 -0.009556 -0.015649
std 39.115214 0.300638 0.284251
min 1880.000000 -0.480000 -0.450000
25% 1913.500000 -0.235000 -0.250000
50% 1947.000000 -0.070000 -0.050000
75% 1980.500000 0.135000 0.145000
max 2014.000000 0.670000 0.610000
We can combine those result above with print
.
Referencing Columns
Pandas offers two different ways to access data in a column. These syntactically similar ways can have their respective benefits. However, to help with readability it is generally a good idea to use the same method throughout your code.
Before we start selecting individual columns, let’s take a look at our column labels using columns.
temperature_data.columns
Index(['Year', 'Annual_Mean', '5-year_Mean'], dtype='object')
We can also look at our indexes with something very similar.
temperature_data.index
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
125, 126, 127, 128, 129, 130, 131, 132, 133, 134],
dtype='int64', length=135)
While not particularly useful in this example, there are cases when knowing the indexes can be very valuable.
Now that we know our indexes and columns we can start exploring our data a bit more. Let’s take a look at the years in our dataframe.
temperature_data["Year"]
0 1880
1 1881
2 1882
3 1883
4 1884
...
130 2010
131 2011
132 2012
133 2013
134 2014
Name: Year, dtype: int64
We can also use the following syntax.
temperature_data.Year
0 1880
1 1881
2 1882
3 1883
4 1884
...
130 2010
131 2011
132 2012
133 2013
134 2014
Name: Year, dtype: int64
While both techniques are valid, for this lesson we are going to use the temperature_data[“Year”] syntax.
Aggregate operations
There are a number of aggregate operations that can be very useful for doing some quick analysis and verification of your dataset. They are .count(), .max(), .min(), .std(), .mean(), and .sum(). These functions works as one might anticipate.
Let’s count the number of values in our year column.
temperature_data["Year"].count()
135
Let’s look for the max and min year.
print(temperature_data["Year"].max())
print(temperature_data["Year"].min())
2014
1880
Next the standard deviation, mean, and sum.
print(temperature_data["Annual_Mean"].std())
print(temperature_data["Annual_Mean"].mean())
print(temperature_data["Annual_Mean"].sum())
0.300638292778
-0.00955555555556
-1.29
While this might not be the most interesting data analysis, these examples do show the methods functionality and use.
Indexing
Pandas supports some nice indexing and slicing features for dataframes that are common to Python generally.
For example if you wanted to look at the first five years of data.
temperature_data[:5]
Year Annual_Mean 5-year_Mean
0 1880 -0.23 NaN
1 1881 -0.15 NaN
2 1882 -0.18 -0.21
3 1883 -0.21 -0.22
4 1884 -0.29 -0.24
Or the last five years.
temperature_data[-5:]
Year Annual_Mean 5-year_Mean
130 2010 0.66 0.57
131 2011 0.55 0.59
132 2012 0.57 0.61
133 2013 0.60 NaN
134 2014 0.67 NaN
You can even do your own odd mix.
temperature_data[34:58]
Year Annual_Mean 5-year_Mean
34 1914 -0.24 -0.32
35 1915 -0.17 -0.33
36 1916 -0.37 -0.31
37 1917 -0.45 -0.32
38 1918 -0.32 -0.34
39 1919 -0.30 -0.31
40 1920 -0.28 -0.28
41 1921 -0.21 -0.27
42 1922 -0.30 -0.26
43 1923 -0.27 -0.25
44 1924 -0.25 -0.23
45 1925 -0.23 -0.21
46 1926 -0.11 -0.19
47 1927 -0.20 -0.20
48 1928 -0.17 -0.18
49 1929 -0.32 -0.18
50 1930 -0.13 -0.16
51 1931 -0.08 -0.18
52 1932 -0.11 -0.13
53 1933 -0.26 -0.14
54 1934 -0.10 -0.15
55 1935 -0.16 -0.12
56 1936 -0.11 -0.06
57 1937 0.03 -0.04
Properly using indexes can help you select data and analysis in a faster, slightly more logical way. Remember the results from our command temperature_data.index? It was a list of numbers from 0-134. It would be nice if our years were actually our index labels. That would help us select data, say from 1882-1892 fairly easily (without having to think about the zero-index location for those years).
To change our index we can use the .set_index method.
temperature_data_idx = temperature_data.set_index("Year")
Now let’s look at our indexes in our new dataframe, temperature_data_idx.
temperature_data_idx.index
Int64Index([1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889,
...
2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014],
dtype='int64', name='Year', length=135)
A quick note about why we had to create a new dataframe. Many dataframe methods are not done “in place.” Which means if you run the method on the dataframe it won’t change your original dataframe unless you set it to a new dataframe. We could have set the set_index
option to True
or we could have set the dataframe to itself by using the same temperature_data
but we wanted to take a second to use this as a learning opportunity.
With our newly indexed dataframe we can access slices of data by the our index labels using .loc and .ix. This is an improvement to having to know the index number (and with .iloc we haven’t lost our ability to access our data via the index number). This all probably sounding a bit confusing, so let’s look at each method and some examples.
.loc is a label-based index selector. Here label-based is probably best understood as a string or text, instead of an index location. That means with our newly indexed dataframe we can quickly access data by year.
temperature_data_idx.loc[1889:1903]
Annual_Mean 5-year_Mean
Year
1889 -0.11 -0.25
1890 -0.34 -0.25
1891 -0.27 -0.28
1892 -0.32 -0.32
1893 -0.36 -0.31
1894 -0.33 -0.29
1895 -0.26 -0.27
1896 -0.19 -0.26
1897 -0.19 -0.23
1898 -0.32 -0.21
1899 -0.21 -0.22
1900 -0.16 -0.24
1901 -0.21 -0.25
1902 -0.31 -0.30
1903 -0.37 -0.33
We can also select specific columns.
temperature_data_idx.loc[1889:1903, "Annual_Mean"]
Year
1889 -0.11
1890 -0.34
1891 -0.27
1892 -0.32
1893 -0.36
1894 -0.33
1895 -0.26
1896 -0.19
1897 -0.19
1898 -0.32
1899 -0.21
1900 -0.16
1901 -0.21
1902 -0.31
1903 -0.37
Name: Annual_Mean, dtype: float64
And we can add on our aggregate methods too.
temperature_data_idx.loc[1889:1903, "Annual_Mean"].std()
0.080326713815398632
We can also use .iloc if we want slice data by index location (zero-based). For example if we wanted to look at the last 10 years in our data set.
temperature_data_idx.iloc[-10:]
Annual_Mean 5-year_Mean
Year
2005 0.65 0.59
2006 0.59 0.57
2007 0.62 0.59
2008 0.49 0.59
2009 0.59 0.58
2010 0.66 0.57
2011 0.55 0.59
2012 0.57 0.61
2013 0.60 NaN
2014 0.67 NaN
We can also select columns and add aggregate methods too. However, because we are selecting based on index location we will have to refer to our columns by their zero-indexed location too.
temperature_data_idx.iloc[-10:, 1].mean()
0.58624999999999994
Quick question, which column do you think we selected?
Lastly there is .ix. .ix is mix of .loc and .iloc - that is it first looks for a label-based selector and then checks for an integer location.
temperature_data_idx.ix[1921:1937]
Annual_Mean 5-year_Mean
Year
1921 -0.21 -0.27
1922 -0.30 -0.26
1923 -0.27 -0.25
1924 -0.25 -0.23
1925 -0.23 -0.21
1926 -0.11 -0.19
1927 -0.20 -0.20
1928 -0.17 -0.18
1929 -0.32 -0.18
1930 -0.13 -0.16
1931 -0.08 -0.18
1932 -0.11 -0.13
1933 -0.26 -0.14
1934 -0.10 -0.15
1935 -0.16 -0.12
1936 -0.11 -0.06
1937 0.03 -0.04
Now to show .ix index selection ability.
temperature_data_idx.ix[7:12]
Empty DataFrame
Columns: [Annual_Mean, 5-year_Mean]
Index: []
That’s odd, an empty dataframe? In a fun gotcha, when an index is integer based, like our current example, label based access is only supported for .ix because it is difficult for python to know if you’re trying to access your data via the label or the integer location. Our empty dataframe is because we don’t have any data for years 7 through 12. But, there is a workaround for our situation (using our previous dataframe). Which, reminds us to remember that if we obtain a result or an error that we didn’t anticipate it is always best to read the documentation! : )
temperature_data.ix[0:5]
Year Annual_Mean 5-year_Mean
0 1880 -0.23 NaN
1 1881 -0.15 NaN
2 1882 -0.18 -0.21
3 1883 -0.21 -0.22
4 1884 -0.29 -0.24
5 1885 -0.27 -0.27
Finally, we can count the number of NaN in data using Numpy provided function as the following,
np.count_nonzero(~np.isnan(data))
From the best practice, Pandas is suitable and intended for dataframe, whileNumpy is for matrix/array. If data has header (columns label), it is better to process with Pandas, otherwise Numpy’s array works in similar way of Matlab/Octave.
Exercises
- Analyze temperature statistics.
Key Points
Use Pandas data frames to store tabular data for statistical analysis.
Data frame operations make (most) loops unnecessary.