Lecture 8 - Data Manipulation with pandas¶
8.1 Introduction to pandas
¶
pandas
is a library designed for working with structured data, such as tabular data (e.g., from .csv files, Excel files) or relational databases (e.g., SQL).
The DataFrame object in
pandas
is a 2-dimensional tabular, column-oriented data structure. The DateFrame is similar to an Excel spreadsheet and can store data of different types (including text characters, integers, floating-point values, categorical data, and more).
Figure source: Reference [2].
The pandas
name is derived from the term panel data, which is a term from economics for multi-dimensional structured data.
8.2 Importing Data and Summary Statistics¶
Let’s begin by importing the pandas
package using the common abbreviation pd
. Loading pandas
as pd
is standard practice.
[1]:
import pandas as pd
A wide range of input/output formats are supported by pandas
:
CSV, text
SQL database
Excel
HDF5
json
html
pickle
sas, stata
…
For importing .csv files, the function read_csv()
in pandas
allows to easily import data. By default, it assumes that the data is comma-separated, but we can also specify the delimiter used in the data (e.g., tab, semicolon, etc.). There are several parameters that can be specified in read_csv()
. See the documentation here.
Let’s load the data in the file country-total
located in the folder data
and save it under the name unemployment
. This file contains unemployment information for several countries over a time period.
The function read_csv()
returns a DataFrame
, as shown below. Note that the DataFrame has 20,796 rows, and the output of the cell displayed only the first 5 and last 5 rows (or the first and last 30, depending on your system), since displaying all 20,976 rows is probably not what we want at this point anyway.
[2]:
# Import data
unemployment = pd.read_csv('data/country_total.csv')
# Show the DataFrame
unemployment
[2]:
country | seasonality | month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
3 | at | nsa | 1993.04 | 157000 | 4.1 |
4 | at | nsa | 1993.05 | 147000 | 3.9 |
... | ... | ... | ... | ... | ... |
20791 | uk | trend | 2010.06 | 2429000 | 7.7 |
20792 | uk | trend | 2010.07 | 2422000 | 7.7 |
20793 | uk | trend | 2010.08 | 2429000 | 7.7 |
20794 | uk | trend | 2010.09 | 2447000 | 7.8 |
20795 | uk | trend | 2010.10 | 2455000 | 7.8 |
20796 rows × 5 columns
We could have also used print
to display the DataFrame.
[3]:
print(unemployment)
country seasonality month unemployment unemployment_rate
0 at nsa 1993.01 171000 4.5
1 at nsa 1993.02 175000 4.6
2 at nsa 1993.03 166000 4.4
3 at nsa 1993.04 157000 4.1
4 at nsa 1993.05 147000 3.9
... ... ... ... ... ...
20791 uk trend 2010.06 2429000 7.7
20792 uk trend 2010.07 2422000 7.7
20793 uk trend 2010.08 2429000 7.7
20794 uk trend 2010.09 2447000 7.8
20795 uk trend 2010.10 2455000 7.8
[20796 rows x 5 columns]
When the DataFrames have a large number of rows that take large portion of the screen, we can inspect the data by using the .head()
method. By default, this shows the header (names of the columns, commonly referred to as column labels) and the first five rows (having indices ranging from 0 to 4, in the first column in the table). The indices are also referred to as row labels.
[4]:
unemployment.head()
[4]:
country | seasonality | month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
3 | at | nsa | 1993.04 | 157000 | 4.1 |
4 | at | nsa | 1993.05 | 147000 | 3.9 |
Passing an integer number as an argument to .head(n)
returns that number of rows. To see the last \(n\) rows, use .tail(n)
.
[5]:
# show the last 8 rows
unemployment.tail(8)
[5]:
country | seasonality | month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
20788 | uk | trend | 2010.03 | 2437000 | 7.8 |
20789 | uk | trend | 2010.04 | 2419000 | 7.8 |
20790 | uk | trend | 2010.05 | 2419000 | 7.7 |
20791 | uk | trend | 2010.06 | 2429000 | 7.7 |
20792 | uk | trend | 2010.07 | 2422000 | 7.7 |
20793 | uk | trend | 2010.08 | 2429000 | 7.7 |
20794 | uk | trend | 2010.09 | 2447000 | 7.8 |
20795 | uk | trend | 2010.10 | 2455000 | 7.8 |
To find the number of rows in a DataFrame, you can use the len()
function, as with Python lists and other sequences.
[6]:
len(unemployment)
[6]:
20796
Alternatively, we can use the shape
attribute to find the numbers of rows and columns, as with NumPy arrays. The cell output is a tuple, showing that there are 20,796 rows and 5 columns. Note that the left-most column in the above table showing row indices is not part of the data.
[7]:
unemployment.shape
[7]:
(20796, 5)
A useful method that generates various summary statistics of a DataFrame is .describe()
, as shown below.
Notice in the above cell that the DataFrame has 5 columns, but the first 2 columns (country and seasonality) have textual (strings) data, and therefore the summary statistics are shown only for the columns with numeric data (month, unemployment, and unemployment_rate). If .describe()
is called on textual data only, it will return the count, number of unique values, and the most frequent value along with its count.
[8]:
unemployment.describe()
[8]:
month | unemployment | unemployment_rate | |
---|---|---|---|
count | 20796.000000 | 2.079600e+04 | 19851.000000 |
mean | 1999.401290 | 7.900818e+05 | 8.179764 |
std | 7.483751 | 1.015280e+06 | 3.922533 |
min | 1983.010000 | 2.000000e+03 | 1.100000 |
25% | 1994.090000 | 1.400000e+05 | 5.200000 |
50% | 2001.010000 | 3.100000e+05 | 7.600000 |
75% | 2006.010000 | 1.262250e+06 | 10.000000 |
max | 2010.120000 | 4.773000e+06 | 20.900000 |
It is also possible to calculate individual statistics, such as .min()
, .max ()
, or .mean()
, instead of using summary statistics with .describe()
.
[9]:
unemployment.min()
[9]:
country at
seasonality nsa
month 1983.01
unemployment 2000
unemployment_rate 1.1
dtype: object
To view the data types for each column use the dtypes
attribute of the unemployment. The data types in this case are strings (object
type), floats (float64
type), and integers (int64
type).
[10]:
unemployment.dtypes
[10]:
country object
seasonality object
month float64
unemployment int64
unemployment_rate float64
dtype: object
And one more way to get a summary of a DataFrame is by using info()
.
[11]:
unemployment.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20796 entries, 0 to 20795
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 20796 non-null object
1 seasonality 20796 non-null object
2 month 20796 non-null float64
3 unemployment 20796 non-null int64
4 unemployment_rate 19851 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 812.5+ KB
Import Data From a URL¶
Above, we imported the unemployment data using the function read_csv
and a relative file path to the data
directory. The function read_csv
is very flexible and it also allows importing data using a URL as the file path.
A csv file with data on world countries and their abbreviations is located at https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv
Using read_csv
, we can import the country data and save it to the variable countries
. This DataFrame has 30 rows.
[12]:
countries = pd.read_csv('https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv')
countries
[12]:
country | google_country_code | country_group | name_en | name_fr | name_de | latitude | longitude | |
---|---|---|---|---|---|---|---|---|
0 | at | AT | eu | Austria | Autriche | Österreich | 47.696554 | 13.345980 |
1 | be | BE | eu | Belgium | Belgique | Belgien | 50.501045 | 4.476674 |
2 | bg | BG | eu | Bulgaria | Bulgarie | Bulgarien | 42.725674 | 25.482322 |
3 | hr | HR | non-eu | Croatia | Croatie | Kroatien | 44.746643 | 15.340844 |
4 | cy | CY | eu | Cyprus | Chypre | Zypern | 35.129141 | 33.428682 |
5 | cz | CZ | eu | Czech Republic | République tchèque | Tschechische Republik | 49.803531 | 15.474998 |
6 | dk | DK | eu | Denmark | Danemark | Dänemark | 55.939684 | 9.516689 |
7 | ee | EE | eu | Estonia | Estonie | Estland | 58.592469 | 25.806950 |
8 | fi | FI | eu | Finland | Finlande | Finnland | 64.950159 | 26.067564 |
9 | fr | FR | eu | France | France | Frankreich | 46.710994 | 1.718561 |
10 | de | DE | eu | Germany (including former GDR from 1991) | Allemagne (incluant l'ancienne RDA à partir de... | Deutschland (einschließlich der ehemaligen DDR... | 51.163825 | 10.454048 |
11 | gr | GR | eu | Greece | Grèce | Griechenland | 39.698467 | 21.577256 |
12 | hu | HU | eu | Hungary | Hongrie | Ungarn | 47.161163 | 19.504265 |
13 | ie | IE | eu | Ireland | Irlande | Irland | 53.415260 | -8.239122 |
14 | it | IT | eu | Italy | Italie | Italien | 42.504191 | 12.573787 |
15 | lv | LV | eu | Latvia | Lettonie | Lettland | 56.880117 | 24.606555 |
16 | lt | LT | eu | Lithuania | Lituanie | Litauen | 55.173687 | 23.943168 |
17 | lu | LU | eu | Luxembourg | Luxembourg | Luxemburg | 49.815319 | 6.133352 |
18 | mt | MT | eu | Malta | Malte | Malta | 35.902422 | 14.447461 |
19 | nl | NL | eu | Netherlands | Pays-Bas | Niederlande | 52.108118 | 5.330198 |
20 | no | NO | non-eu | Norway | Norvège | Norwegen | 64.556460 | 12.665766 |
21 | pl | PL | eu | Poland | Pologne | Polen | 51.918907 | 19.134334 |
22 | pt | PT | eu | Portugal | Portugal | Portugal | 39.558069 | -7.844941 |
23 | ro | RO | eu | Romania | Roumanie | Rumänien | 45.942611 | 24.990152 |
24 | sk | SK | eu | Slovakia | Slovaquie | Slowakei | 48.672644 | 19.700032 |
25 | si | SI | eu | Slovenia | Slovénie | Slowenien | 46.149259 | 14.986617 |
26 | es | ES | eu | Spain | Espagne | Spanien | 39.895013 | -2.988296 |
27 | se | SE | eu | Sweden | Suède | Schweden | 62.198467 | 14.896307 |
28 | tr | TR | non-eu | Turkey | Turquie | Türkei | 38.952942 | 35.439795 |
29 | uk | GB | eu | United Kingdom | Royaume-Uni | Vereinigtes Königreich | 54.315447 | -2.232612 |
Similar to the above example, we can use shape
and describe()
the understand the countries DataFrame. In this case .describe()
is not very useful, because only 2 of the columns have numeric values.
[13]:
countries.shape
[13]:
(30, 8)
[14]:
# explore the countries data
countries.describe()
[14]:
latitude | longitude | |
---|---|---|
count | 30.000000 | 30.000000 |
mean | 49.092609 | 14.324579 |
std | 7.956624 | 11.257010 |
min | 35.129141 | -8.239122 |
25% | 43.230916 | 6.979186 |
50% | 49.238087 | 14.941462 |
75% | 54.090400 | 23.351690 |
max | 64.950159 | 35.439795 |
Import Data from Excel File¶
In a similar way, we can import data from an Excel file using the function read_excel()
.
[15]:
titanic = pd.read_excel('data/titanic.xls')
titanic
[15]:
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | None | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | None | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | None | None | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | None | 135 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | None | None | Montreal, PQ / Chesterville, ON |
5 | 1 | 1 | Anderson, Mr. Harry | male | 48 | 0 | 0 | 19952 | 26.5500 | E12 | S | 3 | None | New York, NY |
6 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63 | 1 | 0 | 13502 | 77.9583 | D7 | S | 10 | None | Hudson, NY |
7 | 1 | 0 | Andrews, Mr. Thomas Jr | male | 39 | 0 | 0 | 112050 | 0.0000 | A36 | S | None | None | Belfast, NI |
8 | 1 | 1 | Appleton, Mrs. Edward Dale (Charlotte Lamson) | female | 53 | 2 | 0 | 11769 | 51.4792 | C101 | S | D | None | Bayside, Queens, NY |
9 | 1 | 0 | Artagaveytia, Mr. Ramon | male | 71 | 0 | 0 | PC 17609 | 49.5042 | None | C | None | 22 | Montevideo, Uruguay |
10 | 1 | 0 | Astor, Col. John Jacob | male | 47 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | None | 124 | New York, NY |
11 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | 4 | None | New York, NY |
12 | 1 | 1 | Aubart, Mme. Leontine Pauline | female | 24 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | None | Paris, France |
13 | 1 | 1 | Barber, Miss. Ellen "Nellie" | female | 26 | 0 | 0 | 19877 | 78.8500 | None | S | 6 | None | None |
14 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80 | 0 | 0 | 27042 | 30.0000 | A23 | S | B | None | Hessle, Yorks |
15 | 1 | 0 | Baumann, Mr. John D | male | None | 0 | 0 | PC 17318 | 25.9250 | None | S | None | None | New York, NY |
16 | 1 | 0 | Baxter, Mr. Quigg Edmond | male | 24 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | None | None | Montreal, PQ |
17 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | 6 | None | Montreal, PQ |
18 | 1 | 1 | Bazzani, Miss. Albina | female | 32 | 0 | 0 | 11813 | 76.2917 | D15 | C | 8 | None | None |
19 | 1 | 0 | Beattie, Mr. Thomson | male | 36 | 0 | 0 | 13050 | 75.2417 | C6 | C | A | None | Winnipeg, MN |
Note that all DataFrames that we loaded so far in this Jupyter notebook, i.e., unemployment
, countries
, and titanic
, are stored in the memory, and we can access them when needed once they are loaded. For example, we can check the shape of the titanic
DataFrame.
[16]:
titanic.shape
[16]:
(20, 14)
Create a DataFrame¶
Alternatively, we can manually create DataFrames, instead of importing from a file. The following DataFrame called simple_table
contains information from the titanic
data. You can notice that the DataFrame is created similarly to creating a dictionary, where the column headers represent keys, and the data in each column are lists of values.
[17]:
simple_table = pd.DataFrame({
"Name": ["Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth"],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"]})
simple_table
[17]:
Name | Age | Sex | |
---|---|---|---|
0 | Braund, Mr. Owen Harris | 22 | male |
1 | Allen, Mr. William Henry | 35 | male |
2 | Bonnell, Miss. Elizabeth | 58 | female |
[18]:
simple_table.shape
[18]:
(3, 3)
8.3 Rename, Index, and Slice¶
Let’s look again at the unemployment DataFrame. You may have noticed that the month
column actually includes the year and the month added as decimals (e.g., 1993.01 should be year 1993 and month 01).
[19]:
unemployment.head(3)
[19]:
country | seasonality | month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
Let’s rename the column into year_month. The .rename()
method allows modifying column and/or row names. As you can see in the cell below, we passed a dictionary to the columns
parameter, with the original name month
as the key and the new name year_month
as the value. Importantly, we also set the inplace
parameter to True
, which indicates that we want to modify the actual DataFrame, and not to create a new DataFrame.
[20]:
unemployment.rename(columns={'month' : 'year_month'}, inplace=True)
unemployment.head(3)
[20]:
country | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
To observe the effect of inplace=True
, let’s run in the next cell another .rename()
method to change the column country
to year
.
[21]:
unemployment.rename(columns={'country' : 'year'}).head(3)
[21]:
year | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
The above code didn’t change the actual unemployment
DataFrame, as we can check that in the following cell. Instead, it created a copy of the unemployment
DataFrame in which it changed the name of the column country
.
[22]:
unemployment.head(3)
[22]:
country | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
Selecting Columns¶
To select a single column of the DataFrame, we can either use the name of the column enclosed in square brackets []
or the dot notation .
(i.e., via attribute access). It is preferable to use the square brackets notation, since a column name might inadvertently have the same name as an built-in pandas
method.
[23]:
# access with square brackets
unemployment['year_month'].head()
[23]:
0 1993.01
1 1993.02
2 1993.03
3 1993.04
4 1993.05
Name: year_month, dtype: float64
[24]:
# access with dot notation
unemployment.year_month.head()
[24]:
0 1993.01
1 1993.02
2 1993.03
3 1993.04
4 1993.05
Name: year_month, dtype: float64
When selecting a single column, we obtain a pandas
Series object, which is a single vector of data with an associated array of index row labels shown in the left-most column.
A Series object in
pandas
represents an 1-dimensional vector of data (i.e., a column of data).
If we check the type of the unemployment
object and unemployment['year_month']
object, we can see that the first one is DataFrame and the second one is Series.
[25]:
type(unemployment)
[25]:
pandas.core.frame.DataFrame
[26]:
type(unemployment['year_month'])
[26]:
pandas.core.series.Series
pandas
provide many methods that can be applied to Series
objects. A few examples are shown below.
[27]:
print('minimum is ', unemployment['year_month'].min())
print('maximum is ', unemployment['year_month'].max())
print('mean value is ', unemployment['year_month'].mean())
minimum is 1983.01
maximum is 2010.12
mean value is 1999.4012896710906
To select multiple columns in pandas
, use a list of column names within the selection brackets []
.
[28]:
unemployment[['country','year_month']].head()
[28]:
country | year_month | |
---|---|---|
0 | at | 1993.01 |
1 | at | 1993.02 |
2 | at | 1993.03 |
3 | at | 1993.04 |
4 | at | 1993.05 |
Selecting Rows¶
One way to select rows is by using the []
operator, similar to indexing and slicing in Python lists and other sequence data.
[29]:
unemployment[0:4]
[29]:
country | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
3 | at | nsa | 1993.04 | 157000 | 4.1 |
Another graphical representation of a DataFrame is shown in the figure below.
Figure source: Reference [2].
The first column with the indices in pandas
DataFrames does not need to be a sequence of integers, but it can can also contains strings or other numeric data (e.g., dates, years).
For instance, let’s create a DataFrame
called bacteria to see how indexing with string indices works. We again pass in a dictionary, with the keys corresponding to column names and the values to the data, and in addition we pass a list of strings called index
. (Compare to the simple_table above, which does not use index
for creating the DataFrame, and in that case, the indices were automatically set to integer numbers.)
[30]:
bacteria = pd.DataFrame({'bacteria_counts' : [632, 1638, 569, 115],
'other_feature' : [438, 833, 234, 298]},
index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])
bacteria
[30]:
bacteria_counts | other_feature | |
---|---|---|
Firmicutes | 632 | 438 |
Proteobacteria | 1638 | 833 |
Actinobacteria | 569 | 234 |
Bacteroidetes | 115 | 298 |
For selecting rows and/or columns in pandas
, beside the use of square brackets, two other operators are used: .loc
and .iloc
.
The operator .loc
works with string locations (string labels), and .iloc
works with integer locations (integer labels). These two operators can accept either a single label, a list of labels, or a slice of labels (e.g., 'a:f'
or 2:5
).
For instance, if we’re interested in the row Actinobacteria, we can use .loc
and the index name. This returns the column values for the specified row.
[31]:
bacteria.loc['Actinobacteria']
[31]:
bacteria_counts 569
other_feature 234
Name: Actinobacteria, dtype: int64
We could have also used “positional indexing” with square brackets []
, even though the indices are strings. The difference is that .loc
returns a Series
object because we selected a single label, while [2:3]
returns a DataFrame
because we selected a range of positions.
[32]:
bacteria[2:3]
[32]:
bacteria_counts | other_feature | |
---|---|---|
Actinobacteria | 569 | 234 |
Let’s return to the unemployment data to show how .iloc
is used, since unemployment has integer indices. To select specific rows, we can do the following.
[33]:
unemployment.iloc[0:4]
[33]:
country | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 166000 | 4.4 |
3 | at | nsa | 1993.04 | 157000 | 4.1 |
[34]:
unemployment.iloc[[1, 5, 6, 22]]
[34]:
country | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
1 | at | nsa | 1993.02 | 175000 | 4.6 |
5 | at | nsa | 1993.06 | 134000 | 3.5 |
6 | at | nsa | 1993.07 | 128000 | 3.4 |
22 | at | nsa | 1994.11 | 148000 | 3.9 |
We can also select a range of rows and specify the step value.
[35]:
unemployment.iloc[25:50:5]
[35]:
country | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
25 | at | nsa | 1995.02 | 174000 | 4.5 |
30 | at | nsa | 1995.07 | 123000 | 3.3 |
35 | at | nsa | 1995.12 | 175000 | 4.7 |
40 | at | nsa | 1996.05 | 159000 | 4.3 |
45 | at | nsa | 1996.10 | 146000 | 3.9 |
Selecting a Specific Value¶
Both .loc
and .iloc
can be used to select a particular value if they are given two arguments. The first argument is the row name (when using .loc
) or the row index number (when using .iloc
), while the second argument is the column name or index number.
Using loc
, we can select “Bacteroidetes” and “bacteria_counts” to get the count of Bacteroidetes, as in the next cell below.
[36]:
bacteria
[36]:
bacteria_counts | other_feature | |
---|---|---|
Firmicutes | 632 | 438 |
Proteobacteria | 1638 | 833 |
Actinobacteria | 569 | 234 |
Bacteroidetes | 115 | 298 |
[37]:
bacteria.loc['Bacteroidetes']['bacteria_counts']
[37]:
115
[38]:
# This is the same as above
bacteria.iloc[3][0]
[38]:
115
[39]:
# This the same as above
bacteria.iloc[3]['bacteria_counts']
[39]:
115
Or, for the unemployment data:
[40]:
# The year_month in the first row
unemployment.iloc[0,2]
[40]:
1993.01
[41]:
# This the same as above
unemployment.iloc[0][2]
[41]:
1993.01
Selecting Multiple Rows and Columns¶
Both .loc
and .iloc
can be used to select subsets of rows and columns at the same time if they are given lists as their two arguments, or slices for .iloc
.
[42]:
unemployment.iloc[2:6,0:2]
[42]:
country | seasonality | |
---|---|---|
2 | at | nsa |
3 | at | nsa |
4 | at | nsa |
5 | at | nsa |
Using .iloc
on the unemployment DataFrame, select the rows starting at row 2 and ending at row 5, and the 0th, 2nd, and 3rd columns.
[43]:
unemployment.iloc[2:6,[0,2,3]]
[43]:
country | year_month | unemployment | |
---|---|---|---|
2 | at | 1993.03 | 166000 |
3 | at | 1993.04 | 157000 |
4 | at | 1993.05 | 147000 |
5 | at | 1993.06 | 134000 |
The same selection can be achieved by using the .loc
operator and listing the column names.
[44]:
# The same as above
unemployment.loc[2:6,['country', 'year_month', 'unemployment']]
[44]:
country | year_month | unemployment | |
---|---|---|---|
2 | at | 1993.03 | 166000 |
3 | at | 1993.04 | 157000 |
4 | at | 1993.05 | 147000 |
5 | at | 1993.06 | 134000 |
6 | at | 1993.07 | 128000 |
We can also display values from a DataFrame that satisfy certain criteria using conditional expressions, such as <
, >
, ==
, !=
, etc.
One example is shown below where only the rows that have an unemployment rate greater than 15 are shown.
[45]:
unemployment[unemployment['unemployment_rate'] > 15].head(10)
[45]:
country | seasonality | year_month | unemployment | unemployment_rate | |
---|---|---|---|---|---|
1717 | bg | nsa | 2000.02 | 523000 | 15.4 |
1718 | bg | nsa | 2000.03 | 547000 | 16.0 |
1719 | bg | nsa | 2000.04 | 560000 | 16.3 |
1720 | bg | nsa | 2000.05 | 561000 | 16.3 |
1721 | bg | nsa | 2000.06 | 554000 | 16.2 |
1722 | bg | nsa | 2000.07 | 558000 | 16.3 |
1723 | bg | nsa | 2000.08 | 569000 | 16.7 |
1724 | bg | nsa | 2000.09 | 574000 | 16.8 |
1725 | bg | nsa | 2000.10 | 583000 | 17.1 |
1726 | bg | nsa | 2000.11 | 597000 | 17.5 |
Differences between loc
and iloc
¶
To show the differences between loc
and iloc
let’s consider the following example.
[46]:
df1 = pd.DataFrame({'x':[10, 20, 30, 40 ,50],
'y':[20, 30, 40, 50, 60],
'z':[30, 40, 50, 60, 70]},
index=[10, 11,12, 0, 1])
df1
[46]:
x | y | z | |
---|---|---|---|
10 | 10 | 20 | 30 |
11 | 20 | 30 | 40 |
12 | 30 | 40 | 50 |
0 | 40 | 50 | 60 |
1 | 50 | 60 | 70 |
Note in the following cells that iloc
selects the row with index location 0
, whereas loc
selects the row with index label 0
.
[47]:
# value at index location 0
df1.iloc[0]
[47]:
x 10
y 20
z 30
Name: 10, dtype: int64
[48]:
# value at index label 0
df1.loc[0]
[48]:
x 40
y 50
z 60
Name: 0, dtype: int64
Also, there is a difference in the selected rows when using slicing operations with iloc
and loc
. One must be careful when using these operators, and always check the output to ensure it is as expected.
[49]:
# rows at index location between 0 and 1 (exclusive)
df1.iloc[0:1]
[49]:
x | y | z | |
---|---|---|---|
10 | 10 | 20 | 30 |
[50]:
# rows at index labels between 0 and 1 (inclusive)
df1.loc[0:1]
[50]:
x | y | z | |
---|---|---|---|
0 | 40 | 50 | 60 |
1 | 50 | 60 | 70 |
8.4 Creating New Columns, Reordering¶
Since the year_month
column is not shown correctly, let’s try to split it into two separate columns for years and months.
In the previous section we saw that the data type in this column is float64
. We’ll first extract the year using the .astype()
method. This allows for type casting, i.e., using .astype(int)
we will convert the floating point values into integer numbers (by truncating the decimals).
The new column year will be added on the right of the DataFrame.
[51]:
unemployment['year'] = unemployment['year_month'].astype(int)
unemployment.head()
[51]:
country | seasonality | year_month | unemployment | unemployment_rate | year | |
---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 | 1993 |
1 | at | nsa | 1993.02 | 175000 | 4.6 | 1993 |
2 | at | nsa | 1993.03 | 166000 | 4.4 | 1993 |
3 | at | nsa | 1993.04 | 157000 | 4.1 | 1993 |
4 | at | nsa | 1993.05 | 147000 | 3.9 | 1993 |
Next, let’s create a new column month. We will subtract the year value from year_month to get the decimal portion of the value, and multiply the result by 100 and convert to int
. Because of the truncating that occurs when casting to int
, we first need to round the values to the nearest whole number using round()
.
[52]:
unemployment['month'] = ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int)
unemployment.head(12)
[52]:
country | seasonality | year_month | unemployment | unemployment_rate | year | month | |
---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 171000 | 4.5 | 1993 | 1 |
1 | at | nsa | 1993.02 | 175000 | 4.6 | 1993 | 2 |
2 | at | nsa | 1993.03 | 166000 | 4.4 | 1993 | 3 |
3 | at | nsa | 1993.04 | 157000 | 4.1 | 1993 | 4 |
4 | at | nsa | 1993.05 | 147000 | 3.9 | 1993 | 5 |
5 | at | nsa | 1993.06 | 134000 | 3.5 | 1993 | 6 |
6 | at | nsa | 1993.07 | 128000 | 3.4 | 1993 | 7 |
7 | at | nsa | 1993.08 | 130000 | 3.4 | 1993 | 8 |
8 | at | nsa | 1993.09 | 132000 | 3.5 | 1993 | 9 |
9 | at | nsa | 1993.10 | 141000 | 3.7 | 1993 | 10 |
10 | at | nsa | 1993.11 | 156000 | 4.1 | 1993 | 11 |
11 | at | nsa | 1993.12 | 169000 | 4.4 | 1993 | 12 |
Now, let’s try to reorder the newly created year and month columns in the DataFrame. For this, we will use the square brackets notation again, passing in a list of column names in the order we would like to see them.
[53]:
unemployment = unemployment[['country', 'seasonality',
'year_month', 'year', 'month',
'unemployment', 'unemployment_rate']]
unemployment.head(10)
[53]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | |
---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 1993 | 1 | 171000 | 4.5 |
1 | at | nsa | 1993.02 | 1993 | 2 | 175000 | 4.6 |
2 | at | nsa | 1993.03 | 1993 | 3 | 166000 | 4.4 |
3 | at | nsa | 1993.04 | 1993 | 4 | 157000 | 4.1 |
4 | at | nsa | 1993.05 | 1993 | 5 | 147000 | 3.9 |
5 | at | nsa | 1993.06 | 1993 | 6 | 134000 | 3.5 |
6 | at | nsa | 1993.07 | 1993 | 7 | 128000 | 3.4 |
7 | at | nsa | 1993.08 | 1993 | 8 | 130000 | 3.4 |
8 | at | nsa | 1993.09 | 1993 | 9 | 132000 | 3.5 |
9 | at | nsa | 1993.10 | 1993 | 10 | 141000 | 3.7 |
8.5 Merging¶
If we examine the unemployment
DataFrame we can notice that we don’t exactly know what the values in the country
column refer to. We can fix that by getting the country names from the countries
DataFrame that we imported earlier.
We can see in the countries
data that at stands for Austria. This DataFrame even provides the country names in three different languages.
[54]:
countries.head()
[54]:
country | google_country_code | country_group | name_en | name_fr | name_de | latitude | longitude | |
---|---|---|---|---|---|---|---|---|
0 | at | AT | eu | Austria | Autriche | Österreich | 47.696554 | 13.345980 |
1 | be | BE | eu | Belgium | Belgique | Belgien | 50.501045 | 4.476674 |
2 | bg | BG | eu | Bulgaria | Bulgarie | Bulgarien | 42.725674 | 25.482322 |
3 | hr | HR | non-eu | Croatia | Croatie | Kroatien | 44.746643 | 15.340844 |
4 | cy | CY | eu | Cyprus | Chypre | Zypern | 35.129141 | 33.428682 |
Because the data we need is stored in two separate files, we will first merge the two DataFrames. The country
column is shown in both DataFrames, so it is a good option for joining the data. However, we don’t need all columns in the countries
DataFrame, and therefore, we will create a new DataFrame. To select certain columns to retain, we can use the bracket notation that we used earlier to reorder the columns.
[55]:
country_names = countries[['country', 'country_group', 'name_en']]
[56]:
country_names.head(5)
[56]:
country | country_group | name_en | |
---|---|---|---|
0 | at | eu | Austria |
1 | be | eu | Belgium |
2 | bg | eu | Bulgaria |
3 | hr | non-eu | Croatia |
4 | cy | eu | Cyprus |
pandas
include an easy-to-use merge
function, which has the following syntax.
pd.merge(first_file, second_file, on=['column_name'])
[57]:
unemployment = pd.merge(unemployment, country_names, on=['country'])
unemployment.head()
[57]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | |
---|---|---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 1993 | 1 | 171000 | 4.5 | eu | Austria |
1 | at | nsa | 1993.02 | 1993 | 2 | 175000 | 4.6 | eu | Austria |
2 | at | nsa | 1993.03 | 1993 | 3 | 166000 | 4.4 | eu | Austria |
3 | at | nsa | 1993.04 | 1993 | 4 | 157000 | 4.1 | eu | Austria |
4 | at | nsa | 1993.05 | 1993 | 5 | 147000 | 3.9 | eu | Austria |
If we want to merge two files using multiple columns that exist in both files, we can pass a list of column names to the on
parameter.
For more information on merging, check the pandas documentation.
8.6 Calculating Unique and Missing Values¶
In the unemployment DataFrame, we might want to know what countries we have data for. To extract this information, we can use the .unique()
method. Note that the countries are listed in the right-most column name-en
so we will use it to find the unique elements in that column.
[58]:
unemployment.head()
[58]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | |
---|---|---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 1993 | 1 | 171000 | 4.5 | eu | Austria |
1 | at | nsa | 1993.02 | 1993 | 2 | 175000 | 4.6 | eu | Austria |
2 | at | nsa | 1993.03 | 1993 | 3 | 166000 | 4.4 | eu | Austria |
3 | at | nsa | 1993.04 | 1993 | 4 | 157000 | 4.1 | eu | Austria |
4 | at | nsa | 1993.05 | 1993 | 5 | 147000 | 3.9 | eu | Austria |
[59]:
unemployment['name_en'].unique()
## We can also
# unemployment.name_en.unique()
[59]:
array(['Austria', 'Belgium', 'Bulgaria', 'Cyprus', 'Czech Republic',
'Germany (including former GDR from 1991)', 'Denmark', 'Estonia',
'Spain', 'Finland', 'France', 'Greece', 'Croatia', 'Hungary',
'Ireland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta',
'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Sweden',
'Slovenia', 'Slovakia', 'Turkey', 'United Kingdom'], dtype=object)
To get a count of the number of unique countries, we can use the .nunique()
method.
[60]:
unemployment['name_en'].nunique()
[60]:
30
Or, we can also use len()
to get the number of items in the above array.
[61]:
len(unemployment['name_en'].unique())
[61]:
30
If we are interested to know how many observations there are per country, pandas
has a method called .value_counts()
that returns the counts for the unique values in a column.
[62]:
unemployment['name_en'].value_counts()
[62]:
France 1008
Sweden 1008
Portugal 1008
Netherlands 1008
Luxembourg 1008
Denmark 1008
Belgium 1008
Spain 1008
Ireland 1008
United Kingdom 1002
Italy 924
Finland 828
Norway 786
Austria 648
Hungary 576
Slovakia 576
Slovenia 576
Bulgaria 576
Malta 576
Poland 576
Germany (including former GDR from 1991) 504
Czech Republic 468
Latvia 459
Lithuania 459
Greece 450
Romania 423
Cyprus 396
Estonia 387
Croatia 324
Turkey 210
Name: name_en, dtype: int64
By default, the output is sorted by values. If we would like it sorted by index (or, by country name in this case since the countries were listed in alphabetical order), we can append the .sort_index()
method.
[63]:
unemployment['name_en'].value_counts().sort_index()
[63]:
Austria 648
Belgium 1008
Bulgaria 576
Croatia 324
Cyprus 396
Czech Republic 468
Denmark 1008
Estonia 387
Finland 828
France 1008
Germany (including former GDR from 1991) 504
Greece 450
Hungary 576
Ireland 1008
Italy 924
Latvia 459
Lithuania 459
Luxembourg 1008
Malta 576
Netherlands 1008
Norway 786
Poland 576
Portugal 1008
Romania 423
Slovakia 576
Slovenia 576
Spain 1008
Sweden 1008
Turkey 210
United Kingdom 1002
Name: name_en, dtype: int64
As we noticed earlier, there are missing values in the unemployment_rate
column. To find out how many unemployment rate values are missing we will use the .isnull()
method, which returns a corresponding boolean value for each missing entry in the unemployment_rate
column. As we know, in Python True
is equivalent to 1 and False
is equivalent to 0. Thus, when we add the result with .sum()
, we get a count for the total number of missing values.
[64]:
unemployment['unemployment_rate'].isnull().sum()
[64]:
945
GroupBy¶
If we would like to know how many missing values exist at the country level, we can take the main part of what we had above and create a new column in the DataFrame
. This is the last column, in which False
means that the value is not missing.
[65]:
unemployment['unemployment_rate_null'] = unemployment['unemployment_rate'].isnull()
unemployment.head()
[65]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | unemployment_rate_null | |
---|---|---|---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 1993 | 1 | 171000 | 4.5 | eu | Austria | False |
1 | at | nsa | 1993.02 | 1993 | 2 | 175000 | 4.6 | eu | Austria | False |
2 | at | nsa | 1993.03 | 1993 | 3 | 166000 | 4.4 | eu | Austria | False |
3 | at | nsa | 1993.04 | 1993 | 4 | 157000 | 4.1 | eu | Austria | False |
4 | at | nsa | 1993.05 | 1993 | 5 | 147000 | 3.9 | eu | Austria | False |
To count the number of missing values for each country, we can use the .groupby()
method. It groups the data by the country name_en
column included in the parentheses, and the .sum()
operation is performed over the unemployment_rate_null
column.
[66]:
unemployment.groupby('name_en')['unemployment_rate_null'].sum()
[66]:
name_en
Austria 0
Belgium 0
Bulgaria 180
Croatia 216
Cyprus 0
Czech Republic 0
Denmark 0
Estonia 0
Finland 0
France 0
Germany (including former GDR from 1991) 0
Greece 0
Hungary 36
Ireland 0
Italy 0
Latvia 0
Lithuania 0
Luxembourg 0
Malta 180
Netherlands 0
Norway 0
Poland 72
Portugal 0
Romania 0
Slovakia 108
Slovenia 36
Spain 117
Sweden 0
Turkey 0
United Kingdom 0
Name: unemployment_rate_null, dtype: int64
8.7 Exporting A DataFrame to csv¶
If we wanted to save the last DataFrame as a .csv file, we can use the .to_csv()
method.
[67]:
unemployment.to_csv('data/unemployment_missing.csv')
The file will be saved in the data
directory.
By default, this method writes the indices in the column 0 (i.e., row labels). We probably don’t want a column 0 with indices to be added, and we can set index
to False
. We can also specify the type of delimiter that we want to use, such as commas (,)
, pipes (|
), semicolons (;
), tabs (\t
), etc.
[68]:
unemployment.to_csv('data/unemployment_missing.csv', index=False, sep=',')
Now that we have the missing values saved, we can drop the last column unemployment_rate_null
that we added to unemployment
, as shown in the next cell below.
[69]:
unemployment.head()
[69]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | unemployment_rate_null | |
---|---|---|---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 1993 | 1 | 171000 | 4.5 | eu | Austria | False |
1 | at | nsa | 1993.02 | 1993 | 2 | 175000 | 4.6 | eu | Austria | False |
2 | at | nsa | 1993.03 | 1993 | 3 | 166000 | 4.4 | eu | Austria | False |
3 | at | nsa | 1993.04 | 1993 | 4 | 157000 | 4.1 | eu | Austria | False |
4 | at | nsa | 1993.05 | 1993 | 5 | 147000 | 3.9 | eu | Austria | False |
[70]:
unemployment.drop('unemployment_rate_null', axis=1, inplace=True)
unemployment.head()
[70]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | |
---|---|---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 1993 | 1 | 171000 | 4.5 | eu | Austria |
1 | at | nsa | 1993.02 | 1993 | 2 | 175000 | 4.6 | eu | Austria |
2 | at | nsa | 1993.03 | 1993 | 3 | 166000 | 4.4 | eu | Austria |
3 | at | nsa | 1993.04 | 1993 | 4 | 157000 | 4.1 | eu | Austria |
4 | at | nsa | 1993.05 | 1993 | 5 | 147000 | 3.9 | eu | Austria |
It is important to specify the axis
parameter in the above code, where axis=1
refers to columns (axis=0
refers to rows). Note again that the parameter inplace=True
modifies the actual DataFrame rather than returning a new DataFrame.
8.8 Dealing With Missing Values: Boolean Indexing¶
There are two main options for dealing with missing values:
Fill the missing values with some other values.
Remove the observations with missing values.
Here we will adopt the second approach and exclude missing values from our primary analyses. Instead of just getting rid of that data, it might make sense to store it in a separate DataFrame. This way, we could answer questions such as, “do missing values occur during certain months (or years) more frequently?”
To do this, we will use boolean indexing for filtering data.
Recall that unemployment['unemployment_rate'].isnull()
produces an array of Boolean values. We used this previously when counting the number of missing values, and it is shown in the next cell.
[71]:
unemployment['unemployment_rate'].isnull()[:10]
[71]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: unemployment_rate, dtype: bool
To create unemployment_rate_missing
, we will index unemployment
with the Boolean array above. This returns only the rows where the value in the array is True
.
[72]:
unemployment_rate_missing = unemployment[unemployment['unemployment_rate'].isnull()]
unemployment_rate_missing.head()
[72]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | |
---|---|---|---|---|---|---|---|---|---|
1656 | bg | nsa | 1995.01 | 1995 | 1 | 391000 | NaN | eu | Bulgaria |
1657 | bg | nsa | 1995.02 | 1995 | 2 | 387000 | NaN | eu | Bulgaria |
1658 | bg | nsa | 1995.03 | 1995 | 3 | 378000 | NaN | eu | Bulgaria |
1659 | bg | nsa | 1995.04 | 1995 | 4 | 365000 | NaN | eu | Bulgaria |
1660 | bg | nsa | 1995.05 | 1995 | 5 | 346000 | NaN | eu | Bulgaria |
It is also possible to specify multiple conditions using the &
operator, but each condition needs to be inside of parentheses.
Now, to remove the missing data in unemployment
, we can use the .dropna()
method. This method drops all observations for which unemployment_rate == NaN
.
[73]:
unemployment.dropna(subset=['unemployment_rate'], inplace=True)
unemployment.head()
[73]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | |
---|---|---|---|---|---|---|---|---|---|
0 | at | nsa | 1993.01 | 1993 | 1 | 171000 | 4.5 | eu | Austria |
1 | at | nsa | 1993.02 | 1993 | 2 | 175000 | 4.6 | eu | Austria |
2 | at | nsa | 1993.03 | 1993 | 3 | 166000 | 4.4 | eu | Austria |
3 | at | nsa | 1993.04 | 1993 | 4 | 157000 | 4.1 | eu | Austria |
4 | at | nsa | 1993.05 | 1993 | 5 | 147000 | 3.9 | eu | Austria |
[74]:
# Check the shape of the modified DataFrame
unemployment.shape
[74]:
(19851, 9)
Sorting Values¶
If we want to know what the highest unemployment rates were, we can use the .sort_values()
method to sort the data.
This cell sorted the data in descending order, and printed the first 10 rows.
[75]:
unemployment.sort_values('unemployment_rate', ascending=False)[:10]
[75]:
country | seasonality | year_month | year | month | unemployment | unemployment_rate | country_group | name_en | |
---|---|---|---|---|---|---|---|---|---|
15526 | pl | nsa | 2004.02 | 2004 | 2 | 3531000 | 20.9 | eu | Poland |
15525 | pl | nsa | 2004.01 | 2004 | 1 | 3520000 | 20.7 | eu | Poland |
15514 | pl | nsa | 2003.02 | 2003 | 2 | 3460000 | 20.7 | eu | Poland |
5663 | es | sa | 2010.09 | 2010 | 9 | 4773000 | 20.6 | eu | Spain |
15527 | pl | nsa | 2004.03 | 2004 | 3 | 3475000 | 20.6 | eu | Poland |
5999 | es | trend | 2010.09 | 2010 | 9 | 4769000 | 20.6 | eu | Spain |
6000 | es | trend | 2010.10 | 2010 | 10 | 4760000 | 20.6 | eu | Spain |
15513 | pl | nsa | 2003.01 | 2003 | 1 | 3466000 | 20.6 | eu | Poland |
5664 | es | sa | 2010.10 | 2010 | 10 | 4758000 | 20.6 | eu | Spain |
5998 | es | trend | 2010.08 | 2010 | 8 | 4742000 | 20.5 | eu | Spain |
Several additional functionalities of pandas
will be described in the next lectures.
References¶
Introduction to Pandas, Python Data Wrangling by D-Lab at UC Berkley, available at: https://github.com/dlab-berkeley/introduction-to-pandas.
Pandas documentation, available at https://pandas.pydata.org/pandas-docs/stable/.
BACK TO TOP