Lecture 8 - Data Manipulation with pandas¶

8.1 Introduction to pandas
8.2 Importing Data and Summary Statistics
8.3 Rename, Index, and Slice
8.4 Creating New Columns, Reordering
8.5 Merging
8.6 Calculating Unique and Missing Values
8.7 Exporting A DataFrame to csv
8.8 Dealing With Missing Values: Boolean Indexing
References

8.1 Introduction to `pandas`¶

pandas is a library designed for working with structured data, such as tabular data (e.g., from .csv files, Excel files) or relational databases (e.g., SQL).

The DataFrame object in pandas is a 2-dimensional tabular, column-oriented data structure. The DateFrame is similar to an Excel spreadsheet and can store data of different types (including text characters, integers, floating-point values, categorical data, and more).

851b8d3e626a4b15aea9239e30b56ac2 Figure source: Reference [2].

The pandas name is derived from the term panel data, which is a term from economics for multi-dimensional structured data.

8.2 Importing Data and Summary Statistics¶

Let’s begin by importing the pandas package using the common abbreviation pd. Loading pandas as pd is standard practice.

[1]:

import pandas as pd

A wide range of input/output formats are supported by pandas:

CSV, text
SQL database
Excel
HDF5
json
html
pickle
sas, stata
…

For importing .csv files, the function read_csv() in pandas allows to easily import data. By default, it assumes that the data is comma-separated, but we can also specify the delimiter used in the data (e.g., tab, semicolon, etc.). There are several parameters that can be specified in read_csv(). See the documentation here.

Let’s load the data in the file country-total located in the folder data and save it under the name unemployment. This file contains unemployment information for several countries over a time period.

The function read_csv() returns a DataFrame, as shown below. Note that the DataFrame has 20,796 rows, and the output of the cell displayed only the first 5 and last 5 rows (or the first and last 30, depending on your system), since displaying all 20,976 rows is probably not what we want at this point anyway.

[2]:

# Import data
unemployment = pd.read_csv('data/country_total.csv')
# Show the DataFrame
unemployment

[2]:

	country	seasonality	month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4
3	at	nsa	1993.04	157000	4.1
4	at	nsa	1993.05	147000	3.9
...	...	...	...	...	...
20791	uk	trend	2010.06	2429000	7.7
20792	uk	trend	2010.07	2422000	7.7
20793	uk	trend	2010.08	2429000	7.7
20794	uk	trend	2010.09	2447000	7.8
20795	uk	trend	2010.10	2455000	7.8

20796 rows × 5 columns

We could have also used print to display the DataFrame.

[3]:

print(unemployment)

      country seasonality    month  unemployment  unemployment_rate
0          at         nsa  1993.01        171000                4.5
1          at         nsa  1993.02        175000                4.6
2          at         nsa  1993.03        166000                4.4
3          at         nsa  1993.04        157000                4.1
4          at         nsa  1993.05        147000                3.9
...       ...         ...      ...           ...                ...
20791      uk       trend  2010.06       2429000                7.7
20792      uk       trend  2010.07       2422000                7.7
20793      uk       trend  2010.08       2429000                7.7
20794      uk       trend  2010.09       2447000                7.8
20795      uk       trend  2010.10       2455000                7.8

[20796 rows x 5 columns]

When the DataFrames have a large number of rows that take large portion of the screen, we can inspect the data by using the .head() method. By default, this shows the header (names of the columns, commonly referred to as column labels) and the first five rows (having indices ranging from 0 to 4, in the first column in the table). The indices are also referred to as row labels.

[4]:

unemployment.head()

[4]:

	country	seasonality	month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4
3	at	nsa	1993.04	157000	4.1
4	at	nsa	1993.05	147000	3.9

Passing an integer number as an argument to .head(n) returns that number of rows. To see the last \(n\) rows, use .tail(n).

[5]:

# show the last 8 rows
unemployment.tail(8)

[5]:

	country	seasonality	month	unemployment	unemployment_rate
20788	uk	trend	2010.03	2437000	7.8
20789	uk	trend	2010.04	2419000	7.8
20790	uk	trend	2010.05	2419000	7.7
20791	uk	trend	2010.06	2429000	7.7
20792	uk	trend	2010.07	2422000	7.7
20793	uk	trend	2010.08	2429000	7.7
20794	uk	trend	2010.09	2447000	7.8
20795	uk	trend	2010.10	2455000	7.8

To find the number of rows in a DataFrame, you can use the len() function, as with Python lists and other sequences.

[6]:

len(unemployment)

[6]:

Alternatively, we can use the shape attribute to find the numbers of rows and columns, as with NumPy arrays. The cell output is a tuple, showing that there are 20,796 rows and 5 columns. Note that the left-most column in the above table showing row indices is not part of the data.

[7]:

unemployment.shape

[7]:

(20796, 5)

A useful method that generates various summary statistics of a DataFrame is .describe(), as shown below.

Notice in the above cell that the DataFrame has 5 columns, but the first 2 columns (country and seasonality) have textual (strings) data, and therefore the summary statistics are shown only for the columns with numeric data (month, unemployment, and unemployment_rate). If .describe() is called on textual data only, it will return the count, number of unique values, and the most frequent value along with its count.

[8]:

unemployment.describe()

[8]:

	month	unemployment	unemployment_rate
count	20796.000000	2.079600e+04	19851.000000
mean	1999.401290	7.900818e+05	8.179764
std	7.483751	1.015280e+06	3.922533
min	1983.010000	2.000000e+03	1.100000
25%	1994.090000	1.400000e+05	5.200000
50%	2001.010000	3.100000e+05	7.600000
75%	2006.010000	1.262250e+06	10.000000
max	2010.120000	4.773000e+06	20.900000

It is also possible to calculate individual statistics, such as .min(), .max (), or .mean(), instead of using summary statistics with .describe().

[9]:

unemployment.min()

[9]:

country                   at
seasonality              nsa
month                1983.01
unemployment            2000
unemployment_rate        1.1
dtype: object

To view the data types for each column use the dtypes attribute of the unemployment. The data types in this case are strings (object type), floats (float64 type), and integers (int64 type).

[10]:

unemployment.dtypes

[10]:

country               object
seasonality           object
month                float64
unemployment           int64
unemployment_rate    float64
dtype: object

And one more way to get a summary of a DataFrame is by using info().

[11]:

unemployment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20796 entries, 0 to 20795
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   country            20796 non-null  object
 1   seasonality        20796 non-null  object
 2   month              20796 non-null  float64
 3   unemployment       20796 non-null  int64
 4   unemployment_rate  19851 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 812.5+ KB

Import Data From a URL¶

Above, we imported the unemployment data using the function read_csv and a relative file path to the data directory. The function read_csv is very flexible and it also allows importing data using a URL as the file path.

A csv file with data on world countries and their abbreviations is located at https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv

Using read_csv, we can import the country data and save it to the variable countries. This DataFrame has 30 rows.

[12]:

countries = pd.read_csv('https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv')
countries

[12]:

	country	google_country_code	country_group	name_en	name_fr	name_de	latitude	longitude
0	at	AT	eu	Austria	Autriche	Österreich	47.696554	13.345980
1	be	BE	eu	Belgium	Belgique	Belgien	50.501045	4.476674
2	bg	BG	eu	Bulgaria	Bulgarie	Bulgarien	42.725674	25.482322
3	hr	HR	non-eu	Croatia	Croatie	Kroatien	44.746643	15.340844
4	cy	CY	eu	Cyprus	Chypre	Zypern	35.129141	33.428682
5	cz	CZ	eu	Czech Republic	République tchèque	Tschechische Republik	49.803531	15.474998
6	dk	DK	eu	Denmark	Danemark	Dänemark	55.939684	9.516689
7	ee	EE	eu	Estonia	Estonie	Estland	58.592469	25.806950
8	fi	FI	eu	Finland	Finlande	Finnland	64.950159	26.067564
9	fr	FR	eu	France	France	Frankreich	46.710994	1.718561
10	de	DE	eu	Germany (including former GDR from 1991)	Allemagne (incluant l'ancienne RDA à partir de...	Deutschland (einschließlich der ehemaligen DDR...	51.163825	10.454048
11	gr	GR	eu	Greece	Grèce	Griechenland	39.698467	21.577256
12	hu	HU	eu	Hungary	Hongrie	Ungarn	47.161163	19.504265
13	ie	IE	eu	Ireland	Irlande	Irland	53.415260	-8.239122
14	it	IT	eu	Italy	Italie	Italien	42.504191	12.573787
15	lv	LV	eu	Latvia	Lettonie	Lettland	56.880117	24.606555
16	lt	LT	eu	Lithuania	Lituanie	Litauen	55.173687	23.943168
17	lu	LU	eu	Luxembourg	Luxembourg	Luxemburg	49.815319	6.133352
18	mt	MT	eu	Malta	Malte	Malta	35.902422	14.447461
19	nl	NL	eu	Netherlands	Pays-Bas	Niederlande	52.108118	5.330198
20	no	NO	non-eu	Norway	Norvège	Norwegen	64.556460	12.665766
21	pl	PL	eu	Poland	Pologne	Polen	51.918907	19.134334
22	pt	PT	eu	Portugal	Portugal	Portugal	39.558069	-7.844941
23	ro	RO	eu	Romania	Roumanie	Rumänien	45.942611	24.990152
24	sk	SK	eu	Slovakia	Slovaquie	Slowakei	48.672644	19.700032
25	si	SI	eu	Slovenia	Slovénie	Slowenien	46.149259	14.986617
26	es	ES	eu	Spain	Espagne	Spanien	39.895013	-2.988296
27	se	SE	eu	Sweden	Suède	Schweden	62.198467	14.896307
28	tr	TR	non-eu	Turkey	Turquie	Türkei	38.952942	35.439795
29	uk	GB	eu	United Kingdom	Royaume-Uni	Vereinigtes Königreich	54.315447	-2.232612

Similar to the above example, we can use shape and describe() the understand the countries DataFrame. In this case .describe() is not very useful, because only 2 of the columns have numeric values.

[13]:

countries.shape

[13]:

(30, 8)

[14]:

# explore the countries data
countries.describe()

[14]:

	latitude	longitude
count	30.000000	30.000000
mean	49.092609	14.324579
std	7.956624	11.257010
min	35.129141	-8.239122
25%	43.230916	6.979186
50%	49.238087	14.941462
75%	54.090400	23.351690
max	64.950159	35.439795

Import Data from Excel File¶

In a similar way, we can import data from an Excel file using the function read_excel() .

[15]:

titanic = pd.read_excel('data/titanic.xls')
titanic

[15]:

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29	0	0	24160	211.3375	B5	S	2	None	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	None	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2	1	2	113781	151.5500	C22 C26	S	None	None	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30	1	2	113781	151.5500	C22 C26	S	None	135	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25	1	2	113781	151.5500	C22 C26	S	None	None	Montreal, PQ / Chesterville, ON
5	1	1	Anderson, Mr. Harry	male	48	0	0	19952	26.5500	E12	S	3	None	New York, NY
6	1	1	Andrews, Miss. Kornelia Theodosia	female	63	1	0	13502	77.9583	D7	S	10	None	Hudson, NY
7	1	0	Andrews, Mr. Thomas Jr	male	39	0	0	112050	0.0000	A36	S	None	None	Belfast, NI
8	1	1	Appleton, Mrs. Edward Dale (Charlotte Lamson)	female	53	2	0	11769	51.4792	C101	S	D	None	Bayside, Queens, NY
9	1	0	Artagaveytia, Mr. Ramon	male	71	0	0	PC 17609	49.5042	None	C	None	22	Montevideo, Uruguay
10	1	0	Astor, Col. John Jacob	male	47	1	0	PC 17757	227.5250	C62 C64	C	None	124	New York, NY
11	1	1	Astor, Mrs. John Jacob (Madeleine Talmadge Force)	female	18	1	0	PC 17757	227.5250	C62 C64	C	4	None	New York, NY
12	1	1	Aubart, Mme. Leontine Pauline	female	24	0	0	PC 17477	69.3000	B35	C	9	None	Paris, France
13	1	1	Barber, Miss. Ellen "Nellie"	female	26	0	0	19877	78.8500	None	S	6	None	None
14	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80	0	0	27042	30.0000	A23	S	B	None	Hessle, Yorks
15	1	0	Baumann, Mr. John D	male	None	0	0	PC 17318	25.9250	None	S	None	None	New York, NY
16	1	0	Baxter, Mr. Quigg Edmond	male	24	0	1	PC 17558	247.5208	B58 B60	C	None	None	Montreal, PQ
17	1	1	Baxter, Mrs. James (Helene DeLaudeniere Chaput)	female	50	0	1	PC 17558	247.5208	B58 B60	C	6	None	Montreal, PQ
18	1	1	Bazzani, Miss. Albina	female	32	0	0	11813	76.2917	D15	C	8	None	None
19	1	0	Beattie, Mr. Thomson	male	36	0	0	13050	75.2417	C6	C	A	None	Winnipeg, MN

Note that all DataFrames that we loaded so far in this Jupyter notebook, i.e., unemployment, countries, and titanic, are stored in the memory, and we can access them when needed once they are loaded. For example, we can check the shape of the titanic DataFrame.

[16]:

titanic.shape

[16]:

(20, 14)

Create a DataFrame¶

Alternatively, we can manually create DataFrames, instead of importing from a file. The following DataFrame called simple_table contains information from the titanic data. You can notice that the DataFrame is created similarly to creating a dictionary, where the column headers represent keys, and the data in each column are lists of values.

[17]:

simple_table = pd.DataFrame({
        "Name": ["Braund, Mr. Owen Harris",
                 "Allen, Mr. William Henry",
                 "Bonnell, Miss. Elizabeth"],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"]})
simple_table

[17]:

	Name	Age	Sex
0	Braund, Mr. Owen Harris	22	male
1	Allen, Mr. William Henry	35	male
2	Bonnell, Miss. Elizabeth	58	female

[18]:

simple_table.shape

[18]:

(3, 3)

8.3 Rename, Index, and Slice¶

Let’s look again at the unemployment DataFrame. You may have noticed that the month column actually includes the year and the month added as decimals (e.g., 1993.01 should be year 1993 and month 01).

[19]:

unemployment.head(3)

[19]:

	country	seasonality	month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4

Let’s rename the column into year_month. The .rename() method allows modifying column and/or row names. As you can see in the cell below, we passed a dictionary to the columns parameter, with the original name month as the key and the new name year_month as the value. Importantly, we also set the inplace parameter to True, which indicates that we want to modify the actual DataFrame, and not to create a new DataFrame.

[20]:

unemployment.rename(columns={'month' : 'year_month'}, inplace=True)
unemployment.head(3)

[20]:

	country	seasonality	year_month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4

To observe the effect of inplace=True, let’s run in the next cell another .rename() method to change the column country to year.

[21]:

unemployment.rename(columns={'country' : 'year'}).head(3)

[21]:

	year	seasonality	year_month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4

The above code didn’t change the actual unemployment DataFrame, as we can check that in the following cell. Instead, it created a copy of the unemployment DataFrame in which it changed the name of the column country.

[22]:

unemployment.head(3)

[22]:

	country	seasonality	year_month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4

Selecting Columns¶

To select a single column of the DataFrame, we can either use the name of the column enclosed in square brackets [] or the dot notation . (i.e., via attribute access). It is preferable to use the square brackets notation, since a column name might inadvertently have the same name as an built-in pandas method.

[23]:

# access with square brackets
unemployment['year_month'].head()

[23]:

0    1993.01
1    1993.02
2    1993.03
3    1993.04
4    1993.05
Name: year_month, dtype: float64

[24]:

# access with dot notation
unemployment.year_month.head()

[24]:

0    1993.01
1    1993.02
2    1993.03
3    1993.04
4    1993.05
Name: year_month, dtype: float64

When selecting a single column, we obtain a pandas Series object, which is a single vector of data with an associated array of index row labels shown in the left-most column.

A Series object in pandas represents an 1-dimensional vector of data (i.e., a column of data).

If we check the type of the unemployment object and unemployment['year_month'] object, we can see that the first one is DataFrame and the second one is Series.

[25]:

type(unemployment)

[25]:

pandas.core.frame.DataFrame

[26]:

type(unemployment['year_month'])

[26]:

pandas.core.series.Series

pandas provide many methods that can be applied to Series objects. A few examples are shown below.

[27]:

print('minimum is ', unemployment['year_month'].min())
print('maximum is ', unemployment['year_month'].max())
print('mean value is ', unemployment['year_month'].mean())

minimum is  1983.01
maximum is  2010.12
mean value is  1999.4012896710906

To select multiple columns in pandas, use a list of column names within the selection brackets [].

[28]:

unemployment[['country','year_month']].head()

[28]:

	country	year_month
0	at	1993.01
1	at	1993.02
2	at	1993.03
3	at	1993.04
4	at	1993.05

Selecting Rows¶

One way to select rows is by using the [] operator, similar to indexing and slicing in Python lists and other sequence data.

[29]:

unemployment[0:4]

[29]:

	country	seasonality	year_month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4
3	at	nsa	1993.04	157000	4.1

Another graphical representation of a DataFrame is shown in the figure below.

508c9f616a74460fbacd9c3d414ca41d Figure source: Reference [2].

The first column with the indices in pandas DataFrames does not need to be a sequence of integers, but it can can also contains strings or other numeric data (e.g., dates, years).

For instance, let’s create a DataFrame called bacteria to see how indexing with string indices works. We again pass in a dictionary, with the keys corresponding to column names and the values to the data, and in addition we pass a list of strings called index. (Compare to the simple_table above, which does not use index for creating the DataFrame, and in that case, the indices were automatically set to integer numbers.)

[30]:

bacteria = pd.DataFrame({'bacteria_counts' : [632, 1638, 569, 115],
                         'other_feature' : [438, 833, 234, 298]},
                         index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])
bacteria

[30]:

	bacteria_counts	other_feature
Firmicutes	632	438
Proteobacteria	1638	833
Actinobacteria	569	234
Bacteroidetes	115	298

For selecting rows and/or columns in pandas, beside the use of square brackets, two other operators are used: .loc and .iloc.

The operator .loc works with string locations (string labels), and .iloc works with integer locations (integer labels). These two operators can accept either a single label, a list of labels, or a slice of labels (e.g., 'a:f' or 2:5).

For instance, if we’re interested in the row Actinobacteria, we can use .loc and the index name. This returns the column values for the specified row.

[31]:

bacteria.loc['Actinobacteria']

[31]:

bacteria_counts    569
other_feature      234
Name: Actinobacteria, dtype: int64

We could have also used “positional indexing” with square brackets [], even though the indices are strings. The difference is that .loc returns a Series object because we selected a single label, while [2:3] returns a DataFrame because we selected a range of positions.

[32]:

bacteria[2:3]

[32]:

	bacteria_counts	other_feature
Actinobacteria	569	234

Let’s return to the unemployment data to show how .iloc is used, since unemployment has integer indices. To select specific rows, we can do the following.

[33]:

unemployment.iloc[0:4]

[33]:

	country	seasonality	year_month	unemployment	unemployment_rate
0	at	nsa	1993.01	171000	4.5
1	at	nsa	1993.02	175000	4.6
2	at	nsa	1993.03	166000	4.4
3	at	nsa	1993.04	157000	4.1

[34]:

unemployment.iloc[[1, 5, 6, 22]]

[34]:

	country	seasonality	year_month	unemployment	unemployment_rate
1	at	nsa	1993.02	175000	4.6
5	at	nsa	1993.06	134000	3.5
6	at	nsa	1993.07	128000	3.4
22	at	nsa	1994.11	148000	3.9

We can also select a range of rows and specify the step value.

[35]:

unemployment.iloc[25:50:5]

[35]:

	country	seasonality	year_month	unemployment	unemployment_rate
25	at	nsa	1995.02	174000	4.5
30	at	nsa	1995.07	123000	3.3
35	at	nsa	1995.12	175000	4.7
40	at	nsa	1996.05	159000	4.3
45	at	nsa	1996.10	146000	3.9

Selecting a Specific Value¶

Both .loc and .iloc can be used to select a particular value if they are given two arguments. The first argument is the row name (when using .loc) or the row index number (when using .iloc), while the second argument is the column name or index number.

Using loc, we can select “Bacteroidetes” and “bacteria_counts” to get the count of Bacteroidetes, as in the next cell below.

[36]:

bacteria

[36]:

	bacteria_counts	other_feature
Firmicutes	632	438
Proteobacteria	1638	833
Actinobacteria	569	234
Bacteroidetes	115	298

[37]:

bacteria.loc['Bacteroidetes']['bacteria_counts']

[37]:

[38]:

# This is the same as above
bacteria.iloc[3][0]

[38]:

[39]:

# This the same as above
bacteria.iloc[3]['bacteria_counts']

[39]:

Or, for the unemployment data:

[40]:

# The year_month in the first row
unemployment.iloc[0,2]

[40]:

1993.01

[41]:

# This the same as above
unemployment.iloc[0][2]

[41]:

1993.01

Selecting Multiple Rows and Columns¶

Both .loc and .iloc can be used to select subsets of rows and columns at the same time if they are given lists as their two arguments, or slices for .iloc.

[42]:

unemployment.iloc[2:6,0:2]

[42]:

	country	seasonality
2	at	nsa
3	at	nsa
4	at	nsa
5	at	nsa

Using .iloc on the unemployment DataFrame, select the rows starting at row 2 and ending at row 5, and the 0th, 2nd, and 3rd columns.

[43]:

unemployment.iloc[2:6,[0,2,3]]

[43]:

	country	year_month	unemployment
2	at	1993.03	166000
3	at	1993.04	157000
4	at	1993.05	147000
5	at	1993.06	134000

The same selection can be achieved by using the .loc operator and listing the column names.

[44]:

# The same as above
unemployment.loc[2:6,['country', 'year_month', 'unemployment']]

[44]:

	country	year_month	unemployment
2	at	1993.03	166000
3	at	1993.04	157000
4	at	1993.05	147000
5	at	1993.06	134000
6	at	1993.07	128000

We can also display values from a DataFrame that satisfy certain criteria using conditional expressions, such as <, >, ==, !=, etc.

One example is shown below where only the rows that have an unemployment rate greater than 15 are shown.

[45]:

unemployment[unemployment['unemployment_rate'] > 15].head(10)

[45]:

	country	seasonality	year_month	unemployment	unemployment_rate
1717	bg	nsa	2000.02	523000	15.4
1718	bg	nsa	2000.03	547000	16.0
1719	bg	nsa	2000.04	560000	16.3
1720	bg	nsa	2000.05	561000	16.3
1721	bg	nsa	2000.06	554000	16.2
1722	bg	nsa	2000.07	558000	16.3
1723	bg	nsa	2000.08	569000	16.7
1724	bg	nsa	2000.09	574000	16.8
1725	bg	nsa	2000.10	583000	17.1
1726	bg	nsa	2000.11	597000	17.5

Differences between `loc` and `iloc`¶

To show the differences between loc and iloc let’s consider the following example.

[46]:

df1 = pd.DataFrame({'x':[10, 20, 30, 40 ,50],
                    'y':[20, 30, 40, 50, 60],
                    'z':[30, 40, 50, 60, 70]},
                      index=[10, 11,12, 0, 1])
df1

[46]:

	x	y	z
10	10	20	30
11	20	30	40
12	30	40	50
0	40	50	60
1	50	60	70

Note in the following cells that iloc selects the row with index location 0, whereas loc selects the row with index label 0.

[47]:

 # value at index location 0
df1.iloc[0]

[47]:

x    10
y    20
z    30
Name: 10, dtype: int64

[48]:

 # value at index label 0
df1.loc[0]

[48]:

x    40
y    50
z    60
Name: 0, dtype: int64

Also, there is a difference in the selected rows when using slicing operations with iloc and loc. One must be careful when using these operators, and always check the output to ensure it is as expected.

[49]:

# rows at index location between 0 and 1 (exclusive)
df1.iloc[0:1]

[49]:

	x	y	z
10	10	20	30

[50]:

# rows at index labels between 0 and 1 (inclusive)
df1.loc[0:1]

[50]:

	x	y	z
0	40	50	60
1	50	60	70

8.4 Creating New Columns, Reordering¶

Since the year_month column is not shown correctly, let’s try to split it into two separate columns for years and months.

In the previous section we saw that the data type in this column is float64. We’ll first extract the year using the .astype() method. This allows for type casting, i.e., using .astype(int) we will convert the floating point values into integer numbers (by truncating the decimals).

The new column year will be added on the right of the DataFrame.

[51]:

unemployment['year'] = unemployment['year_month'].astype(int)
unemployment.head()

[51]:

	country	seasonality	year_month	unemployment	unemployment_rate	year
0	at	nsa	1993.01	171000	4.5	1993
1	at	nsa	1993.02	175000	4.6	1993
2	at	nsa	1993.03	166000	4.4	1993
3	at	nsa	1993.04	157000	4.1	1993
4	at	nsa	1993.05	147000	3.9	1993

Next, let’s create a new column month. We will subtract the year value from year_month to get the decimal portion of the value, and multiply the result by 100 and convert to int. Because of the truncating that occurs when casting to int, we first need to round the values to the nearest whole number using round().

[52]:

unemployment['month'] = ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int)
unemployment.head(12)

[52]:

	country	seasonality	year_month	unemployment	unemployment_rate	year	month
0	at	nsa	1993.01	171000	4.5	1993	1
1	at	nsa	1993.02	175000	4.6	1993	2
2	at	nsa	1993.03	166000	4.4	1993	3
3	at	nsa	1993.04	157000	4.1	1993	4
4	at	nsa	1993.05	147000	3.9	1993	5
5	at	nsa	1993.06	134000	3.5	1993	6
6	at	nsa	1993.07	128000	3.4	1993	7
7	at	nsa	1993.08	130000	3.4	1993	8
8	at	nsa	1993.09	132000	3.5	1993	9
9	at	nsa	1993.10	141000	3.7	1993	10
10	at	nsa	1993.11	156000	4.1	1993	11
11	at	nsa	1993.12	169000	4.4	1993	12

Now, let’s try to reorder the newly created year and month columns in the DataFrame. For this, we will use the square brackets notation again, passing in a list of column names in the order we would like to see them.

[53]:

unemployment = unemployment[['country', 'seasonality',
                             'year_month', 'year', 'month',
                             'unemployment', 'unemployment_rate']]
unemployment.head(10)

[53]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate
0	at	nsa	1993.01	1993	1	171000	4.5
1	at	nsa	1993.02	1993	2	175000	4.6
2	at	nsa	1993.03	1993	3	166000	4.4
3	at	nsa	1993.04	1993	4	157000	4.1
4	at	nsa	1993.05	1993	5	147000	3.9
5	at	nsa	1993.06	1993	6	134000	3.5
6	at	nsa	1993.07	1993	7	128000	3.4
7	at	nsa	1993.08	1993	8	130000	3.4
8	at	nsa	1993.09	1993	9	132000	3.5
9	at	nsa	1993.10	1993	10	141000	3.7

8.5 Merging¶

If we examine the unemployment DataFrame we can notice that we don’t exactly know what the values in the country column refer to. We can fix that by getting the country names from the countries DataFrame that we imported earlier.

We can see in the countries data that at stands for Austria. This DataFrame even provides the country names in three different languages.

[54]:

countries.head()

[54]:

	country	google_country_code	country_group	name_en	name_fr	name_de	latitude	longitude
0	at	AT	eu	Austria	Autriche	Österreich	47.696554	13.345980
1	be	BE	eu	Belgium	Belgique	Belgien	50.501045	4.476674
2	bg	BG	eu	Bulgaria	Bulgarie	Bulgarien	42.725674	25.482322
3	hr	HR	non-eu	Croatia	Croatie	Kroatien	44.746643	15.340844
4	cy	CY	eu	Cyprus	Chypre	Zypern	35.129141	33.428682

Because the data we need is stored in two separate files, we will first merge the two DataFrames. The country column is shown in both DataFrames, so it is a good option for joining the data. However, we don’t need all columns in the countries DataFrame, and therefore, we will create a new DataFrame. To select certain columns to retain, we can use the bracket notation that we used earlier to reorder the columns.

[55]:

country_names = countries[['country', 'country_group', 'name_en']]

[56]:

country_names.head(5)

[56]:

	country	country_group	name_en
0	at	eu	Austria
1	be	eu	Belgium
2	bg	eu	Bulgaria
3	hr	non-eu	Croatia
4	cy	eu	Cyprus

pandas include an easy-to-use merge function, which has the following syntax.

pd.merge(first_file, second_file, on=['column_name'])

[57]:

unemployment = pd.merge(unemployment, country_names, on=['country'])
unemployment.head()

[57]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en
0	at	nsa	1993.01	1993	1	171000	4.5	eu	Austria
1	at	nsa	1993.02	1993	2	175000	4.6	eu	Austria
2	at	nsa	1993.03	1993	3	166000	4.4	eu	Austria
3	at	nsa	1993.04	1993	4	157000	4.1	eu	Austria
4	at	nsa	1993.05	1993	5	147000	3.9	eu	Austria

If we want to merge two files using multiple columns that exist in both files, we can pass a list of column names to the on parameter.

For more information on merging, check the pandas documentation.

8.6 Calculating Unique and Missing Values¶

In the unemployment DataFrame, we might want to know what countries we have data for. To extract this information, we can use the .unique() method. Note that the countries are listed in the right-most column name-en so we will use it to find the unique elements in that column.

[58]:

unemployment.head()

[58]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en
0	at	nsa	1993.01	1993	1	171000	4.5	eu	Austria
1	at	nsa	1993.02	1993	2	175000	4.6	eu	Austria
2	at	nsa	1993.03	1993	3	166000	4.4	eu	Austria
3	at	nsa	1993.04	1993	4	157000	4.1	eu	Austria
4	at	nsa	1993.05	1993	5	147000	3.9	eu	Austria

[59]:

unemployment['name_en'].unique()
## We can also
# unemployment.name_en.unique()

[59]:

array(['Austria', 'Belgium', 'Bulgaria', 'Cyprus', 'Czech Republic',
       'Germany (including  former GDR from 1991)', 'Denmark', 'Estonia',
       'Spain', 'Finland', 'France', 'Greece', 'Croatia', 'Hungary',
       'Ireland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta',
       'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Sweden',
       'Slovenia', 'Slovakia', 'Turkey', 'United Kingdom'], dtype=object)

To get a count of the number of unique countries, we can use the .nunique() method.

[60]:

unemployment['name_en'].nunique()

[60]:

Or, we can also use len() to get the number of items in the above array.

[61]:

len(unemployment['name_en'].unique())

[61]:

If we are interested to know how many observations there are per country, pandas has a method called .value_counts() that returns the counts for the unique values in a column.

[62]:

unemployment['name_en'].value_counts()

[62]:

France                                       1008
Sweden                                       1008
Portugal                                     1008
Netherlands                                  1008
Luxembourg                                   1008
Denmark                                      1008
Belgium                                      1008
Spain                                        1008
Ireland                                      1008
United Kingdom                               1002
Italy                                         924
Finland                                       828
Norway                                        786
Austria                                       648
Hungary                                       576
Slovakia                                      576
Slovenia                                      576
Bulgaria                                      576
Malta                                         576
Poland                                        576
Germany (including  former GDR from 1991)     504
Czech Republic                                468
Latvia                                        459
Lithuania                                     459
Greece                                        450
Romania                                       423
Cyprus                                        396
Estonia                                       387
Croatia                                       324
Turkey                                        210
Name: name_en, dtype: int64

By default, the output is sorted by values. If we would like it sorted by index (or, by country name in this case since the countries were listed in alphabetical order), we can append the .sort_index() method.

[63]:

unemployment['name_en'].value_counts().sort_index()

[63]:

Austria                                       648
Belgium                                      1008
Bulgaria                                      576
Croatia                                       324
Cyprus                                        396
Czech Republic                                468
Denmark                                      1008
Estonia                                       387
Finland                                       828
France                                       1008
Germany (including  former GDR from 1991)     504
Greece                                        450
Hungary                                       576
Ireland                                      1008
Italy                                         924
Latvia                                        459
Lithuania                                     459
Luxembourg                                   1008
Malta                                         576
Netherlands                                  1008
Norway                                        786
Poland                                        576
Portugal                                     1008
Romania                                       423
Slovakia                                      576
Slovenia                                      576
Spain                                        1008
Sweden                                       1008
Turkey                                        210
United Kingdom                               1002
Name: name_en, dtype: int64

As we noticed earlier, there are missing values in the unemployment_rate column. To find out how many unemployment rate values are missing we will use the .isnull() method, which returns a corresponding boolean value for each missing entry in the unemployment_rate column. As we know, in Python True is equivalent to 1 and False is equivalent to 0. Thus, when we add the result with .sum(), we get a count for the total number of missing values.

[64]:

unemployment['unemployment_rate'].isnull().sum()

[64]:

GroupBy¶

If we would like to know how many missing values exist at the country level, we can take the main part of what we had above and create a new column in the DataFrame. This is the last column, in which False means that the value is not missing.

[65]:

unemployment['unemployment_rate_null'] = unemployment['unemployment_rate'].isnull()
unemployment.head()

[65]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en	unemployment_rate_null
0	at	nsa	1993.01	1993	1	171000	4.5	eu	Austria	False
1	at	nsa	1993.02	1993	2	175000	4.6	eu	Austria	False
2	at	nsa	1993.03	1993	3	166000	4.4	eu	Austria	False
3	at	nsa	1993.04	1993	4	157000	4.1	eu	Austria	False
4	at	nsa	1993.05	1993	5	147000	3.9	eu	Austria	False

To count the number of missing values for each country, we can use the .groupby() method. It groups the data by the country name_en column included in the parentheses, and the .sum() operation is performed over the unemployment_rate_null column.

[66]:

unemployment.groupby('name_en')['unemployment_rate_null'].sum()

[66]:

name_en
Austria                                        0
Belgium                                        0
Bulgaria                                     180
Croatia                                      216
Cyprus                                         0
Czech Republic                                 0
Denmark                                        0
Estonia                                        0
Finland                                        0
France                                         0
Germany (including  former GDR from 1991)      0
Greece                                         0
Hungary                                       36
Ireland                                        0
Italy                                          0
Latvia                                         0
Lithuania                                      0
Luxembourg                                     0
Malta                                        180
Netherlands                                    0
Norway                                         0
Poland                                        72
Portugal                                       0
Romania                                        0
Slovakia                                     108
Slovenia                                      36
Spain                                        117
Sweden                                         0
Turkey                                         0
United Kingdom                                 0
Name: unemployment_rate_null, dtype: int64

8.7 Exporting A DataFrame to csv¶

If we wanted to save the last DataFrame as a .csv file, we can use the .to_csv() method.

[67]:

unemployment.to_csv('data/unemployment_missing.csv')

The file will be saved in the data directory.

By default, this method writes the indices in the column 0 (i.e., row labels). We probably don’t want a column 0 with indices to be added, and we can set index to False. We can also specify the type of delimiter that we want to use, such as commas (,), pipes (|), semicolons (;), tabs (\t), etc.

[68]:

unemployment.to_csv('data/unemployment_missing.csv', index=False, sep=',')

Now that we have the missing values saved, we can drop the last column unemployment_rate_null that we added to unemployment, as shown in the next cell below.

[69]:

unemployment.head()

[69]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en	unemployment_rate_null
0	at	nsa	1993.01	1993	1	171000	4.5	eu	Austria	False
1	at	nsa	1993.02	1993	2	175000	4.6	eu	Austria	False
2	at	nsa	1993.03	1993	3	166000	4.4	eu	Austria	False
3	at	nsa	1993.04	1993	4	157000	4.1	eu	Austria	False
4	at	nsa	1993.05	1993	5	147000	3.9	eu	Austria	False

[70]:

unemployment.drop('unemployment_rate_null', axis=1, inplace=True)
unemployment.head()

[70]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en
0	at	nsa	1993.01	1993	1	171000	4.5	eu	Austria
1	at	nsa	1993.02	1993	2	175000	4.6	eu	Austria
2	at	nsa	1993.03	1993	3	166000	4.4	eu	Austria
3	at	nsa	1993.04	1993	4	157000	4.1	eu	Austria
4	at	nsa	1993.05	1993	5	147000	3.9	eu	Austria

It is important to specify the axis parameter in the above code, where axis=1 refers to columns (axis=0 refers to rows). Note again that the parameter inplace=True modifies the actual DataFrame rather than returning a new DataFrame.

8.8 Dealing With Missing Values: Boolean Indexing¶

There are two main options for dealing with missing values:

Fill the missing values with some other values.
Remove the observations with missing values.

Here we will adopt the second approach and exclude missing values from our primary analyses. Instead of just getting rid of that data, it might make sense to store it in a separate DataFrame. This way, we could answer questions such as, “do missing values occur during certain months (or years) more frequently?”

To do this, we will use boolean indexing for filtering data.

Recall that unemployment['unemployment_rate'].isnull() produces an array of Boolean values. We used this previously when counting the number of missing values, and it is shown in the next cell.

[71]:

unemployment['unemployment_rate'].isnull()[:10]

[71]:

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: unemployment_rate, dtype: bool

To create unemployment_rate_missing, we will index unemployment with the Boolean array above. This returns only the rows where the value in the array is True.

[72]:

unemployment_rate_missing = unemployment[unemployment['unemployment_rate'].isnull()]
unemployment_rate_missing.head()

[72]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en
1656	bg	nsa	1995.01	1995	1	391000	NaN	eu	Bulgaria
1657	bg	nsa	1995.02	1995	2	387000	NaN	eu	Bulgaria
1658	bg	nsa	1995.03	1995	3	378000	NaN	eu	Bulgaria
1659	bg	nsa	1995.04	1995	4	365000	NaN	eu	Bulgaria
1660	bg	nsa	1995.05	1995	5	346000	NaN	eu	Bulgaria

It is also possible to specify multiple conditions using the & operator, but each condition needs to be inside of parentheses.

Now, to remove the missing data in unemployment, we can use the .dropna() method. This method drops all observations for which unemployment_rate == NaN.

[73]:

unemployment.dropna(subset=['unemployment_rate'], inplace=True)
unemployment.head()

[73]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en
0	at	nsa	1993.01	1993	1	171000	4.5	eu	Austria
1	at	nsa	1993.02	1993	2	175000	4.6	eu	Austria
2	at	nsa	1993.03	1993	3	166000	4.4	eu	Austria
3	at	nsa	1993.04	1993	4	157000	4.1	eu	Austria
4	at	nsa	1993.05	1993	5	147000	3.9	eu	Austria

[74]:

# Check the shape of the modified DataFrame
unemployment.shape

[74]:

(19851, 9)

Sorting Values¶

If we want to know what the highest unemployment rates were, we can use the .sort_values() method to sort the data.

This cell sorted the data in descending order, and printed the first 10 rows.

[75]:

unemployment.sort_values('unemployment_rate', ascending=False)[:10]

[75]:

	country	seasonality	year_month	year	month	unemployment	unemployment_rate	country_group	name_en
15526	pl	nsa	2004.02	2004	2	3531000	20.9	eu	Poland
15525	pl	nsa	2004.01	2004	1	3520000	20.7	eu	Poland
15514	pl	nsa	2003.02	2003	2	3460000	20.7	eu	Poland
5663	es	sa	2010.09	2010	9	4773000	20.6	eu	Spain
15527	pl	nsa	2004.03	2004	3	3475000	20.6	eu	Poland
5999	es	trend	2010.09	2010	9	4769000	20.6	eu	Spain
6000	es	trend	2010.10	2010	10	4760000	20.6	eu	Spain
15513	pl	nsa	2003.01	2003	1	3466000	20.6	eu	Poland
5664	es	sa	2010.10	2010	10	4758000	20.6	eu	Spain
5998	es	trend	2010.08	2010	8	4742000	20.5	eu	Spain

Several additional functionalities of pandas will be described in the next lectures.

References¶

Introduction to Pandas, Python Data Wrangling by D-Lab at UC Berkley, available at: https://github.com/dlab-berkeley/introduction-to-pandas.
Pandas documentation, available at https://pandas.pydata.org/pandas-docs/stable/.

BACK TO TOP