Lecture 3: Data Analysis¶

Warning

This page is still building. Please be patient for the coming lectures.

In this lecture, we will learn data analysis as well as presentation in Python including follwoing four packages:

Fig. 15 The “Four Musketeers” for Python data analysis: numPy, matplotlib, pandas and seaborn.¶

Why you need Python data analysis

Basically, modern office suite like Microsoft Excel provides powerful features to process and present your data. However, they are lacking of flexibility and the features are hidden in their complex UI (user interface). On the other hand, Python gives us more efficient and flexible way to operate data. For example, Microsoft Excel has a maximum row number \(1048576=2^{20}\) in Office 2007 or later while pandas can easily exceed this limitation.

Fig. 16 A table has 100,000,000 rows x 1 columns.¶

It is also possible to custom a pipeline with help of these packages so you can analyze data automatically. Once you build one, you can just run it every time when you have new data rather than operating data again and again.

Preparation¶

Warning

This is an advanced lecture so it is recommended to learn lecture 1 and lecture 2 before it if you have no background in Python programming.

Pandas is a data analysis and manipulation tools based on Python. To learn Pandas, the best material is the official cookbook which can be found at the official website. The documentation for pandas is also helpful for your learning.

Fig. 17 A sample entry of pandas docs. DataFrame is the basic data container in pandas.¶

Before start everything, we need to install Pandas as well as some other packages via pip.

numPy: pandas is based on numPy, an efficient numerical computation package written in C++.
openpyxl: pandas requires it to read data from excel file.
matplotlib: a graphical package to draw figures.

!pip install -q pandas numpy openpyxl matplotlib

Note

Here -q means quiet installation which can be ignored if you don not understand.
In Jupyter Notebook, command starting with mark ! is NOT native Python code, but a wrapped shell command, i.e. you can run pip install numpy to install numPy in your Anaconda Prompt Shell.

Let us import these two packages with short names:

import numpy as np
import pandas as pd

Data container¶

Pandas provides two types of container Series and DataFrame which are used to store sequential and tabular data respectively.

Fig. 18 Series storing sequential data.¶

Fig. 19 DataFrame storing tabular data.¶

Run following code to create two data containers.

s = pd.Series([1,2,3,4,5,6])
d = pd.DataFrame(np.arange(1, 19).reshape(3, -1))

print(s)
print(d)

  1
  2
  3
  4
  5
  6
dtype: int64
 1   2   3   4   5
 1   2   3   4   5   6
 7   8   9  10  11  12
13  14  15  16  17  18

Note

Here np.arange(1, 19) will generate a arrary from \(1\) to \(18\), and reshape(3, -1) will change the shape of arrary with first dimension as \(3\). Parameter -1 means it will be derivated internally and here will be replaced with \(6\) because \(6=18\div 3\).

shape is a property of container and it is a tuple of dimensions.

print(s.shape)
print(d.shape)

(6,)
(3, 6)

For very large DataFrame, Jupyter notebook will only show a part of the content:

d = pd.DataFrame(np.random.randn(100, 30))
d

	0	1	2	3	4	5	6	7	8	9	...	20	21	22	23	24	25	26	27	28	29
0	-0.343684	0.298432	-0.346330	-1.360610	-0.780171	-0.174479	-2.073209	0.741540	0.862682	1.790920	...	1.427133	-0.330105	-0.833102	-1.314724	0.896119	-0.738813	1.374135	0.440371	1.041022	-0.121798
1	-0.193280	1.257865	0.658521	-0.556523	0.867679	1.392618	1.358895	1.235232	0.598424	0.407585	...	0.480264	0.024192	0.416871	0.503437	-1.003764	-0.444086	-1.318141	-0.002651	0.091065	0.267475
2	-0.091231	1.268993	0.825904	-0.386833	0.215122	0.575687	-1.589176	1.526072	-1.083706	0.360741	...	-0.404119	0.178404	-0.065295	0.571753	-1.697193	1.665987	-0.487148	0.983719	1.474843	1.010711
3	-1.040651	0.876857	0.090111	-0.774266	0.412296	0.903602	0.168427	-0.690537	0.620906	0.483258	...	-0.225915	0.139047	0.568339	-0.726861	-0.061449	1.187309	1.819904	0.746932	0.246057	-0.622705
4	-0.890623	-1.924294	-1.607911	-0.919474	-0.756320	-0.130709	0.181522	0.337458	0.750356	-0.552621	...	-1.348002	0.687834	-0.597506	-0.185708	0.681879	-1.264852	-1.432979	-0.262975	0.340943	1.115939
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	-0.790644	0.618316	-0.239128	0.223507	-1.972358	1.344895	-0.921793	1.239987	-0.727588	2.143194	...	0.173822	-0.209904	0.167663	1.023742	-1.487167	-0.950625	-0.448279	-0.726560	1.566739	-2.048781
96	-0.897666	-1.011664	-1.305373	0.292486	0.033395	1.184037	-0.224618	-0.603053	0.583313	1.238475	...	-1.162605	1.927186	0.189962	-0.048620	-0.499449	-1.396445	-0.187596	-0.866826	0.488415	-1.043417
97	-0.386214	-0.737502	-0.724941	2.207187	1.191978	-0.849805	1.267080	0.092453	0.524919	-0.903533	...	-1.127624	0.242240	0.553355	-1.119842	0.132852	-1.981549	1.617266	-1.088464	-1.790250	0.687706
98	0.256366	-1.496072	-0.901937	-0.612122	-0.097003	-0.011705	0.593026	-0.146939	-0.992527	-0.885829	...	1.609611	1.205030	-0.915260	0.181878	0.419530	2.393224	0.259172	0.533195	0.432964	0.874746
99	-0.000901	-1.145946	0.295250	-0.390829	1.592131	0.339830	1.496009	-0.121397	0.382855	-0.491859	...	-1.915573	-0.667241	1.049157	0.382023	-0.689314	0.989662	0.171298	-0.197036	-0.235648	0.647312

100 rows × 30 columns

Note

Here np.random.randn(n, m) will generate a random matrix with \(n\) rows and \(m\) columns. Each element in the matrix follows the standard normal distribution \( X\sim \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\)

Try head(n) and it will give you the first \(n\) rows of the DataFrame while tail(n) will give the last \(n\) rows.

d.head(5)

	0	1	2	3	4	5	6	7	8	9	...	20	21	22	23	24	25	26	27	28	29
0	-0.343684	0.298432	-0.346330	-1.360610	-0.780171	-0.174479	-2.073209	0.741540	0.862682	1.790920	...	1.427133	-0.330105	-0.833102	-1.314724	0.896119	-0.738813	1.374135	0.440371	1.041022	-0.121798
1	-0.193280	1.257865	0.658521	-0.556523	0.867679	1.392618	1.358895	1.235232	0.598424	0.407585	...	0.480264	0.024192	0.416871	0.503437	-1.003764	-0.444086	-1.318141	-0.002651	0.091065	0.267475
2	-0.091231	1.268993	0.825904	-0.386833	0.215122	0.575687	-1.589176	1.526072	-1.083706	0.360741	...	-0.404119	0.178404	-0.065295	0.571753	-1.697193	1.665987	-0.487148	0.983719	1.474843	1.010711
3	-1.040651	0.876857	0.090111	-0.774266	0.412296	0.903602	0.168427	-0.690537	0.620906	0.483258	...	-0.225915	0.139047	0.568339	-0.726861	-0.061449	1.187309	1.819904	0.746932	0.246057	-0.622705
4	-0.890623	-1.924294	-1.607911	-0.919474	-0.756320	-0.130709	0.181522	0.337458	0.750356	-0.552621	...	-1.348002	0.687834	-0.597506	-0.185708	0.681879	-1.264852	-1.432979	-0.262975	0.340943	1.115939

5 rows × 30 columns

In order to make our examples meaningful, we employ a real-world dataset as an example to show how pandas works. World Health Organization provides latest data about the COVID-19 such as vaccination and confirmed cases, which is keeped in CSV-format files. Here we refer to the vaccination data: https://covid19.who.int/who-data/vaccination-data.csv.

vac = pd.read_csv('https://covid19.who.int/who-data/vaccination-data.csv')
vac

	COUNTRY	ISO3	WHO_REGION	DATA_SOURCE	DATE_UPDATED	TOTAL_VACCINATIONS	PERSONS_VACCINATED_1PLUS_DOSE	TOTAL_VACCINATIONS_PER100	PERSONS_VACCINATED_1PLUS_DOSE_PER100	PERSONS_FULLY_VACCINATED	PERSONS_FULLY_VACCINATED_PER100	VACCINES_USED	FIRST_VACCINE_DATE	NUMBER_VACCINES_TYPES_USED
0	Afghanistan	AFG	EMRO	REPORTING	2022-03-06	5597130	4952744.0	14.378	12.723	4281934.0	11.000	Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2...	2021-02-22	4.0
1	Albania	ALB	EURO	REPORTING	2022-02-20	2707658	1284034.0	94.100	45.118	1192155.0	41.889	AstraZeneca - Vaxzevria,Gamaleya - Gam-Covid-V...	2021-01-13	5.0
2	Algeria	DZA	AFRO	REPORTING	2022-02-20	13631683	7456361.0	31.086	17.004	6076272.0	13.857	Beijing CNBG - BBIBP-CorV,Gamaleya - Gam-Covid...	2021-01-30	4.0
3	American Samoa	ASM	WPRO	REPORTING	2022-02-16	85050	42212.0	154.084	76.475	37805.0	68.491	Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz...	2020-12-21	3.0
4	Andorra	AND	EURO	REPORTING	2022-02-13	142420	57797.0	184.300	75.872	53250.0	69.903	AstraZeneca - Vaxzevria,Moderna - Spikevax,Pfi...	2021-01-20	3.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
223	Viet Nam	VNM	WPRO	REPORTING	2022-02-24	192865986	79469390.0	198.139	81.642	75322542.0	77.382	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-03-08	6.0
224	Wallis and Futuna	WLF	WPRO	REPORTING	2022-02-18	15633	6450.0	139.009	57.354	6399.0	56.900	Moderna - Spikevax	2021-03-19	1.0
225	Yemen	YEM	EMRO	REPORTING	2022-03-06	784792	624837.0	2.631	2.095	384655.0	1.290	Janssen - Ad26.COV 2-S,SII - Covishield,Sinova...	2021-04-20	3.0
226	Zambia	ZMB	AFRO	REPORTING	2022-03-03	2858338	2510296.0	15.548	13.655	1883799.0	10.247	Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2...	2021-04-14	3.0
227	Zimbabwe	ZWE	AFRO	REPORTING	2022-03-04	7904719	4368633.0	53.184	29.393	3403501.0	22.899	Beijing CNBG - BBIBP-CorV,Bharat - Covaxin,Gam...	2021-02-18	4.0

228 rows × 14 columns

Note

read_csv(path) reads a CSV-format file from URL (Web link or local file) and returns a DataFrame. For example, you can use pd.read_csv('./data.csv') to load csv file from your current directory.

Properties index and columns will give the row/column labels:

vac.columns

Index(['COUNTRY', 'ISO3', 'WHO_REGION', 'DATA_SOURCE', 'DATE_UPDATED',
       'TOTAL_VACCINATIONS', 'PERSONS_VACCINATED_1PLUS_DOSE',
       'TOTAL_VACCINATIONS_PER100', 'PERSONS_VACCINATED_1PLUS_DOSE_PER100',
       'PERSONS_FULLY_VACCINATED', 'PERSONS_FULLY_VACCINATED_PER100',
       'VACCINES_USED', 'FIRST_VACCINE_DATE', 'NUMBER_VACCINES_TYPES_USED'],
      dtype='object')

In addition, you may want to sort by values, just like what we usually do in Excel.

vac.sort_values(by='TOTAL_VACCINATIONS', ascending=False)

	COUNTRY	ISO3	WHO_REGION	DATA_SOURCE	DATE_UPDATED	TOTAL_VACCINATIONS	PERSONS_VACCINATED_1PLUS_DOSE	TOTAL_VACCINATIONS_PER100	PERSONS_VACCINATED_1PLUS_DOSE_PER100	PERSONS_FULLY_VACCINATED	PERSONS_FULLY_VACCINATED_PER100	VACCINES_USED	FIRST_VACCINE_DATE	NUMBER_VACCINES_TYPES_USED
43	China	CHN	WPRO	REPORTING	2022-02-27	3138003103	1.276556e+09	213.283	86.765	1.235633e+09	83.983	Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co...	2020-07-22	8.0
94	India	IND	SEARO	REPORTING	2022-02-28	1776718549	9.642345e+08	128.747	69.872	7.923996e+08	57.420	Bharat - Covaxin,Gamaleya - Gam-Covid-Vac,Jans...	2021-01-16	6.0
218	United States of America	USA	AMRO	REPORTING	2022-02-25	537567013	2.531571e+08	162.406	76.482	2.101726e+08	63.496	Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz...	2020-12-14	3.0
29	Brazil	BRA	AMRO	REPORTING	2022-02-25	369527744	1.723341e+08	173.847	81.076	1.487775e+08	69.993	AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S...	2021-01-17	5.0
95	Indonesia	IDN	SEARO	REPORTING	2022-03-01	345697245	1.909768e+08	126.387	69.821	1.445058e+08	52.831	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-01-13	7.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
172	Saba	XCA	AMRO	REPORTING	2022-02-25	3148	1.582000e+03	162.856	81.842	1.566000e+03	81.014	NaN	NaN	NaN
186	Sint Eustatius	XBA	AMRO	REPORTING	2022-02-25	3110	1.588000e+03	99.076	50.589	1.522000e+03	48.487	NaN	NaN	NaN
149	Niue	NIU	WPRO	REPORTING	2021-11-24	2628	1.340000e+03	162.423	82.818	1.288000e+03	79.604	Pfizer BioNTech - Comirnaty	2021-06-08	1.0
206	Tokelau	TKL	WPRO	REPORTING	2021-09-28	1936	9.680000e+02	143.407	71.704	9.680000e+02	71.704	Pfizer BioNTech - Comirnaty	2021-07-20	1.0
162	Pitcairn Islands	PCN	WPRO	REPORTING	2021-11-11	74	3.700000e+01	148.000	74.000	3.700000e+01	74.000	AstraZeneca - Vaxzevria	2021-05-17	1.0

228 rows × 14 columns

T property will return a transpose of DataFrame.

vac.T

	0	1	2	3	4	5	6	7	8	9	...	218	219	220	221	222	223	224	225	226	227
COUNTRY	Afghanistan	Albania	Algeria	American Samoa	Andorra	Angola	Anguilla	Antigua and Barbuda	Argentina	Armenia	...	United States of America	Uruguay	Uzbekistan	Vanuatu	Venezuela (Bolivarian Republic of)	Viet Nam	Wallis and Futuna	Yemen	Zambia	Zimbabwe
ISO3	AFG	ALB	DZA	ASM	AND	AGO	AIA	ATG	ARG	ARM	...	USA	URY	UZB	VUT	VEN	VNM	WLF	YEM	ZMB	ZWE
WHO_REGION	EMRO	EURO	AFRO	WPRO	EURO	AFRO	AMRO	AMRO	AMRO	EURO	...	AMRO	AMRO	EURO	WPRO	AMRO	WPRO	WPRO	EMRO	AFRO	AFRO
DATA_SOURCE	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	...	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING	REPORTING
DATE_UPDATED	2022-03-06	2022-02-20	2022-02-20	2022-02-16	2022-02-13	2022-03-02	2022-02-25	2022-02-25	2022-02-25	2022-02-13	...	2022-02-25	2022-02-25	2022-02-20	2022-02-20	2022-02-25	2022-02-24	2022-02-18	2022-03-06	2022-03-03	2022-03-04
TOTAL_VACCINATIONS	5597130	2707658	13631683	85050	142420	16633167	22165	124726	93008081	1971565	...	537567013	7612199	42121870	197335	37860994	192865986	15633	784792	2858338	7904719
PERSONS_VACCINATED_1PLUS_DOSE	4952744.0	1284034.0	7456361.0	42212.0	57797.0	10780927.0	10370.0	63582.0	40572052.0	1080404.0	...	253157102.0	2964083.0	18446210.0	114284.0	22157232.0	79469390.0	6450.0	624837.0	2510296.0	4368633.0
TOTAL_VACCINATIONS_PER100	14.378	94.1	31.086	154.084	184.3	50.609	147.747	127.364	205.789	66.5	...	162.406	219.136	125.9	64.248	133.145	198.139	139.009	2.631	15.548	53.184
PERSONS_VACCINATED_1PLUS_DOSE_PER100	12.723	45.118	17.004	76.475	75.872	32.802	69.124	64.927	89.77	36.46	...	76.482	85.329	55.114	37.208	77.92	81.642	57.354	2.095	13.655	29.393
PERSONS_FULLY_VACCINATED	4281934.0	1192155.0	6076272.0	37805.0	53250.0	5609044.0	9531.0	61144.0	36237826.0	872336.0	...	210172612.0	2728709.0	13160671.0	85180.0	14287370.0	75322542.0	6399.0	384655.0	1883799.0	3403501.0
PERSONS_FULLY_VACCINATED_PER100	11.0	41.889	13.857	68.491	69.903	17.066	63.532	62.437	80.18	29.439	...	63.496	78.553	39.322	27.733	50.244	77.382	56.9	1.29	10.247	22.899
VACCINES_USED	Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2...	AstraZeneca - Vaxzevria,Gamaleya - Gam-Covid-V...	Beijing CNBG - BBIBP-CorV,Gamaleya - Gam-Covid...	Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz...	AstraZeneca - Vaxzevria,Moderna - Spikevax,Pfi...	SII - Covishield	AstraZeneca - Vaxzevria,Pfizer BioNTech - Comi...	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	...	Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz...	AstraZeneca - Vaxzevria,Pfizer BioNTech - Comi...	Anhui ZL - Recombinant,Gamaleya - Gam-Covid-Va...	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	Beijing CNBG - BBIBP-CorV,CIGB - CIGB-66,Finla...	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	Moderna - Spikevax	Janssen - Ad26.COV 2-S,SII - Covishield,Sinova...	Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2...	Beijing CNBG - BBIBP-CorV,Bharat - Covaxin,Gam...
FIRST_VACCINE_DATE	2021-02-22	2021-01-13	2021-01-30	2020-12-21	2021-01-20	2021-03-10	2021-02-05	2021-03-01	2020-12-29	2021-06-16	...	2020-12-14	2021-02-27	2021-04-28	2021-06-02	2021-02-22	2021-03-08	2021-03-19	2021-04-20	2021-04-14	2021-02-18
NUMBER_VACCINES_TYPES_USED	4.0	5.0	4.0	3.0	3.0	1.0	2.0	6.0	7.0	5.0	...	3.0	3.0	7.0	3.0	6.0	6.0	1.0	3.0	3.0	4.0

14 rows × 228 columns

You can use describe() function to have a quick look on DataFrame:

vac.describe()

	TOTAL_VACCINATIONS	PERSONS_VACCINATED_1PLUS_DOSE	TOTAL_VACCINATIONS_PER100	PERSONS_VACCINATED_1PLUS_DOSE_PER100	PERSONS_FULLY_VACCINATED	PERSONS_FULLY_VACCINATED_PER100	NUMBER_VACCINES_TYPES_USED
count	2.280000e+02	2.270000e+02	228.000000	227.000000	2.270000e+02	227.000000	225.000000
mean	4.694756e+07	2.187060e+07	122.156500	55.765784	1.928435e+07	50.121868	3.928889
std	2.442809e+08	1.090964e+08	71.896713	27.044030	9.958222e+07	26.417687	1.928247
min	7.400000e+01	3.700000e+01	0.097000	0.082000	3.700000e+01	0.077000	1.000000
25%	3.534380e+05	1.733685e+05	62.841000	34.350500	1.458095e+05	29.097500	2.000000
50%	2.721102e+06	1.536112e+06	127.515000	63.069000	1.192155e+06	53.974000	4.000000
75%	1.670141e+07	7.843871e+06	178.022250	77.008000	6.861035e+06	72.275500	5.000000
max	3.138003e+09	1.276556e+09	335.811000	124.567000	1.235633e+09	121.448000	11.000000

A better way to present data is using plot() function to draw all columns:

import matplotlib.pyplot as plt

vac = vac.sort_values(by='TOTAL_VACCINATIONS', ascending=False)
top20 = vac[['TOTAL_VACCINATIONS']][:20]
countries = vac['COUNTRY'][:20]
top20.index = countries

top20.plot.pie(y='TOTAL_VACCINATIONS', figsize=(15,15))

<AxesSubplot:ylabel='TOTAL_VACCINATIONS'>

Tip

plot() is a convinient way to visualize data in pandas, but it is not perfect. We will learn more about the data visualization in following lectures.

Select¶

Pandas provides different ways to select elements in the DataFrame.

`iloc`¶

iloc[n] will give the \(n-1\) rows of the DataFrame, i.e., a Series.

vac.iloc[0]

COUNTRY                                                                             China
ISO3                                                                                  CHN
WHO_REGION                                                                           WPRO
DATA_SOURCE                                                                     REPORTING
DATE_UPDATED                                                                   2022-02-27
TOTAL_VACCINATIONS                                                             3138003103
PERSONS_VACCINATED_1PLUS_DOSE                                                1276556393.0
TOTAL_VACCINATIONS_PER100                                                         213.283
PERSONS_VACCINATED_1PLUS_DOSE_PER100                                               86.765
PERSONS_FULLY_VACCINATED                                                     1235633241.0
PERSONS_FULLY_VACCINATED_PER100                                                    83.983
VACCINES_USED                           Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co...
FIRST_VACCINE_DATE                                                             2020-07-22
NUMBER_VACCINES_TYPES_USED                                                            8.0
Name: 43, dtype: object

Tip

iloc index starts from \(0\), instead of \(1\). So the \(n\)-rows has a index \(n-1\).

In order to select multiple rows, just feed a list containing indices to the iloc:

vac.iloc[[0, 1, 2]]

	COUNTRY	ISO3	WHO_REGION	DATA_SOURCE	DATE_UPDATED	TOTAL_VACCINATIONS	PERSONS_VACCINATED_1PLUS_DOSE	TOTAL_VACCINATIONS_PER100	PERSONS_VACCINATED_1PLUS_DOSE_PER100	PERSONS_FULLY_VACCINATED	PERSONS_FULLY_VACCINATED_PER100	VACCINES_USED	FIRST_VACCINE_DATE	NUMBER_VACCINES_TYPES_USED
43	China	CHN	WPRO	REPORTING	2022-02-27	3138003103	1.276556e+09	213.283	86.765	1.235633e+09	83.983	Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co...	2020-07-22	8.0
94	India	IND	SEARO	REPORTING	2022-02-28	1776718549	9.642345e+08	128.747	69.872	7.923996e+08	57.420	Bharat - Covaxin,Gamaleya - Gam-Covid-Vac,Jans...	2021-01-16	6.0
218	United States of America	USA	AMRO	REPORTING	2022-02-25	537567013	2.531571e+08	162.406	76.482	2.101726e+08	63.496	Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz...	2020-12-14	3.0

Also, you can select both rows and columns at the same time:

vac.iloc[[0, 1, 2], [0, 4, 5]]

	COUNTRY	DATE_UPDATED	TOTAL_VACCINATIONS
43	China	2022-02-27	3138003103
94	India	2022-02-28	1776718549
218	United States of America	2022-02-25	537567013

Or just access single element:

vac.iloc[0, 4]

'2022-02-27'

`loc`¶

As you can see, iloc may be confused to read because it uses numbers as index. However, loc provides another way to access by index and columns in DataFrame.

vac.loc[0:3, ["COUNTRY", "DATE_UPDATED", "TOTAL_VACCINATIONS"]]

	COUNTRY	DATE_UPDATED	TOTAL_VACCINATIONS
0	Afghanistan	2022-03-06	5597130
134	Mongolia	2022-02-25	5568712
114	Lebanon	2022-02-28	5452304
50	Croatia	2022-02-20	5179921
87	Guinea	2022-03-03	5159979
...	...	...	...
151	Northern Mariana Islands (Commonwealth of the)	2022-02-04	105295
68	Faroe Islands	2022-02-18	103894
132	Micronesia (Federated States of)	2022-02-16	97185
82	Grenada	2022-02-25	85891
3	American Samoa	2022-02-16	85050

103 rows × 3 columns

Important

Becareful with the index names in iloc and loc. They are different.

Filter¶

Both loc and iloc accept boolean parameters, which can be used to filter particular elements in the DataFrame.

vac[vac['TOTAL_VACCINATIONS'] > 100000000]

	COUNTRY	ISO3	WHO_REGION	DATA_SOURCE	DATE_UPDATED	TOTAL_VACCINATIONS	PERSONS_VACCINATED_1PLUS_DOSE	TOTAL_VACCINATIONS_PER100	PERSONS_VACCINATED_1PLUS_DOSE_PER100	PERSONS_FULLY_VACCINATED	PERSONS_FULLY_VACCINATED_PER100	VACCINES_USED	FIRST_VACCINE_DATE	NUMBER_VACCINES_TYPES_USED
43	China	CHN	WPRO	REPORTING	2022-02-27	3138003103	1.276556e+09	213.283	86.765	1.235633e+09	83.983	Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co...	2020-07-22	8.0
94	India	IND	SEARO	REPORTING	2022-02-28	1776718549	9.642345e+08	128.747	69.872	7.923996e+08	57.420	Bharat - Covaxin,Gamaleya - Gam-Covid-Vac,Jans...	2021-01-16	6.0
218	United States of America	USA	AMRO	REPORTING	2022-02-25	537567013	2.531571e+08	162.406	76.482	2.101726e+08	63.496	Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz...	2020-12-14	3.0
29	Brazil	BRA	AMRO	REPORTING	2022-02-25	369527744	1.723341e+08	173.847	81.076	1.487775e+08	69.993	AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S...	2021-01-17	5.0
95	Indonesia	IDN	SEARO	REPORTING	2022-03-01	345697245	1.909768e+08	126.387	69.821	1.445058e+08	52.831	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-01-13	7.0
103	Japan	JPN	WPRO	REPORTING	2022-02-24	223820819	1.016722e+08	176.966	80.388	1.001786e+08	79.207	AstraZeneca - Vaxzevria,Moderna - Spikevax,Nov...	2021-02-17	4.0
155	Pakistan	PAK	EMRO	REPORTING	2022-02-28	213532343	1.243097e+08	96.668	56.276	1.000965e+08	45.315	Beijing CNBG - BBIBP-CorV,Bharat - Covaxin,Can...	2021-02-03	8.0
16	Bangladesh	BGD	SEARO	REPORTING	2022-03-01	213176935	1.246919e+08	129.442	75.713	8.461175e+07	51.377	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-01-27	8.0
223	Viet Nam	VNM	WPRO	REPORTING	2022-02-24	192865986	7.946939e+07	198.139	81.642	7.532254e+07	77.382	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-03-08	6.0
131	Mexico	MEX	AMRO	REPORTING	2022-02-25	179274307	8.508365e+07	139.045	65.991	7.875005e+07	61.078	AstraZeneca - Vaxzevria,CanSino - Convidecia,G...	2020-12-24	7.0
77	Germany	DEU	EURO	REPORTING	2022-02-20	169071638	6.338051e+07	203.300	76.209	6.250088e+07	75.151	AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S...	2020-12-23	4.0
170	Russian Federation	RUS	EURO	OWID	2022-02-17	160816996	7.828428e+07	110.200	53.643	7.160191e+07	49.064	SRCVB - EpiVacCorona, Gamaleya - Sputnik V	NaN	2.0
71	France	FRA	EURO	REPORTING	2022-02-20	153013512	5.470059e+07	227.300	81.254	5.368728e+07	79.749	AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S...	2020-12-30	4.0
210	Turkey	TUR	EURO	OWID	2022-02-23	144850157	5.762345e+07	171.700	69.296	5.271558e+07	63.394	Pfizer BioNTech - Comirnaty, Sinovac - CoronaV...	NaN	3.0
96	Iran (Islamic Republic of)	IRN	EMRO	REPORTING	2022-03-06	142195819	6.311054e+07	169.295	75.138	5.571737e+07	66.336	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-02-09	6.0
203	The United Kingdom	GBR	EURO	REPORTING	2022-02-22	139482283	5.257359e+07	205.500	77.246	4.888589e+07	71.828	AstraZeneca - Vaxzevria,Moderna - Spikevax,Pfi...	2020-12-21	3.0
161	Philippines	PHL	WPRO	REPORTING	2022-02-24	135251295	6.866160e+07	123.426	62.658	6.301683e+07	57.507	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-03-01	11.0
101	Italy	ITA	EURO	REPORTING	2022-02-20	131126289	5.019838e+07	219.900	84.171	4.679558e+07	78.461	AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S...	2020-12-23	4.0
202	Thailand	THA	SEARO	REPORTING	2022-02-28	123738218	5.358496e+07	177.275	76.769	4.967325e+07	71.165	AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C...	2021-02-28	7.0
167	Republic of Korea	KOR	WPRO	REPORTING	2022-02-27	118895241	4.477671e+07	231.904	87.336	4.428542e+07	86.378	AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S...	2021-02-26	5.0

Python Lectures

Lecture 3: Data Analysis

Contents

Lecture 3: Data Analysis¶

Preparation¶

Data container¶

Select¶

`iloc`¶

`loc`¶

Filter¶

Modify¶

Python Lectures

Lecture 3: Data Analysis

Contents

Lecture 3: Data Analysis¶

Preparation¶

Data container¶

Select¶

iloc¶

loc¶

Filter¶

Modify¶

`iloc`¶

`loc`¶