Lecture 3: Data Analysis
Contents
Lecture 3: Data Analysis¶
Warning
This page is still building. Please be patient for the coming lectures.
In this lecture, we will learn data analysis as well as presentation in Python including follwoing four packages:
Why you need Python data analysis
Basically, modern office suite like Microsoft Excel
provides powerful features
to process and present your data. However, they are lacking of flexibility and
the features are hidden in their complex UI (user interface). On the other hand,
Python gives us more efficient and flexible way to operate data. For example,
Microsoft Excel
has a maximum row number \(1048576=2^{20}\) in Office 2007
or later while pandas
can easily exceed this limitation.
It is also possible to custom a pipeline with help of these packages so you can analyze data automatically. Once you build one, you can just run it every time when you have new data rather than operating data again and again.
Preparation¶
Warning
This is an advanced lecture so it is recommended to learn lecture 1 and lecture 2 before it if you have no background in Python programming.
Pandas is a data analysis and manipulation tools based on Python. To learn Pandas, the best material is the official cookbook which can be found at the official website. The documentation for pandas is also helpful for your learning.
Before start everything, we need to install Pandas as well as some other packages via pip.
numPy
:pandas
is based onnumPy
, an efficient numerical computation package written in C++.openpyxl
:pandas
requires it to read data from excel file.matplotlib
: a graphical package to draw figures.
!pip install -q pandas numpy openpyxl matplotlib
Note
Here
-q
means quiet installation which can be ignored if you don not understand.In Jupyter Notebook, command starting with mark
!
is NOT native Python code, but a wrapped shell command, i.e. you can runpip install numpy
to installnumPy
in your Anaconda Prompt Shell.
Let us import these two packages with short names:
import numpy as np
import pandas as pd
Data container¶
Pandas
provides two types of container Series
and DataFrame
which are used to store sequential
and tabular data respectively.
Run following code to create two data containers.
s = pd.Series([1,2,3,4,5,6])
d = pd.DataFrame(np.arange(1, 19).reshape(3, -1))
print(s)
print(d)
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
0 1 2 3 4 5
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
Note
Here np.arange(1, 19)
will generate a arrary from \(1\) to \(18\), and reshape(3, -1)
will change the shape of arrary with first dimension as \(3\). Parameter -1
means it will be
derivated internally and here will be replaced with \(6\) because \(6=18\div 3\).
shape
is a property of container and it is a tuple of dimensions.
print(s.shape)
print(d.shape)
(6,)
(3, 6)
For very large DataFrame
, Jupyter notebook
will only show a part of the content:
d = pd.DataFrame(np.random.randn(100, 30))
d
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.343684 | 0.298432 | -0.346330 | -1.360610 | -0.780171 | -0.174479 | -2.073209 | 0.741540 | 0.862682 | 1.790920 | ... | 1.427133 | -0.330105 | -0.833102 | -1.314724 | 0.896119 | -0.738813 | 1.374135 | 0.440371 | 1.041022 | -0.121798 |
1 | -0.193280 | 1.257865 | 0.658521 | -0.556523 | 0.867679 | 1.392618 | 1.358895 | 1.235232 | 0.598424 | 0.407585 | ... | 0.480264 | 0.024192 | 0.416871 | 0.503437 | -1.003764 | -0.444086 | -1.318141 | -0.002651 | 0.091065 | 0.267475 |
2 | -0.091231 | 1.268993 | 0.825904 | -0.386833 | 0.215122 | 0.575687 | -1.589176 | 1.526072 | -1.083706 | 0.360741 | ... | -0.404119 | 0.178404 | -0.065295 | 0.571753 | -1.697193 | 1.665987 | -0.487148 | 0.983719 | 1.474843 | 1.010711 |
3 | -1.040651 | 0.876857 | 0.090111 | -0.774266 | 0.412296 | 0.903602 | 0.168427 | -0.690537 | 0.620906 | 0.483258 | ... | -0.225915 | 0.139047 | 0.568339 | -0.726861 | -0.061449 | 1.187309 | 1.819904 | 0.746932 | 0.246057 | -0.622705 |
4 | -0.890623 | -1.924294 | -1.607911 | -0.919474 | -0.756320 | -0.130709 | 0.181522 | 0.337458 | 0.750356 | -0.552621 | ... | -1.348002 | 0.687834 | -0.597506 | -0.185708 | 0.681879 | -1.264852 | -1.432979 | -0.262975 | 0.340943 | 1.115939 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | -0.790644 | 0.618316 | -0.239128 | 0.223507 | -1.972358 | 1.344895 | -0.921793 | 1.239987 | -0.727588 | 2.143194 | ... | 0.173822 | -0.209904 | 0.167663 | 1.023742 | -1.487167 | -0.950625 | -0.448279 | -0.726560 | 1.566739 | -2.048781 |
96 | -0.897666 | -1.011664 | -1.305373 | 0.292486 | 0.033395 | 1.184037 | -0.224618 | -0.603053 | 0.583313 | 1.238475 | ... | -1.162605 | 1.927186 | 0.189962 | -0.048620 | -0.499449 | -1.396445 | -0.187596 | -0.866826 | 0.488415 | -1.043417 |
97 | -0.386214 | -0.737502 | -0.724941 | 2.207187 | 1.191978 | -0.849805 | 1.267080 | 0.092453 | 0.524919 | -0.903533 | ... | -1.127624 | 0.242240 | 0.553355 | -1.119842 | 0.132852 | -1.981549 | 1.617266 | -1.088464 | -1.790250 | 0.687706 |
98 | 0.256366 | -1.496072 | -0.901937 | -0.612122 | -0.097003 | -0.011705 | 0.593026 | -0.146939 | -0.992527 | -0.885829 | ... | 1.609611 | 1.205030 | -0.915260 | 0.181878 | 0.419530 | 2.393224 | 0.259172 | 0.533195 | 0.432964 | 0.874746 |
99 | -0.000901 | -1.145946 | 0.295250 | -0.390829 | 1.592131 | 0.339830 | 1.496009 | -0.121397 | 0.382855 | -0.491859 | ... | -1.915573 | -0.667241 | 1.049157 | 0.382023 | -0.689314 | 0.989662 | 0.171298 | -0.197036 | -0.235648 | 0.647312 |
100 rows × 30 columns
Note
Here np.random.randn(n, m)
will generate a random matrix with \(n\) rows and \(m\) columns.
Each element in the matrix follows the standard normal distribution
\( X\sim \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\)
Try head(n)
and it will give you the first \(n\) rows of the DataFrame
while tail(n)
will give the last \(n\) rows.
d.head(5)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.343684 | 0.298432 | -0.346330 | -1.360610 | -0.780171 | -0.174479 | -2.073209 | 0.741540 | 0.862682 | 1.790920 | ... | 1.427133 | -0.330105 | -0.833102 | -1.314724 | 0.896119 | -0.738813 | 1.374135 | 0.440371 | 1.041022 | -0.121798 |
1 | -0.193280 | 1.257865 | 0.658521 | -0.556523 | 0.867679 | 1.392618 | 1.358895 | 1.235232 | 0.598424 | 0.407585 | ... | 0.480264 | 0.024192 | 0.416871 | 0.503437 | -1.003764 | -0.444086 | -1.318141 | -0.002651 | 0.091065 | 0.267475 |
2 | -0.091231 | 1.268993 | 0.825904 | -0.386833 | 0.215122 | 0.575687 | -1.589176 | 1.526072 | -1.083706 | 0.360741 | ... | -0.404119 | 0.178404 | -0.065295 | 0.571753 | -1.697193 | 1.665987 | -0.487148 | 0.983719 | 1.474843 | 1.010711 |
3 | -1.040651 | 0.876857 | 0.090111 | -0.774266 | 0.412296 | 0.903602 | 0.168427 | -0.690537 | 0.620906 | 0.483258 | ... | -0.225915 | 0.139047 | 0.568339 | -0.726861 | -0.061449 | 1.187309 | 1.819904 | 0.746932 | 0.246057 | -0.622705 |
4 | -0.890623 | -1.924294 | -1.607911 | -0.919474 | -0.756320 | -0.130709 | 0.181522 | 0.337458 | 0.750356 | -0.552621 | ... | -1.348002 | 0.687834 | -0.597506 | -0.185708 | 0.681879 | -1.264852 | -1.432979 | -0.262975 | 0.340943 | 1.115939 |
5 rows × 30 columns
In order to make our examples meaningful,
we employ a real-world dataset as an example to show how pandas
works. World Health Organization
provides latest data about the COVID-19 such as vaccination and confirmed cases, which is keeped in CSV-format files. Here we
refer to the vaccination data: https://covid19.who.int/who-data/vaccination-data.csv.
vac = pd.read_csv('https://covid19.who.int/who-data/vaccination-data.csv')
vac
COUNTRY | ISO3 | WHO_REGION | DATA_SOURCE | DATE_UPDATED | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | AFG | EMRO | REPORTING | 2022-03-06 | 5597130 | 4952744.0 | 14.378 | 12.723 | 4281934.0 | 11.000 | Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2... | 2021-02-22 | 4.0 |
1 | Albania | ALB | EURO | REPORTING | 2022-02-20 | 2707658 | 1284034.0 | 94.100 | 45.118 | 1192155.0 | 41.889 | AstraZeneca - Vaxzevria,Gamaleya - Gam-Covid-V... | 2021-01-13 | 5.0 |
2 | Algeria | DZA | AFRO | REPORTING | 2022-02-20 | 13631683 | 7456361.0 | 31.086 | 17.004 | 6076272.0 | 13.857 | Beijing CNBG - BBIBP-CorV,Gamaleya - Gam-Covid... | 2021-01-30 | 4.0 |
3 | American Samoa | ASM | WPRO | REPORTING | 2022-02-16 | 85050 | 42212.0 | 154.084 | 76.475 | 37805.0 | 68.491 | Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz... | 2020-12-21 | 3.0 |
4 | Andorra | AND | EURO | REPORTING | 2022-02-13 | 142420 | 57797.0 | 184.300 | 75.872 | 53250.0 | 69.903 | AstraZeneca - Vaxzevria,Moderna - Spikevax,Pfi... | 2021-01-20 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
223 | Viet Nam | VNM | WPRO | REPORTING | 2022-02-24 | 192865986 | 79469390.0 | 198.139 | 81.642 | 75322542.0 | 77.382 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-03-08 | 6.0 |
224 | Wallis and Futuna | WLF | WPRO | REPORTING | 2022-02-18 | 15633 | 6450.0 | 139.009 | 57.354 | 6399.0 | 56.900 | Moderna - Spikevax | 2021-03-19 | 1.0 |
225 | Yemen | YEM | EMRO | REPORTING | 2022-03-06 | 784792 | 624837.0 | 2.631 | 2.095 | 384655.0 | 1.290 | Janssen - Ad26.COV 2-S,SII - Covishield,Sinova... | 2021-04-20 | 3.0 |
226 | Zambia | ZMB | AFRO | REPORTING | 2022-03-03 | 2858338 | 2510296.0 | 15.548 | 13.655 | 1883799.0 | 10.247 | Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2... | 2021-04-14 | 3.0 |
227 | Zimbabwe | ZWE | AFRO | REPORTING | 2022-03-04 | 7904719 | 4368633.0 | 53.184 | 29.393 | 3403501.0 | 22.899 | Beijing CNBG - BBIBP-CorV,Bharat - Covaxin,Gam... | 2021-02-18 | 4.0 |
228 rows × 14 columns
Note
read_csv(path)
reads a CSV-format file from URL (Web link or local file) and returns a DataFrame
.
For example, you can use pd.read_csv('./data.csv')
to load csv file from your current directory.
Properties index
and columns
will give the row/column labels:
vac.columns
Index(['COUNTRY', 'ISO3', 'WHO_REGION', 'DATA_SOURCE', 'DATE_UPDATED',
'TOTAL_VACCINATIONS', 'PERSONS_VACCINATED_1PLUS_DOSE',
'TOTAL_VACCINATIONS_PER100', 'PERSONS_VACCINATED_1PLUS_DOSE_PER100',
'PERSONS_FULLY_VACCINATED', 'PERSONS_FULLY_VACCINATED_PER100',
'VACCINES_USED', 'FIRST_VACCINE_DATE', 'NUMBER_VACCINES_TYPES_USED'],
dtype='object')
In addition, you may want to sort by values, just like what we usually do in Excel
.
vac.sort_values(by='TOTAL_VACCINATIONS', ascending=False)
COUNTRY | ISO3 | WHO_REGION | DATA_SOURCE | DATE_UPDATED | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
43 | China | CHN | WPRO | REPORTING | 2022-02-27 | 3138003103 | 1.276556e+09 | 213.283 | 86.765 | 1.235633e+09 | 83.983 | Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co... | 2020-07-22 | 8.0 |
94 | India | IND | SEARO | REPORTING | 2022-02-28 | 1776718549 | 9.642345e+08 | 128.747 | 69.872 | 7.923996e+08 | 57.420 | Bharat - Covaxin,Gamaleya - Gam-Covid-Vac,Jans... | 2021-01-16 | 6.0 |
218 | United States of America | USA | AMRO | REPORTING | 2022-02-25 | 537567013 | 2.531571e+08 | 162.406 | 76.482 | 2.101726e+08 | 63.496 | Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz... | 2020-12-14 | 3.0 |
29 | Brazil | BRA | AMRO | REPORTING | 2022-02-25 | 369527744 | 1.723341e+08 | 173.847 | 81.076 | 1.487775e+08 | 69.993 | AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S... | 2021-01-17 | 5.0 |
95 | Indonesia | IDN | SEARO | REPORTING | 2022-03-01 | 345697245 | 1.909768e+08 | 126.387 | 69.821 | 1.445058e+08 | 52.831 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-01-13 | 7.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
172 | Saba | XCA | AMRO | REPORTING | 2022-02-25 | 3148 | 1.582000e+03 | 162.856 | 81.842 | 1.566000e+03 | 81.014 | NaN | NaN | NaN |
186 | Sint Eustatius | XBA | AMRO | REPORTING | 2022-02-25 | 3110 | 1.588000e+03 | 99.076 | 50.589 | 1.522000e+03 | 48.487 | NaN | NaN | NaN |
149 | Niue | NIU | WPRO | REPORTING | 2021-11-24 | 2628 | 1.340000e+03 | 162.423 | 82.818 | 1.288000e+03 | 79.604 | Pfizer BioNTech - Comirnaty | 2021-06-08 | 1.0 |
206 | Tokelau | TKL | WPRO | REPORTING | 2021-09-28 | 1936 | 9.680000e+02 | 143.407 | 71.704 | 9.680000e+02 | 71.704 | Pfizer BioNTech - Comirnaty | 2021-07-20 | 1.0 |
162 | Pitcairn Islands | PCN | WPRO | REPORTING | 2021-11-11 | 74 | 3.700000e+01 | 148.000 | 74.000 | 3.700000e+01 | 74.000 | AstraZeneca - Vaxzevria | 2021-05-17 | 1.0 |
228 rows × 14 columns
T
property will return a transpose of DataFrame
.
vac.T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
COUNTRY | Afghanistan | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Armenia | ... | United States of America | Uruguay | Uzbekistan | Vanuatu | Venezuela (Bolivarian Republic of) | Viet Nam | Wallis and Futuna | Yemen | Zambia | Zimbabwe |
ISO3 | AFG | ALB | DZA | ASM | AND | AGO | AIA | ATG | ARG | ARM | ... | USA | URY | UZB | VUT | VEN | VNM | WLF | YEM | ZMB | ZWE |
WHO_REGION | EMRO | EURO | AFRO | WPRO | EURO | AFRO | AMRO | AMRO | AMRO | EURO | ... | AMRO | AMRO | EURO | WPRO | AMRO | WPRO | WPRO | EMRO | AFRO | AFRO |
DATA_SOURCE | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | ... | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING | REPORTING |
DATE_UPDATED | 2022-03-06 | 2022-02-20 | 2022-02-20 | 2022-02-16 | 2022-02-13 | 2022-03-02 | 2022-02-25 | 2022-02-25 | 2022-02-25 | 2022-02-13 | ... | 2022-02-25 | 2022-02-25 | 2022-02-20 | 2022-02-20 | 2022-02-25 | 2022-02-24 | 2022-02-18 | 2022-03-06 | 2022-03-03 | 2022-03-04 |
TOTAL_VACCINATIONS | 5597130 | 2707658 | 13631683 | 85050 | 142420 | 16633167 | 22165 | 124726 | 93008081 | 1971565 | ... | 537567013 | 7612199 | 42121870 | 197335 | 37860994 | 192865986 | 15633 | 784792 | 2858338 | 7904719 |
PERSONS_VACCINATED_1PLUS_DOSE | 4952744.0 | 1284034.0 | 7456361.0 | 42212.0 | 57797.0 | 10780927.0 | 10370.0 | 63582.0 | 40572052.0 | 1080404.0 | ... | 253157102.0 | 2964083.0 | 18446210.0 | 114284.0 | 22157232.0 | 79469390.0 | 6450.0 | 624837.0 | 2510296.0 | 4368633.0 |
TOTAL_VACCINATIONS_PER100 | 14.378 | 94.1 | 31.086 | 154.084 | 184.3 | 50.609 | 147.747 | 127.364 | 205.789 | 66.5 | ... | 162.406 | 219.136 | 125.9 | 64.248 | 133.145 | 198.139 | 139.009 | 2.631 | 15.548 | 53.184 |
PERSONS_VACCINATED_1PLUS_DOSE_PER100 | 12.723 | 45.118 | 17.004 | 76.475 | 75.872 | 32.802 | 69.124 | 64.927 | 89.77 | 36.46 | ... | 76.482 | 85.329 | 55.114 | 37.208 | 77.92 | 81.642 | 57.354 | 2.095 | 13.655 | 29.393 |
PERSONS_FULLY_VACCINATED | 4281934.0 | 1192155.0 | 6076272.0 | 37805.0 | 53250.0 | 5609044.0 | 9531.0 | 61144.0 | 36237826.0 | 872336.0 | ... | 210172612.0 | 2728709.0 | 13160671.0 | 85180.0 | 14287370.0 | 75322542.0 | 6399.0 | 384655.0 | 1883799.0 | 3403501.0 |
PERSONS_FULLY_VACCINATED_PER100 | 11.0 | 41.889 | 13.857 | 68.491 | 69.903 | 17.066 | 63.532 | 62.437 | 80.18 | 29.439 | ... | 63.496 | 78.553 | 39.322 | 27.733 | 50.244 | 77.382 | 56.9 | 1.29 | 10.247 | 22.899 |
VACCINES_USED | Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2... | AstraZeneca - Vaxzevria,Gamaleya - Gam-Covid-V... | Beijing CNBG - BBIBP-CorV,Gamaleya - Gam-Covid... | Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz... | AstraZeneca - Vaxzevria,Moderna - Spikevax,Pfi... | SII - Covishield | AstraZeneca - Vaxzevria,Pfizer BioNTech - Comi... | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | ... | Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz... | AstraZeneca - Vaxzevria,Pfizer BioNTech - Comi... | Anhui ZL - Recombinant,Gamaleya - Gam-Covid-Va... | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | Beijing CNBG - BBIBP-CorV,CIGB - CIGB-66,Finla... | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | Moderna - Spikevax | Janssen - Ad26.COV 2-S,SII - Covishield,Sinova... | Beijing CNBG - BBIBP-CorV,Janssen - Ad26.COV 2... | Beijing CNBG - BBIBP-CorV,Bharat - Covaxin,Gam... |
FIRST_VACCINE_DATE | 2021-02-22 | 2021-01-13 | 2021-01-30 | 2020-12-21 | 2021-01-20 | 2021-03-10 | 2021-02-05 | 2021-03-01 | 2020-12-29 | 2021-06-16 | ... | 2020-12-14 | 2021-02-27 | 2021-04-28 | 2021-06-02 | 2021-02-22 | 2021-03-08 | 2021-03-19 | 2021-04-20 | 2021-04-14 | 2021-02-18 |
NUMBER_VACCINES_TYPES_USED | 4.0 | 5.0 | 4.0 | 3.0 | 3.0 | 1.0 | 2.0 | 6.0 | 7.0 | 5.0 | ... | 3.0 | 3.0 | 7.0 | 3.0 | 6.0 | 6.0 | 1.0 | 3.0 | 3.0 | 4.0 |
14 rows × 228 columns
You can use describe()
function to have a quick look on DataFrame
:
vac.describe()
TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | NUMBER_VACCINES_TYPES_USED | |
---|---|---|---|---|---|---|---|
count | 2.280000e+02 | 2.270000e+02 | 228.000000 | 227.000000 | 2.270000e+02 | 227.000000 | 225.000000 |
mean | 4.694756e+07 | 2.187060e+07 | 122.156500 | 55.765784 | 1.928435e+07 | 50.121868 | 3.928889 |
std | 2.442809e+08 | 1.090964e+08 | 71.896713 | 27.044030 | 9.958222e+07 | 26.417687 | 1.928247 |
min | 7.400000e+01 | 3.700000e+01 | 0.097000 | 0.082000 | 3.700000e+01 | 0.077000 | 1.000000 |
25% | 3.534380e+05 | 1.733685e+05 | 62.841000 | 34.350500 | 1.458095e+05 | 29.097500 | 2.000000 |
50% | 2.721102e+06 | 1.536112e+06 | 127.515000 | 63.069000 | 1.192155e+06 | 53.974000 | 4.000000 |
75% | 1.670141e+07 | 7.843871e+06 | 178.022250 | 77.008000 | 6.861035e+06 | 72.275500 | 5.000000 |
max | 3.138003e+09 | 1.276556e+09 | 335.811000 | 124.567000 | 1.235633e+09 | 121.448000 | 11.000000 |
A better way to present data is using plot()
function to draw all columns:
import matplotlib.pyplot as plt
vac = vac.sort_values(by='TOTAL_VACCINATIONS', ascending=False)
top20 = vac[['TOTAL_VACCINATIONS']][:20]
countries = vac['COUNTRY'][:20]
top20.index = countries
top20.plot.pie(y='TOTAL_VACCINATIONS', figsize=(15,15))
<AxesSubplot:ylabel='TOTAL_VACCINATIONS'>
Tip
plot()
is a convinient way to visualize data in pandas
, but it is
not perfect. We will learn more about the data visualization in following lectures.
Select¶
Pandas
provides different ways to select elements in the DataFrame
.
iloc
¶
iloc[n]
will give the \(n-1\) rows of the DataFrame
, i.e., a Series
.
vac.iloc[0]
COUNTRY China
ISO3 CHN
WHO_REGION WPRO
DATA_SOURCE REPORTING
DATE_UPDATED 2022-02-27
TOTAL_VACCINATIONS 3138003103
PERSONS_VACCINATED_1PLUS_DOSE 1276556393.0
TOTAL_VACCINATIONS_PER100 213.283
PERSONS_VACCINATED_1PLUS_DOSE_PER100 86.765
PERSONS_FULLY_VACCINATED 1235633241.0
PERSONS_FULLY_VACCINATED_PER100 83.983
VACCINES_USED Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co...
FIRST_VACCINE_DATE 2020-07-22
NUMBER_VACCINES_TYPES_USED 8.0
Name: 43, dtype: object
Tip
iloc
index starts from \(0\), instead of \(1\). So the \(n\)-rows has a index \(n-1\).
In order to select multiple rows, just feed a list containing indices to the iloc
:
vac.iloc[[0, 1, 2]]
COUNTRY | ISO3 | WHO_REGION | DATA_SOURCE | DATE_UPDATED | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
43 | China | CHN | WPRO | REPORTING | 2022-02-27 | 3138003103 | 1.276556e+09 | 213.283 | 86.765 | 1.235633e+09 | 83.983 | Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co... | 2020-07-22 | 8.0 |
94 | India | IND | SEARO | REPORTING | 2022-02-28 | 1776718549 | 9.642345e+08 | 128.747 | 69.872 | 7.923996e+08 | 57.420 | Bharat - Covaxin,Gamaleya - Gam-Covid-Vac,Jans... | 2021-01-16 | 6.0 |
218 | United States of America | USA | AMRO | REPORTING | 2022-02-25 | 537567013 | 2.531571e+08 | 162.406 | 76.482 | 2.101726e+08 | 63.496 | Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz... | 2020-12-14 | 3.0 |
Also, you can select both rows and columns at the same time:
vac.iloc[[0, 1, 2], [0, 4, 5]]
COUNTRY | DATE_UPDATED | TOTAL_VACCINATIONS | |
---|---|---|---|
43 | China | 2022-02-27 | 3138003103 |
94 | India | 2022-02-28 | 1776718549 |
218 | United States of America | 2022-02-25 | 537567013 |
Or just access single element:
vac.iloc[0, 4]
'2022-02-27'
See also
You can also use slice operations to access elements, try vac.iloc[0:2, 0:3]
loc
¶
As you can see, iloc
may be confused to read because it uses numbers as index. However,
loc
provides another way to access by index and columns in DataFrame
.
vac.loc[0:3, ["COUNTRY", "DATE_UPDATED", "TOTAL_VACCINATIONS"]]
COUNTRY | DATE_UPDATED | TOTAL_VACCINATIONS | |
---|---|---|---|
0 | Afghanistan | 2022-03-06 | 5597130 |
134 | Mongolia | 2022-02-25 | 5568712 |
114 | Lebanon | 2022-02-28 | 5452304 |
50 | Croatia | 2022-02-20 | 5179921 |
87 | Guinea | 2022-03-03 | 5159979 |
... | ... | ... | ... |
151 | Northern Mariana Islands (Commonwealth of the) | 2022-02-04 | 105295 |
68 | Faroe Islands | 2022-02-18 | 103894 |
132 | Micronesia (Federated States of) | 2022-02-16 | 97185 |
82 | Grenada | 2022-02-25 | 85891 |
3 | American Samoa | 2022-02-16 | 85050 |
103 rows × 3 columns
Important
Becareful with the index names in iloc
and loc
. They are different.
Filter¶
Both loc
and iloc
accept boolean parameters, which can be used to filter
particular elements in the DataFrame
.
vac[vac['TOTAL_VACCINATIONS'] > 100000000]
COUNTRY | ISO3 | WHO_REGION | DATA_SOURCE | DATE_UPDATED | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
43 | China | CHN | WPRO | REPORTING | 2022-02-27 | 3138003103 | 1.276556e+09 | 213.283 | 86.765 | 1.235633e+09 | 83.983 | Anhui ZL - Recombinant,Beijing CNBG - BBIBP-Co... | 2020-07-22 | 8.0 |
94 | India | IND | SEARO | REPORTING | 2022-02-28 | 1776718549 | 9.642345e+08 | 128.747 | 69.872 | 7.923996e+08 | 57.420 | Bharat - Covaxin,Gamaleya - Gam-Covid-Vac,Jans... | 2021-01-16 | 6.0 |
218 | United States of America | USA | AMRO | REPORTING | 2022-02-25 | 537567013 | 2.531571e+08 | 162.406 | 76.482 | 2.101726e+08 | 63.496 | Janssen - Ad26.COV 2-S,Moderna - Spikevax,Pfiz... | 2020-12-14 | 3.0 |
29 | Brazil | BRA | AMRO | REPORTING | 2022-02-25 | 369527744 | 1.723341e+08 | 173.847 | 81.076 | 1.487775e+08 | 69.993 | AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S... | 2021-01-17 | 5.0 |
95 | Indonesia | IDN | SEARO | REPORTING | 2022-03-01 | 345697245 | 1.909768e+08 | 126.387 | 69.821 | 1.445058e+08 | 52.831 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-01-13 | 7.0 |
103 | Japan | JPN | WPRO | REPORTING | 2022-02-24 | 223820819 | 1.016722e+08 | 176.966 | 80.388 | 1.001786e+08 | 79.207 | AstraZeneca - Vaxzevria,Moderna - Spikevax,Nov... | 2021-02-17 | 4.0 |
155 | Pakistan | PAK | EMRO | REPORTING | 2022-02-28 | 213532343 | 1.243097e+08 | 96.668 | 56.276 | 1.000965e+08 | 45.315 | Beijing CNBG - BBIBP-CorV,Bharat - Covaxin,Can... | 2021-02-03 | 8.0 |
16 | Bangladesh | BGD | SEARO | REPORTING | 2022-03-01 | 213176935 | 1.246919e+08 | 129.442 | 75.713 | 8.461175e+07 | 51.377 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-01-27 | 8.0 |
223 | Viet Nam | VNM | WPRO | REPORTING | 2022-02-24 | 192865986 | 7.946939e+07 | 198.139 | 81.642 | 7.532254e+07 | 77.382 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-03-08 | 6.0 |
131 | Mexico | MEX | AMRO | REPORTING | 2022-02-25 | 179274307 | 8.508365e+07 | 139.045 | 65.991 | 7.875005e+07 | 61.078 | AstraZeneca - Vaxzevria,CanSino - Convidecia,G... | 2020-12-24 | 7.0 |
77 | Germany | DEU | EURO | REPORTING | 2022-02-20 | 169071638 | 6.338051e+07 | 203.300 | 76.209 | 6.250088e+07 | 75.151 | AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S... | 2020-12-23 | 4.0 |
170 | Russian Federation | RUS | EURO | OWID | 2022-02-17 | 160816996 | 7.828428e+07 | 110.200 | 53.643 | 7.160191e+07 | 49.064 | SRCVB - EpiVacCorona, Gamaleya - Sputnik V | NaN | 2.0 |
71 | France | FRA | EURO | REPORTING | 2022-02-20 | 153013512 | 5.470059e+07 | 227.300 | 81.254 | 5.368728e+07 | 79.749 | AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S... | 2020-12-30 | 4.0 |
210 | Turkey | TUR | EURO | OWID | 2022-02-23 | 144850157 | 5.762345e+07 | 171.700 | 69.296 | 5.271558e+07 | 63.394 | Pfizer BioNTech - Comirnaty, Sinovac - CoronaV... | NaN | 3.0 |
96 | Iran (Islamic Republic of) | IRN | EMRO | REPORTING | 2022-03-06 | 142195819 | 6.311054e+07 | 169.295 | 75.138 | 5.571737e+07 | 66.336 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-02-09 | 6.0 |
203 | The United Kingdom | GBR | EURO | REPORTING | 2022-02-22 | 139482283 | 5.257359e+07 | 205.500 | 77.246 | 4.888589e+07 | 71.828 | AstraZeneca - Vaxzevria,Moderna - Spikevax,Pfi... | 2020-12-21 | 3.0 |
161 | Philippines | PHL | WPRO | REPORTING | 2022-02-24 | 135251295 | 6.866160e+07 | 123.426 | 62.658 | 6.301683e+07 | 57.507 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-03-01 | 11.0 |
101 | Italy | ITA | EURO | REPORTING | 2022-02-20 | 131126289 | 5.019838e+07 | 219.900 | 84.171 | 4.679558e+07 | 78.461 | AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S... | 2020-12-23 | 4.0 |
202 | Thailand | THA | SEARO | REPORTING | 2022-02-28 | 123738218 | 5.358496e+07 | 177.275 | 76.769 | 4.967325e+07 | 71.165 | AstraZeneca - Vaxzevria,Beijing CNBG - BBIBP-C... | 2021-02-28 | 7.0 |
167 | Republic of Korea | KOR | WPRO | REPORTING | 2022-02-27 | 118895241 | 4.477671e+07 | 231.904 | 87.336 | 4.428542e+07 | 86.378 | AstraZeneca - Vaxzevria,Janssen - Ad26.COV 2-S... | 2021-02-26 | 5.0 |