Downloading data from World Bank Data (Python version)¶
This section will explain how to use the functions in the worldbank
module from the epigraphhub
package to download the data hosted in the world bank data platform.
All the functions created in this file were created based on the implementation of the package wbgapi.
Function search_in_database()
¶
This function allows the user to search, using a keyword the name of a database hosted in the world bank data. The function will search over all the databases and return the matched values. The return of this function is a pandas DataFrame with some information about the databases found in the search.
The most important columns of the DataFrame returned are:
The column
name
, that is used in the search to match with the keyword;The column
id
that we will use to refer to the database in other functions;The column
lastupdated
that returns when was the last time that the data in the database was updated.
This function has only one parameter named keyword
and must be a string.
For example, you can search over all the databases with the keyword global
in the name. In this case, the return will be:
from epigraphhub.data.worldbank import search_in_database
df_db = search_in_database('global')
df_db
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_330/3054689962.py in <module>
----> 1 from epigraphhub.data.worldbank import search_in_database
2
3 df_db = search_in_database('global')
4
5 df_db
ModuleNotFoundError: No module named 'epigraphhub'
If you use the keyword = all
all the available databases will be returned.
After selecting a database, we can use the function search_in_indicators()
to see what indicators we can get from this database.
Function search_in_indicators()
¶
This function returns a DataFrame with the indicators matched by partial name. Accept two parameters: the first is keyword
, which should be a string used to search combinations between the keyword and the indicator’s name in a specific database; the second parameter is related with the database, it’s called db
. This parameter only accepts int values as input. It must be filled with the id number of the database, which can be obtained with the function search_in_database
.
If the db
parameter is not filled, the function assumes as default db = 2
. In this configuration, the list of indicators from the database World Development Indicators is returned.
For example, to get the name of the indicators related to air pollution
in the db = 2
, just type search_in_indicators('air pollution', db = 2)
and the returned data frame will be:
from epigraphhub.data.worldbank import search_in_indicators
df_ind = search_in_indicators('air pollution', db = 2)
df_ind
id | value | |
---|---|---|
0 | EN.ATM.PM25.MC.M3 | PM2.5 air pollution, mean annual exposure (mic... |
1 | EN.ATM.PM25.MC.ZS | PM2.5 air pollution, population exposed to lev... |
2 | SH.STA.AIRP.FE.P5 | Mortality rate attributed to household and amb... |
3 | SH.STA.AIRP.MA.P5 | Mortality rate attributed to household and amb... |
4 | SH.STA.AIRP.P5 | Mortality rate attributed to household and amb... |
We will use the id
column values to get the data for the indicators described in the value
cell associated with the id
column. To get this data we will use the function get_worldbank_data()
.
Function get_worldbank_data()
¶
The function get_worldbank_data
returns a DataFrame with indicators available in some database of the world bank data.
This function has the following parameters:
ind
: This parameter must be filled with a list of strings where each value in the list should be filled with an indicator’s id value. An indicator’sid
value can be obtained with the functionsearch_in_indicators()
.country
: This parameter must be filled with a list of strings where each value in the list should be filled with the ISO-CODE of each interest country.db
: This parameter should be filled with an int value representing the database where the data is being captured. You can obtain this value with the functionsearch_in_database()
.time
: If filledtime = 'all'
, the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter withtime = range(2010,2021)
.columns
: This parameter will be used to rename the columns in the DataFrame returned. By default, the columns of the indicators will be named using theind
name. To rename the columns, you should provide a list of strings with the same length of the list int the parameterind
. Also, observe that the columns will be renamed respecting the order of the list. So, the first value incolumns
will be used as the new name of the first value inind
.
For example, we can get the data for the two first indicators that we obtained in the last section for the countries Brazil and Switzerland. In this case ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']
, country = ['BRA', 'CHE']
, db = 2
(The indicators refered in ind
are available in the dabatase refered by the number 2).
Using these parameters the result will be:
from epigraphhub.data.worldbank import get_worldbank_data
ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']
country = ['BRA', 'CHE']
df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021))
df = df.sort_index()
df.head()
/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
dummy = pd.Series() # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:241: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df.columns = df.columns.str.lower().str.replace(".", "_")
country | en_atm_pm25_mc_m3 | en_atm_pm25_mc_zs | frequency | |
---|---|---|---|---|
date | ||||
2010-01-01 | CHE | 12.922220 | 93.000705 | yearly |
2010-01-01 | BRA | 15.955285 | 90.938123 | yearly |
2011-01-01 | BRA | 15.912798 | 91.928375 | yearly |
2011-01-01 | CHE | 13.049221 | 94.785235 | yearly |
2012-01-01 | CHE | 12.261388 | 91.820914 | yearly |
By default, the function will transform all the upper cases in the column’s name to lower case and replace ‘.’ with ‘_’. If you would like to rename the columns with the names ‘air_1’ and ‘air_2’, for example, just add the parameter columns = ['air_1', 'air_2']
, and the result will be:
from epigraphhub.data.worldbank import get_worldbank_data
ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']
country = ['BRA', 'CHE']
df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021), columns = ['air_1', 'air_2'])
df = df.sort_index()
df.head()
/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
dummy = pd.Series() # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:232: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df.columns = df.columns.str.lower().str.replace(".", "_")
country | air_1 | air_2 | frequency | |
---|---|---|---|---|
date | ||||
2010-01-01 | CHE | 12.922220 | 93.000705 | yearly |
2010-01-01 | BRA | 15.955285 | 90.938123 | yearly |
2011-01-01 | BRA | 15.912798 | 91.928375 | yearly |
2011-01-01 | CHE | 13.049221 | 94.785235 | yearly |
2012-01-01 | CHE | 12.261388 | 91.820914 | yearly |
Function get_pop_data()
¶
This function get the population data, stratified by age and sex, from the database with the id number equal to two. This database is called World Development Indicators. This function has three parameters:
country
: It must be filled with a string with the ISO-CODE of the country which you want to get the data from.time
: If filledtime = 'all'
, the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter withtime = range(2010,2021)
.fx_et
: This parameter selects the stratification type in the population data. There are three different possibilities:
If
fx_et == '5Y'
, it will be returned the population by 5-year age groups.If
fx_et == 'IN'
, it will be return the population divided in 3 age groups.If
fx_et == 'TOTL'
, it will be returned the total population without considering the age groups.
The return of the function is a pandas DataFrame.
In the cell below, you can see an example of how to get the population data divided into three age groups in Switzerland.
from epigraphhub.data.worldbank import get_pop_data
country = 'CHE'
time = range(2016,2022)
df_pop = get_pop_data(country, time , fx_et = 'IN')
df_pop
/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
dummy = pd.Series() # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:89: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df.columns = ((df.columns.str.lower()).str.replace(".", "_")).str[3:-3]
pop_0014_fe | pop_0014_ma | pop_1564_fe | pop_1564_ma | pop_65up_fe | pop_65up_ma | pop_totl_fe | pop_totl_ma | frequency | country | |
---|---|---|---|---|---|---|---|---|---|---|
2016-01-01 | 604573.0 | 637409.0 | 2775453.0 | 2829540.0 | 846466.0 | 679897.0 | 4226492.0 | 4146846.0 | yearly | CHE |
2017-01-01 | 612125.0 | 644879.0 | 2790916.0 | 2846187.0 | 860828.0 | 696905.0 | 4263869.0 | 4187971.0 | yearly | CHE |
2018-01-01 | 618492.0 | 651077.0 | 2801611.0 | 2857509.0 | 873428.0 | 712212.0 | 4293531.0 | 4220798.0 | yearly | CHE |
2019-01-01 | 624324.0 | 656882.0 | 2811358.0 | 2867308.0 | 886912.0 | 728496.0 | 4322594.0 | 4252686.0 | yearly | CHE |
2020-01-01 | 629609.0 | 662354.0 | 2819974.0 | 2875524.0 | 902457.0 | 746978.0 | 4352040.0 | 4284856.0 | yearly | CHE |