Downloading data from World Bank Data (Python version)¶

This section will explain how to use the functions in the worldbank module from the epigraphhub package to download the data hosted in the world bank data platform.

All the functions created in this file were created based on the implementation of the package wbgapi.

Function `search_in_database()`¶

This function allows the user to search, using a keyword the name of a database hosted in the world bank data. The function will search over all the databases and return the matched values. The return of this function is a pandas DataFrame with some information about the databases found in the search.

The most important columns of the DataFrame returned are:

The column name, that is used in the search to match with the keyword;
The column id that we will use to refer to the database in other functions;
The column lastupdated that returns when was the last time that the data in the database was updated.

This function has only one parameter named keyword and must be a string.

For example, you can search over all the databases with the keyword global in the name. In this case, the return will be:

from epigraphhub.data.worldbank import search_in_database

df_db = search_in_database('global')

df_db

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_330/3054689962.py in <module>
----> 1 from epigraphhub.data.worldbank import search_in_database
      2 
      3 df_db = search_in_database('global')
      4 
      5 df_db

ModuleNotFoundError: No module named 'epigraphhub'

If you use the keyword = all all the available databases will be returned.

After selecting a database, we can use the function search_in_indicators() to see what indicators we can get from this database.

Function `search_in_indicators()`¶

This function returns a DataFrame with the indicators matched by partial name. Accept two parameters: the first is keyword, which should be a string used to search combinations between the keyword and the indicator’s name in a specific database; the second parameter is related with the database, it’s called db. This parameter only accepts int values as input. It must be filled with the id number of the database, which can be obtained with the function search_in_database.

If the db parameter is not filled, the function assumes as default db = 2. In this configuration, the list of indicators from the database World Development Indicators is returned.

For example, to get the name of the indicators related to air pollution in the db = 2, just type search_in_indicators('air pollution', db = 2) and the returned data frame will be:

from epigraphhub.data.worldbank import search_in_indicators 

df_ind = search_in_indicators('air pollution', db = 2)

df_ind 

	id	value
0	EN.ATM.PM25.MC.M3	PM2.5 air pollution, mean annual exposure (mic...
1	EN.ATM.PM25.MC.ZS	PM2.5 air pollution, population exposed to lev...
2	SH.STA.AIRP.FE.P5	Mortality rate attributed to household and amb...
3	SH.STA.AIRP.MA.P5	Mortality rate attributed to household and amb...
4	SH.STA.AIRP.P5	Mortality rate attributed to household and amb...

We will use the id column values to get the data for the indicators described in the value cell associated with the id column. To get this data we will use the function get_worldbank_data().

Function `get_worldbank_data()`¶

The function get_worldbank_data returns a DataFrame with indicators available in some database of the world bank data.

This function has the following parameters:

ind : This parameter must be filled with a list of strings where each value in the list should be filled with an indicator’s id value. An indicator’s id value can be obtained with the function search_in_indicators().
country: This parameter must be filled with a list of strings where each value in the list should be filled with the ISO-CODE of each interest country.
db: This parameter should be filled with an int value representing the database where the data is being captured. You can obtain this value with the function search_in_database().
time: If filled time = 'all', the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter with time = range(2010,2021).
columns: This parameter will be used to rename the columns in the DataFrame returned. By default, the columns of the indicators will be named using the ind name. To rename the columns, you should provide a list of strings with the same length of the list int the parameter ind. Also, observe that the columns will be renamed respecting the order of the list. So, the first value in columns will be used as the new name of the first value in ind.

For example, we can get the data for the two first indicators that we obtained in the last section for the countries Brazil and Switzerland. In this case ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS'], country = ['BRA', 'CHE'], db = 2 (The indicators refered in ind are available in the dabatase refered by the number 2).

Using these parameters the result will be:

from epigraphhub.data.worldbank import get_worldbank_data

ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']
country = ['BRA', 'CHE']

df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021))

df = df.sort_index()

df.head()

/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  dummy = pd.Series()    # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:241: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df.columns = df.columns.str.lower().str.replace(".", "_")

	country	en_atm_pm25_mc_m3	en_atm_pm25_mc_zs	frequency
date
2010-01-01	CHE	12.922220	93.000705	yearly
2010-01-01	BRA	15.955285	90.938123	yearly
2011-01-01	BRA	15.912798	91.928375	yearly
2011-01-01	CHE	13.049221	94.785235	yearly
2012-01-01	CHE	12.261388	91.820914	yearly

By default, the function will transform all the upper cases in the column’s name to lower case and replace ‘.’ with ‘_’. If you would like to rename the columns with the names ‘air_1’ and ‘air_2’, for example, just add the parameter columns = ['air_1', 'air_2'] , and the result will be:

from epigraphhub.data.worldbank import get_worldbank_data

ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']
country = ['BRA', 'CHE']

df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021), columns = ['air_1', 'air_2'])

df = df.sort_index()

df.head()

/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  dummy = pd.Series()    # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:232: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df.columns = df.columns.str.lower().str.replace(".", "_")

	country	air_1	air_2	frequency
date
2010-01-01	CHE	12.922220	93.000705	yearly
2010-01-01	BRA	15.955285	90.938123	yearly
2011-01-01	BRA	15.912798	91.928375	yearly
2011-01-01	CHE	13.049221	94.785235	yearly
2012-01-01	CHE	12.261388	91.820914	yearly

Function `get_pop_data()`¶

This function get the population data, stratified by age and sex, from the database with the id number equal to two. This database is called World Development Indicators. This function has three parameters:

country: It must be filled with a string with the ISO-CODE of the country which you want to get the data from.
time: If filled time = 'all', the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter with time = range(2010,2021).
fx_et: This parameter selects the stratification type in the population data. There are three different possibilities:

If fx_et == '5Y', it will be returned the population by 5-year age groups.
If fx_et == 'IN', it will be return the population divided in 3 age groups.
If fx_et == 'TOTL', it will be returned the total population without considering the age groups.

The return of the function is a pandas DataFrame.

In the cell below, you can see an example of how to get the population data divided into three age groups in Switzerland.

from epigraphhub.data.worldbank import get_pop_data

country = 'CHE'
time = range(2016,2022)
df_pop = get_pop_data(country, time , fx_et = 'IN')

df_pop

/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  dummy = pd.Series()    # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:89: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df.columns = ((df.columns.str.lower()).str.replace(".", "_")).str[3:-3]

	pop_0014_fe	pop_0014_ma	pop_1564_fe	pop_1564_ma	pop_65up_fe	pop_65up_ma	pop_totl_fe	pop_totl_ma	frequency	country
2016-01-01	604573.0	637409.0	2775453.0	2829540.0	846466.0	679897.0	4226492.0	4146846.0	yearly	CHE
2017-01-01	612125.0	644879.0	2790916.0	2846187.0	860828.0	696905.0	4263869.0	4187971.0	yearly	CHE
2018-01-01	618492.0	651077.0	2801611.0	2857509.0	873428.0	712212.0	4293531.0	4220798.0	yearly	CHE
2019-01-01	624324.0	656882.0	2811358.0	2867308.0	886912.0	728496.0	4322594.0	4252686.0	yearly	CHE
2020-01-01	629609.0	662354.0	2819974.0	2875524.0	902457.0	746978.0	4352040.0	4284856.0	yearly	CHE

Downloading data from World Bank Data (Python version)¶

Function `search_in_database()`¶

Function `search_in_indicators()`¶

Function `get_worldbank_data()`¶

Function `get_pop_data()`¶

EpigraphHub Library

Navigation

Related Topics

Downloading data from World Bank Data (Python version)¶

Function search_in_database()¶

Function search_in_indicators()¶

Function get_worldbank_data()¶

Function get_pop_data()¶

Function `search_in_database()`¶

Function `search_in_indicators()`¶

Function `get_worldbank_data()`¶

Function `get_pop_data()`¶