Downloading data from World Bank Data (Python version)

This section will explain how to use the functions in the worldbank module from the epigraphhub package to download the data hosted in the world bank data platform.

All the functions created in this file were created based on the implementation of the package wbgapi.

Function search_in_database()

This function allows the user to search, using a keyword the name of a database hosted in the world bank data. The function will search over all the databases and return the matched values. The return of this function is a pandas DataFrame with some information about the databases found in the search.

The most important columns of the DataFrame returned are:

  • The column name, that is used in the search to match with the keyword;

  • The column id that we will use to refer to the database in other functions;

  • The column lastupdated that returns when was the last time that the data in the database was updated.

This function has only one parameter named keyword and must be a string.

For example, you can search over all the databases with the keyword global in the name. In this case, the return will be:

from epigraphhub.data.worldbank import search_in_database

df_db = search_in_database('global')

df_db
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_330/3054689962.py in <module>
----> 1 from epigraphhub.data.worldbank import search_in_database
      2 
      3 df_db = search_in_database('global')
      4 
      5 df_db

ModuleNotFoundError: No module named 'epigraphhub'

If you use the keyword = all all the available databases will be returned.

After selecting a database, we can use the function search_in_indicators() to see what indicators we can get from this database.

Function search_in_indicators()

This function returns a DataFrame with the indicators matched by partial name. Accept two parameters: the first is keyword, which should be a string used to search combinations between the keyword and the indicator’s name in a specific database; the second parameter is related with the database, it’s called db. This parameter only accepts int values as input. It must be filled with the id number of the database, which can be obtained with the function search_in_database.

If the db parameter is not filled, the function assumes as default db = 2. In this configuration, the list of indicators from the database World Development Indicators is returned.

For example, to get the name of the indicators related to air pollution in the db = 2, just type search_in_indicators('air pollution', db = 2) and the returned data frame will be:

from epigraphhub.data.worldbank import search_in_indicators 

df_ind = search_in_indicators('air pollution', db = 2)

df_ind 
id value
0 EN.ATM.PM25.MC.M3 PM2.5 air pollution, mean annual exposure (mic...
1 EN.ATM.PM25.MC.ZS PM2.5 air pollution, population exposed to lev...
2 SH.STA.AIRP.FE.P5 Mortality rate attributed to household and amb...
3 SH.STA.AIRP.MA.P5 Mortality rate attributed to household and amb...
4 SH.STA.AIRP.P5 Mortality rate attributed to household and amb...

We will use the id column values to get the data for the indicators described in the value cell associated with the id column. To get this data we will use the function get_worldbank_data().

Function get_worldbank_data()

The function get_worldbank_data returns a DataFrame with indicators available in some database of the world bank data.

This function has the following parameters:

  • ind : This parameter must be filled with a list of strings where each value in the list should be filled with an indicator’s id value. An indicator’s id value can be obtained with the function search_in_indicators().

  • country: This parameter must be filled with a list of strings where each value in the list should be filled with the ISO-CODE of each interest country.

  • db: This parameter should be filled with an int value representing the database where the data is being captured. You can obtain this value with the function search_in_database().

  • time: If filled time = 'all', the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter with time = range(2010,2021).

  • columns: This parameter will be used to rename the columns in the DataFrame returned. By default, the columns of the indicators will be named using the ind name. To rename the columns, you should provide a list of strings with the same length of the list int the parameter ind. Also, observe that the columns will be renamed respecting the order of the list. So, the first value in columns will be used as the new name of the first value in ind.

For example, we can get the data for the two first indicators that we obtained in the last section for the countries Brazil and Switzerland. In this case ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS'], country = ['BRA', 'CHE'], db = 2 (The indicators refered in ind are available in the dabatase refered by the number 2).

Using these parameters the result will be:

from epigraphhub.data.worldbank import get_worldbank_data

ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']
country = ['BRA', 'CHE']

df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021))

df = df.sort_index()

df.head()
/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  dummy = pd.Series()    # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:241: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df.columns = df.columns.str.lower().str.replace(".", "_")
country en_atm_pm25_mc_m3 en_atm_pm25_mc_zs frequency
date
2010-01-01 CHE 12.922220 93.000705 yearly
2010-01-01 BRA 15.955285 90.938123 yearly
2011-01-01 BRA 15.912798 91.928375 yearly
2011-01-01 CHE 13.049221 94.785235 yearly
2012-01-01 CHE 12.261388 91.820914 yearly

By default, the function will transform all the upper cases in the column’s name to lower case and replace ‘.’ with ‘_’. If you would like to rename the columns with the names ‘air_1’ and ‘air_2’, for example, just add the parameter columns = ['air_1', 'air_2'] , and the result will be:

from epigraphhub.data.worldbank import get_worldbank_data

ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']
country = ['BRA', 'CHE']

df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021), columns = ['air_1', 'air_2'])

df = df.sort_index()

df.head()
/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  dummy = pd.Series()    # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:232: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df.columns = df.columns.str.lower().str.replace(".", "_")
country air_1 air_2 frequency
date
2010-01-01 CHE 12.922220 93.000705 yearly
2010-01-01 BRA 15.955285 90.938123 yearly
2011-01-01 BRA 15.912798 91.928375 yearly
2011-01-01 CHE 13.049221 94.785235 yearly
2012-01-01 CHE 12.261388 91.820914 yearly

Function get_pop_data()

This function get the population data, stratified by age and sex, from the database with the id number equal to two. This database is called World Development Indicators. This function has three parameters:

  • country: It must be filled with a string with the ISO-CODE of the country which you want to get the data from.

  • time: If filled time = 'all', the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter with time = range(2010,2021).

  • fx_et: This parameter selects the stratification type in the population data. There are three different possibilities:

  • If fx_et == '5Y', it will be returned the population by 5-year age groups.

  • If fx_et == 'IN', it will be return the population divided in 3 age groups.

  • If fx_et == 'TOTL', it will be returned the total population without considering the age groups.

The return of the function is a pandas DataFrame.

In the cell below, you can see an example of how to get the population data divided into three age groups in Switzerland.

from epigraphhub.data.worldbank import get_pop_data

country = 'CHE'
time = range(2016,2022)
df_pop = get_pop_data(country, time , fx_et = 'IN')

df_pop
/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  dummy = pd.Series()    # empty series - never assigned actual values
/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:89: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df.columns = ((df.columns.str.lower()).str.replace(".", "_")).str[3:-3]
pop_0014_fe pop_0014_ma pop_1564_fe pop_1564_ma pop_65up_fe pop_65up_ma pop_totl_fe pop_totl_ma frequency country
2016-01-01 604573.0 637409.0 2775453.0 2829540.0 846466.0 679897.0 4226492.0 4146846.0 yearly CHE
2017-01-01 612125.0 644879.0 2790916.0 2846187.0 860828.0 696905.0 4263869.0 4187971.0 yearly CHE
2018-01-01 618492.0 651077.0 2801611.0 2857509.0 873428.0 712212.0 4293531.0 4220798.0 yearly CHE
2019-01-01 624324.0 656882.0 2811358.0 2867308.0 886912.0 728496.0 4322594.0 4252686.0 yearly CHE
2020-01-01 629609.0 662354.0 2819974.0 2875524.0 902457.0 746978.0 4352040.0 4284856.0 yearly CHE