{ "cells": [ { "cell_type": "markdown", "id": "6770f85f", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Downloading data from World Bank Data (Python version)\n", "\n", "This section will explain how to use the functions in the `worldbank` module from the `epigraphhub` package to download the data hosted in the world bank data platform. \n", "\n", "All the functions created in this file were created based on the implementation of the package [**wbgapi**](https://pypi.org/project/wbgapi/). \n", "\n", "### Function `search_in_database()`\n", "\n", "This function allows the user to search, using a keyword the name of a database hosted in the world bank data. The function will search over all the databases and return the matched values. The return of this function is a pandas DataFrame with some information about the databases found in the search. \n", " \n", "The most important columns of the DataFrame returned are:\n", "\n", "* The column `name`, that is used in the search to match with the keyword; \n", "\n", "* The column `id` that we will use to refer to the database in other functions;\n", "\n", "* The column `lastupdated` that returns when was the last time that the data in the database was updated. \n", "\n", "This function has only one parameter named `keyword` and must be a string. \n", "\n", "For example, you can search over all the databases with the keyword `global` in the name. In this case, the return will be: \n" ] }, { "cell_type": "code", "execution_count": 1, "id": "b08de47c", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idlastupdatednamecodedatabiddescriptionurldataavailabilitymetadataavailabilityconcepts
0152020-07-27Global Economic MonitorGEM1179YY3
1272022-01-11Global Economic ProspectsGEP999YN3
2282018-10-15Global Financial InclusionFDX1228YY3
3322021-12-03Global Financial DevelopmentGFD1250YN3
4342013-04-12Global Partnership for EducationGPE1247YY3
5692019-06-27Global Financial Inclusion and Consumer Protec...RFA3703YY3
6732019-06-27Global Financial Inclusion and Consumer Protec...RFI3704YY3
7822021-03-24Global Public ProcurementGPP3724YN3
8862021-09-24Global Jobs Indicators Database (JOIN)JON3735YY3
\n", "
" ], "text/plain": [ " id lastupdated name code \\\n", "0 15 2020-07-27 Global Economic Monitor GEM \n", "1 27 2022-01-11 Global Economic Prospects GEP \n", "2 28 2018-10-15 Global Financial Inclusion FDX \n", "3 32 2021-12-03 Global Financial Development GFD \n", "4 34 2013-04-12 Global Partnership for Education GPE \n", "5 69 2019-06-27 Global Financial Inclusion and Consumer Protec... RFA \n", "6 73 2019-06-27 Global Financial Inclusion and Consumer Protec... RFI \n", "7 82 2021-03-24 Global Public Procurement GPP \n", "8 86 2021-09-24 Global Jobs Indicators Database (JOIN) JON \n", "\n", " databid description url dataavailability metadataavailability concepts \n", "0 1179 Y Y 3 \n", "1 999 Y N 3 \n", "2 1228 Y Y 3 \n", "3 1250 Y N 3 \n", "4 1247 Y Y 3 \n", "5 3703 Y Y 3 \n", "6 3704 Y Y 3 \n", "7 3724 Y N 3 \n", "8 3735 Y Y 3 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from epigraphhub.data.worldbank import search_in_database\n", "\n", "df_db = search_in_database('global')\n", "\n", "df_db" ] }, { "cell_type": "markdown", "id": "13cbd132", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "If you use the `keyword = all` all the available databases will be returned. \n", "\n", "After selecting a database, we can use the function `search_in_indicators()` to see what indicators we can get from this database. \n", "\n", "### Function `search_in_indicators()`\n", "\n", "This function returns a DataFrame with the indicators matched by partial name. Accept two parameters: the first is `keyword`, which should be a string used to search combinations between the keyword and the indicator's name in a specific database; the second parameter is related with the database, it's called `db`. This parameter only accepts **int** values as input. It must be filled with the **id** number of the database, which can be obtained with the function `search_in_database`. \n", "\n", "If the `db` parameter is not filled, the function assumes as default `db = 2`. In this configuration, the list of indicators from the database **World Development Indicators** is returned.\n", "\n", "For example, to get the name of the indicators related to `air pollution` in the `db = 2`, just type `search_in_indicators('air pollution', db = 2)` and the returned data frame will be: " ] }, { "cell_type": "code", "execution_count": 2, "id": "cb544c55", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idvalue
0EN.ATM.PM25.MC.M3PM2.5 air pollution, mean annual exposure (mic...
1EN.ATM.PM25.MC.ZSPM2.5 air pollution, population exposed to lev...
2SH.STA.AIRP.FE.P5Mortality rate attributed to household and amb...
3SH.STA.AIRP.MA.P5Mortality rate attributed to household and amb...
4SH.STA.AIRP.P5Mortality rate attributed to household and amb...
\n", "
" ], "text/plain": [ " id value\n", "0 EN.ATM.PM25.MC.M3 PM2.5 air pollution, mean annual exposure (mic...\n", "1 EN.ATM.PM25.MC.ZS PM2.5 air pollution, population exposed to lev...\n", "2 SH.STA.AIRP.FE.P5 Mortality rate attributed to household and amb...\n", "3 SH.STA.AIRP.MA.P5 Mortality rate attributed to household and amb...\n", "4 SH.STA.AIRP.P5 Mortality rate attributed to household and amb..." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from epigraphhub.data.worldbank import search_in_indicators \n", "\n", "df_ind = search_in_indicators('air pollution', db = 2)\n", "\n", "df_ind " ] }, { "cell_type": "markdown", "id": "9c9142d0", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "We will use the `id` column values to get the data for the indicators described in the `value` cell associated with the `id` column. To get this data we will use the function `get_worldbank_data()`. \n" ] }, { "cell_type": "markdown", "id": "42b01c33", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Function `get_worldbank_data()`\n", "\n", "The function `get_worldbank_data` returns a DataFrame with indicators available in some database of the world bank data. \n", " \n", " This function has the following parameters: \n", "\n", " * `ind` : This parameter must be filled with a list of strings where each value in the list should be filled with an indicator's id value. An indicator's `id` value can be obtained with the function `search_in_indicators()`. \n", "\n", " * `country`: This parameter must be filled with a list of strings where each value in the list should be filled with the ISO-CODE of each interest country. \n", "\n", " * `db`: This parameter should be filled with an int value representing the database where the data is being captured. You can obtain this value with the function `search_in_database()`.\n", "\n", " * `time`: If filled ` time = 'all'`, the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter with `time = range(2010,2021)`.\n", "\n", " * `columns`: This parameter will be used to rename the columns in the DataFrame returned. By default, the columns of the indicators will be named using the `ind` name. To rename the columns, you should provide a list of strings with the same length of the list int the parameter `ind`. Also, observe that the columns will be renamed respecting the order of the list. So, the first value in `columns` will be used as the new name of the first value in `ind`. \n", "\n", "For example, we can get the data for the two first indicators that we obtained in the last section for the countries Brazil and Switzerland. In this case `ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']`, `country = ['BRA', 'CHE']`, `db = 2` (The indicators refered in `ind` are available in the dabatase refered by the number 2).\n", "\n", "Using these parameters the result will be: \n" ] }, { "cell_type": "code", "execution_count": 10, "id": "e7ce0305", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.\n", " dummy = pd.Series() # empty series - never assigned actual values\n", "/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:241: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", " df.columns = df.columns.str.lower().str.replace(\".\", \"_\")\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryen_atm_pm25_mc_m3en_atm_pm25_mc_zsfrequency
date
2010-01-01CHE12.92222093.000705yearly
2010-01-01BRA15.95528590.938123yearly
2011-01-01BRA15.91279891.928375yearly
2011-01-01CHE13.04922194.785235yearly
2012-01-01CHE12.26138891.820914yearly
\n", "
" ], "text/plain": [ " country en_atm_pm25_mc_m3 en_atm_pm25_mc_zs frequency\n", "date \n", "2010-01-01 CHE 12.922220 93.000705 yearly\n", "2010-01-01 BRA 15.955285 90.938123 yearly\n", "2011-01-01 BRA 15.912798 91.928375 yearly\n", "2011-01-01 CHE 13.049221 94.785235 yearly\n", "2012-01-01 CHE 12.261388 91.820914 yearly" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from epigraphhub.data.worldbank import get_worldbank_data\n", "\n", "ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']\n", "country = ['BRA', 'CHE']\n", "\n", "df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021))\n", "\n", "df = df.sort_index()\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "id": "5623c1f7", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "By default, the function will transform all the upper cases in the column's name to lower case and replace '.' with '_'. If you would like to rename the columns with the names 'air_1' and 'air_2', for example, just add the parameter `columns = ['air_1', 'air_2'] `, and the result will be: " ] }, { "cell_type": "code", "execution_count": 11, "id": "41a82f54", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.\n", " dummy = pd.Series() # empty series - never assigned actual values\n", "/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:232: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", " df.columns = df.columns.str.lower().str.replace(\".\", \"_\")\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryair_1air_2frequency
date
2010-01-01CHE12.92222093.000705yearly
2010-01-01BRA15.95528590.938123yearly
2011-01-01BRA15.91279891.928375yearly
2011-01-01CHE13.04922194.785235yearly
2012-01-01CHE12.26138891.820914yearly
\n", "
" ], "text/plain": [ " country air_1 air_2 frequency\n", "date \n", "2010-01-01 CHE 12.922220 93.000705 yearly\n", "2010-01-01 BRA 15.955285 90.938123 yearly\n", "2011-01-01 BRA 15.912798 91.928375 yearly\n", "2011-01-01 CHE 13.049221 94.785235 yearly\n", "2012-01-01 CHE 12.261388 91.820914 yearly" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from epigraphhub.data.worldbank import get_worldbank_data\n", "\n", "ind = ['EN.ATM.PM25.MC.M3', 'EN.ATM.PM25.MC.ZS']\n", "country = ['BRA', 'CHE']\n", "\n", "df = get_worldbank_data(ind, country, db= 2, time = range(2010, 2021), columns = ['air_1', 'air_2'])\n", "\n", "df = df.sort_index()\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "id": "86a41235", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Function `get_pop_data()`\n", "\n", "This function get the population data, stratified by age and sex, from the database with the id number equal to two. This database is called **World Development Indicators**. This function has three parameters: \n", "\n", "\n", "* `country`: It must be filled with a string with the ISO-CODE of the country which you want to get the data from. \n", "\n", "* `time`: If filled `time = 'all'`, the function will return all the data available. You can also specify a range of years. For example, if you want to get the data for the period between the years 2010 and 2020, you can fill this parameter with `time = range(2010,2021)`.\n", "\n", "* `fx_et`: This parameter selects the stratification type in the population data. There are three different possibilities: \n", "\n", "- If `fx_et == '5Y'`, it will be returned the population by 5-year age groups.\n", "\n", "- If `fx_et == 'IN'`, it will be return the population divided in 3 age groups.\n", "\n", "- If `fx_et == 'TOTL'`, it will be returned the total population without considering the age groups.\n", "\n", "The return of the function is a pandas DataFrame. \n", "\n", "In the cell below, you can see an example of how to get the population data divided into three age groups in Switzerland. \n" ] }, { "cell_type": "code", "execution_count": 12, "id": "50d1348c", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/eduardoaraujo/mambaforge/envs/swiss_covid19/lib/python3.10/site-packages/wbgapi/data.py:327: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.\n", " dummy = pd.Series() # empty series - never assigned actual values\n", "/Users/eduardoaraujo/Documents/GitHub/epigraphhub_py/epigraphhub/data/worldbank.py:89: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", " df.columns = ((df.columns.str.lower()).str.replace(\".\", \"_\")).str[3:-3]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_0014_fepop_0014_mapop_1564_fepop_1564_mapop_65up_fepop_65up_mapop_totl_fepop_totl_mafrequencycountry
2016-01-01604573.0637409.02775453.02829540.0846466.0679897.04226492.04146846.0yearlyCHE
2017-01-01612125.0644879.02790916.02846187.0860828.0696905.04263869.04187971.0yearlyCHE
2018-01-01618492.0651077.02801611.02857509.0873428.0712212.04293531.04220798.0yearlyCHE
2019-01-01624324.0656882.02811358.02867308.0886912.0728496.04322594.04252686.0yearlyCHE
2020-01-01629609.0662354.02819974.02875524.0902457.0746978.04352040.04284856.0yearlyCHE
\n", "
" ], "text/plain": [ " pop_0014_fe pop_0014_ma pop_1564_fe pop_1564_ma pop_65up_fe \\\n", "2016-01-01 604573.0 637409.0 2775453.0 2829540.0 846466.0 \n", "2017-01-01 612125.0 644879.0 2790916.0 2846187.0 860828.0 \n", "2018-01-01 618492.0 651077.0 2801611.0 2857509.0 873428.0 \n", "2019-01-01 624324.0 656882.0 2811358.0 2867308.0 886912.0 \n", "2020-01-01 629609.0 662354.0 2819974.0 2875524.0 902457.0 \n", "\n", " pop_65up_ma pop_totl_fe pop_totl_ma frequency country \n", "2016-01-01 679897.0 4226492.0 4146846.0 yearly CHE \n", "2017-01-01 696905.0 4263869.0 4187971.0 yearly CHE \n", "2018-01-01 712212.0 4293531.0 4220798.0 yearly CHE \n", "2019-01-01 728496.0 4322594.0 4252686.0 yearly CHE \n", "2020-01-01 746978.0 4352040.0 4284856.0 yearly CHE " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from epigraphhub.data.worldbank import get_pop_data\n", "\n", "country = 'CHE'\n", "time = range(2016,2022)\n", "df_pop = get_pop_data(country, time , fx_et = 'IN')\n", "\n", "df_pop" ] }, { "cell_type": "code", "execution_count": null, "id": "c4e24f97", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "vscode": { "interpreter": { "hash": "f9e3a44f2f7108c4b7beba943bd42895a37b8963dbda2768ecd4cf1430c6d52e" } } }, "nbformat": 4, "nbformat_minor": 5 }