{ "cells": [ { "cell_type": "markdown", "id": "d9d7701f-4828-45fe-aeef-27db97bd3b8a", "metadata": {}, "source": [ "## Preprocessing data (Python version)\n", "\n", "This notebook provides some examples of how the functions in the `preprocessing.py` module can be used. " ] }, { "cell_type": "code", "execution_count": 1, "id": "f9f19ba1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from epigraphhub.analysis.preprocessing import *" ] }, { "cell_type": "markdown", "id": "3c8d511a", "metadata": {}, "source": [ "The functions in the preprocessing.py module allow the transformation of tabular data in a format accepted by ML models (tabular data using lagged values) and neural network models (3D array data and multiple-output).\n", "\n", "In this tutorial, we will use the data saved in the path: ./data/data_GE.csv. This dataset represents the number of tests, cases, and hospitalizations of COVID-19 reported in some cantons of Switzerland." ] }, { "cell_type": "code", "execution_count": 2, "id": "3307e375", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test_FRdiff_test_FRdiff_2_test_FRtest_NEdiff_test_NEdiff_2_test_NEtest_TIdiff_test_TIdiff_2_test_TItest_VD...hosp_NEdiff_hosp_NEdiff_2_hosp_NEhosp_FRdiff_hosp_FRdiff_2_hosp_FRhosp_GEdiff_hosp_GEdiff_2_hosp_GEvac_all
datum
2020-03-010.00.00.00.00.00.00.00.00.00.0...0.1428570.0000000.0000000.4285710.1428570.2857140.4285710.0000000.0000000.0
2020-03-020.00.00.00.00.00.00.00.00.00.0...0.2857140.1428570.1428570.8571430.4285710.5714290.4285710.0000000.1428570.0
2020-03-030.00.00.00.00.00.00.00.00.00.0...0.4285710.1428570.2857140.8571430.0000000.4285710.4285710.0000000.0000000.0
2020-03-040.00.00.00.00.00.00.00.00.00.0...0.285714-0.1428570.0000000.714286-0.142857-0.1428570.5714290.1428570.1428570.0
2020-03-050.00.00.00.00.00.00.00.00.00.0...0.4285710.1428570.0000001.0000000.2857140.1428570.8571430.2857140.4285710.0
\n", "

5 rows × 64 columns

\n", "
" ], "text/plain": [ " test_FR diff_test_FR diff_2_test_FR test_NE diff_test_NE \\\n", "datum \n", "2020-03-01 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-02 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-03 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-04 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-05 0.0 0.0 0.0 0.0 0.0 \n", "\n", " diff_2_test_NE test_TI diff_test_TI diff_2_test_TI test_VD \\\n", "datum \n", "2020-03-01 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-02 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-03 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-04 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-05 0.0 0.0 0.0 0.0 0.0 \n", "\n", " ... hosp_NE diff_hosp_NE diff_2_hosp_NE hosp_FR \\\n", "datum ... \n", "2020-03-01 ... 0.142857 0.000000 0.000000 0.428571 \n", "2020-03-02 ... 0.285714 0.142857 0.142857 0.857143 \n", "2020-03-03 ... 0.428571 0.142857 0.285714 0.857143 \n", "2020-03-04 ... 0.285714 -0.142857 0.000000 0.714286 \n", "2020-03-05 ... 0.428571 0.142857 0.000000 1.000000 \n", "\n", " diff_hosp_FR diff_2_hosp_FR hosp_GE diff_hosp_GE \\\n", "datum \n", "2020-03-01 0.142857 0.285714 0.428571 0.000000 \n", "2020-03-02 0.428571 0.571429 0.428571 0.000000 \n", "2020-03-03 0.000000 0.428571 0.428571 0.000000 \n", "2020-03-04 -0.142857 -0.142857 0.571429 0.142857 \n", "2020-03-05 0.285714 0.142857 0.857143 0.285714 \n", "\n", " diff_2_hosp_GE vac_all \n", "datum \n", "2020-03-01 0.000000 0.0 \n", "2020-03-02 0.142857 0.0 \n", "2020-03-03 0.000000 0.0 \n", "2020-03-04 0.142857 0.0 \n", "2020-03-05 0.428571 0.0 \n", "\n", "[5 rows x 64 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./data/data_GE.csv')\n", "df.set_index('datum', inplace = True)\n", "df.index = pd.to_datetime(df.index)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "812c8484", "metadata": {}, "source": [ "The functions below were created to allow the application of a machine learning regressor model as a forecasting model, using past information as lagged columns (the features) and training multiple models, each one specialized in predicting one day in the future. " ] }, { "cell_type": "markdown", "id": "56e4fa16", "metadata": {}, "source": [ "### Function `build_lagged_features()`\n", "\n", "This function takes as input a DataFrame and a number of lags, and computed the lagged values of each column in a new DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "id": "fef0e715", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test_FRtest_FR_lag1test_FR_lag2test_FR_lag3diff_test_FRdiff_test_FR_lag1diff_test_FR_lag2diff_test_FR_lag3diff_2_test_FRdiff_2_test_FR_lag1...diff_hosp_GE_lag2diff_hosp_GE_lag3diff_2_hosp_GEdiff_2_hosp_GE_lag1diff_2_hosp_GE_lag2diff_2_hosp_GE_lag3vac_allvac_all_lag1vac_all_lag2vac_all_lag3
datum
2020-03-040.00.00.00.00.00.00.00.00.00.0...0.0000000.0000000.1428570.0000000.1428570.0000000.00.00.00.0
2020-03-050.00.00.00.00.00.00.00.00.00.0...0.0000000.0000000.4285710.1428570.0000000.1428570.00.00.00.0
2020-03-060.00.00.00.00.00.00.00.00.00.0...0.1428570.0000000.4285710.4285710.1428570.0000000.00.00.00.0
2020-03-070.00.00.00.00.00.00.00.00.00.0...0.2857140.1428570.2857140.4285710.4285710.1428570.00.00.00.0
2020-03-080.00.00.00.00.00.00.00.00.00.0...0.1428570.2857140.0000000.2857140.4285710.4285710.00.00.00.0
\n", "

5 rows × 256 columns

\n", "
" ], "text/plain": [ " test_FR test_FR_lag1 test_FR_lag2 test_FR_lag3 diff_test_FR \\\n", "datum \n", "2020-03-04 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-05 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-06 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-07 0.0 0.0 0.0 0.0 0.0 \n", "2020-03-08 0.0 0.0 0.0 0.0 0.0 \n", "\n", " diff_test_FR_lag1 diff_test_FR_lag2 diff_test_FR_lag3 \\\n", "datum \n", "2020-03-04 0.0 0.0 0.0 \n", "2020-03-05 0.0 0.0 0.0 \n", "2020-03-06 0.0 0.0 0.0 \n", "2020-03-07 0.0 0.0 0.0 \n", "2020-03-08 0.0 0.0 0.0 \n", "\n", " diff_2_test_FR diff_2_test_FR_lag1 ... diff_hosp_GE_lag2 \\\n", "datum ... \n", "2020-03-04 0.0 0.0 ... 0.000000 \n", "2020-03-05 0.0 0.0 ... 0.000000 \n", "2020-03-06 0.0 0.0 ... 0.142857 \n", "2020-03-07 0.0 0.0 ... 0.285714 \n", "2020-03-08 0.0 0.0 ... 0.142857 \n", "\n", " diff_hosp_GE_lag3 diff_2_hosp_GE diff_2_hosp_GE_lag1 \\\n", "datum \n", "2020-03-04 0.000000 0.142857 0.000000 \n", "2020-03-05 0.000000 0.428571 0.142857 \n", "2020-03-06 0.000000 0.428571 0.428571 \n", "2020-03-07 0.142857 0.285714 0.428571 \n", "2020-03-08 0.285714 0.000000 0.285714 \n", "\n", " diff_2_hosp_GE_lag2 diff_2_hosp_GE_lag3 vac_all vac_all_lag1 \\\n", "datum \n", "2020-03-04 0.142857 0.000000 0.0 0.0 \n", "2020-03-05 0.000000 0.142857 0.0 0.0 \n", "2020-03-06 0.142857 0.000000 0.0 0.0 \n", "2020-03-07 0.428571 0.142857 0.0 0.0 \n", "2020-03-08 0.428571 0.428571 0.0 0.0 \n", "\n", " vac_all_lag2 vac_all_lag3 \n", "datum \n", "2020-03-04 0.0 0.0 \n", "2020-03-05 0.0 0.0 \n", "2020-03-06 0.0 0.0 \n", "2020-03-07 0.0 0.0 \n", "2020-03-08 0.0 0.0 \n", "\n", "[5 rows x 256 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lag = build_lagged_features(dt = df, maxlag = 3)\n", "\n", "df_lag.head()" ] }, { "cell_type": "markdown", "id": "a29c40e4", "metadata": {}, "source": [ "### Function `preprocess_data()`:\n", "\n", "This function made the same that `build_lagged_features()`. The difference is that this function allow the user to subset the dataframe returned by an initial and end date. " ] }, { "cell_type": "code", "execution_count": 4, "id": "0a2f9300", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test_FRtest_FR_lag1test_FR_lag2test_FR_lag3diff_test_FRdiff_test_FR_lag1diff_test_FR_lag2diff_test_FR_lag3diff_2_test_FRdiff_2_test_FR_lag1...diff_hosp_GE_lag2diff_hosp_GE_lag3diff_2_hosp_GEdiff_2_hosp_GE_lag1diff_2_hosp_GE_lag2diff_2_hosp_GE_lag3vac_allvac_all_lag1vac_all_lag2vac_all_lag3
datum
2021-01-01542.285714550.857143586.000000650.428571-8.571429-35.142857-64.428571-32.714286-43.714286-99.571429...0.0000000.285714-0.142857-0.2857140.2857140.7142860.0442860.0357140.0271430.017143
2021-01-02536.571429542.285714550.857143586.000000-5.714286-8.571429-35.142857-64.428571-14.285714-43.714286...-0.2857140.0000000.571429-0.142857-0.2857140.2857140.0528570.0442860.0357140.027143
2021-01-03530.000000536.571429542.285714550.857143-6.571429-5.714286-8.571429-35.142857-12.285714-14.285714...0.142857-0.2857140.5714290.571429-0.142857-0.2857140.0614290.0528570.0442860.035714
2021-01-04552.285714530.000000536.571429542.28571422.285714-6.571429-5.714286-8.57142915.714286-12.285714...0.4285710.1428570.0000000.5714290.571429-0.1428570.0742860.0614290.0528570.044286
2021-01-05552.857143552.285714530.000000536.5714290.57142922.285714-6.571429-5.71428622.85714315.714286...0.1428570.428571-0.4285710.0000000.5714290.5714290.0957140.0742860.0614290.052857
\n", "

5 rows × 256 columns

\n", "
" ], "text/plain": [ " test_FR test_FR_lag1 test_FR_lag2 test_FR_lag3 \\\n", "datum \n", "2021-01-01 542.285714 550.857143 586.000000 650.428571 \n", "2021-01-02 536.571429 542.285714 550.857143 586.000000 \n", "2021-01-03 530.000000 536.571429 542.285714 550.857143 \n", "2021-01-04 552.285714 530.000000 536.571429 542.285714 \n", "2021-01-05 552.857143 552.285714 530.000000 536.571429 \n", "\n", " diff_test_FR diff_test_FR_lag1 diff_test_FR_lag2 \\\n", "datum \n", "2021-01-01 -8.571429 -35.142857 -64.428571 \n", "2021-01-02 -5.714286 -8.571429 -35.142857 \n", "2021-01-03 -6.571429 -5.714286 -8.571429 \n", "2021-01-04 22.285714 -6.571429 -5.714286 \n", "2021-01-05 0.571429 22.285714 -6.571429 \n", "\n", " diff_test_FR_lag3 diff_2_test_FR diff_2_test_FR_lag1 ... \\\n", "datum ... \n", "2021-01-01 -32.714286 -43.714286 -99.571429 ... \n", "2021-01-02 -64.428571 -14.285714 -43.714286 ... \n", "2021-01-03 -35.142857 -12.285714 -14.285714 ... \n", "2021-01-04 -8.571429 15.714286 -12.285714 ... \n", "2021-01-05 -5.714286 22.857143 15.714286 ... \n", "\n", " diff_hosp_GE_lag2 diff_hosp_GE_lag3 diff_2_hosp_GE \\\n", "datum \n", "2021-01-01 0.000000 0.285714 -0.142857 \n", "2021-01-02 -0.285714 0.000000 0.571429 \n", "2021-01-03 0.142857 -0.285714 0.571429 \n", "2021-01-04 0.428571 0.142857 0.000000 \n", "2021-01-05 0.142857 0.428571 -0.428571 \n", "\n", " diff_2_hosp_GE_lag1 diff_2_hosp_GE_lag2 diff_2_hosp_GE_lag3 \\\n", "datum \n", "2021-01-01 -0.285714 0.285714 0.714286 \n", "2021-01-02 -0.142857 -0.285714 0.285714 \n", "2021-01-03 0.571429 -0.142857 -0.285714 \n", "2021-01-04 0.571429 0.571429 -0.142857 \n", "2021-01-05 0.000000 0.571429 0.571429 \n", "\n", " vac_all vac_all_lag1 vac_all_lag2 vac_all_lag3 \n", "datum \n", "2021-01-01 0.044286 0.035714 0.027143 0.017143 \n", "2021-01-02 0.052857 0.044286 0.035714 0.027143 \n", "2021-01-03 0.061429 0.052857 0.044286 0.035714 \n", "2021-01-04 0.074286 0.061429 0.052857 0.044286 \n", "2021-01-05 0.095714 0.074286 0.061429 0.052857 \n", "\n", "[5 rows x 256 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lag = preprocess_data(data = df, maxlag = 3, ini_date = '2021-01-01' , end_date = '2021-05-01')\n", "\n", "df_lag.head()" ] }, { "cell_type": "markdown", "id": "2f076371", "metadata": {}, "source": [ "### Function `get_targets()`\n", "\n", "This function allows the transformation of a series (pd.Series) of targets in a dictionary, where the target values are shifted many times as necessary to for example, train multiple ML regression models capable to forecast the curve in the target parameter (pd.Series). " ] }, { "cell_type": "code", "execution_count": 5, "id": "6e691f0b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{1: datum\n", " 2020-03-01 0.428571\n", " 2020-03-02 0.428571\n", " 2020-03-03 0.571429\n", " 2020-03-04 0.857143\n", " 2020-03-05 1.000000\n", " ... \n", " 2022-08-25 2.571429\n", " 2022-08-26 2.571429\n", " 2022-08-27 2.142857\n", " 2022-08-28 1.857143\n", " 2022-08-29 1.428571\n", " Name: hosp_GE, Length: 912, dtype: float64,\n", " 2: datum\n", " 2020-03-01 0.428571\n", " 2020-03-02 0.571429\n", " 2020-03-03 0.857143\n", " 2020-03-04 1.000000\n", " 2020-03-05 1.142857\n", " ... \n", " 2022-08-24 2.571429\n", " 2022-08-25 2.571429\n", " 2022-08-26 2.142857\n", " 2022-08-27 1.857143\n", " 2022-08-28 1.428571\n", " Name: hosp_GE, Length: 911, dtype: float64,\n", " 3: datum\n", " 2020-03-01 0.571429\n", " 2020-03-02 0.857143\n", " 2020-03-03 1.000000\n", " 2020-03-04 1.142857\n", " 2020-03-05 1.000000\n", " ... \n", " 2022-08-23 2.571429\n", " 2022-08-24 2.571429\n", " 2022-08-25 2.142857\n", " 2022-08-26 1.857143\n", " 2022-08-27 1.428571\n", " Name: hosp_GE, Length: 910, dtype: float64,\n", " 4: datum\n", " 2020-03-01 0.857143\n", " 2020-03-02 1.000000\n", " 2020-03-03 1.142857\n", " 2020-03-04 1.000000\n", " 2020-03-05 1.571429\n", " ... \n", " 2022-08-22 2.571429\n", " 2022-08-23 2.571429\n", " 2022-08-24 2.142857\n", " 2022-08-25 1.857143\n", " 2022-08-26 1.428571\n", " Name: hosp_GE, Length: 909, dtype: float64}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict_target = get_targets(target = df['hosp_GE'],predict_n = 4)\n", "\n", "dict_target" ] }, { "cell_type": "markdown", "id": "81ebe3ca", "metadata": {}, "source": [ "### Function `get_next_n_days()`: \n", "\n", "This function takes as input a string with a date and the number of days after the day in the string that will be returned in the list with the next dates. " ] }, { "cell_type": "code", "execution_count": 6, "id": "97700803", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[datetime.datetime(2021, 1, 2, 0, 0),\n", " datetime.datetime(2021, 1, 3, 0, 0),\n", " datetime.datetime(2021, 1, 4, 0, 0),\n", " datetime.datetime(2021, 1, 5, 0, 0)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next_dates = get_next_n_days(ini_date='2021-01-01', next_days = 4)\n", "next_dates" ] }, { "cell_type": "markdown", "id": "b1ddcb99", "metadata": {}, "source": [ "The functions below were created to allow the application of neural network models as forecasting models. The function transform tabular data into array data, which this class of models accepts. " ] }, { "cell_type": "markdown", "id": "248ecaf7", "metadata": {}, "source": [ "### Function `lstm_split_data()`: \n", "\n", "This function split the data into training and test sets. It takes as inputs a DataFrame, the number of days that it will be predicted, the number of past days used in the prediction, the position of the target column, and the fraction of the data used as train and test datasets." ] }, { "cell_type": "code", "execution_count": 7, "id": "3a64154a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(723, 4, 64)\n", "(723, 4)\n", "(183, 4, 64)\n", "(183, 4)\n" ] } ], "source": [ "\n", "X_train, Y_train, X_test, Y_test = lstm_split_data(df = df,\n", " look_back = 4,\n", " predict_n = 4, \n", " ratio = 0.8,\n", " Y_column = df.columns.get_loc(\"hosp_GE\"))\n", "\n", "print(X_train.shape)\n", "print(Y_train.shape)\n", "print(X_test.shape)\n", "print(Y_test.shape)" ] }, { "cell_type": "markdown", "id": "5fdb3be4", "metadata": {}, "source": [ "### Function `normalize_data()`: \n", "\n", "This function takes as input a DataFrame and normalizes the columns based on the maximum value. " ] }, { "cell_type": "code", "execution_count": 8, "id": "4265931d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test_FRdiff_test_FRdiff_2_test_FRtest_NEdiff_test_NEdiff_2_test_NEtest_TIdiff_test_TIdiff_2_test_TItest_VD...hosp_NEdiff_hosp_NEdiff_2_hosp_NEhosp_FRdiff_hosp_FRdiff_2_hosp_FRhosp_GEdiff_hosp_GEdiff_2_hosp_GEvac_all
00.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0090090.00000.000.0247930.0666670.0909090.0146340.0000000.0000000.0
10.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0180180.06250.050.0495870.2000000.1818180.0146340.0000000.0322580.0
20.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0270270.06250.100.0495870.0000000.1363640.0146340.0000000.0000000.0
30.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.018018-0.06250.000.041322-0.066667-0.0454550.0195120.0434780.0322580.0
40.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0270270.06250.000.0578510.1333330.0454550.0292680.0869570.0967740.0
..................................................................
9080.046436-0.002678-0.0253520.0536100.0054520.0139900.096719-0.017641-0.0231440.069339...0.0000000.00000.000.008264-0.200000-0.1363640.087805-0.0434780.0322581.0
9090.046236-0.002678-0.0040240.0536660.0007790.0041450.096447-0.002614-0.0149470.069339...0.0000000.00000.000.0082640.000000-0.1363640.0878050.000000-0.0322581.0
9100.044516-0.023032-0.0193160.052940-0.010125-0.0062180.0969910.0052270.0019290.068820...0.0000000.00000.000.0082640.0000000.0000000.073171-0.130435-0.0967741.0
9110.042517-0.026781-0.0374250.050874-0.028816-0.0259070.085841-0.107155-0.0752170.065681...0.0000000.00000.000.0082640.0000000.0000000.063415-0.086957-0.1612901.0
9120.034677-0.025174-0.0792760.041883-0.034268-0.0829020.069862-0.095394-0.1388620.053947...0.0000000.00000.000.0082640.0000000.0000000.048780-0.130435-0.1612901.0
\n", "

913 rows × 64 columns

\n", "
" ], "text/plain": [ " test_FR diff_test_FR diff_2_test_FR test_NE diff_test_NE \\\n", "0 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "1 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "2 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "3 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "4 0.000000 0.000000 0.000000 0.000000 0.000000 \n", ".. ... ... ... ... ... \n", "908 0.046436 -0.002678 -0.025352 0.053610 0.005452 \n", "909 0.046236 -0.002678 -0.004024 0.053666 0.000779 \n", "910 0.044516 -0.023032 -0.019316 0.052940 -0.010125 \n", "911 0.042517 -0.026781 -0.037425 0.050874 -0.028816 \n", "912 0.034677 -0.025174 -0.079276 0.041883 -0.034268 \n", "\n", " diff_2_test_NE test_TI diff_test_TI diff_2_test_TI test_VD ... \\\n", "0 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "1 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "2 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "3 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "4 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n", ".. ... ... ... ... ... ... \n", "908 0.013990 0.096719 -0.017641 -0.023144 0.069339 ... \n", "909 0.004145 0.096447 -0.002614 -0.014947 0.069339 ... \n", "910 -0.006218 0.096991 0.005227 0.001929 0.068820 ... \n", "911 -0.025907 0.085841 -0.107155 -0.075217 0.065681 ... \n", "912 -0.082902 0.069862 -0.095394 -0.138862 0.053947 ... \n", "\n", " hosp_NE diff_hosp_NE diff_2_hosp_NE hosp_FR diff_hosp_FR \\\n", "0 0.009009 0.0000 0.00 0.024793 0.066667 \n", "1 0.018018 0.0625 0.05 0.049587 0.200000 \n", "2 0.027027 0.0625 0.10 0.049587 0.000000 \n", "3 0.018018 -0.0625 0.00 0.041322 -0.066667 \n", "4 0.027027 0.0625 0.00 0.057851 0.133333 \n", ".. ... ... ... ... ... \n", "908 0.000000 0.0000 0.00 0.008264 -0.200000 \n", "909 0.000000 0.0000 0.00 0.008264 0.000000 \n", "910 0.000000 0.0000 0.00 0.008264 0.000000 \n", "911 0.000000 0.0000 0.00 0.008264 0.000000 \n", "912 0.000000 0.0000 0.00 0.008264 0.000000 \n", "\n", " diff_2_hosp_FR hosp_GE diff_hosp_GE diff_2_hosp_GE vac_all \n", "0 0.090909 0.014634 0.000000 0.000000 0.0 \n", "1 0.181818 0.014634 0.000000 0.032258 0.0 \n", "2 0.136364 0.014634 0.000000 0.000000 0.0 \n", "3 -0.045455 0.019512 0.043478 0.032258 0.0 \n", "4 0.045455 0.029268 0.086957 0.096774 0.0 \n", ".. ... ... ... ... ... \n", "908 -0.136364 0.087805 -0.043478 0.032258 1.0 \n", "909 -0.136364 0.087805 0.000000 -0.032258 1.0 \n", "910 0.000000 0.073171 -0.130435 -0.096774 1.0 \n", "911 0.000000 0.063415 -0.086957 -0.161290 1.0 \n", "912 0.000000 0.048780 -0.130435 -0.161290 1.0 \n", "\n", "[913 rows x 64 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_n, max_values = normalize_data(df)\n", "\n", "df_n" ] }, { "cell_type": "code", "execution_count": 9, "id": "4767c9b2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "test_FR 3571.714286\n", "diff_test_FR 266.714286\n", "diff_2_test_FR 355.000000\n", "test_NE 2558.142857\n", "diff_test_NE 183.428571\n", " ... \n", "diff_2_hosp_FR 3.142857\n", "hosp_GE 29.285714\n", "diff_hosp_GE 3.285714\n", "diff_2_hosp_GE 3.571429\n", "vac_all 182.800000\n", "Length: 64, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max_values" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }