Preprocessing data (Python version)

This notebook provides some examples of how the functions in the preprocessing.py module can be used.

import pandas as pd
from epigraphhub.analysis.preprocessing import *
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_245/4248182819.py in <module>
      1 import pandas as pd
----> 2 from epigraphhub.analysis.preprocessing import *

ModuleNotFoundError: No module named 'epigraphhub'

The functions in the preprocessing.py module allow the transformation of tabular data in a format accepted by ML models (tabular data using lagged values) and neural network models (3D array data and multiple-output).

In this tutorial, we will use the data saved in the path: ./data/data_GE.csv. This dataset represents the number of tests, cases, and hospitalizations of COVID-19 reported in some cantons of Switzerland.

df = pd.read_csv('./data/data_GE.csv')
df.set_index('datum', inplace = True)
df.index = pd.to_datetime(df.index)
df.head()
test_FR diff_test_FR diff_2_test_FR test_NE diff_test_NE diff_2_test_NE test_TI diff_test_TI diff_2_test_TI test_VD ... hosp_NE diff_hosp_NE diff_2_hosp_NE hosp_FR diff_hosp_FR diff_2_hosp_FR hosp_GE diff_hosp_GE diff_2_hosp_GE vac_all
datum
2020-03-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.142857 0.000000 0.000000 0.428571 0.142857 0.285714 0.428571 0.000000 0.000000 0.0
2020-03-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.285714 0.142857 0.142857 0.857143 0.428571 0.571429 0.428571 0.000000 0.142857 0.0
2020-03-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.428571 0.142857 0.285714 0.857143 0.000000 0.428571 0.428571 0.000000 0.000000 0.0
2020-03-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.285714 -0.142857 0.000000 0.714286 -0.142857 -0.142857 0.571429 0.142857 0.142857 0.0
2020-03-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.428571 0.142857 0.000000 1.000000 0.285714 0.142857 0.857143 0.285714 0.428571 0.0

5 rows × 64 columns

The functions below were created to allow the application of a machine learning regressor model as a forecasting model, using past information as lagged columns (the features) and training multiple models, each one specialized in predicting one day in the future.

Function build_lagged_features()

This function takes as input a DataFrame and a number of lags, and computed the lagged values of each column in a new DataFrame.

df_lag = build_lagged_features(dt = df, maxlag = 3)

df_lag.head()
test_FR test_FR_lag1 test_FR_lag2 test_FR_lag3 diff_test_FR diff_test_FR_lag1 diff_test_FR_lag2 diff_test_FR_lag3 diff_2_test_FR diff_2_test_FR_lag1 ... diff_hosp_GE_lag2 diff_hosp_GE_lag3 diff_2_hosp_GE diff_2_hosp_GE_lag1 diff_2_hosp_GE_lag2 diff_2_hosp_GE_lag3 vac_all vac_all_lag1 vac_all_lag2 vac_all_lag3
datum
2020-03-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.142857 0.000000 0.142857 0.000000 0.0 0.0 0.0 0.0
2020-03-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.428571 0.142857 0.000000 0.142857 0.0 0.0 0.0 0.0
2020-03-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.142857 0.000000 0.428571 0.428571 0.142857 0.000000 0.0 0.0 0.0 0.0
2020-03-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.285714 0.142857 0.285714 0.428571 0.428571 0.142857 0.0 0.0 0.0 0.0
2020-03-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.142857 0.285714 0.000000 0.285714 0.428571 0.428571 0.0 0.0 0.0 0.0

5 rows × 256 columns

Function preprocess_data():

This function made the same that build_lagged_features(). The difference is that this function allow the user to subset the dataframe returned by an initial and end date.

df_lag = preprocess_data(data = df, maxlag = 3, ini_date = '2021-01-01' , end_date = '2021-05-01')

df_lag.head()
test_FR test_FR_lag1 test_FR_lag2 test_FR_lag3 diff_test_FR diff_test_FR_lag1 diff_test_FR_lag2 diff_test_FR_lag3 diff_2_test_FR diff_2_test_FR_lag1 ... diff_hosp_GE_lag2 diff_hosp_GE_lag3 diff_2_hosp_GE diff_2_hosp_GE_lag1 diff_2_hosp_GE_lag2 diff_2_hosp_GE_lag3 vac_all vac_all_lag1 vac_all_lag2 vac_all_lag3
datum
2021-01-01 542.285714 550.857143 586.000000 650.428571 -8.571429 -35.142857 -64.428571 -32.714286 -43.714286 -99.571429 ... 0.000000 0.285714 -0.142857 -0.285714 0.285714 0.714286 0.044286 0.035714 0.027143 0.017143
2021-01-02 536.571429 542.285714 550.857143 586.000000 -5.714286 -8.571429 -35.142857 -64.428571 -14.285714 -43.714286 ... -0.285714 0.000000 0.571429 -0.142857 -0.285714 0.285714 0.052857 0.044286 0.035714 0.027143
2021-01-03 530.000000 536.571429 542.285714 550.857143 -6.571429 -5.714286 -8.571429 -35.142857 -12.285714 -14.285714 ... 0.142857 -0.285714 0.571429 0.571429 -0.142857 -0.285714 0.061429 0.052857 0.044286 0.035714
2021-01-04 552.285714 530.000000 536.571429 542.285714 22.285714 -6.571429 -5.714286 -8.571429 15.714286 -12.285714 ... 0.428571 0.142857 0.000000 0.571429 0.571429 -0.142857 0.074286 0.061429 0.052857 0.044286
2021-01-05 552.857143 552.285714 530.000000 536.571429 0.571429 22.285714 -6.571429 -5.714286 22.857143 15.714286 ... 0.142857 0.428571 -0.428571 0.000000 0.571429 0.571429 0.095714 0.074286 0.061429 0.052857

5 rows × 256 columns

Function get_targets()

This function allows the transformation of a series (pd.Series) of targets in a dictionary, where the target values are shifted many times as necessary to for example, train multiple ML regression models capable to forecast the curve in the target parameter (pd.Series).

dict_target = get_targets(target = df['hosp_GE'],predict_n = 4)

dict_target
{1: datum
 2020-03-01    0.428571
 2020-03-02    0.428571
 2020-03-03    0.571429
 2020-03-04    0.857143
 2020-03-05    1.000000
                 ...   
 2022-08-25    2.571429
 2022-08-26    2.571429
 2022-08-27    2.142857
 2022-08-28    1.857143
 2022-08-29    1.428571
 Name: hosp_GE, Length: 912, dtype: float64,
 2: datum
 2020-03-01    0.428571
 2020-03-02    0.571429
 2020-03-03    0.857143
 2020-03-04    1.000000
 2020-03-05    1.142857
                 ...   
 2022-08-24    2.571429
 2022-08-25    2.571429
 2022-08-26    2.142857
 2022-08-27    1.857143
 2022-08-28    1.428571
 Name: hosp_GE, Length: 911, dtype: float64,
 3: datum
 2020-03-01    0.571429
 2020-03-02    0.857143
 2020-03-03    1.000000
 2020-03-04    1.142857
 2020-03-05    1.000000
                 ...   
 2022-08-23    2.571429
 2022-08-24    2.571429
 2022-08-25    2.142857
 2022-08-26    1.857143
 2022-08-27    1.428571
 Name: hosp_GE, Length: 910, dtype: float64,
 4: datum
 2020-03-01    0.857143
 2020-03-02    1.000000
 2020-03-03    1.142857
 2020-03-04    1.000000
 2020-03-05    1.571429
                 ...   
 2022-08-22    2.571429
 2022-08-23    2.571429
 2022-08-24    2.142857
 2022-08-25    1.857143
 2022-08-26    1.428571
 Name: hosp_GE, Length: 909, dtype: float64}

Function get_next_n_days():

This function takes as input a string with a date and the number of days after the day in the string that will be returned in the list with the next dates.

next_dates = get_next_n_days(ini_date='2021-01-01', next_days = 4)
next_dates
[datetime.datetime(2021, 1, 2, 0, 0),
 datetime.datetime(2021, 1, 3, 0, 0),
 datetime.datetime(2021, 1, 4, 0, 0),
 datetime.datetime(2021, 1, 5, 0, 0)]

The functions below were created to allow the application of neural network models as forecasting models. The function transform tabular data into array data, which this class of models accepts.

Function lstm_split_data():

This function split the data into training and test sets. It takes as inputs a DataFrame, the number of days that it will be predicted, the number of past days used in the prediction, the position of the target column, and the fraction of the data used as train and test datasets.

X_train, Y_train, X_test, Y_test = lstm_split_data(df = df,
                                                   look_back = 4,
                                                   predict_n = 4, 
                                                   ratio = 0.8,
                                                   Y_column = df.columns.get_loc("hosp_GE"))

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
(723, 4, 64)
(723, 4)
(183, 4, 64)
(183, 4)

Function normalize_data():

This function takes as input a DataFrame and normalizes the columns based on the maximum value.

df_n, max_values = normalize_data(df)

df_n
test_FR diff_test_FR diff_2_test_FR test_NE diff_test_NE diff_2_test_NE test_TI diff_test_TI diff_2_test_TI test_VD ... hosp_NE diff_hosp_NE diff_2_hosp_NE hosp_FR diff_hosp_FR diff_2_hosp_FR hosp_GE diff_hosp_GE diff_2_hosp_GE vac_all
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.009009 0.0000 0.00 0.024793 0.066667 0.090909 0.014634 0.000000 0.000000 0.0
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.018018 0.0625 0.05 0.049587 0.200000 0.181818 0.014634 0.000000 0.032258 0.0
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.027027 0.0625 0.10 0.049587 0.000000 0.136364 0.014634 0.000000 0.000000 0.0
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.018018 -0.0625 0.00 0.041322 -0.066667 -0.045455 0.019512 0.043478 0.032258 0.0
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.027027 0.0625 0.00 0.057851 0.133333 0.045455 0.029268 0.086957 0.096774 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
908 0.046436 -0.002678 -0.025352 0.053610 0.005452 0.013990 0.096719 -0.017641 -0.023144 0.069339 ... 0.000000 0.0000 0.00 0.008264 -0.200000 -0.136364 0.087805 -0.043478 0.032258 1.0
909 0.046236 -0.002678 -0.004024 0.053666 0.000779 0.004145 0.096447 -0.002614 -0.014947 0.069339 ... 0.000000 0.0000 0.00 0.008264 0.000000 -0.136364 0.087805 0.000000 -0.032258 1.0
910 0.044516 -0.023032 -0.019316 0.052940 -0.010125 -0.006218 0.096991 0.005227 0.001929 0.068820 ... 0.000000 0.0000 0.00 0.008264 0.000000 0.000000 0.073171 -0.130435 -0.096774 1.0
911 0.042517 -0.026781 -0.037425 0.050874 -0.028816 -0.025907 0.085841 -0.107155 -0.075217 0.065681 ... 0.000000 0.0000 0.00 0.008264 0.000000 0.000000 0.063415 -0.086957 -0.161290 1.0
912 0.034677 -0.025174 -0.079276 0.041883 -0.034268 -0.082902 0.069862 -0.095394 -0.138862 0.053947 ... 0.000000 0.0000 0.00 0.008264 0.000000 0.000000 0.048780 -0.130435 -0.161290 1.0

913 rows × 64 columns

max_values
test_FR           3571.714286
diff_test_FR       266.714286
diff_2_test_FR     355.000000
test_NE           2558.142857
diff_test_NE       183.428571
                     ...     
diff_2_hosp_FR       3.142857
hosp_GE             29.285714
diff_hosp_GE         3.285714
diff_2_hosp_GE       3.571429
vac_all            182.800000
Length: 64, dtype: float64