Preprocessing data (Python version)¶
This notebook provides some examples of how the functions in the preprocessing.py
module can be used.
import pandas as pd
from epigraphhub.analysis.preprocessing import *
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_245/4248182819.py in <module>
1 import pandas as pd
----> 2 from epigraphhub.analysis.preprocessing import *
ModuleNotFoundError: No module named 'epigraphhub'
The functions in the preprocessing.py module allow the transformation of tabular data in a format accepted by ML models (tabular data using lagged values) and neural network models (3D array data and multiple-output).
In this tutorial, we will use the data saved in the path: ./data/data_GE.csv. This dataset represents the number of tests, cases, and hospitalizations of COVID-19 reported in some cantons of Switzerland.
df = pd.read_csv('./data/data_GE.csv')
df.set_index('datum', inplace = True)
df.index = pd.to_datetime(df.index)
df.head()
test_FR | diff_test_FR | diff_2_test_FR | test_NE | diff_test_NE | diff_2_test_NE | test_TI | diff_test_TI | diff_2_test_TI | test_VD | ... | hosp_NE | diff_hosp_NE | diff_2_hosp_NE | hosp_FR | diff_hosp_FR | diff_2_hosp_FR | hosp_GE | diff_hosp_GE | diff_2_hosp_GE | vac_all | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datum | |||||||||||||||||||||
2020-03-01 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.142857 | 0.000000 | 0.000000 | 0.428571 | 0.142857 | 0.285714 | 0.428571 | 0.000000 | 0.000000 | 0.0 |
2020-03-02 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.285714 | 0.142857 | 0.142857 | 0.857143 | 0.428571 | 0.571429 | 0.428571 | 0.000000 | 0.142857 | 0.0 |
2020-03-03 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.428571 | 0.142857 | 0.285714 | 0.857143 | 0.000000 | 0.428571 | 0.428571 | 0.000000 | 0.000000 | 0.0 |
2020-03-04 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.285714 | -0.142857 | 0.000000 | 0.714286 | -0.142857 | -0.142857 | 0.571429 | 0.142857 | 0.142857 | 0.0 |
2020-03-05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.428571 | 0.142857 | 0.000000 | 1.000000 | 0.285714 | 0.142857 | 0.857143 | 0.285714 | 0.428571 | 0.0 |
5 rows × 64 columns
The functions below were created to allow the application of a machine learning regressor model as a forecasting model, using past information as lagged columns (the features) and training multiple models, each one specialized in predicting one day in the future.
Function build_lagged_features()
¶
This function takes as input a DataFrame and a number of lags, and computed the lagged values of each column in a new DataFrame.
df_lag = build_lagged_features(dt = df, maxlag = 3)
df_lag.head()
test_FR | test_FR_lag1 | test_FR_lag2 | test_FR_lag3 | diff_test_FR | diff_test_FR_lag1 | diff_test_FR_lag2 | diff_test_FR_lag3 | diff_2_test_FR | diff_2_test_FR_lag1 | ... | diff_hosp_GE_lag2 | diff_hosp_GE_lag3 | diff_2_hosp_GE | diff_2_hosp_GE_lag1 | diff_2_hosp_GE_lag2 | diff_2_hosp_GE_lag3 | vac_all | vac_all_lag1 | vac_all_lag2 | vac_all_lag3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datum | |||||||||||||||||||||
2020-03-04 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.142857 | 0.000000 | 0.142857 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-03-05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.428571 | 0.142857 | 0.000000 | 0.142857 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-03-06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.142857 | 0.000000 | 0.428571 | 0.428571 | 0.142857 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-03-07 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.285714 | 0.142857 | 0.285714 | 0.428571 | 0.428571 | 0.142857 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-03-08 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.142857 | 0.285714 | 0.000000 | 0.285714 | 0.428571 | 0.428571 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 256 columns
Function preprocess_data()
:¶
This function made the same that build_lagged_features()
. The difference is that this function allow the user to subset the dataframe returned by an initial and end date.
df_lag = preprocess_data(data = df, maxlag = 3, ini_date = '2021-01-01' , end_date = '2021-05-01')
df_lag.head()
test_FR | test_FR_lag1 | test_FR_lag2 | test_FR_lag3 | diff_test_FR | diff_test_FR_lag1 | diff_test_FR_lag2 | diff_test_FR_lag3 | diff_2_test_FR | diff_2_test_FR_lag1 | ... | diff_hosp_GE_lag2 | diff_hosp_GE_lag3 | diff_2_hosp_GE | diff_2_hosp_GE_lag1 | diff_2_hosp_GE_lag2 | diff_2_hosp_GE_lag3 | vac_all | vac_all_lag1 | vac_all_lag2 | vac_all_lag3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datum | |||||||||||||||||||||
2021-01-01 | 542.285714 | 550.857143 | 586.000000 | 650.428571 | -8.571429 | -35.142857 | -64.428571 | -32.714286 | -43.714286 | -99.571429 | ... | 0.000000 | 0.285714 | -0.142857 | -0.285714 | 0.285714 | 0.714286 | 0.044286 | 0.035714 | 0.027143 | 0.017143 |
2021-01-02 | 536.571429 | 542.285714 | 550.857143 | 586.000000 | -5.714286 | -8.571429 | -35.142857 | -64.428571 | -14.285714 | -43.714286 | ... | -0.285714 | 0.000000 | 0.571429 | -0.142857 | -0.285714 | 0.285714 | 0.052857 | 0.044286 | 0.035714 | 0.027143 |
2021-01-03 | 530.000000 | 536.571429 | 542.285714 | 550.857143 | -6.571429 | -5.714286 | -8.571429 | -35.142857 | -12.285714 | -14.285714 | ... | 0.142857 | -0.285714 | 0.571429 | 0.571429 | -0.142857 | -0.285714 | 0.061429 | 0.052857 | 0.044286 | 0.035714 |
2021-01-04 | 552.285714 | 530.000000 | 536.571429 | 542.285714 | 22.285714 | -6.571429 | -5.714286 | -8.571429 | 15.714286 | -12.285714 | ... | 0.428571 | 0.142857 | 0.000000 | 0.571429 | 0.571429 | -0.142857 | 0.074286 | 0.061429 | 0.052857 | 0.044286 |
2021-01-05 | 552.857143 | 552.285714 | 530.000000 | 536.571429 | 0.571429 | 22.285714 | -6.571429 | -5.714286 | 22.857143 | 15.714286 | ... | 0.142857 | 0.428571 | -0.428571 | 0.000000 | 0.571429 | 0.571429 | 0.095714 | 0.074286 | 0.061429 | 0.052857 |
5 rows × 256 columns
Function get_targets()
¶
This function allows the transformation of a series (pd.Series) of targets in a dictionary, where the target values are shifted many times as necessary to for example, train multiple ML regression models capable to forecast the curve in the target parameter (pd.Series).
dict_target = get_targets(target = df['hosp_GE'],predict_n = 4)
dict_target
{1: datum
2020-03-01 0.428571
2020-03-02 0.428571
2020-03-03 0.571429
2020-03-04 0.857143
2020-03-05 1.000000
...
2022-08-25 2.571429
2022-08-26 2.571429
2022-08-27 2.142857
2022-08-28 1.857143
2022-08-29 1.428571
Name: hosp_GE, Length: 912, dtype: float64,
2: datum
2020-03-01 0.428571
2020-03-02 0.571429
2020-03-03 0.857143
2020-03-04 1.000000
2020-03-05 1.142857
...
2022-08-24 2.571429
2022-08-25 2.571429
2022-08-26 2.142857
2022-08-27 1.857143
2022-08-28 1.428571
Name: hosp_GE, Length: 911, dtype: float64,
3: datum
2020-03-01 0.571429
2020-03-02 0.857143
2020-03-03 1.000000
2020-03-04 1.142857
2020-03-05 1.000000
...
2022-08-23 2.571429
2022-08-24 2.571429
2022-08-25 2.142857
2022-08-26 1.857143
2022-08-27 1.428571
Name: hosp_GE, Length: 910, dtype: float64,
4: datum
2020-03-01 0.857143
2020-03-02 1.000000
2020-03-03 1.142857
2020-03-04 1.000000
2020-03-05 1.571429
...
2022-08-22 2.571429
2022-08-23 2.571429
2022-08-24 2.142857
2022-08-25 1.857143
2022-08-26 1.428571
Name: hosp_GE, Length: 909, dtype: float64}
Function get_next_n_days()
:¶
This function takes as input a string with a date and the number of days after the day in the string that will be returned in the list with the next dates.
next_dates = get_next_n_days(ini_date='2021-01-01', next_days = 4)
next_dates
[datetime.datetime(2021, 1, 2, 0, 0),
datetime.datetime(2021, 1, 3, 0, 0),
datetime.datetime(2021, 1, 4, 0, 0),
datetime.datetime(2021, 1, 5, 0, 0)]
The functions below were created to allow the application of neural network models as forecasting models. The function transform tabular data into array data, which this class of models accepts.
Function lstm_split_data()
:¶
This function split the data into training and test sets. It takes as inputs a DataFrame, the number of days that it will be predicted, the number of past days used in the prediction, the position of the target column, and the fraction of the data used as train and test datasets.
X_train, Y_train, X_test, Y_test = lstm_split_data(df = df,
look_back = 4,
predict_n = 4,
ratio = 0.8,
Y_column = df.columns.get_loc("hosp_GE"))
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
(723, 4, 64)
(723, 4)
(183, 4, 64)
(183, 4)
Function normalize_data()
:¶
This function takes as input a DataFrame and normalizes the columns based on the maximum value.
df_n, max_values = normalize_data(df)
df_n
test_FR | diff_test_FR | diff_2_test_FR | test_NE | diff_test_NE | diff_2_test_NE | test_TI | diff_test_TI | diff_2_test_TI | test_VD | ... | hosp_NE | diff_hosp_NE | diff_2_hosp_NE | hosp_FR | diff_hosp_FR | diff_2_hosp_FR | hosp_GE | diff_hosp_GE | diff_2_hosp_GE | vac_all | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.009009 | 0.0000 | 0.00 | 0.024793 | 0.066667 | 0.090909 | 0.014634 | 0.000000 | 0.000000 | 0.0 |
1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.018018 | 0.0625 | 0.05 | 0.049587 | 0.200000 | 0.181818 | 0.014634 | 0.000000 | 0.032258 | 0.0 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.027027 | 0.0625 | 0.10 | 0.049587 | 0.000000 | 0.136364 | 0.014634 | 0.000000 | 0.000000 | 0.0 |
3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.018018 | -0.0625 | 0.00 | 0.041322 | -0.066667 | -0.045455 | 0.019512 | 0.043478 | 0.032258 | 0.0 |
4 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.027027 | 0.0625 | 0.00 | 0.057851 | 0.133333 | 0.045455 | 0.029268 | 0.086957 | 0.096774 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
908 | 0.046436 | -0.002678 | -0.025352 | 0.053610 | 0.005452 | 0.013990 | 0.096719 | -0.017641 | -0.023144 | 0.069339 | ... | 0.000000 | 0.0000 | 0.00 | 0.008264 | -0.200000 | -0.136364 | 0.087805 | -0.043478 | 0.032258 | 1.0 |
909 | 0.046236 | -0.002678 | -0.004024 | 0.053666 | 0.000779 | 0.004145 | 0.096447 | -0.002614 | -0.014947 | 0.069339 | ... | 0.000000 | 0.0000 | 0.00 | 0.008264 | 0.000000 | -0.136364 | 0.087805 | 0.000000 | -0.032258 | 1.0 |
910 | 0.044516 | -0.023032 | -0.019316 | 0.052940 | -0.010125 | -0.006218 | 0.096991 | 0.005227 | 0.001929 | 0.068820 | ... | 0.000000 | 0.0000 | 0.00 | 0.008264 | 0.000000 | 0.000000 | 0.073171 | -0.130435 | -0.096774 | 1.0 |
911 | 0.042517 | -0.026781 | -0.037425 | 0.050874 | -0.028816 | -0.025907 | 0.085841 | -0.107155 | -0.075217 | 0.065681 | ... | 0.000000 | 0.0000 | 0.00 | 0.008264 | 0.000000 | 0.000000 | 0.063415 | -0.086957 | -0.161290 | 1.0 |
912 | 0.034677 | -0.025174 | -0.079276 | 0.041883 | -0.034268 | -0.082902 | 0.069862 | -0.095394 | -0.138862 | 0.053947 | ... | 0.000000 | 0.0000 | 0.00 | 0.008264 | 0.000000 | 0.000000 | 0.048780 | -0.130435 | -0.161290 | 1.0 |
913 rows × 64 columns
max_values
test_FR 3571.714286
diff_test_FR 266.714286
diff_2_test_FR 355.000000
test_NE 2558.142857
diff_test_NE 183.428571
...
diff_2_hosp_FR 3.142857
hosp_GE 29.285714
diff_hosp_GE 3.285714
diff_2_hosp_GE 3.571429
vac_all 182.800000
Length: 64, dtype: float64