Preprocessing data (Python version)¶

This notebook provides some examples of how the functions in the preprocessing.py module can be used.

import pandas as pd
from epigraphhub.analysis.preprocessing import *

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_245/4248182819.py in <module>
      1 import pandas as pd
----> 2 from epigraphhub.analysis.preprocessing import *

ModuleNotFoundError: No module named 'epigraphhub'

The functions in the preprocessing.py module allow the transformation of tabular data in a format accepted by ML models (tabular data using lagged values) and neural network models (3D array data and multiple-output).

In this tutorial, we will use the data saved in the path: ./data/data_GE.csv. This dataset represents the number of tests, cases, and hospitalizations of COVID-19 reported in some cantons of Switzerland.

df = pd.read_csv('./data/data_GE.csv')
df.set_index('datum', inplace = True)
df.index = pd.to_datetime(df.index)
df.head()

	test_FR	diff_test_FR	diff_2_test_FR	test_NE	diff_test_NE	diff_2_test_NE	test_TI	diff_test_TI	diff_2_test_TI	test_VD	...	hosp_NE	diff_hosp_NE	diff_2_hosp_NE	hosp_FR	diff_hosp_FR	diff_2_hosp_FR	hosp_GE	diff_hosp_GE	diff_2_hosp_GE	vac_all
datum
2020-03-01	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.142857	0.000000	0.000000	0.428571	0.142857	0.285714	0.428571	0.000000	0.000000	0.0
2020-03-02	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.285714	0.142857	0.142857	0.857143	0.428571	0.571429	0.428571	0.000000	0.142857	0.0
2020-03-03	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.428571	0.142857	0.285714	0.857143	0.000000	0.428571	0.428571	0.000000	0.000000	0.0
2020-03-04	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.285714	-0.142857	0.000000	0.714286	-0.142857	-0.142857	0.571429	0.142857	0.142857	0.0
2020-03-05	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.428571	0.142857	0.000000	1.000000	0.285714	0.142857	0.857143	0.285714	0.428571	0.0

5 rows × 64 columns

The functions below were created to allow the application of a machine learning regressor model as a forecasting model, using past information as lagged columns (the features) and training multiple models, each one specialized in predicting one day in the future.

Function `build_lagged_features()`¶

This function takes as input a DataFrame and a number of lags, and computed the lagged values of each column in a new DataFrame.

df_lag = build_lagged_features(dt = df, maxlag = 3)

df_lag.head()

	test_FR	test_FR_lag1	test_FR_lag2	test_FR_lag3	diff_test_FR	diff_test_FR_lag1	diff_test_FR_lag2	diff_test_FR_lag3	diff_2_test_FR	diff_2_test_FR_lag1	...	diff_hosp_GE_lag2	diff_hosp_GE_lag3	diff_2_hosp_GE	diff_2_hosp_GE_lag1	diff_2_hosp_GE_lag2	diff_2_hosp_GE_lag3	vac_all	vac_all_lag1	vac_all_lag2	vac_all_lag3
datum
2020-03-04	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.142857	0.000000	0.142857	0.000000	0.0	0.0	0.0	0.0
2020-03-05	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.000000	0.000000	0.428571	0.142857	0.000000	0.142857	0.0	0.0	0.0	0.0
2020-03-06	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.142857	0.000000	0.428571	0.428571	0.142857	0.000000	0.0	0.0	0.0	0.0
2020-03-07	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.285714	0.142857	0.285714	0.428571	0.428571	0.142857	0.0	0.0	0.0	0.0
2020-03-08	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.142857	0.285714	0.000000	0.285714	0.428571	0.428571	0.0	0.0	0.0	0.0

5 rows × 256 columns

Function `preprocess_data()`:¶

This function made the same that build_lagged_features(). The difference is that this function allow the user to subset the dataframe returned by an initial and end date.

df_lag = preprocess_data(data = df, maxlag = 3, ini_date = '2021-01-01' , end_date = '2021-05-01')

df_lag.head()

	test_FR	test_FR_lag1	test_FR_lag2	test_FR_lag3	diff_test_FR	diff_test_FR_lag1	diff_test_FR_lag2	diff_test_FR_lag3	diff_2_test_FR	diff_2_test_FR_lag1	...	diff_hosp_GE_lag2	diff_hosp_GE_lag3	diff_2_hosp_GE	diff_2_hosp_GE_lag1	diff_2_hosp_GE_lag2	diff_2_hosp_GE_lag3	vac_all	vac_all_lag1	vac_all_lag2	vac_all_lag3
datum
2021-01-01	542.285714	550.857143	586.000000	650.428571	-8.571429	-35.142857	-64.428571	-32.714286	-43.714286	-99.571429	...	0.000000	0.285714	-0.142857	-0.285714	0.285714	0.714286	0.044286	0.035714	0.027143	0.017143
2021-01-02	536.571429	542.285714	550.857143	586.000000	-5.714286	-8.571429	-35.142857	-64.428571	-14.285714	-43.714286	...	-0.285714	0.000000	0.571429	-0.142857	-0.285714	0.285714	0.052857	0.044286	0.035714	0.027143
2021-01-03	530.000000	536.571429	542.285714	550.857143	-6.571429	-5.714286	-8.571429	-35.142857	-12.285714	-14.285714	...	0.142857	-0.285714	0.571429	0.571429	-0.142857	-0.285714	0.061429	0.052857	0.044286	0.035714
2021-01-04	552.285714	530.000000	536.571429	542.285714	22.285714	-6.571429	-5.714286	-8.571429	15.714286	-12.285714	...	0.428571	0.142857	0.000000	0.571429	0.571429	-0.142857	0.074286	0.061429	0.052857	0.044286
2021-01-05	552.857143	552.285714	530.000000	536.571429	0.571429	22.285714	-6.571429	-5.714286	22.857143	15.714286	...	0.142857	0.428571	-0.428571	0.000000	0.571429	0.571429	0.095714	0.074286	0.061429	0.052857

5 rows × 256 columns

Function `get_targets()`¶

This function allows the transformation of a series (pd.Series) of targets in a dictionary, where the target values are shifted many times as necessary to for example, train multiple ML regression models capable to forecast the curve in the target parameter (pd.Series).

dict_target = get_targets(target = df['hosp_GE'],predict_n = 4)

dict_target

{1: datum
 2020-03-01    0.428571
 2020-03-02    0.428571
 2020-03-03    0.571429
 2020-03-04    0.857143
 2020-03-05    1.000000
                 ...   
 2022-08-25    2.571429
 2022-08-26    2.571429
 2022-08-27    2.142857
 2022-08-28    1.857143
 2022-08-29    1.428571
 Name: hosp_GE, Length: 912, dtype: float64,
 2: datum
 2020-03-01    0.428571
 2020-03-02    0.571429
 2020-03-03    0.857143
 2020-03-04    1.000000
 2020-03-05    1.142857
                 ...   
 2022-08-24    2.571429
 2022-08-25    2.571429
 2022-08-26    2.142857
 2022-08-27    1.857143
 2022-08-28    1.428571
 Name: hosp_GE, Length: 911, dtype: float64,
 3: datum
 2020-03-01    0.571429
 2020-03-02    0.857143
 2020-03-03    1.000000
 2020-03-04    1.142857
 2020-03-05    1.000000
                 ...   
 2022-08-23    2.571429
 2022-08-24    2.571429
 2022-08-25    2.142857
 2022-08-26    1.857143
 2022-08-27    1.428571
 Name: hosp_GE, Length: 910, dtype: float64,
 4: datum
 2020-03-01    0.857143
 2020-03-02    1.000000
 2020-03-03    1.142857
 2020-03-04    1.000000
 2020-03-05    1.571429
                 ...   
 2022-08-22    2.571429
 2022-08-23    2.571429
 2022-08-24    2.142857
 2022-08-25    1.857143
 2022-08-26    1.428571
 Name: hosp_GE, Length: 909, dtype: float64}

Function `get_next_n_days()`:¶

This function takes as input a string with a date and the number of days after the day in the string that will be returned in the list with the next dates.

next_dates = get_next_n_days(ini_date='2021-01-01', next_days = 4)
next_dates

[datetime.datetime(2021, 1, 2, 0, 0),
 datetime.datetime(2021, 1, 3, 0, 0),
 datetime.datetime(2021, 1, 4, 0, 0),
 datetime.datetime(2021, 1, 5, 0, 0)]

The functions below were created to allow the application of neural network models as forecasting models. The function transform tabular data into array data, which this class of models accepts.

Function `lstm_split_data()`:¶

This function split the data into training and test sets. It takes as inputs a DataFrame, the number of days that it will be predicted, the number of past days used in the prediction, the position of the target column, and the fraction of the data used as train and test datasets.

X_train, Y_train, X_test, Y_test = lstm_split_data(df = df,
                                                   look_back = 4,
                                                   predict_n = 4, 
                                                   ratio = 0.8,
                                                   Y_column = df.columns.get_loc("hosp_GE"))

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(723, 4, 64)
(723, 4)
(183, 4, 64)
(183, 4)

Function `normalize_data()`:¶

This function takes as input a DataFrame and normalizes the columns based on the maximum value.

df_n, max_values = normalize_data(df)

df_n

	test_FR	diff_test_FR	diff_2_test_FR	test_NE	diff_test_NE	diff_2_test_NE	test_TI	diff_test_TI	diff_2_test_TI	test_VD	...	hosp_NE	diff_hosp_NE	diff_2_hosp_NE	hosp_FR	diff_hosp_FR	diff_2_hosp_FR	hosp_GE	diff_hosp_GE	diff_2_hosp_GE	vac_all
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.009009	0.0000	0.00	0.024793	0.066667	0.090909	0.014634	0.000000	0.000000	0.0
1	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.018018	0.0625	0.05	0.049587	0.200000	0.181818	0.014634	0.000000	0.032258	0.0
2	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.027027	0.0625	0.10	0.049587	0.000000	0.136364	0.014634	0.000000	0.000000	0.0
3	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.018018	-0.0625	0.00	0.041322	-0.066667	-0.045455	0.019512	0.043478	0.032258	0.0
4	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.027027	0.0625	0.00	0.057851	0.133333	0.045455	0.029268	0.086957	0.096774	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
908	0.046436	-0.002678	-0.025352	0.053610	0.005452	0.013990	0.096719	-0.017641	-0.023144	0.069339	...	0.000000	0.0000	0.00	0.008264	-0.200000	-0.136364	0.087805	-0.043478	0.032258	1.0
909	0.046236	-0.002678	-0.004024	0.053666	0.000779	0.004145	0.096447	-0.002614	-0.014947	0.069339	...	0.000000	0.0000	0.00	0.008264	0.000000	-0.136364	0.087805	0.000000	-0.032258	1.0
910	0.044516	-0.023032	-0.019316	0.052940	-0.010125	-0.006218	0.096991	0.005227	0.001929	0.068820	...	0.000000	0.0000	0.00	0.008264	0.000000	0.000000	0.073171	-0.130435	-0.096774	1.0
911	0.042517	-0.026781	-0.037425	0.050874	-0.028816	-0.025907	0.085841	-0.107155	-0.075217	0.065681	...	0.000000	0.0000	0.00	0.008264	0.000000	0.000000	0.063415	-0.086957	-0.161290	1.0
912	0.034677	-0.025174	-0.079276	0.041883	-0.034268	-0.082902	0.069862	-0.095394	-0.138862	0.053947	...	0.000000	0.0000	0.00	0.008264	0.000000	0.000000	0.048780	-0.130435	-0.161290	1.0

913 rows × 64 columns

max_values

test_FR           3571.714286
diff_test_FR       266.714286
diff_2_test_FR     355.000000
test_NE           2558.142857
diff_test_NE       183.428571
                     ...     
diff_2_hosp_FR       3.142857
hosp_GE             29.285714
diff_hosp_GE         3.285714
diff_2_hosp_GE       3.571429
vac_all            182.800000
Length: 64, dtype: float64

Preprocessing data (Python version)¶

Function `build_lagged_features()`¶

Function `preprocess_data()`:¶

Function `get_targets()`¶

Function `get_next_n_days()`:¶

Function `lstm_split_data()`:¶

Function `normalize_data()`:¶

EpigraphHub Library

Navigation

Related Topics

Preprocessing data (Python version)¶

Function build_lagged_features()¶

Function preprocess_data():¶

Function get_targets()¶

Function get_next_n_days():¶

Function lstm_split_data():¶

Function normalize_data():¶

Function `build_lagged_features()`¶

Function `preprocess_data()`:¶

Function `get_targets()`¶

Function `get_next_n_days()`:¶

Function `lstm_split_data()`:¶

Function `normalize_data()`:¶