Forecasting time series with Ngboost regressor (Python version)

This notebook provides some examples of how the functions in the ngboost_models.py module can be used. The functions in this module allow the application of the ngboost regressor model. There are separate methods to train and evaluate (separate the data in train and test datasets), train with all the data available, and make forecasts.

import pandas as pd
from epigraphhub.analysis.forecast_models.plots import * 
from epigraphhub.analysis.preprocessing import * 
from epigraphhub.analysis.forecast_models.ngboost_models import * 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_223/95440347.py in <module>
      1 import pandas as pd
----> 2 from epigraphhub.analysis.forecast_models.plots import *
      3 from epigraphhub.analysis.preprocessing import *
      4 from epigraphhub.analysis.forecast_models.ngboost_models import *

ModuleNotFoundError: No module named 'epigraphhub'

In this tutorial, we will use the data saved in the path: ./data/data_GE.csv. This table represents the number of tests, cases, and hospitalizations (your values by day and differences in first and second order) for some cantons in Switzerland.

df = pd.read_csv('./data/data_GE.csv')
df.set_index('datum', inplace = True)
df.index = pd.to_datetime(df.index)

df
test_FR diff_test_FR diff_2_test_FR test_NE diff_test_NE diff_2_test_NE test_TI diff_test_TI diff_2_test_TI test_VD ... hosp_NE diff_hosp_NE diff_2_hosp_NE hosp_FR diff_hosp_FR diff_2_hosp_FR hosp_GE diff_hosp_GE diff_2_hosp_GE vac_all
datum
2020-03-01 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.142857 0.000000 0.000000 0.428571 0.142857 0.285714 0.428571 0.000000 0.000000 0.0
2020-03-02 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.285714 0.142857 0.142857 0.857143 0.428571 0.571429 0.428571 0.000000 0.142857 0.0
2020-03-03 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.428571 0.142857 0.285714 0.857143 0.000000 0.428571 0.428571 0.000000 0.000000 0.0
2020-03-04 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.285714 -0.142857 0.000000 0.714286 -0.142857 -0.142857 0.571429 0.142857 0.142857 0.0
2020-03-05 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.428571 0.142857 0.000000 1.000000 0.285714 0.142857 0.857143 0.285714 0.428571 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2022-08-26 165.857143 -0.714286 -9.000000 137.142857 1.000000 3.857143 406.428571 -7.714286 -13.714286 650.000000 ... 0.000000 0.000000 0.000000 0.142857 -0.428571 -0.428571 2.571429 -0.142857 0.142857 182.8
2022-08-27 165.142857 -0.714286 -1.428571 137.285714 0.142857 1.142857 405.285714 -1.142857 -8.857143 650.000000 ... 0.000000 0.000000 0.000000 0.142857 0.000000 -0.428571 2.571429 0.000000 -0.142857 182.8
2022-08-28 159.000000 -6.142857 -6.857143 135.428571 -1.857143 -1.714286 407.571429 2.285714 1.142857 645.142857 ... 0.000000 0.000000 0.000000 0.142857 0.000000 0.000000 2.142857 -0.428571 -0.428571 182.8
2022-08-29 151.857143 -7.142857 -13.285714 130.142857 -5.285714 -7.142857 360.714286 -46.857143 -44.571429 615.714286 ... 0.000000 0.000000 0.000000 0.142857 0.000000 0.000000 1.857143 -0.285714 -0.714286 182.8
2022-08-30 123.857143 -6.714286 -28.142857 107.142857 -6.285714 -22.857143 293.571429 -41.714286 -82.285714 505.714286 ... 0.000000 0.000000 0.000000 0.142857 0.000000 0.000000 1.428571 -0.428571 -0.714286 182.8

913 rows × 64 columns

Class NGBModel()

This class instantiates a ngboost regressor model. This class takes as input the parameters accepted by a Ngboost model (defined in their documentation), a parameter to define the number of last observations that the model will use as input, a parameter to define the number of days that it will be predicted by the model, the percentage of the train data that will be used as validation, and a parameter to define the early stop of the training. The methods in this class allows the user to train and evaluate the model, to train and save the model and make the forecast using saved models.

This class allows the training of multiple ngboost models, each one specialized in the forecast for a single day.

m = NGBModel(look_back = 14,
            predict_n = 14, 
            validation_split = 0.15, 
            early_stop = 10)
def remove_zeros(tgt):
    
    tgt[tgt == 0] = 0.01
    
    return tgt
    

Method train_eval()

This method takes the class NGBModel() and trains and evaluates this model. This function split the data in train and test dataset and returns the predictions made using the test dataset.

df['hosp_GE'] = remove_zeros(df['hosp_GE'].values)

df_p = m.train_eval(target_name = 'hosp_GE',
                    data = df,
                    ini_date = '2020-05-01',
                    end_date = '2022-04-30',
                    ratio = 0.8, save = False)

df_p
target lower median upper train_size
date
2020-05-02 0.285714 0.171583 0.300695 0.526959 584
2020-05-03 0.285714 0.188583 0.363243 0.699668 584
2020-05-04 0.142857 0.077129 0.199622 0.516653 584
2020-05-05 0.142857 0.074032 0.222506 0.668752 584
2020-05-06 0.142857 0.083643 0.205801 0.506366 584
... ... ... ... ... ...
2022-04-26 7.714286 1.810433 3.324867 6.106128 584
2022-04-27 6.857143 2.234695 3.160372 4.469492 584
2022-04-28 5.714286 2.208807 3.287218 4.892146 584
2022-04-29 5.000000 1.947474 3.324971 5.676806 584
2022-04-30 5.714286 3.261146 5.104967 7.991267 584

729 rows × 5 columns

Function plot_val()

This function is saved in the plots.py module and, given the output of the train_eval() method plot the model’s behavior in train and test sample.

plot_val(df_p, title = 'Hosp in GE')
../_images/db4a442bcd45fb44c4546888a84a4ab615b64e7c609827ad93dbd8041102c842.png

Method train()

This method trains multiple ngboost models with all the data available and will save the model that will be used to make forecasts.

%%time
models = m.train(target_name='hosp_GE',
                 data=df,ini_date = '2020-05-01',
                 end_date = '2022-04-30',
                 save = True,
                 path = './saved_models',
                 name='hosp_GE')
CPU times: user 3min 37s, sys: 653 ms, total: 3min 37s
Wall time: 3min 38s

Method forecast()

This method uses the models trained in the train method and applies them on the last date available (last value in df, or in the data of the date in end_date) and make the forecast making the forecast.

df_f = m.forecast(df, end_date = '2022-04-30',  path = './saved_models', name='hosp_GE')

df_f.head()
lower median upper
date
2022-05-01 3.659357 4.477327 5.478137
2022-05-02 3.678122 4.535233 5.592075
2022-05-03 3.674208 4.571385 5.687637
2022-05-04 3.513954 4.615518 6.062404
2022-05-05 3.800476 4.725615 5.875958

Function plot_forecast()

This function use the data to train the model and the output of the forecast() method to plot the forecast.

plot_forecast(
    df.loc[:'2022-04-30']['hosp_GE'][-90:],
    df_f,
    title = 'Forecast of hosp in GE',
    xlabel="Date",
    ylabel="Incidence", 
    save=False
    )
../_images/4c82bef9115dfa36d0cfd05ccf0b747a93bdee4901b12c9cf6013d3d9852b97f.png