Train-test splitting with Calendars

As the Lilio calendar system was designed with machine-learning in mind, a train-test module is included which aids in generating train/test splits.

Currently this feature is only supported for xarray data.

Let’s start by generating some dummy data:

[1]:

import numpy as np
import pandas as pd
import xarray as xr
import lilio

# Hide the full data when displaying a dataset in the notebook
xr.set_options(display_expand_data=False)

n = 50
time_index = pd.date_range("20151020", periods=n, freq="60d")
time_coord = {"time": time_index}
x1 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor1")
x2 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor2")
y = xr.DataArray(np.random.randn(n), coords=time_coord, name="target")
print(x1)

<xarray.DataArray 'precursor1' (time: 50)>
0.202 -1.004 0.1281 -1.529 0.8706 1.895 ... 1.055 1.808 0.6757 -0.818 0.9551
Coordinates:
  * time     (time) datetime64[ns] 2015-10-20 2015-12-19 ... 2023-11-07

Next we will need a calendar, and use it to resample the dummy data:

[2]:

calendar = lilio.daily_calendar(anchor="10-15", length="180d")
calendar.map_to_data(x1)
x1 = lilio.resample(calendar, x1)
x2 = lilio.resample(calendar, x2)
y = lilio.resample(calendar, y)

print(x1)

<xarray.DataArray 'precursor1' (anchor_year: 7, i_interval: 2)>
0.7507 -0.09163 -0.6898 0.2024 0.4187 ... -0.1501 -0.7546 0.4656 0.1512 0.5571
Coordinates:
  * anchor_year  (anchor_year) int64 2016 2017 2018 2019 2020 2021 2022
  * i_interval   (i_interval) int64 -1 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 2016-04-18 ... 2022...
    right_bound  (anchor_year, i_interval) datetime64[ns] 2016-10-15 ... 2023...
    is_target    (i_interval) bool False True
Attributes:
    lilio_version:               0.4.2
    lilio_calendar_anchor_date:  10-15
    lilio_calendar_code:         Calendar(\n    anchor='10-15',\n    allow_ov...
    history:                     2024-01-19 13:44:22 UTC - Resampled with a L...

Now we are ready to create train and test splits of our data. We setup a strategy (KFold), and give this to lilio.traintest.TrainTestSplit.

We can use this cross validator to split our datasets x1 and x2, as well as the target data y:

[3]:

# Cross-validation
from sklearn.model_selection import KFold
import lilio.traintest

kfold = KFold(n_splits=3)
cv = lilio.traintest.TrainTestSplit(kfold)
for (x1_train, x2_train), (x1_test, x2_test), y_train, y_test in cv.split([x1, x2], y=y):
    print("Train:", x1_train.anchor_year.values)
    print("Test:", x1_test.anchor_year.values)

print(x1_train)

Train: [2019 2020 2021 2022]
Test: [2016 2017 2018]
Train: [2016 2017 2018 2021 2022]
Test: [2019 2020]

Train: [2016 2017 2018 2019 2020]
Test: [2021 2022]
<xarray.DataArray 'precursor1' (anchor_year: 5, i_interval: 2)>
0.7507 -0.09163 -0.6898 0.2024 0.4187 0.2311 -0.1424 -0.593 0.1041 -0.1501
Coordinates:
  * anchor_year  (anchor_year) int64 2016 2017 2018 2019 2020
  * i_interval   (i_interval) int64 -1 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 2016-04-18 ... 2020...
    right_bound  (anchor_year, i_interval) datetime64[ns] 2016-10-15 ... 2021...
    is_target    (i_interval) bool False True
Attributes:
    lilio_version:               0.4.2
    lilio_calendar_anchor_date:  10-15
    lilio_calendar_code:         Calendar(\n    anchor='10-15',\n    allow_ov...
    history:                     2024-01-19 13:44:22 UTC - Resampled with a L...

Now you are ready to train your models!