Train-test splitting with Calendars

As the Lilio calendar system was designed with machine-learning in mind, a train-test module is included which aids in generating train/test splits.

Currently this feature is only supported for xarray data.

Let’s start by generating some dummy data:

[1]:
import numpy as np
import pandas as pd
import xarray as xr
import lilio

# Hide the full data when displaying a dataset in the notebook
xr.set_options(display_expand_data=False)

n = 50
time_index = pd.date_range("20151020", periods=n, freq="60d")
time_coord = {"time": time_index}
x1 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor1")
x2 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor2")
y = xr.DataArray(np.random.randn(n), coords=time_coord, name="target")
print(x1)
<xarray.DataArray 'precursor1' (time: 50)> Size: 400B
1.41 1.024 0.1293 -0.4608 -1.343 -1.688 ... -0.9787 0.1206 0.5908 0.7536 -0.2013
Coordinates:
  * time     (time) datetime64[ns] 400B 2015-10-20 2015-12-19 ... 2023-11-07

Next we will need a calendar, and use it to resample the dummy data:

[2]:
calendar = lilio.daily_calendar(anchor="10-15", length="180d")
calendar.map_to_data(x1)
x1 = lilio.resample(calendar, x1)
x2 = lilio.resample(calendar, x2)
y = lilio.resample(calendar, y)

print(x1)
<xarray.DataArray 'precursor1' (anchor_year: 7, i_interval: 2)> Size: 112B
-1.333 -0.1537 -0.6088 1.236 0.5243 ... 0.2404 1.118 -0.2905 0.5384 0.05353
Coordinates:
  * anchor_year  (anchor_year) int64 56B 2016 2017 2018 2019 2020 2021 2022
  * i_interval   (i_interval) int64 16B -1 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 112B 2016-04-18 ......
    right_bound  (anchor_year, i_interval) datetime64[ns] 112B 2016-10-15 ......
    is_target    (i_interval) bool 2B False True
Attributes:
    lilio_version:               0.5.0
    lilio_calendar_anchor_date:  10-15
    lilio_calendar_code:         Calendar(\n    anchor='10-15',\n    allow_ov...
    history:                     2024-06-13 11:57:42 UTC - Resampled with a L...

Now we are ready to create train and test splits of our data. We setup a strategy (KFold), and give this to lilio.traintest.TrainTestSplit.

We can use this cross validator to split our datasets x1 and x2, as well as the target data y:

[3]:
# Cross-validation
from sklearn.model_selection import KFold
import lilio.traintest

kfold = KFold(n_splits=3)
cv = lilio.traintest.TrainTestSplit(kfold)
for (x1_train, x2_train), (x1_test, x2_test), y_train, y_test in cv.split([x1, x2], y=y):
    print("Train:", x1_train.anchor_year.values)
    print("Test:", x1_test.anchor_year.values)

print(x1_train)
Train: [2019 2020 2021 2022]
Test: [2016 2017 2018]
Train: [2016 2017 2018 2021 2022]
Test: [2019 2020]
Train: [2016 2017 2018 2019 2020]
Test: [2021 2022]
<xarray.DataArray 'precursor1' (anchor_year: 5, i_interval: 2)> Size: 80B
-1.333 -0.1537 -0.6088 1.236 0.5243 -0.9893 0.6597 0.4788 0.5878 0.2404
Coordinates:
  * anchor_year  (anchor_year) int64 40B 2016 2017 2018 2019 2020
  * i_interval   (i_interval) int64 16B -1 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 80B 2016-04-18 ... ...
    right_bound  (anchor_year, i_interval) datetime64[ns] 80B 2016-10-15 ... ...
    is_target    (i_interval) bool 2B False True
Attributes:
    lilio_version:               0.5.0
    lilio_calendar_anchor_date:  10-15
    lilio_calendar_code:         Calendar(\n    anchor='10-15',\n    allow_ov...
    history:                     2024-06-13 11:57:42 UTC - Resampled with a L...

Now you are ready to train your models!