Train-test splitting with Calendars
As the Lilio calendar system was designed with machine-learning in mind, a train-test module is included which aids in generating train/test splits.
Currently this feature is only supported for xarray
data.
Let’s start by generating some dummy data:
[1]:
import numpy as np
import pandas as pd
import xarray as xr
import lilio
# Hide the full data when displaying a dataset in the notebook
xr.set_options(display_expand_data=False)
n = 50
time_index = pd.date_range("20151020", periods=n, freq="60d")
time_coord = {"time": time_index}
x1 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor1")
x2 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor2")
y = xr.DataArray(np.random.randn(n), coords=time_coord, name="target")
print(x1)
<xarray.DataArray 'precursor1' (time: 50)> Size: 400B
1.41 1.024 0.1293 -0.4608 -1.343 -1.688 ... -0.9787 0.1206 0.5908 0.7536 -0.2013
Coordinates:
* time (time) datetime64[ns] 400B 2015-10-20 2015-12-19 ... 2023-11-07
Next we will need a calendar, and use it to resample the dummy data:
[2]:
calendar = lilio.daily_calendar(anchor="10-15", length="180d")
calendar.map_to_data(x1)
x1 = lilio.resample(calendar, x1)
x2 = lilio.resample(calendar, x2)
y = lilio.resample(calendar, y)
print(x1)
<xarray.DataArray 'precursor1' (anchor_year: 7, i_interval: 2)> Size: 112B
-1.333 -0.1537 -0.6088 1.236 0.5243 ... 0.2404 1.118 -0.2905 0.5384 0.05353
Coordinates:
* anchor_year (anchor_year) int64 56B 2016 2017 2018 2019 2020 2021 2022
* i_interval (i_interval) int64 16B -1 1
left_bound (anchor_year, i_interval) datetime64[ns] 112B 2016-04-18 ......
right_bound (anchor_year, i_interval) datetime64[ns] 112B 2016-10-15 ......
is_target (i_interval) bool 2B False True
Attributes:
lilio_version: 0.5.0
lilio_calendar_anchor_date: 10-15
lilio_calendar_code: Calendar(\n anchor='10-15',\n allow_ov...
history: 2024-06-13 11:57:42 UTC - Resampled with a L...
Now we are ready to create train and test splits of our data. We setup a strategy (KFold
), and give this to lilio.traintest.TrainTestSplit
.
We can use this cross validator to split our datasets x1
and x2
, as well as the target data y
:
[3]:
# Cross-validation
from sklearn.model_selection import KFold
import lilio.traintest
kfold = KFold(n_splits=3)
cv = lilio.traintest.TrainTestSplit(kfold)
for (x1_train, x2_train), (x1_test, x2_test), y_train, y_test in cv.split([x1, x2], y=y):
print("Train:", x1_train.anchor_year.values)
print("Test:", x1_test.anchor_year.values)
print(x1_train)
Train: [2019 2020 2021 2022]
Test: [2016 2017 2018]
Train: [2016 2017 2018 2021 2022]
Test: [2019 2020]
Train: [2016 2017 2018 2019 2020]
Test: [2021 2022]
<xarray.DataArray 'precursor1' (anchor_year: 5, i_interval: 2)> Size: 80B
-1.333 -0.1537 -0.6088 1.236 0.5243 -0.9893 0.6597 0.4788 0.5878 0.2404
Coordinates:
* anchor_year (anchor_year) int64 40B 2016 2017 2018 2019 2020
* i_interval (i_interval) int64 16B -1 1
left_bound (anchor_year, i_interval) datetime64[ns] 80B 2016-04-18 ... ...
right_bound (anchor_year, i_interval) datetime64[ns] 80B 2016-10-15 ... ...
is_target (i_interval) bool 2B False True
Attributes:
lilio_version: 0.5.0
lilio_calendar_anchor_date: 10-15
lilio_calendar_code: Calendar(\n anchor='10-15',\n allow_ov...
history: 2024-06-13 11:57:42 UTC - Resampled with a L...
Now you are ready to train your models!