Data Wrangling¶
One of the main focuses of FRAME-FM is to simplify the pre-processing of large spatio-temporal datasets so that users can focus on configuring and running Machine Learning workflows.
Standard approach for setting up data workflows¶
The recommended approach for setting up data pre-processing and loading within the framework is to write _recipes_ that consist of a sequence of operations to be performed on the data.
All _datasets_ are sub-classes of torch.utils.data.Dataset. They all use the standard
torch model of exposing a _dataset_ that can be directly used by a DataLoader object.
The common interface is:
contruction:
__init__()length:
__len__()get item by index:
__getitem__(idx)
Modifying datasets using preprocessors and transforms¶
Each Dataset class can have two types of _operations_ defined by arguments
sent to the constructor:
preprocessors:A list of operations that get run when the Dataset instance is created.
These get run once only.
They operate sequentially, with the first taking in an
xr.DatasetThe final object is saved in
self.data- The resulting output should be ready for use by the standard methods:
def __len__(self):def __getitem__(self, idx):
transforms:A list of operations that get run at training time, within the
__getitem__(idx)call.These are run whenever a
DataLoaderneeds to access single items or batches of items with aDatasetobject.These are typically run like this:
for transform in transforms: sample = transform(sample)
Note that the FRAME_FM.transforms.transforms module contains all the transform
classes that can be in either/both of the preprocessors and transforms lists.
See the examples in the unit tests: tests/transforms/test_transforms.py
See the Dataset unit tests for examples: tests/datasets/test_*.py