Data Wrangling

One of the main focuses of FRAME-FM is to simplify the pre-processing of large spatio-temporal datasets so that users can focus on configuring and running Machine Learning workflows.

Standard approach for setting up data workflows

The recommended approach for setting up data pre-processing and loading within the framework is to write _recipes_ that consist of a sequence of operations to be performed on the data.

All _datasets_ are sub-classes of torch.utils.data.Dataset. They all use the standard torch model of exposing a _dataset_ that can be directly used by a DataLoader object. The common interface is:

  • contruction: __init__()

  • length: __len__()

  • get item by index: __getitem__(idx)

Modifying datasets using preprocessors and transforms

Each Dataset class can have two types of _operations_ defined by arguments sent to the constructor:

  • preprocessors:
    • A list of operations that get run when the Dataset instance is created.

    • These get run once only.

    • They operate sequentially, with the first taking in an xr.Dataset

    • The final object is saved in self.data

    • The resulting output should be ready for use by the standard methods:
      • def __len__(self):

      • def __getitem__(self, idx):

  • transforms:
    • A list of operations that get run at training time, within the __getitem__(idx) call.

    • These are run whenever a DataLoader needs to access single items or batches of items with a Dataset object.

    • These are typically run like this:

      for transform in transforms:
          sample = transform(sample)
      

Note that the FRAME_FM.transforms.transforms module contains all the transform classes that can be in either/both of the preprocessors and transforms lists.

See the examples in the unit tests: tests/transforms/test_transforms.py

See the Dataset unit tests for examples: tests/datasets/test_*.py