Data Wrangling¶

One of the main focuses of FRAME-FM is to simplify the pre-processing of large spatio-temporal datasets so that users can focus on configuring and running Machine Learning workflows.

Standard approach for setting up data workflows¶

The recommended approach for setting up data pre-processing and loading within the framework is to write _recipes_ that consist of a sequence of operations to be performed on the data.

All _datasets_ are sub-classes of torch.utils.data.Dataset. They all use the standard torch model of exposing a _dataset_ that can be directly used by a DataLoader object. The common interface is:

contruction: __init__()
length: __len__()
get item by index: __getitem__(idx)

Modifying datasets using `preprocessors` and `transforms`¶

Each Dataset class can have two types of _operations_ defined by arguments sent to the constructor:

preprocessors:
- A list of operations that get run when the Dataset instance is created.
- These get run once only.
- They operate sequentially, with the first taking in an xr.Dataset
- The final object is saved in self.data
- The resulting output should be ready for use by the standard methods:
  
  def __len__(self):
  
  def __getitem__(self, idx):
transforms:
- A list of operations that get run at training time, within the __getitem__(idx) call.
- These are run whenever a DataLoader needs to access single items or batches of items with a Dataset object.
- These are typically run like this:
  for transform in transforms: sample = transform(sample)

Note that the FRAME_FM.transforms.transforms module contains all the transform classes that can be in either/both of the preprocessors and transforms lists.

See the examples in the unit tests: tests/transforms/test_transforms.py

See the Dataset unit tests for examples: tests/datasets/test_*.py

Data Wrangling¶

Standard approach for setting up data workflows¶

Modifying datasets using preprocessors and transforms¶

Modifying datasets using `preprocessors` and `transforms`¶