Data Wrangling
==============

One of the main focuses of FRAME-FM is to simplify the pre-processing of large
spatio-temporal datasets so that users can focus on configuring and running Machine Learning 
workflows.

Standard approach for setting up data workflows
-----------------------------------------------

The recommended approach for setting up data pre-processing and loading within the framework 
is to write _recipes_ that consist of a sequence of operations to be performed on the data.

All _datasets_ are sub-classes of ``torch.utils.data.Dataset``. They all use the standard 
``torch`` model of exposing a _dataset_ that can be directly used by a ``DataLoader`` object. 
The common interface is:

- contruction: ``__init__()``
- length: ``__len__()``
- get item by index: ``__getitem__(idx)``

Modifying datasets using ``preprocessors`` and ``transforms``
-------------------------------------------------------------

Each ``Dataset`` class can have two types of _operations_ defined by arguments 
sent to the constructor:

- ``preprocessors``:
    - A list of operations that get run when the Dataset instance is created.
    - These get run once only.
    - They operate sequentially, with the first taking in an ``xr.Dataset``
    - The final object is saved in ``self.data``
    - The resulting output should be ready for use by the standard methods:
        - ``def __len__(self):``
        - ``def __getitem__(self, idx):``

- ``transforms``:
    - A list of operations that get run at training time, within the ``__getitem__(idx)``
      call.
    - These are run whenever a ``DataLoader`` needs to access single items or batches
      of items with a ``Dataset`` object.
    - These are typically run like this::

          for transform in transforms:
              sample = transform(sample)

Note that the ``FRAME_FM.transforms.transforms`` module contains all the transform 
classes that can be in either/both of the ``preprocessors`` and ``transforms`` lists.

See the examples in the unit tests: ``tests/transforms/test_transforms.py``

See the `Dataset` unit tests for examples: ``tests/datasets/test_*.py``