src.FRAME_FM.utils.data_utils

Functions

`safely_remove_dir`(path)	Safely remove a directory and its contents if it exists.
`get_main_vars`(→ list)	Get the main variable names from an xarray Dataset, excluding coordinate variables.
`get_xr_kwargs`(→ dict)	Determine the appropriate xarray loading engine and any additional kwargs
`convert_subset_selectors_to_slices`(→ dict)	Convert a dictionary of subset selectors with (low, high) tuples to a dictionary of slice objects.
`handle_special_uri_case`(→ Union[str, pathlib.Path, ...)	Handle special cases for certain URI formats and engines, such as loading refs for kerchunk.
`load_data_from_uri`(→ xarray.Dataset \| xarray.DataArray)	Load data from a URI with optional subset selection.
`unify_transforms`(→ list)	Unify the list of transforms by combining user-specified transforms with the default (class) transforms.
`create_zarr_name`(→ str)	Create a Zarr file name based on the data URI.
`create_cache_path`(→ pathlib.Path)	Create cache path from URI.
`hash_preprocessors`(→ str)
`cache_data_to_zarr`(→ xarray.Dataset)	Cache data to Zarr format based on the provided preprocessors and cache directory.
`write_zarr`(→ pathlib.Path \| str)	Return output after applying chunking and determining the output format and chunking.

Module Contents

src.FRAME_FM.utils.data_utils.safely_remove_dir(path: pathlib.Path | str)[source]: Safely remove a directory and its contents if it exists. :param path: The path to the directory to be removed. :type path: Path | str

src.FRAME_FM.utils.data_utils.get_main_vars(dset: xarray.Dataset) → list[source]

Get the main variable names from an xarray Dataset, excluding coordinate variables. Match only variables that have the maximum size (i.e., the main data variables) to avoid including ancillary variables that may be present in the dataset.

Parameters:

dset (-) – The xarray Dataset from which to extract variable names.

Returns:

A list of variable names that are not coordinates.

Return type:

list

src.FRAME_FM.utils.data_utils.get_xr_kwargs(uri: str | pathlib.Path | list | tuple) → dict[source]

Determine the appropriate xarray loading engine and any additional kwargs based on the URI format or file extension.

Parameters:

uri (-) – The URI of the data source, which can be a string, a Path object, or a list/tuple of URIs.

Returns:

A dictionary of kwargs to pass to xarray loading functions, including the ‘engine’ key.

Return type:

dict

src.FRAME_FM.utils.data_utils.convert_subset_selectors_to_slices(selector: dict) → dict[source]

Convert a dictionary of subset selectors with (low, high) tuples to a dictionary of slice objects.

Parameters:

selector (-) – A dictionary where keys are dimension names and values are tuples of (low, high) bounds.

Returns:

A new dictionary where the values are slice objects created from the (low, high) tuples.

Return type:

dict

Handle special cases for certain URI formats and engines, such as loading refs for kerchunk. :param uri: The original URI of the data source. :type uri: str :param engine: The engine determined for loading the data. :type engine: str

Returns:: The modified URI if special handling was applied, otherwise the original URI.
Return type:: str

Load data from a URI with optional subset selection. :param uri: The URI of the data source, or a glob pattern, or a list of URIs. :type uri: str :param chunks: Optional dictionary specifying chunking strategy for Dask. :type chunks: dict | None :param subset_selection: A dictionary specifying the subset selection criteria. :type subset_selection: dict | None :param **kwargs: Additional keyword arguments to pass to the xarray loading function.

Returns:: The loaded dataset with applied subset selection.
Return type:: xr.Dataset

src.FRAME_FM.utils.data_utils.unify_transforms(transforms: list | None, class_transforms: list, override_transforms: bool) → list[source]: Unify the list of transforms by combining user-specified transforms with the default (class) transforms. If override_transforms is True, only the user-specified transforms will be used. If False, the user-specified transforms will be combined with the default transforms, ensuring that there are no duplicates based on the “type” key of each transform.

src.FRAME_FM.utils.data_utils.create_zarr_name(data_uri: str) → str[source]

Create a Zarr file name based on the data URI. :param data_uri: The URI of the data source. :type data_uri: str

Returns:: A string representing the Zarr file name.
Return type:: str

src.FRAME_FM.utils.data_utils.create_cache_path(data_uri: str, cache_dir: pathlib.Path | str) → pathlib.Path[source]: Create cache path from URI.

src.FRAME_FM.utils.data_utils.hash_preprocessors(preprocessors: list | None) → str[source]

src.FRAME_FM.utils.data_utils.cache_data_to_zarr(dataset: xarray.Dataset, preprocessors: list | None, chunks: dict | None, cache_path: str | pathlib.Path, generate_stats: bool = True) → xarray.Dataset[source]

Cache data to Zarr format based on the provided preprocessors and cache directory.

Args: - dataset (xr.Dataset): The xarray Dataset to be cached. - preprocessors (list | None): A list of preprocessors (used for generating a hash only). - chunks (dict | None): A dictionary specifying chunking strategy for Dask. - cache_dir (str | Path): The directory where cached Zarr files will be stored. - generate_stats (bool): Whether to generate statistics during caching.

Returns: - xr.Dataset: The cached dataset loaded from the Zarr file.

src.FRAME_FM.utils.data_utils.write_zarr(ds: xarray.Dataset, output_path: pathlib.Path | str, chunks: dict[str, int] | None = None) → pathlib.Path | str[source]: Return output after applying chunking and determining the output format and chunking.