preprocessing module

TSStandardScaler class

class tscluster.preprocessing.TSStandardScaler(per_time: bool = True, **kwargs)

Uses sklearn’s StandardScaler to scale a time series data.

Parameters:
per_timebool, default=True

If True, compute zscore per time step. If False, compute zscore per feature across all timesteps.

**kwargskeyword arugments to be passed to sklearn’s StandardScaler.

Methods

fit(X, **kwargs)

Fit method of transformer.

fit_transform(X, **kwargs)

fit and transform the data

inverse_transform(X, **kwargs)

inverse transform method for transformer.

transform(X, **kwargs)

transform method for transformer.

fit(X: ndarray[Any, dtype[float64]], **kwargs) TSScaler

Fit method of transformer.

Parameters:
X: ndarray

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments to be passed to fit method.
Returns:
self

the fitted transformer object

fit_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]

fit and transform the data

Parameters:
X: ndarray

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments
Returns:
numpy array

the transformed data in TNF format

inverse_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]

inverse transform method for transformer.

Parameters:
X: ndarray.

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments.
Returns:
numpy array

the inverse-transform of the data in TNF format

transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]

transform method for transformer.

Parameters:
X: ndarray

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments.
Returns:
numpy array

the transformed data in TNF format

TSMinMaxScaler class

class tscluster.preprocessing.TSMinMaxScaler(per_time: bool = True, **kwargs)

Uses sklearn’s MinMaxScaler to scale a time series data.

Parameters:
per_timebool, default=True

If True, compute zscore per time step. If False, compute zscore per feature across all timesteps.

**kwargskeyword arugments to be passed to sklearn’s MinMaxScaler.

Methods

fit(X, **kwargs)

Fit method of transformer.

fit_transform(X, **kwargs)

fit and transform the data

inverse_transform(X, **kwargs)

inverse transform method for transformer.

transform(X, **kwargs)

transform method for transformer.

fit(X: ndarray[Any, dtype[float64]], **kwargs) TSScaler

Fit method of transformer.

Parameters:
X: ndarray

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments to be passed to fit method.
Returns:
self

the fitted transformer object

fit_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]

fit and transform the data

Parameters:
X: ndarray

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments
Returns:
numpy array

the transformed data in TNF format

inverse_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]

inverse transform method for transformer.

Parameters:
X: ndarray.

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments.
Returns:
numpy array

the inverse-transform of the data in TNF format

transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]

transform method for transformer.

Parameters:
X: ndarray

Input time series data. Should be a 3 dimensional array in TNF fromat.

**kwargs keyword arguments.
Returns:
numpy array

the transformed data in TNF format

utils module

tscluster.preprocessing.utils.broadcast_data(T: int, *, cluster_centers: npt.NDArray[np.float64] | None = None, labels: npt.NDArray[np.int] | None = None) Tuple[npt.NDArray[np.float64], npt.NDArray[np.int64]]

function to make cluster_centers of shape T x N x F and labels of shape N x F

tscluster.preprocessing.utils.load_data(X: npt.NDArray[np.float64] | str | List, *, arr_format: str = 'TNF', suffix_sep: str = '_', file_reader: str = 'infer', read_file_args: dict | None = None, use_suffix_as_label: bool = False, output_arr_format: str = 'TNF') Tuple[np.float64, dict]

function to load data

Parameters:
Xnumpy array, string or list

Input time series data. If ndarray, should be a 3 dimensional array, use arr_format to specify its format. If str and a file name, will use numpy to load file. If str and a directory name, will load all the files in the directory in ascending order of the suffix of the filenames. Use suffix_sep as a keyword argument to indicate the suffix separator. Default is “_”. So, file_0.csv will be read first before file_1.csv and so on. Supported files in the directory are any file that can be read using any of np.load, pd.read_csv, pd.read_json, and pd.read_excel. If list, assumes the list is a list of files or filepaths. If file, each should be a numpy array or pandas DataFrame of data for the different time steps. If list of filepaths, data is read in the order in the list using any of np.load, pd.read_csv, pd.read_json, and pd.read_excel.

arr_formatstr, default=’TNF’

format of the input data. ‘TNF’ means the data dimension is Time x Number of observations x Features ‘NTF’ means the data dimension is Number OF observations x Time x Features

suffix_sepstr, default=’_’

separator separating the suffix from the filename. The suffixes should be numbers and may not neccessarily need to start from 1 or have an interval of 1. So long the suffixes can be sorted and there is a consistent suffix separator, the directory can be parsed by load_data function.

file_readerstr, default=’infer’

file loader to use. Can be any of np.load, pd.read_csv, pd.read_json, and pd.read_excel. If ‘infer’, decorator will attempt to infer the file type from the file name and use the approproate loader.

read_file_argsdict, default=None.

parameters to be passed to the data loader. Keys of the dictionary should be parameter names as keys in str, values should be the values of the parameter keys.

use_suffix_as_label: bool, default=False

If True, use the suffixes of the file names as labels in label_dict. If arr_format = ‘TNF’, the suffixes will be used as labels of timesteps, else if arr_format = ‘NTF’, they will be used as labels of entities. If False, a linear range of the number of files is used as labels (e.g.) range(n_files), where n_files is the number of files.

output_arr_formatstr, default=’TNF’

The format of the output array. Can be any of {‘TNF’, ‘NTF’}.

Returns:
np.array

a numpy array of the data in ‘TNF’ or ‘NTF’ format (depending on the value of output_arr_format)

dict

a dictionary whose keys are ‘T’, ‘N’, and ‘F’, and whose values are lists of the labels of each key.

tscluster.preprocessing.utils.ntf_to_tnf(X: ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]

Utility function to convert an array from Number of observation x Time x Feature format to Time x Number of observation x Feature format

tscluster.preprocessing.utils.tnf_to_ntf(X: ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]

Utility function to convert an array from Time x Number of observation x Feature format to Number of observation x Time x Feature format

tscluster.preprocessing.utils.to_dfs(X: npt.NDArray[np.float64], label_dict: dict | None = None, arr_format: str = 'TNF', output_df_format: str = 'TNF')

Function to convert from (numpy, label_dict) to dataframes

Parameters:
Xnumpy array

The array to be converted to list of dataframes. Should be a 3D array.

label_dict dict, default=None

a dictionary whose keys are ‘T’, ‘N’, and ‘F’ (which are the number of time steps, entities, and features respectively). Value of each key is a list such that the value of key: - ‘T’ is a list of names/labels of each time step to be used as index of each dataframe. If None, range(0, T) is used. Where T is the number of time steps in the fitted data - ‘N’ is a list of names/labels of each entity to be used as index of the dataframe. If None, range(0, N) is used. Where N is the number of entities/observations in the fitted data - ‘F’ is a list of names/labels of each feature to be used as column of each dataframe. If None, range(0, F) is used. Where F is the number of features in the fitted data If label_dict is None, a linear range of the dimensions of the array is used.

arr_formatstr, default ‘TNF’

format of the array. ‘TNF’ means the data dimension is Time x Number of observations x Features ‘NTF’ means the data dimension is Number OF observations x Time x Features

output_df_formatstr, default=’TNF’

The format of the output dataframes. Can be any of {‘TNF’, ‘NTF’}. If ‘TNF’, output is a list of T dataframes each of shape (N, F). If ‘NTF’, output is a list of N dataframes each of shape (T, F).

Returns:
list[pd.DataFrame]

A list of T pandas DataFrames. Where T is the number of time steps. The t-th dataframe in the list is a N x F dataframe of the values of the time series data of all entities at the t-th timestep.