preprocessing module

TSStandardScaler class

class tscluster.preprocessing.TSStandardScaler(per_time: bool = True, **kwargs)

Uses sklearn’s StandardScaler to scale a time series data.

Parameters:

per_timebool, default=True: If True, compute zscore per time step. If False, compute zscore per feature across all timesteps.
**kwargskeyword arugments to be passed to sklearn’s StandardScaler.

Methods

`fit`(X, **kwargs)	Fit method of transformer.
`fit_transform`(X, **kwargs)	fit and transform the data
`inverse_transform`(X, **kwargs)	inverse transform method for transformer.
`transform`(X, **kwargs)	transform method for transformer.

fit(X: ndarray[Any, dtype[float64]], **kwargs) → TSScaler

Fit method of transformer.

Parameters:

X: ndarray: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments to be passed to fit method.

Returns:

self: the fitted transformer object

fit_transform(X: ndarray[Any, dtype[float64]], **kwargs) → ndarray[Any, dtype[float64]]

fit and transform the data

Parameters:

X: ndarray: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments

Returns:

numpy array: the transformed data in TNF format

inverse_transform(X: ndarray[Any, dtype[float64]], **kwargs) → ndarray[Any, dtype[float64]]

inverse transform method for transformer.

Parameters:

X: ndarray.: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments.

Returns:

numpy array: the inverse-transform of the data in TNF format

transform(X: ndarray[Any, dtype[float64]], **kwargs) → ndarray[Any, dtype[float64]]

transform method for transformer.

Parameters:

X: ndarray: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments.

Returns:

numpy array: the transformed data in TNF format

TSMinMaxScaler class

class tscluster.preprocessing.TSMinMaxScaler(per_time: bool = True, **kwargs)

Uses sklearn’s MinMaxScaler to scale a time series data.

Parameters:

per_timebool, default=True: If True, compute zscore per time step. If False, compute zscore per feature across all timesteps.
**kwargskeyword arugments to be passed to sklearn’s MinMaxScaler.

Methods

`fit`(X, **kwargs)	Fit method of transformer.
`fit_transform`(X, **kwargs)	fit and transform the data
`inverse_transform`(X, **kwargs)	inverse transform method for transformer.
`transform`(X, **kwargs)	transform method for transformer.

fit(X: ndarray[Any, dtype[float64]], **kwargs) → TSScaler

Fit method of transformer.

Parameters:

X: ndarray: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments to be passed to fit method.

Returns:

self: the fitted transformer object

fit_transform(X: ndarray[Any, dtype[float64]], **kwargs) → ndarray[Any, dtype[float64]]

fit and transform the data

Parameters:

X: ndarray: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments

Returns:

numpy array: the transformed data in TNF format

inverse_transform(X: ndarray[Any, dtype[float64]], **kwargs) → ndarray[Any, dtype[float64]]

inverse transform method for transformer.

Parameters:

X: ndarray.: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments.

Returns:

numpy array: the inverse-transform of the data in TNF format

transform(X: ndarray[Any, dtype[float64]], **kwargs) → ndarray[Any, dtype[float64]]

transform method for transformer.

Parameters:

X: ndarray: Input time series data. Should be a 3 dimensional array in TNF fromat.
**kwargs keyword arguments.

Returns:

numpy array: the transformed data in TNF format

utils module

tscluster.preprocessing.utils.broadcast_data(T: int, *, cluster_centers: npt.NDArray[np.float64] | None = None, labels: npt.NDArray[np.int] | None = None) → Tuple[npt.NDArray[np.float64], npt.NDArray[np.int64]]: function to make cluster_centers of shape T x N x F and labels of shape N x F

tscluster.preprocessing.utils.load_data(X: npt.NDArray[np.float64] | str | List, *, arr_format: str = 'TNF', suffix_sep: str = '_', file_reader: str = 'infer', read_file_args: dict | None = None, use_suffix_as_label: bool = False, output_arr_format: str = 'TNF') → Tuple[np.float64, dict]

function to load data

Parameters:

Xnumpy array, string or list: Input time series data. If ndarray, should be a 3 dimensional array, use arr_format to specify its format. If str and a file name, will use numpy to load file. If str and a directory name, will load all the files in the directory in ascending order of the suffix of the filenames. Use suffix_sep as a keyword argument to indicate the suffix separator. Default is “_”. So, file_0.csv will be read first before file_1.csv and so on. Supported files in the directory are any file that can be read using any of np.load, pd.read_csv, pd.read_json, and pd.read_excel. If list, assumes the list is a list of files or filepaths. If file, each should be a numpy array or pandas DataFrame of data for the different time steps. If list of filepaths, data is read in the order in the list using any of np.load, pd.read_csv, pd.read_json, and pd.read_excel.
arr_formatstr, default=’TNF’: format of the input data. ‘TNF’ means the data dimension is Time x Number of observations x Features ‘NTF’ means the data dimension is Number OF observations x Time x Features
suffix_sepstr, default=’_’: separator separating the suffix from the filename. The suffixes should be numbers and may not neccessarily need to start from 1 or have an interval of 1. So long the suffixes can be sorted and there is a consistent suffix separator, the directory can be parsed by load_data function.
file_readerstr, default=’infer’: file loader to use. Can be any of np.load, pd.read_csv, pd.read_json, and pd.read_excel. If ‘infer’, decorator will attempt to infer the file type from the file name and use the approproate loader.
read_file_argsdict, default=None.: parameters to be passed to the data loader. Keys of the dictionary should be parameter names as keys in str, values should be the values of the parameter keys.
use_suffix_as_label: bool, default=False: If True, use the suffixes of the file names as labels in label_dict. If arr_format = ‘TNF’, the suffixes will be used as labels of timesteps, else if arr_format = ‘NTF’, they will be used as labels of entities. If False, a linear range of the number of files is used as labels (e.g.) range(n_files), where n_files is the number of files.
output_arr_formatstr, default=’TNF’: The format of the output array. Can be any of {‘TNF’, ‘NTF’}.

Returns:

np.array: a numpy array of the data in ‘TNF’ or ‘NTF’ format (depending on the value of output_arr_format)
dict: a dictionary whose keys are ‘T’, ‘N’, and ‘F’, and whose values are lists of the labels of each key.

tscluster.preprocessing.utils.ntf_to_tnf(X: ndarray[Any, dtype[float64]]) → ndarray[Any, dtype[float64]]: Utility function to convert an array from Number of observation x Time x Feature format to Time x Number of observation x Feature format

tscluster.preprocessing.utils.tnf_to_ntf(X: ndarray[Any, dtype[float64]]) → ndarray[Any, dtype[float64]]: Utility function to convert an array from Time x Number of observation x Feature format to Number of observation x Time x Feature format

tscluster.preprocessing.utils.to_dfs(X: npt.NDArray[np.float64], label_dict: dict | None = None, arr_format: str = 'TNF', output_df_format: str = 'TNF')

Function to convert from (numpy, label_dict) to dataframes

Parameters:

Xnumpy array: The array to be converted to list of dataframes. Should be a 3D array.
label_dict dict, default=None: a dictionary whose keys are ‘T’, ‘N’, and ‘F’ (which are the number of time steps, entities, and features respectively). Value of each key is a list such that the value of key: - ‘T’ is a list of names/labels of each time step to be used as index of each dataframe. If None, range(0, T) is used. Where T is the number of time steps in the fitted data - ‘N’ is a list of names/labels of each entity to be used as index of the dataframe. If None, range(0, N) is used. Where N is the number of entities/observations in the fitted data - ‘F’ is a list of names/labels of each feature to be used as column of each dataframe. If None, range(0, F) is used. Where F is the number of features in the fitted data If label_dict is None, a linear range of the dimensions of the array is used.
arr_formatstr, default ‘TNF’: format of the array. ‘TNF’ means the data dimension is Time x Number of observations x Features ‘NTF’ means the data dimension is Number OF observations x Time x Features
output_df_formatstr, default=’TNF’: The format of the output dataframes. Can be any of {‘TNF’, ‘NTF’}. If ‘TNF’, output is a list of T dataframes each of shape (N, F). If ‘NTF’, output is a list of N dataframes each of shape (T, F).

Returns:

list[pd.DataFrame]: A list of T pandas DataFrames. Where T is the number of time steps. The t-th dataframe in the list is a N x F dataframe of the values of the time series data of all entities at the t-th timestep.