preprocessing module
TSStandardScaler class
- class tscluster.preprocessing.TSStandardScaler(per_time: bool = True, **kwargs)
Uses sklearn’s StandardScaler to scale a time series data.
- Parameters:
- per_timebool, default=True
If True, compute zscore per time step. If False, compute zscore per feature across all timesteps.
- **kwargskeyword arugments to be passed to sklearn’s StandardScaler.
Methods
fit(X, **kwargs)Fit method of transformer.
fit_transform(X, **kwargs)fit and transform the data
inverse_transform(X, **kwargs)inverse transform method for transformer.
transform(X, **kwargs)transform method for transformer.
- fit(X: ndarray[Any, dtype[float64]], **kwargs) TSScaler
Fit method of transformer.
- Parameters:
- X: ndarray
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments to be passed to fit method.
- Returns:
- self
the fitted transformer object
- fit_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]
fit and transform the data
- Parameters:
- X: ndarray
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments
- Returns:
- numpy array
the transformed data in TNF format
- inverse_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]
inverse transform method for transformer.
- Parameters:
- X: ndarray.
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments.
- Returns:
- numpy array
the inverse-transform of the data in TNF format
- transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]
transform method for transformer.
- Parameters:
- X: ndarray
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments.
- Returns:
- numpy array
the transformed data in TNF format
TSMinMaxScaler class
- class tscluster.preprocessing.TSMinMaxScaler(per_time: bool = True, **kwargs)
Uses sklearn’s MinMaxScaler to scale a time series data.
- Parameters:
- per_timebool, default=True
If True, compute zscore per time step. If False, compute zscore per feature across all timesteps.
- **kwargskeyword arugments to be passed to sklearn’s MinMaxScaler.
Methods
fit(X, **kwargs)Fit method of transformer.
fit_transform(X, **kwargs)fit and transform the data
inverse_transform(X, **kwargs)inverse transform method for transformer.
transform(X, **kwargs)transform method for transformer.
- fit(X: ndarray[Any, dtype[float64]], **kwargs) TSScaler
Fit method of transformer.
- Parameters:
- X: ndarray
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments to be passed to fit method.
- Returns:
- self
the fitted transformer object
- fit_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]
fit and transform the data
- Parameters:
- X: ndarray
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments
- Returns:
- numpy array
the transformed data in TNF format
- inverse_transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]
inverse transform method for transformer.
- Parameters:
- X: ndarray.
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments.
- Returns:
- numpy array
the inverse-transform of the data in TNF format
- transform(X: ndarray[Any, dtype[float64]], **kwargs) ndarray[Any, dtype[float64]]
transform method for transformer.
- Parameters:
- X: ndarray
Input time series data. Should be a 3 dimensional array in TNF fromat.
- **kwargs keyword arguments.
- Returns:
- numpy array
the transformed data in TNF format
utils module
- tscluster.preprocessing.utils.broadcast_data(T: int, *, cluster_centers: npt.NDArray[np.float64] | None = None, labels: npt.NDArray[np.int] | None = None) Tuple[npt.NDArray[np.float64], npt.NDArray[np.int64]]
function to make cluster_centers of shape T x N x F and labels of shape N x F
- tscluster.preprocessing.utils.load_data(X: npt.NDArray[np.float64] | str | List, *, arr_format: str = 'TNF', suffix_sep: str = '_', file_reader: str = 'infer', read_file_args: dict | None = None, use_suffix_as_label: bool = False, output_arr_format: str = 'TNF') Tuple[np.float64, dict]
function to load data
- Parameters:
- Xnumpy array, string or list
Input time series data. If ndarray, should be a 3 dimensional array, use arr_format to specify its format. If str and a file name, will use numpy to load file. If str and a directory name, will load all the files in the directory in ascending order of the suffix of the filenames. Use suffix_sep as a keyword argument to indicate the suffix separator. Default is “_”. So, file_0.csv will be read first before file_1.csv and so on. Supported files in the directory are any file that can be read using any of np.load, pd.read_csv, pd.read_json, and pd.read_excel. If list, assumes the list is a list of files or filepaths. If file, each should be a numpy array or pandas DataFrame of data for the different time steps. If list of filepaths, data is read in the order in the list using any of np.load, pd.read_csv, pd.read_json, and pd.read_excel.
- arr_formatstr, default=’TNF’
format of the input data. ‘TNF’ means the data dimension is Time x Number of observations x Features ‘NTF’ means the data dimension is Number OF observations x Time x Features
- suffix_sepstr, default=’_’
separator separating the suffix from the filename. The suffixes should be numbers and may not neccessarily need to start from 1 or have an interval of 1. So long the suffixes can be sorted and there is a consistent suffix separator, the directory can be parsed by load_data function.
- file_readerstr, default=’infer’
file loader to use. Can be any of np.load, pd.read_csv, pd.read_json, and pd.read_excel. If ‘infer’, decorator will attempt to infer the file type from the file name and use the approproate loader.
- read_file_argsdict, default=None.
parameters to be passed to the data loader. Keys of the dictionary should be parameter names as keys in str, values should be the values of the parameter keys.
- use_suffix_as_label: bool, default=False
If True, use the suffixes of the file names as labels in label_dict. If arr_format = ‘TNF’, the suffixes will be used as labels of timesteps, else if arr_format = ‘NTF’, they will be used as labels of entities. If False, a linear range of the number of files is used as labels (e.g.) range(n_files), where n_files is the number of files.
- output_arr_formatstr, default=’TNF’
The format of the output array. Can be any of {‘TNF’, ‘NTF’}.
- Returns:
- np.array
a numpy array of the data in ‘TNF’ or ‘NTF’ format (depending on the value of output_arr_format)
- dict
a dictionary whose keys are ‘T’, ‘N’, and ‘F’, and whose values are lists of the labels of each key.
- tscluster.preprocessing.utils.ntf_to_tnf(X: ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]
Utility function to convert an array from Number of observation x Time x Feature format to Time x Number of observation x Feature format
- tscluster.preprocessing.utils.tnf_to_ntf(X: ndarray[Any, dtype[float64]]) ndarray[Any, dtype[float64]]
Utility function to convert an array from Time x Number of observation x Feature format to Number of observation x Time x Feature format
- tscluster.preprocessing.utils.to_dfs(X: npt.NDArray[np.float64], label_dict: dict | None = None, arr_format: str = 'TNF', output_df_format: str = 'TNF')
Function to convert from (numpy, label_dict) to dataframes
- Parameters:
- Xnumpy array
The array to be converted to list of dataframes. Should be a 3D array.
- label_dict dict, default=None
a dictionary whose keys are ‘T’, ‘N’, and ‘F’ (which are the number of time steps, entities, and features respectively). Value of each key is a list such that the value of key: - ‘T’ is a list of names/labels of each time step to be used as index of each dataframe. If None, range(0, T) is used. Where T is the number of time steps in the fitted data - ‘N’ is a list of names/labels of each entity to be used as index of the dataframe. If None, range(0, N) is used. Where N is the number of entities/observations in the fitted data - ‘F’ is a list of names/labels of each feature to be used as column of each dataframe. If None, range(0, F) is used. Where F is the number of features in the fitted data If label_dict is None, a linear range of the dimensions of the array is used.
- arr_formatstr, default ‘TNF’
format of the array. ‘TNF’ means the data dimension is Time x Number of observations x Features ‘NTF’ means the data dimension is Number OF observations x Time x Features
- output_df_formatstr, default=’TNF’
The format of the output dataframes. Can be any of {‘TNF’, ‘NTF’}. If ‘TNF’, output is a list of T dataframes each of shape (N, F). If ‘NTF’, output is a list of N dataframes each of shape (T, F).
- Returns:
- list[pd.DataFrame]
A list of T pandas DataFrames. Where T is the number of time steps. The t-th dataframe in the list is a N x F dataframe of the values of the time series data of all entities at the t-th timestep.