opttscluster module

OptTSCluster class

class tscluster.opttscluster.OptTSCluster(n_clusters: int, scheme: str = 'z1c0', *, n_allow_assignment_change: None | int = None, use_sum_distance: bool = False, warm_start: bool = True, use_MILP_centroid: bool = True, random_state: None | int = None)

Class for optimal time-series clustering. Throughout this doc and code, ‘z’ refers to cluster centers, while ‘c’ to label assignment. This creates an OptTSCluster object

Parameters:

n_clusters: int

number of clusters

scheme: {‘z0c0’, ‘z0c1’, ‘z1c0’, ‘z1c1’}, default=’z1c0’

The scheme to use for tsclustering. Could be one of:

‘z0c0’ means fixed center, fixed assignment
‘z0c1’ means fixed center, changing assignment
‘z1c0’ means changing center, fixed assignment
‘z1c1’ means changing center, changing assignment

Scheme needs to be a dynamic label assignment scheme (either ‘z1c1’ or ‘z0c1’) when using constrained cluster change (either with n_allow_assignment_change or lagrangian_multiplier)

n_allow_assignment_change: int or None, default=None

total number of label changes to allow

use_sum_distance: bool, default=False

Indicate if to use sum of distance to cluster as the objective. This is the sum of the distances between points in a time series and their centroids.

warm_start: bool, default=True

Indicates if to use k-means to initialize the centroids (Z) and their assignments (C).

use_MILP_centroid: bool, default=True

If True, cluster_centers_ atrribute will be cluster centers obtained from MILP solution, else the average of the datapoints per timestep

random_state: int, default=None

Set the random seed used when initializing with k-means or when initializing samples when using constraint generation.

Attributes:

cluster_centers_: returns the cluster centers. If scheme is fixed centers, returns a k x F 2D array. Where k is the number of clusters and F is the number of features. If scheme is changing centers, returns a T x k x F 3D array. Where T is the number of time stesp, k is the number of clusters and F is the number of features.
fitted_data_shape_: returns a tuple of the shape of the fitted data in TNF format. E.g (T, N, F) where T, N, and F are the number of timesteps,
labels_: returns the assignment labels. values are integers in range [0, k-1], where k is the number of clusters. If scheme is fixed assignment, returns a 1D array of size N. Where N is the number of entities. A value of j at the i-th index means that entity i is assigned to the j-th cluster at all time steps. If scheme is changing assignment, returns a N x T 2D array. Where N is the number of entities and T is the number of time steps. A value of j at the i-th row and t-th column means that entity i is assigned to the j-th cluster at the t-th time step.
label_dict_: returns a dictionary of the labels whose keys are ‘T’, ‘N’, and ‘F’ (which are the number of time steps, entities, and features respectively). Value of each key is a list such that the value of key:
n_changes_: returns the total number of label changes

Methods

`fit`(X[, label_dict, verbose, print_to])	Method for fitting the model by solving the MILP model.
`get_dynamic_entities`()	returns the dynamic entities and their number of changes.
`get_index_of_label`(labels[, axis])	function to return the integer indexes of some given labelled items in self.label_dict_.
`get_label_of_index`(indexes[, axis])	function to return the labels of some given integer indexes as labelled in self.label_dict_.
`get_model_size`(X)	Method to return the size of the model as a tuple of (v, c).
`get_named_cluster_centers`([label_dict])	Method to return the cluster centers with custom names of time steps and features.
`get_named_labels`([label_dict])	Method to return the a data frame of the label assignments with custom names of time steps and entities.
`set_label_dict`(value)	Method to manually set the label_dict_.

fit(X: npt.NDArray[np.float64], label_dict: dict | None = None, verbose: bool = True, print_to: TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, **kwargs) → OptTSCluster

Method for fitting the model by solving the MILP model.

Parameters:

Xnumpy array

Input time series data. Should be a 3 dimensional array in TNF fromat.

label_dictdict, default=None

A dictionary of the labels of X. Keys should be ‘T’, ‘N’, and ‘F’ (which are the number of time steps, entities, and features respectively). Value of each key is a list such that the value of key:

‘T’ is a list of names/labels of each time step used as index of each dataframe during fit. Default is range(0, T). Where T is the number of time steps in the fitted data
‘N’ is a list of names/labels of each entity used as index of the dataframe. Default is range(0, N). Where N is the number of entities/observations in the fitted data
‘F’ is a list of names/labels of each feature used as column of each dataframe. Default is range(0, F). Where F is the number of features in the fitted data

data_loader function from tscluster.preprocessing.utils can help in getting label_dict of a data.

verbosebool, default=True

If True, some model training information will be printed out. Set to False to surpress printouts

print_toTextIO, default=sys.stdout

An object with a write method to write model’s printout information during training. Default is standard output.

Returns:

self: The fitted OptTSCluster object.

get_dynamic_entities() → Tuple[List[int64], List[int64]]

returns the dynamic entities and their number of changes. Both lists are sorted by the number of cluster changes in descending order.

Returns:

dynamic entitieslist: a 1-D array of the indexes of the entities that change cluster at least once.
number of changeslist: a 1-D array of the number of changes for each dynamic entity such that the i-th element is the number of cluster changes for the i-th dynamic entity

get_index_of_label(labels: List[str], axis: str = 'N') → List[int]

function to return the integer indexes of some given labelled items in self.label_dict_. The indexes are assumed to be 0-indexed.

Parameters:

labelslist: a list of the label(s) whose integer indexes should be returned.
axisstr, default=’N’: can be any of {‘T’, ‘N’, ‘F’}. - If ‘T’, the values in the labels parameter are interpreted as time labels (as stored in self.label_dict_[‘T’]). - If ‘N’, the values in the labels parameter are interpreted as entity labels (as stored in self.label_dict_[‘N’]). - If ‘F’, the values in the labels parameter are interpreted as feature labels (as stored in self.label_dict_[‘F’]).

Returns:

list: a list of the integer indexes of the labels in the given axis dimension.

get_label_of_index(indexes: List[int], axis: str = 'N') → List[str]

function to return the labels of some given integer indexes as labelled in self.label_dict_. The indexes are assumed to be 0-indexed.

Parameters:

indexeslist: a list of the index(es) whose labels should be returned.
axisstr, default=’N’: can be any of {‘T’, ‘N’, ‘F’}. - If ‘T’, the values in the indexes parameter are interpreted as the time indexes whose labels (as stored in self.label_dict_[‘T’]) should be returned. - If ‘N’, the values in the indexes parameter are interpreted as the entity indexes whose labels (as stored in self.label_dict_[‘N’]) should be returned. - If ‘F’, the values in the indexes parameter are interpreted as the feature indexes whose labels (as stored in self.label_dict_[‘F’]) should be returned.

Returns:

list: a list of the labels of the given integer indexes in the given axis dimension.

get_model_size(X: ndarray[Any, dtype[float64]]) → Tuple

Method to return the size of the model as a tuple of (v, c). Wehre v is the number of variables, and c is the number of constraints

Parameters

Xnumpy array: Input time series data. Should be a 3 dimensional array in TNF fromat.

Returns:

number of variable: The number of variables in the model
number of constraints: The number of constraints

get_named_cluster_centers(label_dict: dict | None = None) → List[pd.DataFrame]

Method to return the cluster centers with custom names of time steps and features.

Parameters:

label_dict dict, default=None: a dictionary whose keys are ‘T’, ‘N’, and ‘F’ (which are the number of time steps, entities, and features respectively). Value of each key is a list such that the value of key: - ‘T’ is a list of names/labels of each time step to be used as index of each dataframe. If None, range(0, T) is used. Where T is the number of time steps in the fitted data - ‘N’ is a list of names/labels of each entity to be used as index of the dataframe. If None, range(0, N) is used. Where N is the number of entities/observations in the fitted data - ‘F’ is a list of names/labels of each feature to be used as column of each dataframe. If None, range(0, F) is used. Where F is the number of features in the fitted data If label_dict is None, the result of self.label_dict_ is used.
Returns
——
list: A list of k pandas DataFrames. Where k is the number of clusters. The i-th dataframe in the list is a T x F dataframe of the values of the cluster centers of the i-th cluster.

get_named_labels(label_dict: dict | None = None) → pd.DataFrame

Method to return the a data frame of the label assignments with custom names of time steps and entities.

Parameters:

label_dictdict, default=None: a dictionary whose keys are ‘T’, ‘N’, and ‘F’ (which are the number of time steps, entities, and features respectively). Value of each key is a list such that the value of key: - ‘T’ is a list of names/labels of each time step to be used as index of each dataframe. If None, range(0, T) is used. Where T is the number of time steps in the fitted data - ‘N’ is a list of names/labels of each entity to be used as index of the dataframe. If None, range(0, N) is used. Where N is the number of entities/observations in the fitted data - ‘F’ is a list of names/labels of each feature to be used as column of each dataframe. If None, range(0, F) is used. Where F is the number of features in the fitted data If label_dict is None, the result of self.label_dict_ is used.

Returns:

pd.DataFrame: A pandas DataFrame with shape (N, T). The value in the n-th row and t-th column is an integer indicating the custer assignment of the n-th entity/observation at time t.

set_label_dict(value: dict) → None

Method to manually set the label_dict_.

Parameters:

valuedict: the value to set as label_dict_. Should be a dict with all of ‘T’, ‘N’, and ‘F’ (case sensitive, which are number of time steps, entities, and features respectively) as key. The value of each key is a list of labels for the key in the data. If your data don’t have values for any of the keys, set its value to None.

Returns:

dict: a dictionary whose keys are ‘T’, ‘N’, and ‘F’; and values are lists of the labels of each key.