Estimate optimal number of clusters.

Estimates the optimal number of clusters (k) using various methods.

mt_cluster_k(
  data,
  use = "ln_trajectories",
  dimensions = c("xpos", "ypos"),
  kseq = 2:15,
  compute = c("stability", "gap", "jump", "slope"),
  method = "hclust",
  weights = rep(1, length(dimensions)),
  pointwise = TRUE,
  minkowski_p = 2,
  hclust_method = "ward.D",
  kmeans_nstart = 10,
  n_bootstrap = 10,
  model_based = FALSE,
  n_gap = 10,
  na_rm = FALSE,
  verbose = FALSE
)

Arguments

data: a mousetrap data object created using one of the mt_import functions (see mt_example for details). Alternatively, a trajectory array can be provided directly (in this case use will be ignored).
use: a character string specifying which trajectory data should be used.
dimensions: a character vector specifying which trajectory variables should be used. Can be of length 2 or 3, for two-dimensional or three-dimensional trajectories respectively.
kseq: a numeric vector specifying set of candidates for k. Defaults to 2:15, implying that all values of k within that range are compared using the metrics specified in compute.
compute: character vector specifying the to be computed measures. Can be any subset of c("stability","gap","jump","slope").
method: character string specifying the type of clustering procedure for the stability-based method. Either hclust or kmeans.
weights: numeric vector specifying the relative importance of the variables specified in dimensions. Defaults to a vector of 1s implying equal importance. Technically, each variable is rescaled so that the standard deviation matches the corresponding value in weights. To use the original variables, set weights = NULL.
pointwise: boolean specifying the way in which dissimilarity between the trajectories is measured. If TRUE (the default), mt_distmat measures the average dissimilarity and then sums the results. If FALSE, mt_distmat measures dissimilarity once (by treating the various points as independent dimensions). This is only relevant if method is "hclust". See mt_distmat for further details.
minkowski_p: an integer specifying the distance metric for the cluster solution. minkowski_p = 1 computes the city-block distance, minkowski_p = 2 (the default) computes the Euclidian distance, minkowski_p = 3 the cubic distance, etc. Only relevant if method is "hclust". See mt_distmat for further details.
hclust_method: character string specifying the linkage criterion used. Passed on to the method argument of hclust. Default is set to ward.D. Only relevant if method is "hclust".
kmeans_nstart: integer specifying the number of reruns of the kmeans procedure. Larger numbers minimize the risk of finding local minima. Passed on to the nstart argument of kmeans. Only relevant if method is "kmeans".
n_bootstrap: an integer specifying the number of bootstrap comparisons used by stability. See cStability.
model_based: boolean specifying whether the model-based or the model-free should be used by stability, when method is kmeans. See cStability and Haslbeck & Wulff (2020).
n_gap: integer specifying the number of simulated datasets used by gap. See Tibshirani et al. (2001).
na_rm: logical specifying whether trajectory points containing NAs should be removed. Removal is done column-wise. That is, if any trajectory has a missing value at, e.g., the 10th recorded position, the 10th position is removed for all trajectories. This is necessary to compute distance between trajectories.
verbose: logical indicating whether function should report its progress.

Value

A list containing two lists that store the results of the different methods. kopt contains the estimated k for each of the methods specified in compute. paths contains the values for each k in kseq as computed by each of the methods specified in compute. The values in kopt are optima for each of the vectors in paths.

Details

mt_cluster_k estimates the number of clusters (k) using four commonly used k-selection methods (specified via compute): cluster stability (stability), the gap statistic (gap), the jump statistic (jump), and the slope statistic (slope).

Cluster stability methods select k as the number of clusters for which the assignment of objects to clusters is most stable across bootstrap samples. This function implements the model-based and model-free methods described by Haslbeck & Wulff (2020). See references.

The remaining three methods select k as the value that optimizes the gap statistic (Tibshirani, Walther, & Hastie, 2001), the jump statistic (Sugar & James, 2013), and the slope statistic (Fujita, Takahashi, & Patriota, 2014), respectively.

For clustering trajectories, it is often useful that the endpoints of all trajectories share the same direction, e.g., that all trajectories end in the top-left corner of the coordinate system (mt_remap_symmetric or mt_align can be used to achieve this). Furthermore, it is recommended to use length normalized trajectories (see mt_length_normalize; Wulff et al., 2019).

References

Haslbeck, J. M. B., & Wulff, D. U. (2020). Estimating the Number of Clusters via a Corrected Clustering Instability. Computational Statistics, 35, 1879–1894.

Wulff, D. U., Haslbeck, J. M. B., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2019). Mouse-tracking: Detecting types in movement trajectories. In M. Schulte-Mecklenbeck, A. Kühberger, & J. G. Johnson (Eds.), A Handbook of Process Tracing Methods (pp. 131-145). New York, NY: Routledge.

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

Sugar, C. A., & James, G. M. (2013). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763.

Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.

Author

Dirk U. Wulff

Jonas M. B. Haslbeck

Examples