Estimates the optimal number of clusters (k
) using various methods.
mt_cluster_k(
data,
use = "ln_trajectories",
dimensions = c("xpos", "ypos"),
kseq = 2:15,
compute = c("stability", "gap", "jump", "slope"),
method = "hclust",
weights = rep(1, length(dimensions)),
pointwise = TRUE,
minkowski_p = 2,
hclust_method = "ward.D",
kmeans_nstart = 10,
n_bootstrap = 10,
model_based = FALSE,
n_gap = 10,
na_rm = FALSE,
verbose = FALSE
)
a mousetrap data object created using one of the mt_import
functions (see mt_example for details). Alternatively, a trajectory
array can be provided directly (in this case use
will be ignored).
a character string specifying which trajectory data should be used.
a character vector specifying which trajectory variables should be used. Can be of length 2 or 3, for two-dimensional or three-dimensional trajectories respectively.
a numeric vector specifying set of candidates for k. Defaults to
2:15, implying that all values of k within that range are compared using
the metrics specified in compute
.
character vector specifying the to be computed measures. Can
be any subset of c("stability","gap","jump","slope")
.
character string specifying the type of clustering procedure
for the stability-based method. Either hclust
or kmeans
.
numeric vector specifying the relative importance of the
variables specified in dimensions
. Defaults to a vector of 1s
implying equal importance. Technically, each variable is rescaled so that
the standard deviation matches the corresponding value in weights
.
To use the original variables, set weights = NULL
.
boolean specifying the way in which dissimilarity between
the trajectories is measured. If TRUE
(the default),
mt_distmat
measures the average dissimilarity and then sums the
results. If FALSE
, mt_distmat
measures dissimilarity once
(by treating the various points as independent dimensions). This is only
relevant if method
is "hclust". See mt_distmat for further
details.
an integer specifying the distance metric for the cluster
solution. minkowski_p = 1
computes the city-block distance,
minkowski_p = 2
(the default) computes the Euclidian distance,
minkowski_p = 3
the cubic distance, etc. Only relevant if
method
is "hclust". See mt_distmat for further details.
character string specifying the linkage criterion used.
Passed on to the method
argument of hclust. Default is
set to ward.D
. Only relevant if method
is "hclust".
integer specifying the number of reruns of the kmeans
procedure. Larger numbers minimize the risk of finding local minima. Passed
on to the nstart
argument of kmeans. Only relevant if
method
is "kmeans".
an integer specifying the number of bootstrap comparisons
used by stability
. See cStability.
boolean specifying whether the model-based or the
model-free should be used by stability
, when method is
kmeans
. See cStability and Haslbeck & Wulff (2020).
integer specifying the number of simulated datasets used by
gap
. See Tibshirani et al. (2001).
logical specifying whether trajectory points containing NAs should be removed. Removal is done column-wise. That is, if any trajectory has a missing value at, e.g., the 10th recorded position, the 10th position is removed for all trajectories. This is necessary to compute distance between trajectories.
logical indicating whether function should report its progress.
A list containing two lists that store the results of the different
methods. kopt
contains the estimated k
for each of the
methods specified in compute
. paths
contains the values for
each k
in kseq
as computed by each of the methods specified
in compute
. The values in kopt
are optima for each of the
vectors in paths
.
mt_cluster_k
estimates the number of clusters (k
) using four
commonly used k-selection methods (specified via compute
): cluster
stability (stability
), the gap statistic (gap
), the jump
statistic (jump
), and the slope statistic (slope
).
Cluster stability methods select k
as the number of clusters for which
the assignment of objects to clusters is most stable across bootstrap
samples. This function implements the model-based and model-free methods
described by Haslbeck & Wulff (2020). See references.
The remaining three methods select k
as the value that optimizes the
gap statistic (Tibshirani, Walther, & Hastie, 2001), the jump statistic
(Sugar & James, 2013), and the slope statistic (Fujita, Takahashi, &
Patriota, 2014), respectively.
For clustering trajectories, it is often useful that the endpoints of all trajectories share the same direction, e.g., that all trajectories end in the top-left corner of the coordinate system (mt_remap_symmetric or mt_align can be used to achieve this). Furthermore, it is recommended to use length normalized trajectories (see mt_length_normalize; Wulff et al., 2019).
Haslbeck, J. M. B., & Wulff, D. U. (2020). Estimating the Number of Clusters via a Corrected Clustering Instability. Computational Statistics, 35, 1879–1894.
Wulff, D. U., Haslbeck, J. M. B., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2019). Mouse-tracking: Detecting types in movement trajectories. In M. Schulte-Mecklenbeck, A. Kühberger, & J. G. Johnson (Eds.), A Handbook of Process Tracing Methods (pp. 131-145). New York, NY: Routledge.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
Sugar, C. A., & James, G. M. (2013). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763.
Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.
mt_distmat for more information about how the distance matrix is computed when the hclust method is used.
mt_cluster for performing trajectory clustering with a specified number of clusters.
if (FALSE) {
# Length normalize trajectories
KH2017 <- mt_length_normalize(KH2017)
# Find k
results <- mt_cluster_k(KH2017, use="ln_trajectories")
# Retrieve results
results$kopt
results$paths
}