Estimates the optimal number of clusters (k
) using various methods.
mt_cluster_k( data, use = "sp_trajectories", dimensions = c("xpos", "ypos"), kseq = 2:15, compute = c("stability", "gap", "jump", "slope"), method = "hclust", weights = rep(1, length(dimensions)), pointwise = TRUE, minkowski_p = 2, hclust_method = "ward.D", kmeans_nstart = 10, n_bootstrap = 10, model_based = FALSE, n_gap = 10, na_rm = FALSE, verbose = FALSE )
data | a mousetrap data object created using one of the mt_import
functions (see mt_example for details). Alternatively, a trajectory
array can be provided directly (in this case |
---|---|
use | a character string specifying which trajectory data should be used. |
dimensions | a character vector specifying which trajectory variables should be used. Can be of length 2 or 3, for two-dimensional or three-dimensional trajectories respectively. |
kseq | a numeric vector specifying set of candidates for k. Defaults to
2:15, implying that all values of k within that range are compared using
the metrics specified in |
compute | character vector specifying the to be computed measures. Can
be any subset of |
method | character string specifying the type of clustering procedure
for the stability-based method. Either |
weights | numeric vector specifying the relative importance of the
variables specified in |
pointwise | boolean specifying the way in which dissimilarity between
the trajectories is measured. If |
minkowski_p | an integer specifying the distance metric for the cluster
solution. |
hclust_method | character string specifying the linkage criterion used.
Passed on to the |
kmeans_nstart | integer specifying the number of reruns of the kmeans
procedure. Larger numbers minimize the risk of finding local minima. Passed
on to the |
n_bootstrap | an integer specifying the number of bootstrap comparisons
used by |
model_based | boolean specifying whether the model-based or the
model-free should be used by |
n_gap | integer specifying the number of simulated datasets used by
|
na_rm | logical specifying whether trajectory points containing NAs should be removed. Removal is done column-wise. That is, if any trajectory has a missing value at, e.g., the 10th recorded position, the 10th position is removed for all trajectories. This is necessary to compute distance between trajectories. |
verbose | logical indicating whether function should report its progress. |
A list containing two lists that store the results of the different
methods. kopt
contains the estimated k
for each of the
methods specified in compute
. paths
contains the values for
each k
in kseq
as computed by each of the methods specified
in compute
. The values in kopt
are optima for each of the
vectors in paths
.
mt_cluster_k
estimates the number of clusters (k
) using four
commonly used k-selection methods (specified via compute
): cluster
stability (stability
), the gap statistic (gap
), the jump
statistic (jump
), and the slope statistic (slope
).
Cluster stability methods select k
as the number of clusters for which
the assignment of objects to clusters is most stable across bootstrap
samples. This function implements the model-based and model-free methods
described by Haslbeck & Wulff (2016). See references.
The remaining three methods select k
as the value that optimizes the
gap statistic (Tibshirani, Walther, & Hastie, 2001), the jump statistic
(Sugar & James, 2013), and the slope statistic (Fujita, Takahashi, &
Patriota, 2014), respectively.
For clustering trajectories, it is often useful that the endpoints of all trajectories share the same direction, e.g., that all trajectories end in the top-left corner of the coordinate system (mt_remap_symmetric or mt_align can be used to achieve this). Furthermore, it is recommended to use spatialized trajectories (see mt_spatialize; Wulff et al., in press; Haslbeck et al., 2018).
Haslbeck, J., & Wulff, D. U. (2016). Estimating the Number of Clusters via Normalized Cluster Instability. arXiv preprint arXiv:1608.07494.
Wulff, D. U., Haslbeck, J. M. B., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2019). Mouse-tracking: Detecting types in movement trajectories. In M. Schulte-Mecklenbeck, A. Kühberger, & J. G. Johnson (Eds.), A Handbook of Process Tracing Methods (pp. 131-145). New York, NY: Routledge.
Haslbeck, J. M. B., Wulff, D. U., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2018). Advanced mouse- and hand-tracking analysis: Detecting and visualizing clusters in movement trajectories. Manuscript in preparation.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
Sugar, C. A., & James, G. M. (2013). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763.
Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.
mt_distmat for more information about how the distance matrix is computed when the hclust method is used.
mt_cluster for performing trajectory clustering with a specified number of clusters.
if (FALSE) { # Spatialize trajectories KH2017 <- mt_spatialize(KH2017) # Find k results <- mt_cluster_k(KH2017, use="sp_trajectories") # Retrieve results results$kopt results$paths }