Packages

object GapStatistic

The GapStatistic object is used to help determine the optimal number of clusters for a clusterer by comparing results to a reference distribution. -----------------------------------------------------------------------------

See also

web.stanford.edu/~hastie/Papers/gap.pdf

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. GapStatistic
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. def cumDistance(x: MatrixD, cl: Clusterer, clustr: Array[Int], k: Int): VectorD

    Compute a sum of pairwise distances between points in each cluster (in one direction).

    Compute a sum of pairwise distances between points in each cluster (in one direction).

    x

    the vectors/points to be clustered stored as rows of a matrix

    cl

    the Clusterer use to compute the distance metric

    clustr

    the cluster assignments

    k

    the number of clusters

  2. def kMeansPP(x: MatrixD, kMax: Int, algo: Algorithm = HARTIGAN, b: Int = 1, useSVD: Boolean = true, plot: Boolean = false): (KMeansPPClusterer, Array[Int], Int)

    Return a KMeansPPClusterer clustering on the given points with an optimal number of clusters k chosen using the Gap statistic.

    Return a KMeansPPClusterer clustering on the given points with an optimal number of clusters k chosen using the Gap statistic.

    x

    the vectors/points to be clustered stored as rows of a matrix

    kMax

    the upper bound on the number of clusters

    algo

    the reassignment aslgorithm used by KMeansPlusPlusClusterer

    b

    the number of reference distributions to create (default = 1)

    useSVD

    use SVD to account for the shape of the points (default = true)

    plot

    whether or not to plot the logs of the within-SSEs (default = false)

  3. def reference(x: MatrixD, useSVD: Boolean = true, stream: Int = 0): MatrixD

    Compute a reference distribution based on a set of points.

    Compute a reference distribution based on a set of points.

    x

    the vectors/points to be clustered stored as rows of a matrix

    useSVD

    use SVD to account for the shape of the points (default = true)

  4. def withinSSE(x: MatrixD, cl: Clusterer, clustr: Array[Int], k: Int): Double

    Compute the within sum of squared errors in terms of distances between between points within a cluster (in one direction).

    Compute the within sum of squared errors in terms of distances between between points within a cluster (in one direction).

    x

    the vectors/points to be clustered stored as rows of a matrix

    cl

    the Clusterer use to compute the distance metric

    clustr

    the cluster assignments

    k

    the number of clusters