Clustering basics

A clustering algorithm creates a division of the orginal dataset. In Java-ML this is done with the method cluster of the Clusterer interface.

Creating and running a clustering algorithm

  1. /* Load a dataset */
  2. Dataset data = FileHandler.loadDataset(new File("iris.data"), 4, ",");
  3. /* Create a new instance of the KMeans algorithm, with no options
  4.   * specified. By default this will generate 4 clusters. */
  5. Clusterer km = new KMeans();
  6. /* Cluster the data, it will be returned as an array of data sets, with
  7.   * each dataset representing a cluster. */
  8. Dataset[] clusters = km.cluster(data);

[Documented source code]

The code above will load the example iris data set. Next it creates an instance of the K-means algorithms and uses it to cluster the data. The results are returned in an array of Datasets where each Dataset represents a cluster.

Note that there is no guarantee that all original Instances will occur in the clusters or that each Instance occurs only once. Some algorithms allow overlapping clusters, some algorithms allow that 'noisy' datapoints are removed. This is algorithm specific and you can find more information on the API page for each algorithm.