Classification cross validation

In this tutorial we discuss how you can perform cross-validation with Java-ML.

In this tutorial we assume that you know how to load data from a file, how to create a classifier and how to work with the PerformanceMeasure.

Cross validation in Java-ML can be done using the CrossValidation class. The code below shows how to use this class.

[Documented source code]

  1. /* Load data */
  2. Dataset data = FileHandler.loadDataset(new File("iris.data"), 4, ",");
  3. /* Construct KNN classifier */
  4. Classifier knn = new KNearestNeighbors(5);
  5. /* Construct new cross validation instance with the KNN classifier */
  6. CrossValidation cv = new CrossValidation(knn);
  7. /* Perform cross-validation on the data set */
  8. Map<Object, PerformanceMeasure> p = cv.crossValidation(data);

This example first loads the iris data set and then constructs a K-nearest neighbors classifier that uses 5 neighbors to classify instances.
In the next step we create a cross-validation with the constructed classifier.
Finally we instruct the cross-validation to run on a the loaded data. By default a 10-fold cross validation will be performed and the result for each class will be returned in a Map that maps each class label to its corresponding PerformanceMeasure.

Using the same folds for multiple runs

  1. /* Load data */
  2. Dataset data = FileHandler.loadDataset(new File("devtools/data/iris.data"), 4, ",");
  3. /* Construct KNN classifier */
  4. Classifier knn = new KNearestNeighbors(5);
  5. /* Construct new cross validation instance with the KNN classifier, */
  6. CrossValidation cv = new CrossValidation(knn);
  7. /* 5-fold CV with fixed random generator */
  8. Map<Object, PerformanceMeasure> p = cv.crossValidation(data, 5, new Random(1));
  9. Map<Object, PerformanceMeasure> q = cv.crossValidation(data, 5, new Random(1));
  10. Map<Object, PerformanceMeasure> r = cv.crossValidation(data, 5, new Random(25));

[Documented source code]

The example above performs three rounds of cross-validation on the data set. The first two are using exactly the same folds as the random generator used to create the folds is initialized with the same seed. The third CV will be run on different folds as it uses a different seed.

While in this example we have used the same classifier, one can exchange the classifier with a different one and test different classifiers on exactly the same folds.