Regression with a functional covariate: Cross-validation analysis of the Tecator data set

Continuing on Section 3.4 of the vignette, we revisit the code used to obtain out-of-sample test error rates, and extend the analysis to a leave-one-out cross-validation (LOOCV) scheme.

Easy train/test split

The iprior() function includes an argument to conveniently instruct which data samples should be used for training, and any remaining data used for testing. The out-of-sample test error rates would then be reported together. The examples in the vignette can then be conducted as follows:

data(tecator, package = "caret")
fat <- endpoints[, 2]
absorp <- -t(diff(t(absorp)))  # take first differences
mod1 <- iprior(fat, absorp, train.samp = 1:172, method = "mixed")
## Running 5 initial EM iterations
## ================================================================================
## Now switching to direct optimisation
## iter   10 value 222.642108
## final  value 222.642108 
## converged

The prediction error (training and test) can then be obtained easily:

get_prederror(mod1)
## Training RMSE     Test RMSE 
##      2.890732      2.890353

LOOCV experiment

With the above conveniences, it is easy to wrap this in loop to perform k-fold cross-validation; this is done in the iprior_cv() function. We now analyse the predictive performance of I-prior models using a LOOCV scheme. For all n=215 samples, one observation pair is left out and the model trained; the prediction error is obtained for the observation that was left out. This is repeated for all n=215 samples, and the average of the prediction errors calculated.

For the linear RKHS, the code to peform the LOOCV in the iprior package is as follows:

(mod1.cv <- iprior_cv(fat, absorp, method = "em",
                      control = list(stop.crit = 1e-2), folds = Inf))
## Results from Leave-one-out Cross Validation 
## Training RMSE: 2.869906 
## Test RMSE    : 2.331397

Notice the argument folds = Inf—since the iprior_cv() function basically performs a k-fold cross validation experiment, setting folds to be equal to sample size or higher tells the function to perform LOOCV. Also note that the EM algorithm was used to fit the model, and the stopping criterion relaxed to 1e-2—this offered faster convergence without affecting predictive abilities. The resulting fit gives training and test mean squared error (MSE) for the cross-validation experiment.

The rest of the code for the remaining models are given below. As this takes quite a long time to run, this has been run locally and the results saved into the data tecator.cv within the iprior package.

mod2.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "poly2",
                     est.offset = TRUE, control = list(stop.crit = 1e-2))
mod3.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "poly3",
                     est.offset = TRUE, control = list(stop.crit = 1e-2))
mod4.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "fbm",
                     control = list(stop.crit = 1e-2))
mod5.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "fbm",
                     est.hurst = TRUE, control = list(stop.crit = 1e-2))
mod6.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "se",
                     est.lengthscale = TRUE, control = list(stop.crit = 1e-2))

# Function to tabulate the results
tecator_tab_cv <- function() {
  tab <- t(sapply(list(mod1.cv, mod2.cv, mod3.cv, mod4.cv, mod5.cv, mod6.cv),
                  function(mod) {
                    res <- as.numeric(apply(sqrt(mod$mse[, -1]), 2, mean))
                    c("Training MSE" = res[1], "Test MSE" = res[2])
                    }))
  rownames(tab) <- c("Linear", "Quadratic", "Cubic", "fBm-0.5", "fBm-MLE",
                     "SE-MLE")
  tab
}

The results are tabulated below.

Results for the LOOCV experiment for various I-prior models.
Training RMSE Test RMSE
Linear 2.87 2.33
Quadratic 2.98 2.66
Cubic 2.97 2.64
fBm-0.5 0.09 0.50
fBm-MLE 0.01 0.46
SE-MLE 0.36 2.07