--- title: "Regression with a functional covariate: Cross-validation analysis of the Tecator data set" author: "Haziq Jamil" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Regression with a functional covariate: Cross-validation analysis of the Tecator data set} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{iprior,caret} --- ```{r, include = FALSE} library(iprior) ``` Continuing on Section 3.4 of the [vignette](https://phd.haziqj.ml/iprior_paper.pdf), we revisit the code used to obtain out-of-sample test error rates, and extend the analysis to a leave-one-out cross-validation (LOOCV) scheme. # Easy train/test split The `iprior()` function includes an argument to conveniently instruct which data samples should be used for training, and any remaining data used for testing. The out-of-sample test error rates would then be reported together. The examples in the vignette can then be conducted as follows: ```{r} data(tecator, package = "caret") fat <- endpoints[, 2] absorp <- -t(diff(t(absorp))) # take first differences mod1 <- iprior(fat, absorp, train.samp = 1:172, method = "mixed") ``` The prediction error (training and test) can then be obtained easily: ```{r} get_prederror(mod1) ``` # LOOCV experiment With the above conveniences, it is easy to wrap this in loop to perform $k$-fold cross-validation; this is done in the `iprior_cv()` function. We now analyse the predictive performance of I-prior models using a LOOCV scheme. For all `n=215` samples, one observation pair is left out and the model trained; the prediction error is obtained for the observation that was left out. This is repeated for all `n=215` samples, and the average of the prediction errors calculated. For the linear RKHS, the code to peform the LOOCV in the `iprior` package is as follows: ```{r, eval = FALSE} (mod1.cv <- iprior_cv(fat, absorp, method = "em", control = list(stop.crit = 1e-2), folds = Inf)) ``` ```{r, echo = FALSE} data(tecator.cv, package = "iprior") print(tecator.cv[[1]], "RMSE") ``` Notice the argument `folds = Inf`---since the `iprior_cv()` function basically performs a $k$-fold cross validation experiment, setting `folds` to be equal to sample size or higher tells the function to perform LOOCV. Also note that the EM algorithm was used to fit the model, and the stopping criterion relaxed to `1e-2`---this offered faster convergence without affecting predictive abilities. The resulting fit gives training and test mean squared error (MSE) for the cross-validation experiment. The rest of the code for the remaining models are given below. As this takes quite a long time to run, this has been run locally and the results saved into the data `tecator.cv` within the `iprior` package. ```{r, eval = FALSE} mod2.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "poly2", est.offset = TRUE, control = list(stop.crit = 1e-2)) mod3.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "poly3", est.offset = TRUE, control = list(stop.crit = 1e-2)) mod4.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "fbm", control = list(stop.crit = 1e-2)) mod5.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "fbm", est.hurst = TRUE, control = list(stop.crit = 1e-2)) mod6.cv <- iprior_cv(fat, absorp, method = "em", folds = Inf, kernel = "se", est.lengthscale = TRUE, control = list(stop.crit = 1e-2)) # Function to tabulate the results tecator_tab_cv <- function() { tab <- t(sapply(list(mod1.cv, mod2.cv, mod3.cv, mod4.cv, mod5.cv, mod6.cv), function(mod) { res <- as.numeric(apply(sqrt(mod$mse[, -1]), 2, mean)) c("Training MSE" = res[1], "Test MSE" = res[2]) })) rownames(tab) <- c("Linear", "Quadratic", "Cubic", "fBm-0.5", "fBm-MLE", "SE-MLE") tab } ``` The results are tabulated below. ```{r, echo = FALSE} knitr::kable(iprior::dec_plac(tecator.cv[[7]], 2), align = "r", caption = "Results for the LOOCV experiment for various I-prior models.") ```