How to find the Prediction Intervals of a Gaussian Process Regression via caret kernlab packages?

by Rafael Pedrollo de Paes   Last Updated July 25, 2019 21:19 PM

I am trying to use a Gaussian Process Regression (GPR) model to predict hourly streamflow discharges in a river. I've got good results applying the caret::kernlab train () function.

Since the uncertainty idea is one of the main inherent ones advantages of the GPR, I would like to know if anyone could help me to access the results related to the prediction inteval of the test dataset.

I'll put an extract of the code I've been working. Since my real data are huge (and sincerely, I don't know how to put it here), I'll example with the data(airquality). The main goal in this particular example is to predict airquality Ozone, using as predictos the lag-variables of airquality Temperature.

rm(list = ls())
data(airquality)

airquality = na.omit(as.data.frame(airquality)); str(airquality)

library(tidyverse)
airquality$Ozone %>% plot(type = 'l')
lines(airquality$Temp, col = 2)

attach(airquality)
df_lags = data.frame(Ozone_lead1 = lead(n = 1L, airquality$Ozone),
                     Ozone_lag1 = lag(n = 1L, airquality$Ozone),
                     Ozone_lag2 = lag(n = 2L, airquality$Ozone),
                     Ozone = Ozone,
                     Temp_lag1 = lag(n = 1L, airquality$Temp),
                     Temp = airquality$Temp)


sum(is.na(df_lags))          # Count NAs
library(magrittr)
df_lags %<>% na.omit()       # Remove lines with NAs
sum(is.na(df_lags))          # Count NAs

ESM_train = data.frame(df_lags[1:81, ]) # Training Observed 75% dataset
ESM_test = data.frame(df_lags[82:nrow(df_lags), ]) # Testing Observed 25% dataset

grid_gaussprRadial = expand.grid(.sigma = c(0.001, 0.01, 0.05, 0.1, 0.5, 1, 2)) # Sigma parameters searching for GPR

# TRAIN MODEL ############################
# Tuning set
library(caret)
set.seed(111)
cvCtrl <- trainControl(
  method ="repeatedcv",
  repeats = 1,
  number = 20,
  allowParallel = TRUE,
  verboseIter = TRUE,
  savePredictions = "final")

# Train (aprox. 4 seconds time-simulation)
attach(ESM_train)
set.seed(111)
system.time(Model_train <- caret::train(Ozone ~  Temp + Temp_lag1,
                                        trControl = cvCtrl,
                                        data = ESM_train,
                                        metric = "MAE", # Using MAE since I intend minimum values are my focus 
                                        preProcess = c("center", "scale"),
                                        method = "gaussprRadial", # Setting RBF kernel function
                                        tuneGrid = grid_gaussprRadial,
                                        maxit = 1000,
                                        linout = 1)) # Regression type

plot(Model_train)
Model_train
ESM_results_train <- Model_train$resample %>% mutate(Model = "") # K-fold Training measures

# Select the interested TRAIN data and arrange them as dataframe
Ozone_Obs_Tr = Model_train$pred$obs
Ozone_sim = Model_train$pred$pred
Resid = Ozone_Obs_Tr - Ozone_sim
train_results = data.frame(Ozone_Obs_Tr,
                           Ozone_sim,
                           Resid)

# Plot Obs x Simulated train results
library(ggplot2)
ggplot(data = train_results, aes(x = Ozone_Obs_Tr, y = Ozone_sim)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "black")


# TEST MODEL ############################
# From "ESM_test" dataframe, we predict ESM Ozone time series, adding it in "ESM_forecasted" dataframe
ESM_forecasted = ESM_test %>%                                              
  mutate(Ozone_Pred = predict(Model_train, newdata = ESM_test, variance.model = TRUE))
str(ESM_forecasted)

# Select the interested TEST data and arrange them as a dataframe
Ozone_Obs = ESM_forecasted$Ozone
Ozone_Pred = ESM_forecasted$Ozone_Pred

# Plot Obs x Predicted TEST results
ggplot(data = ESM_forecasted, aes(x = Ozone_Obs, y = Ozone_Pred)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "black")


# Model performance #####
library(hydroGOF)
gof_TR = gof(Ozone_sim, Ozone_Obs_Tr)
gof_TEST = gof(Ozone_Pred,Ozone_Obs)
Performances = data.frame(
                          Train = gof_TR,
                          Test = gof_TEST
                          ); Performances

# Plot the TEST prediction
attach(ESM_forecasted)
plot(Ozone_Obs, type = "l", xlab = "", ylab = "", ylim = range(Ozone_Obs, Ozone_Pred))
lines(Ozone_Pred , col = "coral2", lty = 2, lwd = 2)
legend("top", legend = c("Ozone Obs Test", "Ozone Pred Test"),
       col=c(1, "coral2"), lty = 1:2, cex = 0.7, text.font = 4, inset = 0.01, box.lty=0, lwd = 2)

This last plot generates the following graph:

Observed and Predicted Ozone test time series

The next, and last, step would be to extract the prediction intervals, which is based on a gaussian distribution around each prediction point, to plot it together with this last plot.

The caret::kernlab train() appliance returned better prediction than, for instance, just kernlab::gaussprRadial(), or even tgp::bgp() packages. For both of them I could find the prediction interval.

For example, to pick up the prediction intervals via tgp::bgp(), it could be done typing:

Upper_Bound <- Ozone_Pred$ZZ.q2 #Ozone_Pred - 2 * sigma^2 
Lower_Bound <- Ozone_Pred$ZZ.q1 #Ozone_Pred + 2 * sigma^2

Therefore, via caret::kernlab train(), I hope the required standard deviations could be found typing something as:

Model_train$...

or maybe, with:

Ozone_Pred$...

Moreover, at link: Can MAD (median absolute deviation) or MAE (mean absolute error) be used to calculate prediction intervals? Stephan Kolassa author explained that we could estimate the prediction intervals through MAE, or even RMSE. But I didn't understand if this is my point, since the MAE I got is just the comparison between Obs x Predicted Ozone data, in this example.

Please, this solution is very important to me! I think I am near to obtain my main results, but I don't know anymore how to try. Thanks a lot, friends!



Related Questions


Gaussian process prediction interval

Updated December 08, 2016 08:08 AM


Using Gaussian process regression with non Gaussian data

Updated September 29, 2016 08:08 AM

Calculate PLS Xscores for predicting new data

Updated April 04, 2015 05:08 AM