Bert Lloyd
2011-Sep-10  20:42 UTC
[R] npreg: plotting out of sample, extremely large bandwidths
Hello r-help,
I am using the excellent np package to conduct a nonparametric kernel
regression and am having some trouble plotting the results.
I have 2 covariates, x1 and x2, and a continuous outcome variable y. I
am conducting a nonparametric regression of y on x1 and x2. The one
somewhat unusual feature of these data is that, to be included in the
dataset, x1 must be at least as large as x2.
The basics of the analysis are to calculate the correct bandwidth
using npregbw, use npreg to estimate the nonparametric regression
(with the previously calculated bandwidth as an input), and plot the
results using plot (which calls npplot). A simple simulated example is
given below.
Two things are happening in the analysis that I do not understand:
First, although all the data have x1>=x2, plot returns a plot with
estimates for all values of (x1,x2), including where x1<x2 and
therefore there are no data. I have tried using the exdat option to
specify that the predictions should be carried out only on the
existing data, but this does not seem to work. (See example below.)
How can I tell plot / npplot to plot only within the range of the
data? Alternatively, can I specify the exact set of points at which
the plotting should occur?
Second, when I calculate the optimal bandwidth using npregbw, the
calculated bandwidth can be enormous, of order 10^7. I suspect this
may not be specifically related to my x1>=x2 issue, though, but is a
more general matter of sample size.
Are there standard recommendations for what to do when the bandwidth
is so large? Would an alternative method of calculating the bandwidth
be better?
Many thanks in advance for any help with these issues.
The simulated example follows below. I am not sure if this is relevant
but I am using R 2.13.1 via RStudio 0.94.102 on a Windows 7 64-bit PC.
The version of the np package is 0.40-9.
rm(list=ls())
library("np")
set.seed(1)
num_obs <- 400
# generate data
x1 <- runif(num_obs,min=0,max=1)
x2 <- runif(num_obs,min=0,max=1)
e <- rnorm(num_obs,0,1)
y <- x1^2 + x2^2 + x1*x2 + e
mydata <- data.frame(x1=x1,x2=x2,y=y,e=e)
summary(mydata)
rm(x1,x2,y,e)
# PROBLEM 1: plotting out of sample
# example 1: do not restrict data to x1>=x2
# calculate bandwidth using npregbw()
bandwidth <- npregbw(formula = y ~ x1 + x2,
    regtype="ll",
    bwmethod="cv.aic",
    data=mydata)
summary(bandwidth)
# perform nonparametric regression using bandwidth
results <- npreg(bws=bandwidth)
summary(results)
# use plot to plot results
plot(results,view="fixed",theta=300)
# example 2: restrict data to observations for which x1>=x2
mydata2 <- mydata[mydata$x1>=mydata$x2,]
# calculate bandwidth
bandwidth2 <- npregbw(formula=y ~ x1 + x2,
      regtype="ll",
      bwmethod="cv.aic",
      data=mydata2)
summary(bandwidth2)
# perform nonparametric regression using bandwidth2
results2 <- npreg(bws=bandwidth2)
summary(results2)
# use plot to plot results
# note that the plot is not limited to points where x1>=x2
plot(results2,view="fixed",theta=300)
evaluate <- data.frame(x1=mydata2$x1>=mydata2$x2)
# use exdata option to specify the set of (x1,x2) points
# note that this produces exactly the same plot
plot(results2,exdata=evaluate,view="fixed",theta=300)
# try specifying the evalution points in npreg itself
results3 <- npreg(bws=bandwidth2,exdata=evaluate)
summary(results3)
# the results are the same
plot(results3,view="fixed",theta=300)
plot(results3,exdata=evaluate,view="fixed",theta=300)
# PROBLEM 2: Bandwidth estimated in the tens of millions
# I suspect this is a question of sample size and not
# specifically related to the restriction that x1>=x2
# because (a) it did not occur in the above examples
# with 200 observations such that x1>=x2 and (b) it
# does occur in the final example below, which does not
# restrict the data to x1>=x2 but has only 100 observations
num_obs <- 200
# generate data
x1 <- runif(num_obs,min=0,max=1)
x2 <- runif(num_obs,min=0,max=1)
e <- rnorm(num_obs,0,1)
y <- x1^2 + x2^2 + x1*x2 + e
mydata <- data.frame(x1=x1,x2=x2,y=y,e=e)
summary(mydata)
rm(x1,x2,y,e)
# "reasonable" bandwidth with all observations
bandwidth <- npregbw(formula = y ~ x1 + x2,
    regtype="ll",
    bwmethod="cv.aic",
    data=mydata)
summary(bandwidth)
# now observe that the bandwidth for x1 increases by
# a factor of ~10^7 when we use only those points with
# x1>=x2
mydata <- mydata[mydata$x1>=mydata$x2,]
summary(mydata)
bandwidth <- npregbw(formula = y ~ x1 + x2,
    regtype="ll",
    bwmethod="cv.aic",
    data=mydata)
summary(bandwidth)
# however, the same occurs with 100 observations but
# not restricting to x1>=x2
num_obs <- 100
x1 <- runif(num_obs,min=0,max=1)
x2 <- runif(num_obs,min=0,max=1)
e <- rnorm(num_obs,0,1)
y <- x1^2 + x2^2 + x1*x2 + e
mydata <- data.frame(x1=x1,x2=x2,y=y,e=e)
summary(mydata)
rm(x1,x2,y,e)
bandwidth <- npregbw(formula = y ~ x1 + x2,
    regtype="ll",
    bwmethod="cv.aic",
    data=mydata)
summary(bandwidth)
