Bert Lloyd
2011-Sep-10 20:42 UTC
[R] npreg: plotting out of sample, extremely large bandwidths
Hello r-help, I am using the excellent np package to conduct a nonparametric kernel regression and am having some trouble plotting the results. I have 2 covariates, x1 and x2, and a continuous outcome variable y. I am conducting a nonparametric regression of y on x1 and x2. The one somewhat unusual feature of these data is that, to be included in the dataset, x1 must be at least as large as x2. The basics of the analysis are to calculate the correct bandwidth using npregbw, use npreg to estimate the nonparametric regression (with the previously calculated bandwidth as an input), and plot the results using plot (which calls npplot). A simple simulated example is given below. Two things are happening in the analysis that I do not understand: First, although all the data have x1>=x2, plot returns a plot with estimates for all values of (x1,x2), including where x1<x2 and therefore there are no data. I have tried using the exdat option to specify that the predictions should be carried out only on the existing data, but this does not seem to work. (See example below.) How can I tell plot / npplot to plot only within the range of the data? Alternatively, can I specify the exact set of points at which the plotting should occur? Second, when I calculate the optimal bandwidth using npregbw, the calculated bandwidth can be enormous, of order 10^7. I suspect this may not be specifically related to my x1>=x2 issue, though, but is a more general matter of sample size. Are there standard recommendations for what to do when the bandwidth is so large? Would an alternative method of calculating the bandwidth be better? Many thanks in advance for any help with these issues. The simulated example follows below. I am not sure if this is relevant but I am using R 2.13.1 via RStudio 0.94.102 on a Windows 7 64-bit PC. The version of the np package is 0.40-9. rm(list=ls()) library("np") set.seed(1) num_obs <- 400 # generate data x1 <- runif(num_obs,min=0,max=1) x2 <- runif(num_obs,min=0,max=1) e <- rnorm(num_obs,0,1) y <- x1^2 + x2^2 + x1*x2 + e mydata <- data.frame(x1=x1,x2=x2,y=y,e=e) summary(mydata) rm(x1,x2,y,e) # PROBLEM 1: plotting out of sample # example 1: do not restrict data to x1>=x2 # calculate bandwidth using npregbw() bandwidth <- npregbw(formula = y ~ x1 + x2, regtype="ll", bwmethod="cv.aic", data=mydata) summary(bandwidth) # perform nonparametric regression using bandwidth results <- npreg(bws=bandwidth) summary(results) # use plot to plot results plot(results,view="fixed",theta=300) # example 2: restrict data to observations for which x1>=x2 mydata2 <- mydata[mydata$x1>=mydata$x2,] # calculate bandwidth bandwidth2 <- npregbw(formula=y ~ x1 + x2, regtype="ll", bwmethod="cv.aic", data=mydata2) summary(bandwidth2) # perform nonparametric regression using bandwidth2 results2 <- npreg(bws=bandwidth2) summary(results2) # use plot to plot results # note that the plot is not limited to points where x1>=x2 plot(results2,view="fixed",theta=300) evaluate <- data.frame(x1=mydata2$x1>=mydata2$x2) # use exdata option to specify the set of (x1,x2) points # note that this produces exactly the same plot plot(results2,exdata=evaluate,view="fixed",theta=300) # try specifying the evalution points in npreg itself results3 <- npreg(bws=bandwidth2,exdata=evaluate) summary(results3) # the results are the same plot(results3,view="fixed",theta=300) plot(results3,exdata=evaluate,view="fixed",theta=300) # PROBLEM 2: Bandwidth estimated in the tens of millions # I suspect this is a question of sample size and not # specifically related to the restriction that x1>=x2 # because (a) it did not occur in the above examples # with 200 observations such that x1>=x2 and (b) it # does occur in the final example below, which does not # restrict the data to x1>=x2 but has only 100 observations num_obs <- 200 # generate data x1 <- runif(num_obs,min=0,max=1) x2 <- runif(num_obs,min=0,max=1) e <- rnorm(num_obs,0,1) y <- x1^2 + x2^2 + x1*x2 + e mydata <- data.frame(x1=x1,x2=x2,y=y,e=e) summary(mydata) rm(x1,x2,y,e) # "reasonable" bandwidth with all observations bandwidth <- npregbw(formula = y ~ x1 + x2, regtype="ll", bwmethod="cv.aic", data=mydata) summary(bandwidth) # now observe that the bandwidth for x1 increases by # a factor of ~10^7 when we use only those points with # x1>=x2 mydata <- mydata[mydata$x1>=mydata$x2,] summary(mydata) bandwidth <- npregbw(formula = y ~ x1 + x2, regtype="ll", bwmethod="cv.aic", data=mydata) summary(bandwidth) # however, the same occurs with 100 observations but # not restricting to x1>=x2 num_obs <- 100 x1 <- runif(num_obs,min=0,max=1) x2 <- runif(num_obs,min=0,max=1) e <- rnorm(num_obs,0,1) y <- x1^2 + x2^2 + x1*x2 + e mydata <- data.frame(x1=x1,x2=x2,y=y,e=e) summary(mydata) rm(x1,x2,y,e) bandwidth <- npregbw(formula = y ~ x1 + x2, regtype="ll", bwmethod="cv.aic", data=mydata) summary(bandwidth)