Li Jin
2018-Apr-03  00:07 UTC
[R] xgboost: problems with predictions for count data [SEC=UNCLASSIFIED]
Hi All,
I tried to use xgboost to model and predict count data. The predictions are
however not as expected as shown below.
# sponge count data in library(spm)
    library(spm)
data(sponge)
data(sponge.grid)
names(sponge)
[1] "easting"  "northing" "sponge"  
"tpi3"     "var7"     "entro7"   "bs34" 
"bs11"
names(sponge.grid)
[1] "easting"  "northing" "tpi3"    
"var7"     "entro7"   "bs34"     "bs11"
    range(sponge[, c(3)])
[1]  1 39 # count sample data
# the expected predictions are:
set.seed(1234)
gbmpred1 <- gbmpred(sponge[, -c(3)], sponge[, 3], sponge.grid[, c(1:2)],
sponge.grid, family = "poisson", n.cores=2)
range(gbmpred1$Predictions)
[1] 10.04643 31.39230 # the expected predictions
# Here are results from xgboost
# use count:poisson
library(xgboost)
    xgbst2.1 <- xgboost(data = as.matrix(sponge[, -c(3)]), label = sponge[,
3], max_depth = 2, eta = 0.001, nthread = 6, nrounds = 3000, objective =
"count:poisson")
    xgbstpred2 <- predict(xgbst2.1, as.matrix(sponge.grid))
head(xgbstpred2)
range(xgbstpred2)
[1] 1.109032 4.083049 # much lower than expected
    table(xgbstpred2)
                1.10903215408325 1.26556181907654   3.578040599823
4.08304929733276  # only four predictions, why?
                36535             2714            40930            15351
   plot(gbmpred1$Predictions, xgbstpred2) # Fig 1
   # use reg:linear
    xgbst2.2 <- xgboost(data = as.matrix(sponge[, -c(3)]), label = sponge[,
3], max_depth = 2, eta = 0.001, nthread = 6, nrounds = 3000, objective =
"reg:linear")
    xgbstpred2.2 <- predict(xgbst2.2, as.matrix(sponge.grid))
    head(xgbstpred2.2)
    table(xgbstpred2.2)
    range( xgbstpred2.2)
[1]  9.019174 23.060669 # this is much closer to but still lower than what
expected
   plot(gbmpred1$Predictions, xgbstpred2.2) # Fig 2
# use count:poisson and subsample = 0.5
set.seed(1234)
    param <- list(max_depth = 2, eta = 0.001, gamma = 0.001, subsample = 0.5,
silent = 1, nthread = 6, objective = "count:poisson")
    xgbst2.4 <- xgboost(data = as.matrix(sponge[, -c(3)]), label = sponge[,
3], params = param, nrounds = 3000)
    xgbstpred2.4 <- predict(xgbst2.4, as.matrix(sponge.grid))
    head(xgbstpred2.4)
    table(xgbstpred2.4)
    range(xgbstpred2.4)
[1] 1.188561 3.986767 # this is much lower than what expected
   plot(gbmpred1$Predictions, xgbstpred2.4) # Fig 3
  plot(xgbstpred2.2, xgbstpred2.4) # Fig 4
All these were run in R 3.3.3 on Windows"> Sys.info()
                     sysname                      release
                   "Windows"                      "7 x64"
                     version
"build 7601, Service Pack 1"
                     machine
                    "x86-64"
Have I miss-specified or missed some parameters? Or there is a bug in xgboost. I
am grateful for any help.
Kind regards,
Jin
Jin Li, PhD | Spatial Modeller / Computational Statistician
National Earth and Marine Observations | Environmental Geoscience Division
t:  +61 2 6249 9899    www.ga.gov.au<http://www.ga.gov.au/>
Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is
intended only for the person or entity to which it is addressed. If you are not
the intended recipient, then you have received this e-mail by mistake and any
use, dissemination, forwarding, printing or copying of this e-mail and its file
attachments is prohibited. The security of emails transmitted cannot be
guaranteed; by forwarding or replying to this email, you acknowledge and accept
these risks.
-------------------------------------------------------------------------------------------------------------------------
