thr3ads.net - R help - [R] Cross validation, one more time (hopefully the last) [Mar 2005]

If this information is useful, please help other people find it:
Share via:

Trevor Wiens

2005-Mar-17 00:59 UTC

[R] Cross validation, one more time (hopefully the last)

I apologize for posting on this question again, but unfortunately, I don't
have and can't get access to MASS for at least three weeks. I have found
some code on the web however which implements the prediction error algorithm in
cv.glm.

http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf

Now I've tried to adapt it to my purposes, but since I'm not deeply
familiar with R programming, I don't know why it doesn't work. Now
checking the r-help list faq it seems this is an appropriate question.

I've included my attempted function below. The error I get is:

logcv(basp.data, form, 'basp', 'recordyear')
Error in order(na.last, decreasing, ...) : 
        Argument 1 is not a vector

My questions are, why doesn't this work, and how do I fix it.

I'm using the formula function to create the formula that I'm sending to
my function. And the mdata is a data.frame. I'm assumed that if I passed the
column names as strings (response variable - rvar, fold variable - fvar) this
would work. Apparently however it doesn't.

Lastly, since I don't have access to MASS and there are apparently many
examples of doing this kind of thing in MASS, could someone tell me if this
function looks approximately correct?

Thanks

T

------------------------------------------------

logcv <- function(mdata, formula, rvar, fvar) {
require(Hmisc)

# sort by fold variable
sorted <- mdata[order(mdata$fvar), ]

# get fold values and count for each group
vardesc <- describe(sorted$fvar)$values
fvarlist <- as.integer(dimnames(vardesc)[[2]])
k <- length(fvarlist)
countlist <- vardesc[1,1]
for (i in 2:k)
{
countlist[i] <- vardesc[1,i]
}
n <- length(sorted$fvar)

# fit to all the mdata
fit.all <- glm(formula, sorted, family=binomial)
pred.all <- ifelse( predict(fit.all, type="response") < 0.5, 0,
1)

#setup
pred.c <- list()
error.i <- vector(length=k)

for (i in 1:k) 
{
fit.i <- glm(formula, subset(sorted, sorted$fvar != fvarlist[i]),
family=binomial)
pred.i <- ifelse(predict(fit.i, newdata=subset(sorted, sorted$fvar ==
fvarlist[i]), type="response") < 0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i <- ifelse(predict(fit.i, newdata=sorted,
type="response") < 0.5, 0, 1)
error.i[i] <- sum(sorted$rvar != pred.all.i)/n
}
pred.cc <- unlist(pred.c)
delta.cv.k <- sum(sorted$rvar != pred.cc)/n
p.k <- countlist/n
delta.app <- mean(sorted$rvar != pred.all)/n

delta.acv.k <- delta.cv.k + delta.app - sum(p.k*error.i)

print(delta.acv.k)
}


-- 
Trevor Wiens 
twiens at interbaun.com

Trevor Wiens

2005-Mar-17 04:31 UTC

head link

[R] Cross validation, one more time (hopefully the last)

On Wed, 16 Mar 2005 17:59:01 -0700
Trevor Wiens <twiens at interbaun.com> wrote:
> I apologize for posting on this question again, but unfortunately, I
don't have and can't get access to MASS for at least three weeks. I have
found some code on the web however which implements the prediction error
algorithm in cv.glm.
> 
> http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf
> 
> Now I've tried to adapt it to my purposes, but since I'm not deeply
familiar with R programming, I don't know why it doesn't work. Now
checking the r-help list faq it seems this is an appropriate question.
> 
OK. I've determined why that didn't work. But I'm still unsure if
I've implemented the algorithm correctly. Any suggestions for testing would
be appreciated. The corrected function is attached.

Thanks for your assistance.

------------------------
logcv <- function(mdata, formula, rvar, fvar) {
require(Hmisc)

# determine index of variables
rpos <- match(rvar, names(mdata))
fpos <- match(fvar, names(mdata))

# sort by fold variable
sorted <- mdata[order(mdata[[fpos]]), ]

# get fold values and count for each group
vardesc <- describe(sorted[[fpos]])$values
fvarlist <- as.integer(dimnames(vardesc)[[2]])
k <- length(fvarlist)
countlist <- vardesc[1,1]
for (i in 2:k)
{
countlist[i] <- vardesc[1,i]
}
n <- length(sorted[[fpos]])

# fit to all the mdata
fit.all <- glm(formula, sorted, family=binomial)
pred.all <- ifelse( predict(fit.all, type="response") < 0.5, 0,
1)

#setup
pred.c <- list()
error.i <- vector(length=k)

for (i in 1:k) 
{
fit.i <- glm(formula, subset(sorted, sorted[[fpos]] != fvarlist[i]),
family=binomial)
pred.i <- ifelse(predict(fit.i, newdata=subset(sorted, sorted[[fpos]] ==
fvarlist[i]), type="response") < 0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i <- ifelse(predict(fit.i, newdata=sorted,
type="response") < 0.5, 0, 1)
error.i[i] <- sum(sorted[[rpos]] != pred.all.i)/n
}
pred.cc <- unlist(pred.c)
delta.cv.k <- sum(sorted[[rpos]] != pred.cc)/n
p.k <- countlist/n
delta.app <- mean(sorted[[rpos]] != pred.all)/n

delta.acv.k <- delta.cv.k + delta.app - sum(p.k*error.i)

return(delta.acv.k)
}

----------

T
-- 
Trevor Wiens 
twiens at interbaun.com

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Mar 2005 - Cross validation, one more time (hopefully the last)

[R] Cross validation, one more time (hopefully the last)

[R] Cross validation, one more time (hopefully the last)

Possibly Parallel Threads