Doran, Harold
2009-Nov-29 19:40 UTC
[R] optim or nlminb for minimization, which to believe?
I have constructed the function mml2 (below) based on the likelihood function described in the minimal latex I have pasted below for anyone who wants to look at it. This function finds parameter estimates for a basic Rasch (IRT) model. Using the function without the gradient, using either nlminb or optim returns the correct parameter estimates and, in the case of optim, the correct standard errors. By correct, I mean they match another software program as well as the rasch() function in the ltm package. Now, when I pass the gradient to optim, I get a message of successful convergence, but the parameter estimates are not correct, but they are *very* close to being correct. But, when I use nlminb with the gradient, I get a message of false convergence and again, the estimates are off, but again very close to being correct. This is best illustrated via the examples: ### Sample data set set.seed(1234) tmp <- data.frame(item1 = sample(c(0,1), 20, replace=TRUE), item2 = sample(c(0,1), 20, replace=TRUE), item3 = sample(c(0,1), 20, replace=TRUE),item4 = sample(c(0,1), 20, replace=TRUE),item5 = sample(c(0,1), 20, replace=TRUE)) ## Use function mml2 (below) with optim with use of gradient> mml2(tmp,Q=10)$par [1] -0.2438733 0.4889333 -0.2438733 0.4889333 0.7464162 $value [1] 63.86376 $counts function gradient 45 6 $convergence [1] 0 $message NULL $hessian [,1] [,2] [,3] [,4] [,5] [1,] 4.095479 0.000000 0.000000 0.000000 0.000000 [2,] 0.000000 3.986293 0.000000 0.000000 0.000000 [3,] 0.000000 0.000000 4.095479 0.000000 0.000000 [4,] 0.000000 0.000000 0.000000 3.986293 0.000000 [5,] 0.000000 0.000000 0.000000 0.000000 3.800898 ## Use same function but use nlminb with use of gradient> mml2(tmp,Q=10)$par [1] -0.2456398 0.4948889 -0.2456398 0.4948889 0.7516308 $objective [1] 63.86364 $convergence [1] 1 $message [1] "false convergence (8)" $iterations [1] 4 $evaluations function gradient 41 4 ### use nlminb but turn off use of gradient> mml2(tmp,Q=10)$par [1] -0.2517961 0.4898682 -0.2517961 0.4898682 0.7500994 $objective [1] 63.8635 $convergence [1] 0 $message [1] "relative convergence (4)" $iterations [1] 8 $evaluations function gradient 11 64 ### Use optim and turn off gradient> mml2(tmp,Q=10)$par [1] -0.2517990 0.4898676 -0.2517990 0.4898676 0.7500906 $value [1] 63.8635 $counts function gradient 22 7 $convergence [1] 0 $message NULL $hessian [,1] [,2] [,3] [,4] [,5] [1,] 3.6311153 -0.3992959 -0.4224747 -0.3992959 -0.3764526 [2,] -0.3992959 3.5338195 -0.3992959 -0.3960956 -0.3798141 [3,] -0.4224747 -0.3992959 3.6311153 -0.3992959 -0.3764526 [4,] -0.3992959 -0.3960956 -0.3992959 3.5338195 -0.3798141 [5,] -0.3764526 -0.3798141 -0.3764526 -0.3798141 3.3784816 The parameter estimates with and without the use of the gradient are so close that I am inclined to believe that the gradient is correct and maybe the problem is elsewhere. It seems odd that optim seems to converge but nlminb does not with the use of the gradient. But, with the use of the gradient in either case, the parameter estimates differ from what I think are the correct values. So, at this point I am unclear if the problem is somewhere in the way the functions are used or how I am passing the gradient or if the problem lies in the way I have constructed the gradient itself. Below is the function and also some latex for those interested in looking at the likelihood function. Thanks for any reactions Harold> sessionInfo()R version 2.10.0 (2009-10-26) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ltm_0.9-2 polycor_0.7-7 sfsmisc_1.0-9 mvtnorm_0.9-8 msm_0.9.4 MASS_7.3-3 MiscPsycho_1.5 [8] statmod_1.4.1 loaded via a namespace (and not attached): [1] splines_2.10.0 survival_2.35-7 tools_2.10.0 mml2 <- function(data, Q, startVal = NULL, gr = TRUE, ...){ if(!is.null(startVal) && length(startVal) != ncol(data) ){ stop("Length of argument startVal not equal to the number of parameters estimated") } if(!is.null(startVal)){ startVal <- startVal } else { p <- colMeans(data) startVal <- as.vector(log((1 - p)/p)) } qq <- gauss.quad.prob(Q, dist = 'normal') rr1 <- matrix(0, nrow = Q, ncol = nrow(data)) data <- as.matrix(data) L <- nrow(data) C <- ncol(data) fn <- function(b){ for(j in 1:Q){ for(i in 1:nrow(data)){ rr1[j,i] <- exp(sum(dbinom(data[i,], 1, plogis(qq$nodes[j]-b), log = TRUE))) * qq$weights[j] } } -sum(log(colSums(rr1))) } gradient <- function(b){ p <- outer(qq$nodes, b, plogis) * L x <- t(matrix(colSums(data), nrow= C, Q)) w <- matrix(qq$weights, Q, C) rr2 <- (-x + p) * w -colSums(rr2) } opt <- optim(startVal, fn, gr = gradient, hessian = TRUE, method = "BFGS") #opt <- nlminb(startVal, fn, gradient=gradient) #list("coefficients" = opt$par, "LogLik" = -opt$value, "Std.Error" = 1/sqrt(diag(opt$hessian))) opt } ### latex describing the likelihood function \documentclass{article} \usepackage{latexsym} \usepackage{amssymb,amsmath,bm} \newcommand{\Prb}{\operatorname{Prob}} \title{Marginal Maximum Likelihood for IRT Models} \author{Harold C. Doran} \begin{document} \maketitle The likelihood function is written as: \begin{equation} \label{eqn:mml} f(x) = \prod^N_{i=1}\int_{\Theta}\prod^K_{j=1}p(x|\theta,\beta)f(\theta)d\theta \end{equation} \noindent where $N$ indexes the total number of individuals, $K$ indexes the total number of items, $p(x|\theta,\beta)$ is the data likelihood and $f(\theta)$ is a population distribution. For the rasch model, the data likelihood is: \begin{equation} p(x|\theta,\beta) = \prod^N_{i=1}\prod^K_{j=1} \Pr(x_{ij} = 1 | \theta_i, \beta_j)^{x_{ij}} \left[1 - \Pr(X_{ij} = 1 | \theta_i, \beta_j)\right]^{(1-x_{ij})} \end{equation} \begin{equation} \label{rasch} \Pr(x_{ij} = 1 | \theta_i, \beta_j) = \frac{1}{1 + e^{-(\theta_i-\beta_j)}} \quad i = (1, \ldots, K); j = (1, \ldots, N) \end{equation} \noindent where $\theta_i$ is the ability estimate of person $i$ and $\beta_j$ is the location parameter for item $j$. The integral in Equation~(\ref{eqn:mml}) has no closed form expression, so it is approximated using Gauss-Hermite quadrature as: \begin{equation} \label{eqn:mml:approx} f(x) \approx \prod^N_{i=1}\sum_{q=1}^{Q}\prod^K_{j=1}p(x|\theta_q,\beta)w_q \end{equation} \noindent where $q$ indexes the node at quadrature point $q$ and $w$ is the weight at quadrature point $q$. With Equation~(\ref{eqn:mml:approx}), the remaining challenge is to find $\underset{x}{\operatorname{arg\,max}} \, f(x)$. \end{document} [[alternative HTML version deleted]]
Hans W Borchers
2009-Nov-29 23:16 UTC
[R] optim or nlminb for minimization, which to believe?
Your function named 'gradient' is not the correct gradient. Take as an example the following point x0, very near to the true minimum, x0 <- c(-0.2517964, 0.4898680, -0.2517962, 0.4898681, 0.7500995) then you get > gradient(x0) [1] -0.0372110470 0.0001816991 -0.0372102284 0.0001820976 0.0144292657 but the numerical gradient is different: > library(numDeriv) > grad(fn, x0) [1] -6.151645e-07 -5.507219e-07 1.969143e-07 -1.563892e-07 -4.955502e-08 that is the derivative is close to 0 in any direction -- as it should be for an optimum. No wonder, optim et al. get confused when applying this 'gradient'. Regards Hans Werner Doran, Harold wrote:> > I have constructed the function mml2 (below) based on the likelihood > function described in the minimal latex I have pasted below for anyone who > wants to look at it. This function finds parameter estimates for a basic > Rasch (IRT) model. Using the function without the gradient, using either > nlminb or optim returns the correct parameter estimates and, in the case > of optim, the correct standard errors. > > By correct, I mean they match another software program as well as the > rasch() function in the ltm package. > > Now, when I pass the gradient to optim, I get a message of successful > convergence, but the parameter estimates are not correct, but they are > *very* close to being correct. But, when I use nlminb with the gradient, I > get a message of false convergence and again, the estimates are off, but > again very close to being correct. > > This is best illustrated via the examples: > > ### Sample data set > set.seed(1234) > tmp <- data.frame(item1 = sample(c(0,1), 20, replace=TRUE), item2 > sample(c(0,1), 20, replace=TRUE), item3 = sample(c(0,1), 20, > replace=TRUE),item4 = sample(c(0,1), 20, replace=TRUE),item5 > sample(c(0,1), 20, replace=TRUE)) > > ## Use function mml2 (below) with optim with use of gradient >> mml2(tmp,Q=10) > $par > [1] -0.2438733 0.4889333 -0.2438733 0.4889333 0.7464162 > > $value > [1] 63.86376 > > $counts > function gradient > 45 6 > > $convergence > [1] 0 > > $message > NULL > > $hessian > [,1] [,2] [,3] [,4] [,5] > [1,] 4.095479 0.000000 0.000000 0.000000 0.000000 > [2,] 0.000000 3.986293 0.000000 0.000000 0.000000 > [3,] 0.000000 0.000000 4.095479 0.000000 0.000000 > [4,] 0.000000 0.000000 0.000000 3.986293 0.000000 > [5,] 0.000000 0.000000 0.000000 0.000000 3.800898 > > ## Use same function but use nlminb with use of gradient >> mml2(tmp,Q=10) > $par > [1] -0.2456398 0.4948889 -0.2456398 0.4948889 0.7516308 > > $objective > [1] 63.86364 > > $convergence > [1] 1 > > $message > [1] "false convergence (8)" > > $iterations > [1] 4 > > $evaluations > function gradient > 41 4 > > > ### use nlminb but turn off use of gradient >> mml2(tmp,Q=10) > $par > [1] -0.2517961 0.4898682 -0.2517961 0.4898682 0.7500994 > > $objective > [1] 63.8635 > > $convergence > [1] 0 > > $message > [1] "relative convergence (4)" > > $iterations > [1] 8 > > $evaluations > function gradient > 11 64 > > ### Use optim and turn off gradient > >> mml2(tmp,Q=10) > $par > [1] -0.2517990 0.4898676 -0.2517990 0.4898676 0.7500906 > > $value > [1] 63.8635 > > $counts > function gradient > 22 7 > > $convergence > [1] 0 > > $message > NULL > > $hessian > [,1] [,2] [,3] [,4] [,5] > [1,] 3.6311153 -0.3992959 -0.4224747 -0.3992959 -0.3764526 > [2,] -0.3992959 3.5338195 -0.3992959 -0.3960956 -0.3798141 > [3,] -0.4224747 -0.3992959 3.6311153 -0.3992959 -0.3764526 > [4,] -0.3992959 -0.3960956 -0.3992959 3.5338195 -0.3798141 > [5,] -0.3764526 -0.3798141 -0.3764526 -0.3798141 3.3784816 > > The parameter estimates with and without the use of the gradient are so > close that I am inclined to believe that the gradient is correct and maybe > the problem is elsewhere. > > It seems odd that optim seems to converge but nlminb does not with the use > of the gradient. But, with the use of the gradient in either case, the > parameter estimates differ from what I think are the correct values. So, > at this point I am unclear if the problem is somewhere in the way the > functions are used or how I am passing the gradient or if the problem lies > in the way I have constructed the gradient itself. > > Below is the function and also some latex for those interested in looking > at the likelihood function. > > Thanks for any reactions > Harold > >> sessionInfo() > R version 2.10.0 (2009-10-26) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ltm_0.9-2 polycor_0.7-7 sfsmisc_1.0-9 mvtnorm_0.9-8 msm_0.9.4 > MASS_7.3-3 MiscPsycho_1.5 > [8] statmod_1.4.1 > > loaded via a namespace (and not attached): > [1] splines_2.10.0 survival_2.35-7 tools_2.10.0 > > > mml2 <- function(data, Q, startVal = NULL, gr = TRUE, ...){ > if(!is.null(startVal) && length(startVal) != ncol(data) ){ > stop("Length of argument startVal not > equal to the number of parameters estimated") > } > if(!is.null(startVal)){ > startVal <- startVal > } else { > p <- colMeans(data) > startVal <- as.vector(log((1 - p)/p)) > } > qq <- gauss.quad.prob(Q, dist = 'normal') > rr1 <- matrix(0, nrow = Q, ncol = nrow(data)) > data <- as.matrix(data) > L <- nrow(data) > C <- ncol(data) > fn <- function(b){ > for(j in 1:Q){ > for(i in 1:nrow(data)){ > rr1[j,i] > <- exp(sum(dbinom(data[i,], 1, plogis(qq$nodes[j]-b), log = TRUE))) * > > qq$weights[j] > } > } > -sum(log(colSums(rr1))) > } > gradient <- function(b){ > p <- outer(qq$nodes, b, plogis) * L > x <- t(matrix(colSums(data), nrow= C, Q)) > w <- matrix(qq$weights, Q, C) > rr2 <- (-x + p) * w > -colSums(rr2) > } > opt <- optim(startVal, fn, gr = gradient, hessian = TRUE, > method = "BFGS") > #opt <- nlminb(startVal, fn, gradient=gradient) > #list("coefficients" = opt$par, "LogLik" = -opt$value, > "Std.Error" = 1/sqrt(diag(opt$hessian))) > opt > } > > > > ### latex describing the likelihood function > \documentclass{article} > \usepackage{latexsym} > \usepackage{amssymb,amsmath,bm} > \newcommand{\Prb}{\operatorname{Prob}} > \title{Marginal Maximum Likelihood for IRT Models} > \author{Harold C. Doran} > \begin{document} > \maketitle > The likelihood function is written as: > \begin{equation} > \label{eqn:mml} > f(x) > \prod^N_{i=1}\int_{\Theta}\prod^K_{j=1}p(x|\theta,\beta)f(\theta)d\theta > \end{equation} > \noindent where $N$ indexes the total number of individuals, $K$ indexes > the total number of items, $p(x|\theta,\beta)$ is the data likelihood and > $f(\theta)$ is a population distribution. For the rasch model, the data > likelihood is: > \begin{equation} > p(x|\theta,\beta) = \prod^N_{i=1}\prod^K_{j=1} \Pr(x_{ij} = 1 | \theta_i, > \beta_j)^{x_{ij}} \left[1 - \Pr(X_{ij} = 1 | \theta_i, > \beta_j)\right]^{(1-x_{ij})} > \end{equation} > \begin{equation} > \label{rasch} > \Pr(x_{ij} = 1 | \theta_i, \beta_j) = \frac{1}{1 + > e^{-(\theta_i-\beta_j)}} \quad i = (1, \ldots, K); j = (1, \ldots, N) > \end{equation} > \noindent where $\theta_i$ is the ability estimate of person $i$ and > $\beta_j$ is the location parameter for item $j$. The integral in > Equation~(\ref{eqn:mml}) has no closed form expression, so it is > approximated using Gauss-Hermite quadrature as: > \begin{equation} > \label{eqn:mml:approx} > f(x) \approx > \prod^N_{i=1}\sum_{q=1}^{Q}\prod^K_{j=1}p(x|\theta_q,\beta)w_q > \end{equation} > \noindent where $q$ indexes the node at quadrature point $q$ and $w$ is > the weight at quadrature point $q$. With Equation~(\ref{eqn:mml:approx}), > the remaining challenge is to find $\underset{x}{\operatorname{arg\,max}} > \, f(x)$. > \end{document} > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- View this message in context: http://n4.nabble.com/optim-or-nlminb-for-minimization-which-to-believe-tp930841p930942.html Sent from the R help mailing list archive at Nabble.com.