Perry de Valpine
2023-Feb-11 18:01 UTC
[Rd] inconsistent use of optim's parscale in Hessian vs gradient
Dear r-devel: I noticed that optim's finite-element derivative approximations operate on par/parscale for the gradient (as documented in help(optim)) but on par for the Hessian. (The same goes for optimHess, which calls the same Hessian code.) This seems odd, so I am wondering if it is intended. Maybe this is all well known, and thanks if all it takes is pointing me to any existing information on it. Here is an example (all toy values). When parscale has its default value of 1, the issue doesn't arise. By printing the input, we can see what grid points are being used. foo <- function(x) {print(x); x^4} .External2(stats:::C_optimhess, 1, foo, NULL, list(fnscale = 1, parscale 1, ndeps = 0.02)) With parscale = 1 and ndeps = 0.02, the grid points for 2nd derivative at x = 1 are: x + 0.02, x - 0.02 and then the grid points for the 1st derivative around each of those are x + 0.02 + 0.02, x + 0.02 - 0.02 for the first (around x + 0.02), and x - 0.02 + 0.02, x - 0.02 - 0.02 for the second (around x - 0.02). This all makes sense. There is no problem yet. (There is one more function evaluation than needed, but that's not the point.) (BTW, I'm evaluating the Hessian at 1 instead of the optimum at 0 for illustration, but really it is just the grid points that illustrate the issue.) But with parscale != 1, we can see that the epsilons are used on different scales at different steps: .External2(stats:::C_optimhess, 1, foo, NULL, list(fnscale = 1, parscale 3, ndeps = 0.02)) The grid points for the 2nd derivative at x = 1 are the same as above, so they are on the par scale: x + 0.02, x - 0.02 but the grid points for the 1st derivatives then use par/parscale, giving x + 0.02 + 0.06, x + 0.02 - 0.06 for the first, and x - 0.02 + 0.06, x - 0.02 - 0.06 for the second The 0.06 increments are from 0.02 increments on par/parscale, which is then multiplied by parscale=3 before calling foo. Is this intended? I believe the 2nd derivative increments on the scale of par occur at lines 461-462 of src/library/stats/src/optim.c, in optimhess, and the 1st derivative increments on the scale of par/parscale occur at lines 138 and 146 of the same file, in fmingr. A curious consequence in two dimensions is that the Hessian with respect to x[1] and then x[2] uses a different four grid points than the Hessian with respect to x[2] and then x[1], if they have different parscales. foo2 <- function(x) {print(x); sum(x^4)} .External2(stats:::C_optimhess, c(1,1), foo2, NULL, list(fnscale = 1, parscale = c(3,2), ndeps = c(0.02, 0.02))) This prints the input x values from 16 calls to foo2, comprising 4 calls for each of the 4 Hessian elements. The grid points for dx[1] dx[2] are lines 3-4 and 7-8 of the output, which are [1] 1.02 1.04 [1] 1.02 0.96 [1] 0.98 1.04 [1] 0.98 0.96 So the dx[1] steps are on par while the dx[2] steps are on par/parscale. The grid points for dx[2] dx[1] are lines 9-10 and 13-14 of the output, which are [1] 1.06 1.02 [1] 0.94 1.02 [1] 1.06 0.98 [1] 0.94 0.98 So the dx[2] steps are on par while the dx[1] steps are on par/parscale. The code on lines 472-477 of optim.c symmetrizes the two approximations of the same second derivative (from dx[1] dx[2] and dx[2] dx[1]) by averaging. This may be helpful numerically in any case, but was it intended to be averaging over two potentially different grid schemes? That these are the grid points being used can be confirmed by direct computation and comparison to the value returned by optimHess. Thanks for any insight on this. Perry [[alternative HTML version deleted]]