I am trying to fit a rank-frequency distribution with 3 unknowns (a, b and k) to a set of data. This is my data set: y <- c(37047647,27083970,23944887,22536157,20133224, 20088720,18774883,18415648,17103717,13580739,12350767, 8682289,7496355,7248810,7022120,6396495,6262477,6005496, 5065887,4594147,2853307,2745322,454572,448397,275136,268771) and this is the fit I'm trying to do: nlsfit <- nls(y ~ a * x^k * b^x, start=list(a=5,k=1,b=3)) (It's a Yule distribution.) However, I keep getting: "Error in nls(y ~ a * x^k * b^x, start = list(a = 5, k = 1, b = 3)) : singular gradient" I guess this has something to do with the parameter start values. I was wondering, is there a fully automated way of estimating parameters which doesn't need start values close to the final estimates? I know other programs do it, so is it possible in R? Thanks, Andrew Wilson
Martin Maechler has pointed out to me that I omitted to include my values for x when circulating my query. They are a simple rank list from 1 to 26: x <- 1:26 Thanks, Andrew
Dr Andrew Wilson <eia018 at comp.lancs.ac.uk> writes:> I am trying to fit a rank-frequency distribution with 3 unknowns (a, b > and k) to a set of data. > > This is my data set: > > y <- c(37047647,27083970,23944887,22536157,20133224, > 20088720,18774883,18415648,17103717,13580739,12350767, > 8682289,7496355,7248810,7022120,6396495,6262477,6005496, > 5065887,4594147,2853307,2745322,454572,448397,275136,268771) > > and this is the fit I'm trying to do: > > nlsfit <- nls(y ~ a * x^k * b^x, start=list(a=5,k=1,b=3)) > > (It's a Yule distribution.) > > However, I keep getting: > > "Error in nls(y ~ a * x^k * b^x, start = list(a = 5, k = 1, b = 3)) : > singular gradient" > > I guess this has something to do with the parameter start values. > > I was wondering, is there a fully automated way of estimating parameters > which doesn't need start values close to the final estimates? I know > other programs do it, so is it possible in R?You don't seem to have an x anywhere. Are you making the (apparently not uncommon) mistake of trying to use a program for fitting nonlinear relations by least squares to fit a probability density? If so, look for fitdistr() instead. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Since x <- 1:26 and your y's are all positive, your model, ignoring the error term, is mathematically isomorphic to the following: > x <- 1:26 > (fit <- lm(y~x+log(x))) Call: lm(formula = y ~ x + log(x)) Coefficients: (Intercept) x log(x) 35802074 -371008 -8222922 With reasonable starting values, I would expect "a" to converge to roughly exp(35802074), "k" to (-8222922), and "b" to exp(-371008). With values of these magnitudes for "a" and "b", the "nls" optimization is highly ill conditioned. What do these numbers represent? By using "nls" you are assuming implicitly the following: y = a*x^k*b^x + e, where the e's are independent normal errors with mean 0 and constant variance. Meanwhile, the linear model I fit above assumes a different noise model: log(y) = log(a) + k*log(x) + x*log(b) + e, where these e's are also independent normal, mean 0, constant variance. If you have no preference for one noise model over the other, I suggest you use the linear model I estimated, declare victory and worry about something else. If you insist on estimating the multiplicative model, you should start by dividing y by some number like 1e6 or 1e7 and consider reparameterizing the problem if that is not adequate. Have you consulted a good book on nonlinear regression? The two references cited in "?nls" are both excellent: Bates, D.M. and Watts, D.G. (1988) _Nonlinear Regression Analysis and Its Applications_, Wiley Bates, D. M. and Chambers, J. M. (1992) _Nonlinear models._ Chapter 10 of _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole. hope this helps. spencer graves Dr Andrew Wilson wrote:>I am trying to fit a rank-frequency distribution with 3 unknowns (a, b >and k) to a set of data. > >This is my data set: > >y <- c(37047647,27083970,23944887,22536157,20133224, >20088720,18774883,18415648,17103717,13580739,12350767, >8682289,7496355,7248810,7022120,6396495,6262477,6005496, >5065887,4594147,2853307,2745322,454572,448397,275136,268771) > >and this is the fit I'm trying to do: > >nlsfit <- nls(y ~ a * x^k * b^x, start=list(a=5,k=1,b=3)) > >(It's a Yule distribution.) > >However, I keep getting: > >"Error in nls(y ~ a * x^k * b^x, start = list(a = 5, k = 1, b = 3)) : >singular gradient" > >I guess this has something to do with the parameter start values. > >I was wondering, is there a fully automated way of estimating parameters >which doesn't need start values close to the final estimates? I know >other programs do it, so is it possible in R? > >Thanks, >Andrew Wilson > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >
If the numbers are letter frequencies, I would suggest Poisson regression using "glm" the default link is logarithms, and that should work quite well for you. hope this helps. spencer graves ################################### Very many thanks for your help. >> What do these numbers represent? They are letter frequencies arranged in rank order. (A very big sample that I got off the web for testing, but my own data - rank frequencies of various linguistic entities, including letter frequencies - are likely to be similar.) Basically, I am testing the goodness of fit of three or four equations: - the one I posted (Yule's equation) - Zipf's equation (y = a * b^x, if I remember rightly, but the paper's at home, so I may be wrong...) - a parameter-free equation Regards, Andrew Wilson #################################### Since x <- 1:26 and your y's are all positive, your model, ignoring the error term, is mathematically isomorphic to the following:> x <- 1:26 > (fit <- lm(y~x+log(x)))Call: lm(formula = y ~ x + log(x)) Coefficients: (Intercept) x log(x) 35802074 -371008 -8222922 With reasonable starting values, I would expect "a" to converge to roughly exp(35802074), "k" to (-8222922), and "b" to exp(-371008). With values of these magnitudes for "a" and "b", the "nls" optimization is highly ill conditioned. What do these numbers represent? By using "nls" you are assuming implicitly the following: y = a*x^k*b^x + e, where the e's are independent normal errors with mean 0 and constant variance. Meanwhile, the linear model I fit above assumes a different noise model: log(y) = log(a) + k*log(x) + x*log(b) + e, where these e's are also independent normal, mean 0, constant variance. If you have no preference for one noise model over the other, I suggest you use the linear model I estimated, declare victory and worry about something else. If you insist on estimating the multiplicative model, you should start by dividing y by some number like 1e6 or 1e7 and consider reparameterizing the problem if that is not adequate. Have you consulted a good book on nonlinear regression? The two references cited in "?nls" are both excellent: Bates, D.M. and Watts, D.G. (1988) _Nonlinear Regression Analysis and Its Applications_, Wiley Bates, D. M. and Chambers, J. M. (1992) _Nonlinear models._ Chapter 10 of _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole. hope this helps. spencer graves Dr Andrew Wilson wrote:>I am trying to fit a rank-frequency distribution with 3 unknowns (a, b >and k) to a set of data. > >This is my data set: > >y <- c(37047647,27083970,23944887,22536157,20133224, >20088720,18774883,18415648,17103717,13580739,12350767, >8682289,7496355,7248810,7022120,6396495,6262477,6005496, >5065887,4594147,2853307,2745322,454572,448397,275136,268771) > >and this is the fit I'm trying to do: > >nlsfit <- nls(y ~ a * x^k * b^x, start=list(a=5,k=1,b=3)) > >(It's a Yule distribution.) > >However, I keep getting: > >"Error in nls(y ~ a * x^k * b^x, start = list(a = 5, k = 1, b = 3)) : >singular gradient" > >I guess this has something to do with the parameter start values. > >I was wondering, is there a fully automated way of estimating parameters >which doesn't need start values close to the final estimates? I know >other programs do it, so is it possible in R? > >Thanks, >Andrew Wilson > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >
Your starting values for the parameters are no even in the general ballpark. Here is what I got for the final fit: Parameters: Estimate Std. Error t value Pr(>|t|) a 3.806e+07 1.732e+06 21.971 <2e-16 *** k -3.391e-02 6.903e-02 -0.491 0.628 b 9.000e-01 1.240e-02 72.612 <2e-16 *** As you can see only b has the same order of magnitude as your starting b and a is different by 7 orders! In general, an nls procedure needs starting values that are "close". The same is true for any general nonlinear optimization algorithm. Only in special circumstances a nonlinear algorithm will converge from any starting point - one I know of is the convexity of the objective function. I do not think there is any nls program that will find starting values automatically for an arbitrary nonlinear model. It is possible for specific models, but is very much model dependent. Your model, for example, can be easily linearized by taking logs of both sides, i.e. mm <- lm(log(y) ~ x + I(log(x))) Then doing aa <- exp(coef(mm)[1]) bb <- exp(coef(mm)[2]) kk <- coef(mm)[3] mm1 <- nls(y ~ a * x^k * b^x, start=list(a=aa,k=kk,b=bb)) results in convergence with no problems in 7 iterations. Hope this helps, Andy __________________________________ Andy Jaworski 518-1-01 Process Laboratory 3M Corporate Research Laboratory ----- E-mail: apjaworski at mmm.com Tel: (651) 733-6092 Fax: (651) 736-3122 |---------+--------------------------------> | | Dr Andrew Wilson | | | <eia018 at comp.lancs.ac| | | .uk> | | | Sent by: | | | r-help-bounces at stat.m| | | ath.ethz.ch | | | | | | | | | 11/25/2003 03:09 | | | | |---------+--------------------------------> >-----------------------------------------------------------------------------------------------------------------------------| | | | To: r-help at stat.math.ethz.ch | | cc: | | Subject: [R] Parameter estimation in nls | >-----------------------------------------------------------------------------------------------------------------------------| I am trying to fit a rank-frequency distribution with 3 unknowns (a, b and k) to a set of data. This is my data set: y <- c(37047647,27083970,23944887,22536157,20133224, 20088720,18774883,18415648,17103717,13580739,12350767, 8682289,7496355,7248810,7022120,6396495,6262477,6005496, 5065887,4594147,2853307,2745322,454572,448397,275136,268771) and this is the fit I'm trying to do: nlsfit <- nls(y ~ a * x^k * b^x, start=list(a=5,k=1,b=3)) (It's a Yule distribution.) However, I keep getting: "Error in nls(y ~ a * x^k * b^x, start = list(a = 5, k = 1, b = 3)) : singular gradient" I guess this has something to do with the parameter start values. I was wondering, is there a fully automated way of estimating parameters which doesn't need start values close to the final estimates? I know other programs do it, so is it possible in R? Thanks, Andrew Wilson ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help