Hi folks,
I've looked through the list archives and online resources, but I
haven't really found an answer to this -- it's pretty basic, but I'm
(very much) not a statistician, and I just want to check that my
solution is statistically sound.
Basically, I have a data file containing two columns of data, call it data.tsv:
year	count
1999	3
2000	5
2001	9
2002	30
2003	62
2004	154
2005	245
2006	321
These look exponential to me, so what I want to do is plot these
points on a graph with linear axes, and add an exponential curve over
the top. I also want to give an R-squared for the fit.
The way I did it was like so:
# Read in the data, make a copy of it, and take logs
data = read.table("data.tsv", header=TRUE)
log.data = data
log.data$count = log(log.data$count)
# Fit a model to the logs of the data
model = lm(log.data$count ~ year, data = log.data)
# Plot the original data points on a graph
plot(data)
# Draw in the exponents of the model's output
lines(data$year, exp(fitted(model)))
Is this the right way to do it? log-ing the data and then exp-ing the
results seems like a bit of a long-winded way to achieve the desired
effect. Is the R-squared given by summary(model) a valid measurement
of the fit of the points to an exponential curve, and should I use
multiple R-squared or adjusted R-squared?
The R-squared I get from this method (0.98 multiple) seems a little
high going by the deviation of the last data point from the curve --
you'll see what I mean if you try it.
Thanks in advance for any help!
Yours gratefully,
Andrew.
On 24-Jul-07 01:09:06, Andrew Clegg wrote:> Hi folks, > > I've looked through the list archives and online resources, but I > haven't really found an answer to this -- it's pretty basic, but I'm > (very much) not a statistician, and I just want to check that my > solution is statistically sound. > > Basically, I have a data file containing two columns of data, call it > data.tsv: > > year count > 1999 3 > 2000 5 > 2001 9 > 2002 30 > 2003 62 > 2004 154 > 2005 245 > 2006 321 > > These look exponential to me, so what I want to do is plot these > points on a graph with linear axes, and add an exponential curve over > the top. I also want to give an R-squared for the fit. > > The way I did it was like so: > > ># Read in the data, make a copy of it, and take logs > data = read.table("data.tsv", header=TRUE) > log.data = data > log.data$count = log(log.data$count) > ># Fit a model to the logs of the data > model = lm(log.data$count ~ year, data = log.data) > ># Plot the original data points on a graph > plot(data) > ># Draw in the exponents of the model's output > lines(data$year, exp(fitted(model))) > > > Is this the right way to do it? log-ing the data and then exp-ing the > results seems like a bit of a long-winded way to achieve the desired > effect. Is the R-squared given by summary(model) a valid measurement > of the fit of the points to an exponential curve, and should I use > multiple R-squared or adjusted R-squared? > > The R-squared I get from this method (0.98 multiple) seems a little > high going by the deviation of the last data point from the curve -- > you'll see what I mean if you try it.I just did. From the plot of log(count) against year, with the plot of the linear fit of log(count)~year superimposed, I see indications of a non-linear relationship. The departures of the data from the fit follow a rather systematic pattern. Initially the data increase more slowly than the fit, and lie below it. Then they increase faster and corss over above it. Then the data increase less fast than the fit, and the final data point is below the fit. There are not enough data to properly identify the non-linearity, but the overall appearance of the data plot suggests to me that you should be considering one of the "growth curve" models. Many such models start of with an increasing rate of growth, which then slows down, and typically levels off to an asymptote. The apparent large discrepancy of your final data point could be compatible with this kind of behaviour. At this point, knowledge of what kind of thing is represented by your "count" variable might be helpful. If, for instance, it is the count of the numbers of individuals of a species in an area, then independent knowledge of growth mechanisms may help to narrow down the kind of model you should be tring to fit. As to your question about "Is this the right way to do it" (i.e. fitting an exponential curve by doing a linear fit of the logarithm), generally speaking the answer is "Yes". But of course you need to be confident that "exponential" is the right curve to be fitting in the first place. If it's the wrong type of curve to be considering, then it's not "the right way to do it"! Hoping this help[s, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 24-Jul-07 Time: 10:08:33 ------------------------------ XFMail ------------------------------
I think your way is probably the easiest (shockingly). For instance, here are
some alternatives - I think in both cases you have to calculate the
coefficient of determination (R^2) manually. My understanding is that
multiple R^2 in your case is the usual R^2 because you only have one
predictor variable, and the adjusted R^2 considers the degrees of freedom and
penalizes for additional predictors. Which is better... depends? (Perhaps
more stats-savvy people can help you on that one. I'm a chemical engineer so
I unjustifiably claim ignorance).
## Data input
input <-
"Year	Count
1999	3
2000	5
2001	9
2002	30
2003	62
2004	154
2005	245
2006	321"
dat <- read.table(textConnection(input),header=TRUE)
dat[,] <- lapply(dat,function(x) x-x[1])
          # shifting in origin; will need to add back in later
## Nonlinear least squares
plot(dat)
out <- nls(Count~b0*exp(b1*Year),data=dat,
           start=list(b0=1,b1=1))
lines(dat[,1],fitted(out),col=2)
out <- nls(Count~b0+b1*Year+b2*Year^2,data=dat, #polynomial
           start=list(b0=0,b1=1,b2=1))
lines(dat[,1],fitted(out),col=3)
## Optim
f <- function(.pars,.dat,.fun) sum((.dat[,2]-.fun(.pars,.dat[,1]))^2)
fitFun <- function(b,x) cbind(1,x,x^2)%*%b
expFun <- function(b,x) b[1]*exp(b[2]*x)
plot(dat)
out <- optim(c(0,1,1),f,.dat=dat,.fun=fitFun)
lines(dat[,1],fitFun(out$par,dat[,1]),col=2)
out <- optim(c(1,1),f,.dat=dat,.fun=expFun)
lines(dat[,1],expFun(out$par,dat[,1]),col=3)
--- Andrew Clegg <andrew.clegg at gmail.com> wrote:
> Hi folks,
> 
> I've looked through the list archives and online resources, but I
> haven't really found an answer to this -- it's pretty basic, but
I'm
> (very much) not a statistician, and I just want to check that my
> solution is statistically sound.
> 
> Basically, I have a data file containing two columns of data, call it
> data.tsv:
> 
> year	count
> 1999	3
> 2000	5
> 2001	9
> 2002	30
> 2003	62
> 2004	154
> 2005	245
> 2006	321
> 
> These look exponential to me, so what I want to do is plot these
> points on a graph with linear axes, and add an exponential curve over
> the top. I also want to give an R-squared for the fit.
> 
> The way I did it was like so:
> 
> 
> # Read in the data, make a copy of it, and take logs
> data = read.table("data.tsv", header=TRUE)
> log.data = data
> log.data$count = log(log.data$count)
> 
> # Fit a model to the logs of the data
> model = lm(log.data$count ~ year, data = log.data)
> 
> # Plot the original data points on a graph
> plot(data)
> 
> # Draw in the exponents of the model's output
> lines(data$year, exp(fitted(model)))
> 
> 
> Is this the right way to do it? log-ing the data and then exp-ing the
> results seems like a bit of a long-winded way to achieve the desired
> effect. Is the R-squared given by summary(model) a valid measurement
> of the fit of the points to an exponential curve, and should I use
> multiple R-squared or adjusted R-squared?
> 
> The R-squared I get from this method (0.98 multiple) seems a little
> high going by the deviation of the last data point from the curve --
> you'll see what I mean if you try it.
> 
> Thanks in advance for any help!
> 
> Yours gratefully,
> 
> Andrew.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>