Hi,
I have a set of measurements that are made on a daily basis over many
years. I would like to produce a *non-parametric* smooth of these data to
estimate the seasonal cycle - to achieve this, I have been using the cyclic
cubic splines from the mgcv package. This works superbly in most
situations, but not all.
The problem is that for various practical reasons the data is not available
all year around at some sites - in some cases only in summer for example.
The code below (hopefully) should give you an indication of the problem.
#Setup
set.seed(42)
n <- 500
#Generate some randomly distributed observations during (southern) summer
t <- rnorm(n,mean=0,sd=25)%%365
logchl <- cos(t/365*2*pi)^2 +rnorm(n,sd=0.025)
#Now add a seasonal GAM
library(mgcv)
mdl <- gam(logchl ~ s(t,bs="cc"),knots=list(t=c(0,365)))
#Plot
plot(t,logchl,xlab="Day of year",ylab="Log
Chl",xlim=c(0,365),xaxs="i")
t.grid <- data.frame(t=seq(0,365,len=500))
lines(t.grid$t,predict(mdl,newdata=t.grid),lwd=2,col="red",xpd=NA)
When you run it, you can see that the spline smoother is extrapolating well
outside the range of the data during the data-sparse middle of the year.
This is, after all, consistent with how a spline is defined. But in my case
it causes a lot of problems - in some cases the extrapolations can be
really extreme e.g. reaching +/-1e10. This is a pain.
I agree that trying to produce an estimate of the seasonal cycle in this
case is a bit silly - I'm doing it more for consistency with the rest of
the analysis. Ideally I would like to produce something that combines the
features of a gam with linear interpolation e.g. still smoothes where
there is data, but in the absence of data (or low data density regions),
essentially does a linear interpolation between the high-density regions.
Is there anyway that this can be achieved, either in mgcv or other
packages?
An alternative approach might be to make the weight given to an observation
proportional to the density of data in that region - I haven't tried this
as yet.
I'd love to hear peoples opinions these two approaches matter, or
suggestions for other approaches.
Best wishes,
Mark
List of approaches tried thus far
--------------------------------------------
Removing all spline knots in the data-poor regions
Adaptive smoother bases ie s(t,bs="ad")
Loess smoothers
Non-parametric kernel regressions (e.g. ksmooth())
[[alternative HTML version deleted]]