Hi All,
This is a question not directly related to R itself, it's about how to
deal with missing data. I want to build wind roses i.e. circular
histograms of wind directions and associated speeds to look for trends
or changes in the wind patterns over several decades for some meteo
stations. The database I have contains hourly records of wind direction
and speed over the past 50 years.......obviously that's a huge database!
Of course there are a lot of missing data and they are causing problems.
Two major problems arise from the temporal distribution of wind records:
1) Data are missing because of station shutdowns (consecutive missing
data over days, weeks, months and even years for some stations!!!)
2) In the past, wind records were performed only during daytime while
recently they cover day and night time
On top of these situations, data can also miss "at random". The
analysis
is complicated by the fact that wind direction is a circular variable so
specific tools must be used to handle this. I know there are different
ways to deal with missing data such as Multiple Imputation but most
assume gaussianity of the variables. Moreover when a record is missing
in the database, it is missing for all variables so that it is
apparently not possible to use other variables to produce estimates of
missing wind records.
For now I'm considering the following:
- look at copula function to build a bivariate distribution of wind
direction and speeds and simulate values out of it to fill-in missing
data. Produce several estimate of each missing data to assess the
variability of the final results. The bivariate distribution should be
modelled for every 5 or 10 years interval to accommodate for a possible
trend in the data.
- time series approach: it seems that wind direction and wind speed are
autocorrelated over . But it seems to be due to a non stationarity since
computing the autocorrelation on first derivative destroys everything
(correlation of wind direction is performed using the circular-circular
correlation coeff as defined by Mardia 1976).
- Correlate with other meteo stations: this is a problem because wind
patterns are affected by topography for instance and even nearby
stations may have different wind patterns. Also the correlation between
meteo stations is questionable since a N wind will first affect Northern
stations while a S wind will first affect southern stations so the
lagged correlation between stations may appear lower than what it should
be I guess.
- Neural networks: Data driven approach but since missing data are
missing for all variables, I do not have much inputs to feed in the
network.
- Data weighing: this sounds stupid but I tried to give a weight to data
according to the time difference between records. Data next to a missing
value receive more weight than other and the weight is bigger as the
number of missing data increases between two data. I thought about that
because I remember using Voronoi polygons in spatial statistics to
weight data according to the monitoring network density. However I'm not
confident in this approach because I don't like the idea of giving a
higher weight to a data simply because it is surrounded by missing
values....
- Do nothing! Sometime it's better to consider raw data rather than
applying questionable techniques. Computing wind roses with raw data
sure produces artefacts but....
Well now you know more or less that I do not know a lot on the topic of
missing data and desperately need your help :) If you have some hints on
what techniques I may use or general advices, please let me know.
Thanks a lot,
Aziz
[[alternative HTML version deleted]]