jeff6868
2012-Apr-23 10:42 UTC
[R] take data from a file to another according to their correlation coefficient
Hi everyone. I have a question about a work on R I have to do for my job. I have temperature data coming from 70 weather stations. One data file corresponds to one station for one year (so 70 files for one year). Each file looks like this (important: each file contains NAs): time data 01/01/2008 00:00 -0.25 01/01/2008 00:15 -0.18 01/01/2008 00:30 -0.25 01/01/2008 00:45 -0.25 (one column with date + time every 15mn for the whole year, and one column with data). I already did correlation matrices between my weather stations (in order to find the nearest). For example: Station1 Station2 Station3 [...] Station1 1 0.9 0.8 Station2 0.9 1 0.7 Station3 0.8 0.7 1 [...] Now, I would like to fill the NA data gaps of a station with data from another station according to their correlation coefficient. Let's take an example for the Station 1: if the most correlated Station with Station 1 is Station 2, it has to take data from Station 2 to fill NA gaps of Station 1, for the same date and hour of course (or same lines as I'm doing correlations for the same year). So for year 2008 (for example), if the correlation is the highest between Station 1 and 2 (according to all the Stations), and if the data are: time data 01/01/2008 00:00 1 01/01/2008 00:15 2 FOR STATION 1 01/01/2008 00:30 *NA* 01/01/2008 00:45 4 and time data 01/01/2008 00:00 8 01/01/2008 00:15 9 FOR STATION 2 for the same year and the same time 01/01/2008 00:30 *10 * 01/01/2008 00:45 11 The Station1 file should become: time data 01/01/2008 00:00 1 01/01/2008 00:15 2 STATION 1 01/01/2008 00:30 *10 * 01/01/2008 00:45 4 Hope you've understood what I would like to do :) Thanks a lot for your ideas and your replies! -- View this message in context: http://r.789695.n4.nabble.com/take-data-from-a-file-to-another-according-to-their-correlation-coefficient-tp4580054p4580054.html Sent from the R help mailing list archive at Nabble.com.
Sarah Goslee
2012-Apr-23 11:29 UTC
[R] take data from a file to another according to their correlation coefficient
Hi, Even your example should show why this is a bad way to fill in missing weather data: you end up with a sequence for station 1 of 1, 2, 10, 4 even though that's certainly wrong because Station 2 is reliably 7 units above Station 1. "Correlated" doesn't mean "identical." There are other better options. If you're only missing a single value, interpolation between the values you do have for that station is likely better. If you're missing lots, regression of that station with another correlated station would be the more reasonable way to do what you're trying to propose here. But in fact interpolation of weather data is vey complicated, and the subject of a lot of research. The most realistic methods use elevation as a covariate. These may well be overkill for your situation, though, unless you are missing whole days of data. Sarah On Apr 23, 2012, at 6:42 AM, jeff6868 <geoffrey_klein at etu.u-bourgogne.fr> wrote:> Hi everyone. > > I have a question about a work on R I have to do for my job. > I have temperature data coming from 70 weather stations. One data file > corresponds to one station for one year (so 70 files for one year). Each > file looks like this (important: each file contains NAs): > > time data > 01/01/2008 00:00 -0.25 > 01/01/2008 00:15 -0.18 > 01/01/2008 00:30 -0.25 > 01/01/2008 00:45 -0.25 > > (one column with date + time every 15mn for the whole year, and one column > with data). > > I already did correlation matrices between my weather stations (in order to > find the nearest). For example: > > Station1 Station2 Station3 [...] > Station1 1 0.9 0.8 > Station2 0.9 1 0.7 > Station3 0.8 0.7 1 > [...] > > Now, I would like to fill the NA data gaps of a station with data from > another station according to their correlation coefficient. > Let's take an example for the Station 1: if the most correlated Station with > Station 1 is Station 2, it has to take data from Station 2 to fill NA gaps > of Station 1, for the same date and hour of course (or same lines as I'm > doing correlations for the same year). > So for year 2008 (for example), if the correlation is the highest between > Station 1 and 2 (according to all the Stations), and if the data are: > > time data > 01/01/2008 00:00 1 > 01/01/2008 00:15 2 FOR STATION 1 > 01/01/2008 00:30 *NA* > 01/01/2008 00:45 4 > > and > > time data > 01/01/2008 00:00 8 > 01/01/2008 00:15 9 FOR STATION 2 for the same year and the same > time > 01/01/2008 00:30 *10 * > 01/01/2008 00:45 11 > > The Station1 file should become: > > time data > 01/01/2008 00:00 1 > 01/01/2008 00:15 2 STATION 1 > 01/01/2008 00:30 *10 * > 01/01/2008 00:45 4 > > Hope you've understood what I would like to do :) > Thanks a lot for your ideas and your replies! > >