thr3ads.net - R help - [R] relation in aggregated data [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Petr PIKAL

2010-Jul-07 14:24 UTC

[R] relation in aggregated data

Dear all

My question is more on statistics than on R, however it can be 
demonstrated by R. It is about pros and cons trying to find a relationship 
by aggregated data. I can have two variables which can be related and I 
measure them regularly during some time (let say a year) but I can not 
measure them in a same time - (e.g. I can not measure x and respective 
value of y, usually I have 3 or more values of x and only one value of y 
per day). 

I can make a aggregated values (let say quarterly). My questions are:

1.      Is such approach sound? Can I use it?
2.      What could be the problems
3.      Is there any other method to inspect variables which can be 
related but you can not directly measure them in a same time?

My opinion is, that it is not much sound to inspect aggregated values and 
there can be many traps especially if there are only few aggregated 
values. Below you can see my examples.

If you have some opinion on this issue, please let me know.

Best regards
Petr

Let us have a relation x/y

set.seed(555)
x <- rnorm(120)
y <- 5*x+3+rnorm(120)
plot(x, y)

As you can see there is clear relation which can be seen from plot. Now I 
make a factor for aggregation.

fac <- rep(1:4,each=30)

xprum <- tapply(x, fac, mean)
yprum <- tapply(y, fac, mean)
plot(xprum, yprum)

Relationship is completely gone. Now let us make other fake data

xn <- runif(120)*rep(1:4, each=30)
yn <- runif(120)*rep(1:4, each=30)
plot(xn,yn)

There is no visible relation, xn and yn are independent but related to 
aggregation factor.

xprumn <- tapply(xn, fac, mean)
yprumn <- tapply(yn, fac, mean)
plot(xprumn, yprumn)

Here you can see perfect relation which is only due to aggregation factor.

Joris Meys

2010-Jul-07 15:33 UTC

head link

[R] relation in aggregated data

You examples are pretty extreme... Combining 120 data points in 4
points is off course never going to give a result. Try :

fac <- rep(1:8,each=15)
xprum <- tapply(x, fac, mean)
yprum <- tapply(y, fac, mean)
plot(xprum, yprum)

Relation is not obvious, but visible.

Yes, you lose information. Yes, your hypothesis changes. But in the
case you describe, averaging the x-values for every day (so you get an
average linked to 1 y value) seems like a possibility, given you take
that into account when formulating the hypothesis. Optimally, you
should take the standard error on the average into account for the
analysis, but this is complicated, often not done and in most cases
ignoring this issue is not influencing the results to that extent it
becomes important.

YMMV

Cheers

On Wed, Jul 7, 2010 at 4:24 PM, Petr PIKAL <petr.pikal at precheza.cz>
wrote:> Dear all
>
> My question is more on statistics than on R, however it can be
> demonstrated by R. It is about pros and cons trying to find a relationship
> by aggregated data. I can have two variables which can be related and I
> measure them regularly during some time (let say a year) but I can not
> measure them in a same time - (e.g. I can not measure x and respective
> value of y, usually I have 3 or more values of x and only one value of y
> per day).
>
> I can make a aggregated values (let say quarterly). My questions are:
>
> 1. ? ? ?Is such approach sound? Can I use it?
> 2. ? ? ?What could be the problems
> 3. ? ? ?Is there any other method to inspect variables which can be
> related but you can not directly measure them in a same time?
>
> My opinion is, that it is not much sound to inspect aggregated values and
> there can be many traps especially if there are only few aggregated
> values. Below you can see my examples.
>
> If you have some opinion on this issue, please let me know.
>
> Best regards
> Petr
>
> Let us have a relation x/y
>
> set.seed(555)
> x <- rnorm(120)
> y <- 5*x+3+rnorm(120)
> plot(x, y)
>
> As you can see there is clear relation which can be seen from plot. Now I
> make a factor for aggregation.
>
> fac <- rep(1:4,each=30)
>
> xprum <- tapply(x, fac, mean)
> yprum <- tapply(y, fac, mean)
> plot(xprum, yprum)
>
> Relationship is completely gone. Now let us make other fake data
>
> xn <- runif(120)*rep(1:4, each=30)
> yn <- runif(120)*rep(1:4, each=30)
> plot(xn,yn)
>
> There is no visible relation, xn and yn are independent but related to
> aggregation factor.
>
> xprumn <- tapply(xn, fac, mean)
> yprumn <- tapply(yn, fac, mean)
> plot(xprumn, yprumn)
>
> Here you can see perfect relation which is only due to aggregation factor.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

Petr PIKAL

2010-Jul-08 08:03 UTC

head link

[R] relation in aggregated data

Thank you

Actually when I do this myself I always try to make day or week averages 
if possible. However this was done by one of my colleagues and basically 
the aggregation was done on basis of campaigns. There is 4 to 6 campaigns 
per year and sometimes there is apparent relationship in aggregated data 
sometimes is not. My opinion is that I can not say much about exact 
relations until I have other clues or ways like expected underlaying laws 
of physics.

Thanks again

Best regards
Petr



Joris Meys <jorismeys at gmail.com> napsal dne 07.07.2010 17:33:55:
> You examples are pretty extreme... Combining 120 data points in 4
> points is off course never going to give a result. Try :
> 
> fac <- rep(1:8,each=15)
> xprum <- tapply(x, fac, mean)
> yprum <- tapply(y, fac, mean)
> plot(xprum, yprum)
> 
> Relation is not obvious, but visible.
> 
> Yes, you lose information. Yes, your hypothesis changes. But in the
> case you describe, averaging the x-values for every day (so you get an
> average linked to 1 y value) seems like a possibility, given you take
> that into account when formulating the hypothesis. Optimally, you
> should take the standard error on the average into account for the
> analysis, but this is complicated, often not done and in most cases
> ignoring this issue is not influencing the results to that extent it
> becomes important.
> 
> YMMV
> 
> Cheers
> 
> On Wed, Jul 7, 2010 at 4:24 PM, Petr PIKAL <petr.pikal at
precheza.cz>
wrote:> > Dear all
> >
> > My question is more on statistics than on R, however it can be
> > demonstrated by R. It is about pros and cons trying to find a 
relationship> > by aggregated data. I can have two variables which can be related and 
I> > measure them regularly during some time (let say a year) but I can not
> > measure them in a same time - (e.g. I can not measure x and respective
> > value of y, usually I have 3 or more values of x and only one value of
y> > per day).
> >
> > I can make a aggregated values (let say quarterly). My questions are:
> >
> > 1.      Is such approach sound? Can I use it?
> > 2.      What could be the problems
> > 3.      Is there any other method to inspect variables which can be
> > related but you can not directly measure them in a same time?
> >
> > My opinion is, that it is not much sound to inspect aggregated values 
and> > there can be many traps especially if there are only few aggregated
> > values. Below you can see my examples.
> >
> > If you have some opinion on this issue, please let me know.
> >
> > Best regards
> > Petr
> >
> > Let us have a relation x/y
> >
> > set.seed(555)
> > x <- rnorm(120)
> > y <- 5*x+3+rnorm(120)
> > plot(x, y)
> >
> > As you can see there is clear relation which can be seen from plot. 
Now I> > make a factor for aggregation.
> >
> > fac <- rep(1:4,each=30)
> >
> > xprum <- tapply(x, fac, mean)
> > yprum <- tapply(y, fac, mean)
> > plot(xprum, yprum)
> >
> > Relationship is completely gone. Now let us make other fake data
> >
> > xn <- runif(120)*rep(1:4, each=30)
> > yn <- runif(120)*rep(1:4, each=30)
> > plot(xn,yn)
> >
> > There is no visible relation, xn and yn are independent but related to
> > aggregation factor.
> >
> > xprumn <- tapply(xn, fac, mean)
> > yprumn <- tapply(yn, fac, mean)
> > plot(xprumn, yprumn)
> >
> > Here you can see perfect relation which is only due to aggregation 
factor.> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
> -- 
> Joris Meys
> Statistical consultant
> 
> Ghent University
> Faculty of Bioscience Engineering
> Department of Applied mathematics, biometrics and process control
> 
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

Joris Meys

2010-Jul-08 08:44 UTC

head link

[R] relation in aggregated data

Depending on the data and the research question, a meta-analytic
approach might be appropriate. You can see every campaign as a
"study". See the package metafor for example. You can only draw very
general conclusions, but at least your inference will be closer to
correct.

Cheers
Joris

On Thu, Jul 8, 2010 at 10:03 AM, Petr PIKAL <petr.pikal at precheza.cz>
wrote:> Thank you
>
> Actually when I do this myself I always try to make day or week averages
> if possible. However this was done by one of my colleagues and basically
> the aggregation was done on basis of campaigns. There is 4 to 6 campaigns
> per year and sometimes there is apparent relationship in aggregated data
> sometimes is not. My opinion is that I can not say much about exact
> relations until I have other clues or ways like expected underlaying laws
> of physics.
>
> Thanks again
>
> Best regards
> Petr
>
>
>
> Joris Meys <jorismeys at gmail.com> napsal dne 07.07.2010 17:33:55:
>
>> You examples are pretty extreme... Combining 120 data points in 4
>> points is off course never going to give a result. Try :
>>
>> fac <- rep(1:8,each=15)
>> xprum <- tapply(x, fac, mean)
>> yprum <- tapply(y, fac, mean)
>> plot(xprum, yprum)
>>
>> Relation is not obvious, but visible.
>>
>> Yes, you lose information. Yes, your hypothesis changes. But in the
>> case you describe, averaging the x-values for every day (so you get an
>> average linked to 1 y value) seems like a possibility, given you take
>> that into account when formulating the hypothesis. Optimally, you
>> should take the standard error on the average into account for the
>> analysis, but this is complicated, often not done and in most cases
>> ignoring this issue is not influencing the results to that extent it
>> becomes important.
>>
>> YMMV
>>
>> Cheers
>>
>> On Wed, Jul 7, 2010 at 4:24 PM, Petr PIKAL <petr.pikal at
precheza.cz>
> wrote:
>> > Dear all
>> >
>> > My question is more on statistics than on R, however it can be
>> > demonstrated by R. It is about pros and cons trying to find a
> relationship
>> > by aggregated data. I can have two variables which can be related
and
> I
>> > measure them regularly during some time (let say a year) but I can
not
>> > measure them in a same time - (e.g. I can not measure x and
respective
>> > value of y, usually I have 3 or more values of x and only one
value of
> y
>> > per day).
>> >
>> > I can make a aggregated values (let say quarterly). My questions
are:
>> >
>> > 1. ? ? ?Is such approach sound? Can I use it?
>> > 2. ? ? ?What could be the problems
>> > 3. ? ? ?Is there any other method to inspect variables which can
be
>> > related but you can not directly measure them in a same time?
>> >
>> > My opinion is, that it is not much sound to inspect aggregated
values
> and
>> > there can be many traps especially if there are only few
aggregated
>> > values. Below you can see my examples.
>> >
>> > If you have some opinion on this issue, please let me know.
>> >
>> > Best regards
>> > Petr
>> >
>> > Let us have a relation x/y
>> >
>> > set.seed(555)
>> > x <- rnorm(120)
>> > y <- 5*x+3+rnorm(120)
>> > plot(x, y)
>> >
>> > As you can see there is clear relation which can be seen from
plot.
> Now I
>> > make a factor for aggregation.
>> >
>> > fac <- rep(1:4,each=30)
>> >
>> > xprum <- tapply(x, fac, mean)
>> > yprum <- tapply(y, fac, mean)
>> > plot(xprum, yprum)
>> >
>> > Relationship is completely gone. Now let us make other fake data
>> >
>> > xn <- runif(120)*rep(1:4, each=30)
>> > yn <- runif(120)*rep(1:4, each=30)
>> > plot(xn,yn)
>> >
>> > There is no visible relation, xn and yn are independent but
related to
>> > aggregation factor.
>> >
>> > xprumn <- tapply(xn, fac, mean)
>> > yprumn <- tapply(yn, fac, mean)
>> > plot(xprumn, yprumn)
>> >
>> > Here you can see perfect relation which is only due to aggregation
> factor.
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>


-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

R help - Jul 2010 - relation in aggregated data

[R] relation in aggregated data

[R] relation in aggregated data

[R] relation in aggregated data

[R] relation in aggregated data