thr3ads.net - R help - [R] Joining two datasets - recursive procedure? [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2015-Mar-22 17:32 UTC

[R] Fwd: Joining two datasets - recursive procedure?

Nonsense. You are not telling us something or I have failed to
understand something.

Consider:

v1 = c("a","b")
v2 = "c("a","a")

It is not possible to change the value of a sum of values
corresponding to v2="a" without also changing that for v1, which is
not supposed to change according to my understanding of your
specification.

So I'm done.

-- Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, Mar 22, 2015 at 8:28 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:> Sorry forgot to keep the rest of the group in the loop - Luca
> ---------- Forwarded message ----------
> From: Luca Meyer <lucam1968 at gmail.com>
> Date: 2015-03-22 16:27 GMT+01:00
> Subject: Re: [R] Joining two datasets - recursive procedure?
> To: Bert Gunter <gunter.berton at gene.com>
>
>
> Hi Bert,
>
> That is exactly what I am trying to achieve. Please notice that negative v4
> values are allowed. I have done a similar task in the past manually by
> recursively alterating v4 distribution across v3 categories within fix each
> v1&v2 combination so I am quite positive it can be achieved but
honestly I
> took me forever to do it manually and since this is likely to be an
> exercise I need to repeat from time to time I wish I could learn how to do
> it programmatically....
>
> Thanks again for any further suggestion you might have,
>
> Luca
>
>
> 2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
>
>> Oh, wait a minute ...
>>
>> You still want the marginals for the other columns to be as originally?
>>
>> If so, then this is impossible in general as the sum of all the values
>> must be what they were originally and you cannot therefore choose your
>> values for V3 arbitrarily.
>>
>> Or at least, that seems to be what you are trying to do.
>>
>> -- Bert
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>> (650) 467-7374
>>
>> "Data is not information. Information is not knowledge. And
knowledge
>> is certainly not wisdom."
>> Clifford Stoll
>>
>>
>>
>>
>> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at
gene.com> wrote:
>> > I would have thought that this is straightforward given my
previous
>> email...
>> >
>> > Just set z to what you want -- e,g, all B values to 29/number of
B's,
>> > and all C values to 2.567/number of C's (etc. for more
categories).
>> >
>> > A slick but sort of cheat way to do this programmatically -- in
the
>> > sense that it relies on the implementation of factor() rather than
its
>> > API -- is:
>> >
>> > y <- f1$v3  ## to simplify the notation; could be done using
with()
>> > z <- (c(29,2.567)/table(y))[c(y)]
>> >
>> > Then proceed to z1 as I previously described
>> >
>> > -- Bert
>> >
>> >
>> > Bert Gunter
>> > Genentech Nonclinical Biostatistics
>> > (650) 467-7374
>> >
>> > "Data is not information. Information is not knowledge. And
knowledge
>> > is certainly not wisdom."
>> > Clifford Stoll
>> >
>> >
>> >
>> >
>> > On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at
gmail.com> wrote:
>> >> Hi Bert, hello R-experts,
>> >>
>> >> I am close to a solution but I still need one hint w.r.t. the
following
>> >> procedure (available also from
>> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
>> >>
>> >> rm(list=ls())
>> >>
>> >> # this is (an extract of) the INPUT file I have:
>> >> f1 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B", "B",
>> >> "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A", "B", "C", "A",
>> >> "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
"B", "B", "B", "C", "C",
>> >> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042,
>> 2.37232,
>> >> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1",
"v2", "v3", "v4"),
>> class
>> >> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L,
50L, 158L, 165L,
>> 167L,
>> >> 197L, 204L, 206L))
>> >>
>> >> # this is the procedure that Bert suggested (slightly
adjusted):
>> >> z <- rnorm(nrow(f1)) ## or anything you want
>> >> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)),
digits=5)
>> >> aggregate(v4~v1*v2,f1,sum)
>> >> aggregate(z1~v1*v2,f1,sum)
>> >> aggregate(v4~v3,f1,sum)
>> >> aggregate(z1~v3,f1,sum)
>> >>
>> >> My question to you is: how can I set z so that I can obtain
specific
>> values
>> >> for z1-v4 in the v3 aggregation?
>> >> In other words, how can I configure the procedure so that e.g.
B=29 and
>> >> C=2.56723 after running the procedure:
>> >> aggregate(z1~v3,f1,sum)
>> >>
>> >> Thank you,
>> >>
>> >> Luca
>> >>
>> >> PS: to avoid any doubts you might have about who I am the
following is
>> my
>> >> web page: http://lucameyer.wordpress.com/
>> >>
>> >>
>> >> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
>> >>>
>> >>> ... or cleaner:
>> >>>
>> >>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
>> >>>
>> >>>
>> >>> Just for curiosity, was this homework? (in which case I
should
>> >>> probably have not provided you an answer -- that is,
assuming that I
>> >>> HAVE provided an answer).
>> >>>
>> >>> Cheers,
>> >>> Bert
>> >>>
>> >>> Bert Gunter
>> >>> Genentech Nonclinical Biostatistics
>> >>> (650) 467-7374
>> >>>
>> >>> "Data is not information. Information is not
knowledge. And knowledge
>> >>> is certainly not wisdom."
>> >>> Clifford Stoll
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter
at gene.com> wrote:
>> >>> > z <- rnorm(nrow(f1)) ## or anything you want
>> >>> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
>> >>> >
>> >>> >
>> >>> > aggregate(v4~v1,f1,sum)
>> >>> > aggregate(z1~v1,f1,sum)
>> >>> > aggregate(v4~v2,f1,sum)
>> >>> > aggregate(z1~v2,f1,sum)
>> >>> > aggregate(v4~v3,f1,sum)
>> >>> > aggregate(z1~v3,f1,sum)
>> >>> >
>> >>> >
>> >>> > Cheers,
>> >>> > Bert
>> >>> >
>> >>> > Bert Gunter
>> >>> > Genentech Nonclinical Biostatistics
>> >>> > (650) 467-7374
>> >>> >
>> >>> > "Data is not information. Information is not
knowledge. And knowledge
>> >>> > is certainly not wisdom."
>> >>> > Clifford Stoll
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer
<lucam1968 at gmail.com>
>> wrote:
>> >>> >> Hi Bert,
>> >>> >>
>> >>> >> Thank you for your message. I am looking into
ave() and tapply() as
>> you
>> >>> >> suggested but at the same time I have prepared a
example of input
>> and
>> >>> >> output
>> >>> >> files, just in case you or someone else would
like to make an
>> attempt
>> >>> >> to
>> >>> >> generate a code that goes from input to output.
>> >>> >>
>> >>> >> Please see below or download it from
>> >>> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
>> >>> >>
>> >>> >> # this is (an extract of) the INPUT file I have:
>> >>> >> f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
>> >>> >> 1.42917,
>> >>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430,
0.92872,
>> >>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>> >>> >> row.names >> >>> >> c(2L,
>> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L,
204L, 206L))
>> >>> >>
>> >>> >> # this is (an extract of) the OUTPUT file I would
like to obtain:
>> >>> >> f2 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
>> >>> >> 1.77918,
>> >>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430,
0.92872,
>> >>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>> >>> >> row.names >> >>> >> c(2L,
>> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L,
204L, 206L))
>> >>> >>
>> >>> >> # please notice that while the aggregated v4 on
v3 has changed ?
>> >>> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
>> >>> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
>> >>> >>
>> >>> >> # ? the aggregated v4 over v1xv2 has remained
unchanged:
>> >>> >>
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>> >>> >>
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
>> >>> >>
>> >>> >> Thank you very much in advance for your
assitance.
>> >>> >>
>> >>> >> Luca
>> >>> >>
>> >>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter
<gunter.berton at gene.com>:
>> >>> >>>
>> >>> >>> 1. Still not sure what you mean, but maybe
look at ?ave and
>> ?tapply,
>> >>> >>> for which ave() is a wrapper.
>> >>> >>>
>> >>> >>> 2. You still need to heed the rest of
Jeff's advice.
>> >>> >>>
>> >>> >>> Cheers,
>> >>> >>> Bert
>> >>> >>>
>> >>> >>> Bert Gunter
>> >>> >>> Genentech Nonclinical Biostatistics
>> >>> >>> (650) 467-7374
>> >>> >>>
>> >>> >>> "Data is not information. Information is
not knowledge. And
>> knowledge
>> >>> >>> is certainly not wisdom."
>> >>> >>> Clifford Stoll
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer
<lucam1968 at gmail.com>
>> >>> >>> wrote:
>> >>> >>> > Hi Jeff & other R-experts,
>> >>> >>> >
>> >>> >>> > Thank you for your note. I have tried
myself to solve the issue
>> >>> >>> > without
>> >>> >>> > success.
>> >>> >>> >
>> >>> >>> > Following your suggestion, I am
providing a sample of the
>> dataset I
>> >>> >>> > am
>> >>> >>> > using below (also downloadble in plain
text from
>> >>> >>> >
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
>> >>> >>> >
>> >>> >>> > #this is an extract of the overall
dataset (n=1200 cases)
>> >>> >>> > f1 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
>> "B",
>> >>> >>> > "B", "B",
"B", "B"), v2 = c("A", "B",
"C", "A", "B", "C", "A",
>> >>> >>> > "B", "C",
"A", "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
>> >>> >>> > "B", "B",
"B", "C", "C", "C"), v4 =
c(18.1853007621835,
>> >>> >>> > 3.43806581506388,
>> >>> >>> > 0.002733567617055, 1.42917483425029,
1.05786640463504,
>> >>> >>> > 0.000420548864162308,
>> >>> >>> > 2.37232740842861, 3.01835841813241, 0,
1.13430282139936,
>> >>> >>> > 0.928725667117666,
>> >>> >>> > 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
>> >>> >>> > row.names
>> >>> >>> > >> >>> >>> >
c(2L,
>> >>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L,
167L, 197L, 204L, 206L))
>> >>> >>> >
>> >>> >>> > I need to find a automated procedure
that allows me to adjust v3
>> >>> >>> > marginals
>> >>> >>> > while maintaining v1xv2 marginals
unchanged.
>> >>> >>> >
>> >>> >>> > That is: modify the v4 values you can
find by running:
>> >>> >>> >
>> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v3),sum)
>> >>> >>> >
>> >>> >>> > while maintaining costant the values you
can find by running:
>> >>> >>> >
>> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>> >>> >>> >
>> >>> >>> > Now does it make sense?
>> >>> >>> >
>> >>> >>> > Please notice I have tried to build some
syntax that tries to
>> modify
>> >>> >>> > values
>> >>> >>> > within each v1xv2 combination by
computing sum of v4, row
>> percentage
>> >>> >>> > in
>> >>> >>> > terms of v4, and there is where my
effort is blocked. Not really
>> >>> >>> > sure
>> >>> >>> > how I
>> >>> >>> > should proceed. Any suggestion?
>> >>> >>> >
>> >>> >>> > Thanks,
>> >>> >>> >
>> >>> >>> > Luca
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller
<
>> jdnewmil at dcn.davis.ca.us>:
>> >>> >>> >
>> >>> >>> >> I don't understand your
description. The standard practice on
>> this
>> >>> >>> >> list
>> >>> >>> >> is
>> >>> >>> >> to provide a reproducible R example
[1] of the kind of data you
>> are
>> >>> >>> >> working
>> >>> >>> >> with (and any code you have tried)
to go along with your
>> >>> >>> >> description.
>> >>> >>> >> In
>> >>> >>> >> this case, that would be two dputs
of your input data frames
>> and a
>> >>> >>> >> dput
>> >>> >>> >> of
>> >>> >>> >> an output data frame (generated by
hand from your input data
>> >>> >>> >> frame).
>> >>> >>> >> (Probably best to not use the full
number of input values just
>> to
>> >>> >>> >> keep
>> >>> >>> >> the
>> >>> >>> >> size down.) We could then make an
attempt to generate code that
>> >>> >>> >> goes
>> >>> >>> >> from
>> >>> >>> >> input to output.
>> >>> >>> >>
>> >>> >>> >> Of course, if you post that hard
work using HTML then it will
>> get
>> >>> >>> >> corrupted (much like the text below
from your earlier emails)
>> and
>> >>> >>> >> we
>> >>> >>> >> won't
>> >>> >>> >> be able to use it. Please learn to
post from your email software
>> >>> >>> >> using
>> >>> >>> >> plain text when corresponding with
this mailing list.
>> >>> >>> >>
>> >>> >>> >> [1]
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>>
---------------------------------------------------------------------------
>> >>> >>> >> Jeff Newmiller                      
The     .....
>>  .....  Go
>> >>> >>> >> Live...
>> >>> >>> >> DCN:<jdnewmil at
dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>> >>> >>> >> Live
>> >>> >>> >> Go...
>> >>> >>> >>                                     
Live:   OO#.. Dead: OO#..
>> >>> >>> >> Playing
>> >>> >>> >> Research Engineer (Solar/Batteries  
O.O#.       #.O#.
>> >>> >>> >> with
>> >>> >>> >> /Software/Embedded Controllers)     
.OO#.       .OO#.
>> >>> >>> >> rocks...1k
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>>
---------------------------------------------------------------------------
>> >>> >>> >> Sent from my phone. Please excuse my
brevity.
>> >>> >>> >>
>> >>> >>> >> On March 18, 2015 9:05:37 AM PDT,
Luca Meyer <
>> lucam1968 at gmail.com>
>> >>> >>> >> wrote:
>> >>> >>> >> >Thanks for you input Michael,
>> >>> >>> >> >
>> >>> >>> >> >The continuous variable I have
measures quantities (down to the
>> >>> >>> >> > 3rd
>> >>> >>> >> >decimal level) so unfortunately
are not frequencies.
>> >>> >>> >> >
>> >>> >>> >> >Any more specific suggestions on
how that could be tackled?
>> >>> >>> >> >
>> >>> >>> >> >Thanks & kind regards,
>> >>> >>> >> >
>> >>> >>> >> >Luca
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> >==>> >>>
>>> >> >
>> >>> >>> >> >Michael Friendly wrote:
>> >>> >>> >> >I'm not sure I understand
completely what you want to do, but
>> >>> >>> >> >if the data were frequencies, it
sounds like task for fitting a
>> >>> >>> >> >loglinear model with the model
formula
>> >>> >>> >> >
>> >>> >>> >> >~ V1*V2 + V3
>> >>> >>> >> >
>> >>> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer
wrote:
>> >>> >>> >> >>* Hello,
>> >>> >>> >> >*>>* I am facing a quite
challenging task (at least to me) and
>> I
>> >>> >>> >> > was
>> >>> >>> >> >wondering
>> >>> >>> >> >*>* if someone could advise
how R could assist me to speed the
>> >>> >>> >> > task
>> >>> >>> >> > up.
>> >>> >>> >> >*>>* I am dealing with a
dataset with 3 discrete variables and
>> one
>> >>> >>> >> >continuous
>> >>> >>> >> >*>* variable. The discrete
variables are:
>> >>> >>> >> >*>>* V1: 8 modalities
>> >>> >>> >> >*>* V2: 13 modalities
>> >>> >>> >> >*>* V3: 13 modalities
>> >>> >>> >> >*>>* The continuous
variable V4 is a decimal number always
>> greater
>> >>> >>> >> > than
>> >>> >>> >> >zero in
>> >>> >>> >> >*>* the marginals of each of
the 3 variables but it is
>> sometimes
>> >>> >>> >> > equal
>> >>> >>> >> >to zero
>> >>> >>> >> >*>* (and sometimes negative)
in the joint tables.
>> >>> >>> >> >*>>* I have got 2 files:
>> >>> >>> >> >*>>* => one with
distribution of all possible combinations of
>> >>> >>> >> > V1xV2
>> >>> >>> >> >(some of
>> >>> >>> >> >*>* which are zero or
neagtive) and
>> >>> >>> >> >*>* => one with the
marginal distribution of V3.
>> >>> >>> >> >*>>* I am trying to build
the long and narrow dataset V1xV2xV3
>> in
>> >>> >>> >> > such
>> >>> >>> >> >a way
>> >>> >>> >> >*>* that each V1xV2 cell does
not get modified and V3 fits as
>> >>> >>> >> > closely
>> >>> >>> >> >as
>> >>> >>> >> >*>* possible to its marginal
distribution. Does it make sense?
>> >>> >>> >> >*>>* To be even more
specific, my 2 input files look like the
>> >>> >>> >> >following.
>> >>> >>> >> >*>>* FILE 1
>> >>> >>> >> >*>* V1,V2,V4
>> >>> >>> >> >*>* A, A, 24.251
>> >>> >>> >> >*>* A, B, 1.065
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* B, C, 0.294
>> >>> >>> >> >*>* B, D, 2.731
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* H, L, 0.345
>> >>> >>> >> >*>* H, M, 0.000
>> >>> >>> >> >*>>* FILE 2
>> >>> >>> >> >*>* V3, V4
>> >>> >>> >> >*>* A, 1.575
>> >>> >>> >> >*>* B, 4.294
>> >>> >>> >> >*>* C, 10.044
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* L, 5.123
>> >>> >>> >> >*>* M, 3.334
>> >>> >>> >> >*>>* What I need to
achieve is a file such as the following
>> >>> >>> >> >*>>* FILE 3
>> >>> >>> >> >*>* V1, V2, V3, V4
>> >>> >>> >> >*>* A, A, A, ???
>> >>> >>> >> >*>* A, A, B, ???
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* D, D, E, ???
>> >>> >>> >> >*>* D, D, F, ???
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* H, M, L, ???
>> >>> >>> >> >*>* H, M, M, ???
>> >>> >>> >> >*>>* Please notice that
FILE 3 need to be such that if I
>> aggregate
>> >>> >>> >> > on
>> >>> >>> >> >V1+V2 I
>> >>> >>> >> >*>* recover exactly FILE 1
and that if I aggregate on V3 I can
>> >>> >>> >> > recover
>> >>> >>> >> >a file
>> >>> >>> >> >*>* as close as possible to
FILE 3 (ideally the same file).
>> >>> >>> >> >*>>* Can anyone suggest
how I could do that with R?
>> >>> >>> >> >*>>* Thank you very much
indeed for any assistance you are
>> able to
>> >>> >>> >> >provide.
>> >>> >>> >> >*>>* Kind regards,
>> >>> >>> >> >*>>* Luca*
>> >>> >>> >> >
>> >>> >>> >> >       [[alternative HTML
version deleted]]
>> >>> >>> >> >
>> >>> >>> >>
>______________________________________________
>> >>> >>> >> >R-help at r-project.org mailing
list -- To UNSUBSCRIBE and more,
>> see
>> >>> >>> >>
>https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> >>> >> >PLEASE do read the posting guide
>> >>> >>> >>
>http://www.R-project.org/posting-guide.html
>> >>> >>> >> >and provide commented, minimal,
self-contained, reproducible
>> code.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >
>> >>> >>> >         [[alternative HTML version
deleted]]
>> >>> >>> >
>> >>> >>> >
______________________________________________
>> >>> >>> > R-help at r-project.org mailing list --
To UNSUBSCRIBE and more,
>> see
>> >>> >>> >
https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> >>> > PLEASE do read the posting guide
>> >>> >>> >
http://www.R-project.org/posting-guide.html
>> >>> >>> > and provide commented, minimal,
self-contained, reproducible
>> code.
>> >>> >>
>> >>> >>
>> >>
>> >>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Luca Meyer

2015-Mar-22 20:12 UTC

head link

[R] Fwd: Joining two datasets - recursive procedure?

Hi Bert,

Maybe I did not explain myself clearly enough. But let me show you with a
manual example that indeed what I would like to do is feasible.

The following is also available for download from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0

rm(list=ls())

This is usual (an extract of) the INPUT file I have:

f1 <- structure(list(v1 = c("A", "A", "A",
"A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A",
"B", "C", "A", "B", "C"), v3 =
c("B", "B", "B", "C", "C",
"C",
"B", "B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"),
class = "data.frame", row.names c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

This are the initial marginal distributions

aggregate(v4~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)

First I order the file such that I have nicely listed 6 distinct v1xv2
combinations.

f1 <- f1[order(f1$v1,f1$v2),]

Then I compute (manually) the relative importance of each v1xv2 combination:

tAA <-
(18.18530+1.42917)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
# this is for combination v1=A & v2=A
tAB <-
(3.43806+1.05786)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
# this is for combination v1=A & v2=B
tAC <-
(0.00273+0.00042)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
# this is for combination v1=A & v2=C
tBA <-
(2.37232+1.13430)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
# this is for combination v1=B & v2=A
tBB <-
(3.01835+0.92872)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
# this is for combination v1=B & v2=B
tBC <-
(0.00000+0.00000)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
# this is for combination v1=B & v2=C
# and just to make sure I have not made mistakes the following should be
equal to 1
tAA+tAB+tAC+tBA+tBB+tBC

Next, I know I need to increase v4 any time v3=B and the total increase I
need to have over the whole dataset is 29-27.01676=1.98324. In turn, I need
to dimish v4 any time V3=C by the same amount (4.55047-2.56723=1.98324).
This aspect was perhaps not clear at first. I need to move v4 across v3
categories, but the totals will always remain unchanged.

Since I want the data alteration to be proportional to the v1xv2
combinations I do the following:

f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" &
f1$v3=="B", f1$v4+(tAA*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" &
f1$v3=="C", f1$v4-(tAA*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" &
f1$v3=="B", f1$v4+(tAB*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" &
f1$v3=="C", f1$v4-(tAB*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" &
f1$v3=="B", f1$v4+(tAC*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" &
f1$v3=="C", f1$v4-(tAC*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" &
f1$v3=="B", f1$v4+(tBA*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" &
f1$v3=="C", f1$v4-(tBA*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" &
f1$v3=="B", f1$v4+(tBB*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" &
f1$v3=="C", f1$v4-(tBB*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" &
f1$v3=="B", f1$v4+(tBC*1.98324),
f1$v4)
f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" &
f1$v3=="C", f1$v4-(tBC*1.98324),
f1$v4)

This are the final marginal distributions:

aggregate(v4~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)

Can this procedure be made programmatic so that I can run it on the
(8x13x13) categories matrix? if so, how would you do it? I have really hard
time to do it with some (semi)automatic procedure.

Thank you very much indeed once more :)

Luca


2015-03-22 18:32 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
> Nonsense. You are not telling us something or I have failed to
> understand something.
>
> Consider:
>
> v1 = c("a","b")
> v2 = "c("a","a")
>
> It is not possible to change the value of a sum of values
> corresponding to v2="a" without also changing that for v1, which
is
> not supposed to change according to my understanding of your
> specification.
>
> So I'm done.
>
> -- Bert
>
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Sun, Mar 22, 2015 at 8:28 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:
> > Sorry forgot to keep the rest of the group in the loop - Luca
> > ---------- Forwarded message ----------
> > From: Luca Meyer <lucam1968 at gmail.com>
> > Date: 2015-03-22 16:27 GMT+01:00
> > Subject: Re: [R] Joining two datasets - recursive procedure?
> > To: Bert Gunter <gunter.berton at gene.com>
> >
> >
> > Hi Bert,
> >
> > That is exactly what I am trying to achieve. Please notice that
negative
> v4
> > values are allowed. I have done a similar task in the past manually by
> > recursively alterating v4 distribution across v3 categories within fix
> each
> > v1&v2 combination so I am quite positive it can be achieved but
honestly
> I
> > took me forever to do it manually and since this is likely to be an
> > exercise I need to repeat from time to time I wish I could learn how
to
> do
> > it programmatically....
> >
> > Thanks again for any further suggestion you might have,
> >
> > Luca
> >
> >
> > 2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
> >
> >> Oh, wait a minute ...
> >>
> >> You still want the marginals for the other columns to be as
originally?
> >>
> >> If so, then this is impossible in general as the sum of all the
values
> >> must be what they were originally and you cannot therefore choose
your
> >> values for V3 arbitrarily.
> >>
> >> Or at least, that seems to be what you are trying to do.
> >>
> >> -- Bert
> >>
> >> Bert Gunter
> >> Genentech Nonclinical Biostatistics
> >> (650) 467-7374
> >>
> >> "Data is not information. Information is not knowledge. And
knowledge
> >> is certainly not wisdom."
> >> Clifford Stoll
> >>
> >>
> >>
> >>
> >> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at
gene.com> wrote:
> >> > I would have thought that this is straightforward given my
previous
> >> email...
> >> >
> >> > Just set z to what you want -- e,g, all B values to 29/number
of B's,
> >> > and all C values to 2.567/number of C's (etc. for more
categories).
> >> >
> >> > A slick but sort of cheat way to do this programmatically --
in the
> >> > sense that it relies on the implementation of factor() rather
than its
> >> > API -- is:
> >> >
> >> > y <- f1$v3  ## to simplify the notation; could be done
using with()
> >> > z <- (c(29,2.567)/table(y))[c(y)]
> >> >
> >> > Then proceed to z1 as I previously described
> >> >
> >> > -- Bert
> >> >
> >> >
> >> > Bert Gunter
> >> > Genentech Nonclinical Biostatistics
> >> > (650) 467-7374
> >> >
> >> > "Data is not information. Information is not knowledge.
And knowledge
> >> > is certainly not wisdom."
> >> > Clifford Stoll
> >> >
> >> >
> >> >
> >> >
> >> > On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at
gmail.com>
> wrote:
> >> >> Hi Bert, hello R-experts,
> >> >>
> >> >> I am close to a solution but I still need one hint w.r.t.
the
> following
> >> >> procedure (available also from
> >> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
> >> >>
> >> >> rm(list=ls())
> >> >>
> >> >> # this is (an extract of) the INPUT file I have:
> >> >> f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
> "B",
> >> >> "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A", "B", "C",
> "A",
> >> >> "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
"B", "B", "B", "C",
> "C",
> >> >> "C"), v4 = c(18.18530, 3.43806,0.00273,
1.42917, 1.05786, 0.00042,
> >> 2.37232,
> >> >> 3.01835, 0, 1.13430, 0.92872, 0)), .Names =
c("v1", "v2", "v3",
> "v4"),
> >> class
> >> >> = "data.frame", row.names = c(2L, 9L, 11L, 41L,
48L, 50L, 158L, 165L,
> >> 167L,
> >> >> 197L, 204L, 206L))
> >> >>
> >> >> # this is the procedure that Bert suggested (slightly
adjusted):
> >> >> z <- rnorm(nrow(f1)) ## or anything you want
> >> >> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)),
digits=5)
> >> >> aggregate(v4~v1*v2,f1,sum)
> >> >> aggregate(z1~v1*v2,f1,sum)
> >> >> aggregate(v4~v3,f1,sum)
> >> >> aggregate(z1~v3,f1,sum)
> >> >>
> >> >> My question to you is: how can I set z so that I can
obtain specific
> >> values
> >> >> for z1-v4 in the v3 aggregation?
> >> >> In other words, how can I configure the procedure so that
e.g. B=29
> and
> >> >> C=2.56723 after running the procedure:
> >> >> aggregate(z1~v3,f1,sum)
> >> >>
> >> >> Thank you,
> >> >>
> >> >> Luca
> >> >>
> >> >> PS: to avoid any doubts you might have about who I am the
following
> is
> >> my
> >> >> web page: http://lucameyer.wordpress.com/
> >> >>
> >> >>
> >> >> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton
at gene.com>:
> >> >>>
> >> >>> ... or cleaner:
> >> >>>
> >> >>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
> >> >>>
> >> >>>
> >> >>> Just for curiosity, was this homework? (in which case
I should
> >> >>> probably have not provided you an answer -- that is,
assuming that I
> >> >>> HAVE provided an answer).
> >> >>>
> >> >>> Cheers,
> >> >>> Bert
> >> >>>
> >> >>> Bert Gunter
> >> >>> Genentech Nonclinical Biostatistics
> >> >>> (650) 467-7374
> >> >>>
> >> >>> "Data is not information. Information is not
knowledge. And
> knowledge
> >> >>> is certainly not wisdom."
> >> >>> Clifford Stoll
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter
<bgunter at gene.com>
> wrote:
> >> >>> > z <- rnorm(nrow(f1)) ## or anything you want
> >> >>> > z1 <- f1$v4 + z -
with(f1,ave(z,v1,v2,FUN=mean))
> >> >>> >
> >> >>> >
> >> >>> > aggregate(v4~v1,f1,sum)
> >> >>> > aggregate(z1~v1,f1,sum)
> >> >>> > aggregate(v4~v2,f1,sum)
> >> >>> > aggregate(z1~v2,f1,sum)
> >> >>> > aggregate(v4~v3,f1,sum)
> >> >>> > aggregate(z1~v3,f1,sum)
> >> >>> >
> >> >>> >
> >> >>> > Cheers,
> >> >>> > Bert
> >> >>> >
> >> >>> > Bert Gunter
> >> >>> > Genentech Nonclinical Biostatistics
> >> >>> > (650) 467-7374
> >> >>> >
> >> >>> > "Data is not information. Information is
not knowledge. And
> knowledge
> >> >>> > is certainly not wisdom."
> >> >>> > Clifford Stoll
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer
<lucam1968 at gmail.com>
> >> wrote:
> >> >>> >> Hi Bert,
> >> >>> >>
> >> >>> >> Thank you for your message. I am looking
into ave() and tapply()
> as
> >> you
> >> >>> >> suggested but at the same time I have
prepared a example of input
> >> and
> >> >>> >> output
> >> >>> >> files, just in case you or someone else
would like to make an
> >> attempt
> >> >>> >> to
> >> >>> >> generate a code that goes from input to
output.
> >> >>> >>
> >> >>> >> Please see below or download it from
> >> >>> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
> >> >>> >>
> >> >>> >> # this is (an extract of) the INPUT file I
have:
> >> >>> >> f1 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
> "B",
> >> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
> >> >>> >> 1.42917,
> >> >>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0,
1.13430, 0.92872,
> >> >>> >> 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
> >> >>> >> row.names > >> >>>
>> c(2L,
> >> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L,
197L, 204L, 206L))
> >> >>> >>
> >> >>> >> # this is (an extract of) the OUTPUT file I
would like to obtain:
> >> >>> >> f2 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
> "B",
> >> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
> >> >>> >> 1.77918,
> >> >>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0,
1.13430, 0.92872,
> >> >>> >> 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
> >> >>> >> row.names > >> >>>
>> c(2L,
> >> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L,
197L, 204L, 206L))
> >> >>> >>
> >> >>> >> # please notice that while the aggregated v4
on v3 has changed ?
> >> >>> >>
aggregate(f1[,c("v4")],list(f1$v3),sum)
> >> >>> >>
aggregate(f2[,c("v4")],list(f2$v3),sum)
> >> >>> >>
> >> >>> >> # ? the aggregated v4 over v1xv2 has
remained unchanged:
> >> >>> >>
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >> >>> >>
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
> >> >>> >>
> >> >>> >> Thank you very much in advance for your
assitance.
> >> >>> >>
> >> >>> >> Luca
> >> >>> >>
> >> >>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter
<gunter.berton at gene.com>:
> >> >>> >>>
> >> >>> >>> 1. Still not sure what you mean, but
maybe look at ?ave and
> >> ?tapply,
> >> >>> >>> for which ave() is a wrapper.
> >> >>> >>>
> >> >>> >>> 2. You still need to heed the rest of
Jeff's advice.
> >> >>> >>>
> >> >>> >>> Cheers,
> >> >>> >>> Bert
> >> >>> >>>
> >> >>> >>> Bert Gunter
> >> >>> >>> Genentech Nonclinical Biostatistics
> >> >>> >>> (650) 467-7374
> >> >>> >>>
> >> >>> >>> "Data is not information.
Information is not knowledge. And
> >> knowledge
> >> >>> >>> is certainly not wisdom."
> >> >>> >>> Clifford Stoll
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca
Meyer <
> lucam1968 at gmail.com>
> >> >>> >>> wrote:
> >> >>> >>> > Hi Jeff & other R-experts,
> >> >>> >>> >
> >> >>> >>> > Thank you for your note. I have
tried myself to solve the
> issue
> >> >>> >>> > without
> >> >>> >>> > success.
> >> >>> >>> >
> >> >>> >>> > Following your suggestion, I am
providing a sample of the
> >> dataset I
> >> >>> >>> > am
> >> >>> >>> > using below (also downloadble in
plain text from
> >> >>> >>> >
> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
> >> >>> >>> >
> >> >>> >>> > #this is an extract of the overall
dataset (n=1200 cases)
> >> >>> >>> > f1 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
> >> "B",
> >> >>> >>> > "B", "B",
"B", "B"), v2 = c("A", "B",
"C", "A", "B", "C", "A",
> >> >>> >>> > "B", "C",
"A", "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
> >> >>> >>> > "B", "B",
"B", "C", "C", "C"), v4 =
c(18.1853007621835,
> >> >>> >>> > 3.43806581506388,
> >> >>> >>> > 0.002733567617055,
1.42917483425029, 1.05786640463504,
> >> >>> >>> > 0.000420548864162308,
> >> >>> >>> > 2.37232740842861, 3.01835841813241,
0, 1.13430282139936,
> >> >>> >>> > 0.928725667117666,
> >> >>> >>> > 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
> >> >>> >>> > row.names
> >> >>> >>> > > >> >>>
>>> > c(2L,
> >> >>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L,
167L, 197L, 204L, 206L))
> >> >>> >>> >
> >> >>> >>> > I need to find a automated
procedure that allows me to adjust
> v3
> >> >>> >>> > marginals
> >> >>> >>> > while maintaining v1xv2 marginals
unchanged.
> >> >>> >>> >
> >> >>> >>> > That is: modify the v4 values you
can find by running:
> >> >>> >>> >
> >> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v3),sum)
> >> >>> >>> >
> >> >>> >>> > while maintaining costant the
values you can find by running:
> >> >>> >>> >
> >> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >> >>> >>> >
> >> >>> >>> > Now does it make sense?
> >> >>> >>> >
> >> >>> >>> > Please notice I have tried to build
some syntax that tries to
> >> modify
> >> >>> >>> > values
> >> >>> >>> > within each v1xv2 combination by
computing sum of v4, row
> >> percentage
> >> >>> >>> > in
> >> >>> >>> > terms of v4, and there is where my
effort is blocked. Not
> really
> >> >>> >>> > sure
> >> >>> >>> > how I
> >> >>> >>> > should proceed. Any suggestion?
> >> >>> >>> >
> >> >>> >>> > Thanks,
> >> >>> >>> >
> >> >>> >>> > Luca
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff
Newmiller <
> >> jdnewmil at dcn.davis.ca.us>:
> >> >>> >>> >
> >> >>> >>> >> I don't understand your
description. The standard practice on
> >> this
> >> >>> >>> >> list
> >> >>> >>> >> is
> >> >>> >>> >> to provide a reproducible R
example [1] of the kind of data
> you
> >> are
> >> >>> >>> >> working
> >> >>> >>> >> with (and any code you have
tried) to go along with your
> >> >>> >>> >> description.
> >> >>> >>> >> In
> >> >>> >>> >> this case, that would be two
dputs of your input data frames
> >> and a
> >> >>> >>> >> dput
> >> >>> >>> >> of
> >> >>> >>> >> an output data frame (generated
by hand from your input data
> >> >>> >>> >> frame).
> >> >>> >>> >> (Probably best to not use the
full number of input values
> just
> >> to
> >> >>> >>> >> keep
> >> >>> >>> >> the
> >> >>> >>> >> size down.) We could then make
an attempt to generate code
> that
> >> >>> >>> >> goes
> >> >>> >>> >> from
> >> >>> >>> >> input to output.
> >> >>> >>> >>
> >> >>> >>> >> Of course, if you post that
hard work using HTML then it will
> >> get
> >> >>> >>> >> corrupted (much like the text
below from your earlier emails)
> >> and
> >> >>> >>> >> we
> >> >>> >>> >> won't
> >> >>> >>> >> be able to use it. Please learn
to post from your email
> software
> >> >>> >>> >> using
> >> >>> >>> >> plain text when corresponding
with this mailing list.
> >> >>> >>> >>
> >> >>> >>> >> [1]
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >>
> >>
>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >>
> >>
> ---------------------------------------------------------------------------
> >> >>> >>> >> Jeff Newmiller                 
The     .....
> >>  .....  Go
> >> >>> >>> >> Live...
> >> >>> >>> >> DCN:<jdnewmil at
dcn.davis.ca.us>        Basics: ##.#.
>  ##.#.
> >> >>> >>> >> Live
> >> >>> >>> >> Go...
> >> >>> >>> >>                                
Live:   OO#.. Dead:
> OO#..
> >> >>> >>> >> Playing
> >> >>> >>> >> Research Engineer
(Solar/Batteries            O.O#.
>  #.O#.
> >> >>> >>> >> with
> >> >>> >>> >> /Software/Embedded Controllers)
.OO#.
>  .OO#.
> >> >>> >>> >> rocks...1k
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >>
> >>
> ---------------------------------------------------------------------------
> >> >>> >>> >> Sent from my phone. Please
excuse my brevity.
> >> >>> >>> >>
> >> >>> >>> >> On March 18, 2015 9:05:37 AM
PDT, Luca Meyer <
> >> lucam1968 at gmail.com>
> >> >>> >>> >> wrote:
> >> >>> >>> >> >Thanks for you input
Michael,
> >> >>> >>> >> >
> >> >>> >>> >> >The continuous variable I
have measures quantities (down to
> the
> >> >>> >>> >> > 3rd
> >> >>> >>> >> >decimal level) so
unfortunately are not frequencies.
> >> >>> >>> >> >
> >> >>> >>> >> >Any more specific
suggestions on how that could be tackled?
> >> >>> >>> >> >
> >> >>> >>> >> >Thanks & kind regards,
> >> >>> >>> >> >
> >> >>> >>> >> >Luca
> >> >>> >>> >> >
> >> >>> >>> >> >
> >> >>> >>> >> >==> >>
>>> >>> >> >
> >> >>> >>> >> >Michael Friendly wrote:
> >> >>> >>> >> >I'm not sure I
understand completely what you want to do,
> but
> >> >>> >>> >> >if the data were
frequencies, it sounds like task for
> fitting a
> >> >>> >>> >> >loglinear model with the
model formula
> >> >>> >>> >> >
> >> >>> >>> >> >~ V1*V2 + V3
> >> >>> >>> >> >
> >> >>> >>> >> >On 3/18/2015 2:17 AM, Luca
Meyer wrote:
> >> >>> >>> >> >>* Hello,
> >> >>> >>> >> >*>>* I am facing a
quite challenging task (at least to me)
> and
> >> I
> >> >>> >>> >> > was
> >> >>> >>> >> >wondering
> >> >>> >>> >> >*>* if someone could
advise how R could assist me to speed
> the
> >> >>> >>> >> > task
> >> >>> >>> >> > up.
> >> >>> >>> >> >*>>* I am dealing
with a dataset with 3 discrete variables
> and
> >> one
> >> >>> >>> >> >continuous
> >> >>> >>> >> >*>* variable. The
discrete variables are:
> >> >>> >>> >> >*>>* V1: 8 modalities
> >> >>> >>> >> >*>* V2: 13 modalities
> >> >>> >>> >> >*>* V3: 13 modalities
> >> >>> >>> >> >*>>* The continuous
variable V4 is a decimal number always
> >> greater
> >> >>> >>> >> > than
> >> >>> >>> >> >zero in
> >> >>> >>> >> >*>* the marginals of
each of the 3 variables but it is
> >> sometimes
> >> >>> >>> >> > equal
> >> >>> >>> >> >to zero
> >> >>> >>> >> >*>* (and sometimes
negative) in the joint tables.
> >> >>> >>> >> >*>>* I have got 2
files:
> >> >>> >>> >> >*>>* => one with
distribution of all possible combinations
> of
> >> >>> >>> >> > V1xV2
> >> >>> >>> >> >(some of
> >> >>> >>> >> >*>* which are zero or
neagtive) and
> >> >>> >>> >> >*>* => one with the
marginal distribution of V3.
> >> >>> >>> >> >*>>* I am trying to
build the long and narrow dataset
> V1xV2xV3
> >> in
> >> >>> >>> >> > such
> >> >>> >>> >> >a way
> >> >>> >>> >> >*>* that each V1xV2 cell
does not get modified and V3 fits
> as
> >> >>> >>> >> > closely
> >> >>> >>> >> >as
> >> >>> >>> >> >*>* possible to its
marginal distribution. Does it make
> sense?
> >> >>> >>> >> >*>>* To be even more
specific, my 2 input files look like
> the
> >> >>> >>> >> >following.
> >> >>> >>> >> >*>>* FILE 1
> >> >>> >>> >> >*>* V1,V2,V4
> >> >>> >>> >> >*>* A, A, 24.251
> >> >>> >>> >> >*>* A, B, 1.065
> >> >>> >>> >> >*>* (...)
> >> >>> >>> >> >*>* B, C, 0.294
> >> >>> >>> >> >*>* B, D, 2.731
> >> >>> >>> >> >*>* (...)
> >> >>> >>> >> >*>* H, L, 0.345
> >> >>> >>> >> >*>* H, M, 0.000
> >> >>> >>> >> >*>>* FILE 2
> >> >>> >>> >> >*>* V3, V4
> >> >>> >>> >> >*>* A, 1.575
> >> >>> >>> >> >*>* B, 4.294
> >> >>> >>> >> >*>* C, 10.044
> >> >>> >>> >> >*>* (...)
> >> >>> >>> >> >*>* L, 5.123
> >> >>> >>> >> >*>* M, 3.334
> >> >>> >>> >> >*>>* What I need to
achieve is a file such as the following
> >> >>> >>> >> >*>>* FILE 3
> >> >>> >>> >> >*>* V1, V2, V3, V4
> >> >>> >>> >> >*>* A, A, A, ???
> >> >>> >>> >> >*>* A, A, B, ???
> >> >>> >>> >> >*>* (...)
> >> >>> >>> >> >*>* D, D, E, ???
> >> >>> >>> >> >*>* D, D, F, ???
> >> >>> >>> >> >*>* (...)
> >> >>> >>> >> >*>* H, M, L, ???
> >> >>> >>> >> >*>* H, M, M, ???
> >> >>> >>> >> >*>>* Please notice
that FILE 3 need to be such that if I
> >> aggregate
> >> >>> >>> >> > on
> >> >>> >>> >> >V1+V2 I
> >> >>> >>> >> >*>* recover exactly FILE
1 and that if I aggregate on V3 I
> can
> >> >>> >>> >> > recover
> >> >>> >>> >> >a file
> >> >>> >>> >> >*>* as close as possible
to FILE 3 (ideally the same file).
> >> >>> >>> >> >*>>* Can anyone
suggest how I could do that with R?
> >> >>> >>> >> >*>>* Thank you very
much indeed for any assistance you are
> >> able to
> >> >>> >>> >> >provide.
> >> >>> >>> >> >*>>* Kind regards,
> >> >>> >>> >> >*>>* Luca*
> >> >>> >>> >> >
> >> >>> >>> >> >       [[alternative HTML
version deleted]]
> >> >>> >>> >> >
> >> >>> >>> >>
>______________________________________________
> >> >>> >>> >> >R-help at r-project.org
mailing list -- To UNSUBSCRIBE and
> more,
> >> see
> >> >>> >>> >>
>https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> >>> >> >PLEASE do read the posting
guide
> >> >>> >>> >>
>http://www.R-project.org/posting-guide.html
> >> >>> >>> >> >and provide commented,
minimal, self-contained, reproducible
> >> code.
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >
> >> >>> >>> >         [[alternative HTML version
deleted]]
> >> >>> >>> >
> >> >>> >>> >
______________________________________________
> >> >>> >>> > R-help at r-project.org mailing
list -- To UNSUBSCRIBE and more,
> >> see
> >> >>> >>> >
https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> >>> > PLEASE do read the posting guide
> >> >>> >>> >
http://www.R-project.org/posting-guide.html
> >> >>> >>> > and provide commented, minimal,
self-contained, reproducible
> >> code.
> >> >>> >>
> >> >>> >>
> >> >>
> >> >>
> >>
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

David Winsemius

2015-Mar-22 21:11 UTC

head link

[R] Joining two datasets - recursive procedure?

On Mar 22, 2015, at 1:12 PM, Luca Meyer wrote:
> Hi Bert,
> 
> Maybe I did not explain myself clearly enough. But let me show you with a
> manual example that indeed what I would like to do is feasible.
> 
> The following is also available for download from
> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
> 
> rm(list=ls())
> 
> This is usual (an extract of) the INPUT file I have:
> 
> f1 <- structure(list(v1 = c("A", "A", "A",
"A", "A", "A", "B", "B",
> "B", "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A",
> "B", "C", "A", "B", "C"),
v3 = c("B", "B", "B", "C",
"C", "C",
> "B", "B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
> 0)), .Names = c("v1", "v2", "v3",
"v4"), class = "data.frame", row.names > c(2L,
> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
> 
> This are the initial marginal distributions
> 
> aggregate(v4~v1*v2,f1,sum)
> aggregate(v4~v3,f1,sum)
> 
> First I order the file such that I have nicely listed 6 distinct v1xv2
> combinations.
> 
> f1 <- f1[order(f1$v1,f1$v2),]
> 
> Then I compute (manually) the relative importance of each v1xv2
combination:
> 
> tAA <-
>
(18.18530+1.42917)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
> # this is for combination v1=A & v2=A
> tAB <-
>
(3.43806+1.05786)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
> # this is for combination v1=A & v2=B
> tAC <-
>
(0.00273+0.00042)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
> # this is for combination v1=A & v2=C
> tBA <-
>
(2.37232+1.13430)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
> # this is for combination v1=B & v2=A
> tBB <-
>
(3.01835+0.92872)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
> # this is for combination v1=B & v2=B
> tBC <-
>
(0.00000+0.00000)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000)
> # this is for combination v1=B & v2=C
> # and just to make sure I have not made mistakes the following should be
> equal to 1
> tAA+tAB+tAC+tBA+tBB+tBC
> 
> Next, I know I need to increase v4 any time v3=B and the total increase I
> need to have over the whole dataset is 29-27.01676=1.98324. In turn, I need
> to dimish v4 any time V3=C by the same amount (4.55047-2.56723=1.98324).
> This aspect was perhaps not clear at first. I need to move v4 across v3
> categories, but the totals will always remain unchanged.
> 
> Since I want the data alteration to be proportional to the v1xv2
> combinations I do the following:
> 
> f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" &
f1$v3=="B", f1$v4+(tAA*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" &
f1$v3=="C", f1$v4-(tAA*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" &
f1$v3=="B", f1$v4+(tAB*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" &
f1$v3=="C", f1$v4-(tAB*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" &
f1$v3=="B", f1$v4+(tAC*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" &
f1$v3=="C", f1$v4-(tAC*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" &
f1$v3=="B", f1$v4+(tBA*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" &
f1$v3=="C", f1$v4-(tBA*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" &
f1$v3=="B", f1$v4+(tBB*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" &
f1$v3=="C", f1$v4-(tBB*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" &
f1$v3=="B", f1$v4+(tBC*1.98324),
> f1$v4)
> f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" &
f1$v3=="C", f1$v4-(tBC*1.98324),
> f1$v4)
> 
Seems that this could be done a lot more simply with a lookup matrix and
ordinary indexing
> lookarr <- array(NA,
dim=c(length(unique(f1$v1)),length(unique(f1$v2)),length(unique(f1$v3)) ) ,
dimnames=list( unique(f1$v1), unique(f1$v2), unique(f1$v3) ) )
> lookarr[] <- c(tAA,tAA,tAB,tAB,tAC,tAC,tBA,tBA,                 tBB, tBB, tBC, tBC)
> lookarr[ "A","B","C"][1] 0.1250369
> lookarr[ with(f1, cbind(v1, v2, v3)) ] [1] 6.213554e-01 1.110842e-01 1.424236e-01 1.250369e-01 9.978703e-05
 [6] 0.000000e+00 6.213554e-01 1.110842e-01 1.424236e-01 1.250369e-01
[11] 9.978703e-05 0.000000e+00> f1$v4mod <- f1$v4*lookarr[ with(f1, cbind(v1,v2,v3)) ]
> f1    v1 v2 v3       v4        v4mod
2    A  A  B 18.18530 1.129954e+01
41   A  A  C  1.42917 1.587582e-01
9    A  B  B  3.43806 4.896610e-01
48   A  B  C  1.05786 1.322716e-01
11   A  C  B  0.00273 2.724186e-07
50   A  C  C  0.00042 0.000000e+00
158  B  A  B  2.37232 1.474054e+00
197  B  A  C  1.13430 1.260028e-01
165  B  B  B  3.01835 4.298844e-01
204  B  B  C  0.92872 1.161243e-01
167  B  C  B  0.00000 0.000000e+00
206  B  C  C  0.00000 0.000000e+00

-- 
david.

> This are the final marginal distributions:
> 
> aggregate(v4~v1*v2,f1,sum)
> aggregate(v4~v3,f1,sum)
> 
> Can this procedure be made programmatic so that I can run it on the
> (8x13x13) categories matrix? if so, how would you do it? I have really hard
> time to do it with some (semi)automatic procedure.
> 
> Thank you very much indeed once more :)
> 
> Luca
> 
> 
> 2015-03-22 18:32 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
> 
>> Nonsense. You are not telling us something or I have failed to
>> understand something.
>> 
>> Consider:
>> 
>> v1 = c("a","b")
>> v2 = "c("a","a")
>> 
>> It is not possible to change the value of a sum of values
>> corresponding to v2="a" without also changing that for v1,
which is
>> not supposed to change according to my understanding of your
>> specification.
>> 
>> So I'm done.
>> 
>> -- Bert
>> 
>> 
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>> (650) 467-7374
>> 
>> "Data is not information. Information is not knowledge. And
knowledge
>> is certainly not wisdom."
>> Clifford Stoll
>> 
>> 
>> 
>> 
>> On Sun, Mar 22, 2015 at 8:28 AM, Luca Meyer <lucam1968 at
gmail.com> wrote:
>>> Sorry forgot to keep the rest of the group in the loop - Luca
>>> ---------- Forwarded message ----------
>>> From: Luca Meyer <lucam1968 at gmail.com>
>>> Date: 2015-03-22 16:27 GMT+01:00
>>> Subject: Re: [R] Joining two datasets - recursive procedure?
>>> To: Bert Gunter <gunter.berton at gene.com>
>>> 
>>> 
>>> Hi Bert,
>>> 
>>> That is exactly what I am trying to achieve. Please notice that
negative
>> v4
>>> values are allowed. I have done a similar task in the past manually
by
>>> recursively alterating v4 distribution across v3 categories within
fix
>> each
>>> v1&v2 combination so I am quite positive it can be achieved but
honestly
>> I
>>> took me forever to do it manually and since this is likely to be an
>>> exercise I need to repeat from time to time I wish I could learn
how to
>> do
>>> it programmatically....
>>> 
>>> Thanks again for any further suggestion you might have,
>>> 
>>> Luca
>>> 
>>> 
>>> 2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
>>> 
>>>> Oh, wait a minute ...
>>>> 
>>>> You still want the marginals for the other columns to be as
originally?
>>>> 
>>>> If so, then this is impossible in general as the sum of all the
values
>>>> must be what they were originally and you cannot therefore
choose your
>>>> values for V3 arbitrarily.
>>>> 
>>>> Or at least, that seems to be what you are trying to do.
>>>> 
>>>> -- Bert
>>>> 
>>>> Bert Gunter
>>>> Genentech Nonclinical Biostatistics
>>>> (650) 467-7374
>>>> 
>>>> "Data is not information. Information is not knowledge.
And knowledge
>>>> is certainly not wisdom."
>>>> Clifford Stoll
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at
gene.com> wrote:
>>>>> I would have thought that this is straightforward given my
previous
>>>> email...
>>>>> 
>>>>> Just set z to what you want -- e,g, all B values to
29/number of B's,
>>>>> and all C values to 2.567/number of C's (etc. for more
categories).
>>>>> 
>>>>> A slick but sort of cheat way to do this programmatically
-- in the
>>>>> sense that it relies on the implementation of factor()
rather than its
>>>>> API -- is:
>>>>> 
>>>>> y <- f1$v3  ## to simplify the notation; could be done
using with()
>>>>> z <- (c(29,2.567)/table(y))[c(y)]
>>>>> 
>>>>> Then proceed to z1 as I previously described
>>>>> 
>>>>> -- Bert
>>>>> 
>>>>> 
>>>>> Bert Gunter
>>>>> Genentech Nonclinical Biostatistics
>>>>> (650) 467-7374
>>>>> 
>>>>> "Data is not information. Information is not
knowledge. And knowledge
>>>>> is certainly not wisdom."
>>>>> Clifford Stoll
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968
at gmail.com>
>> wrote:
>>>>>> Hi Bert, hello R-experts,
>>>>>> 
>>>>>> I am close to a solution but I still need one hint
w.r.t. the
>> following
>>>>>> procedure (available also from
>>>>>>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
>>>>>> 
>>>>>> rm(list=ls())
>>>>>> 
>>>>>> # this is (an extract of) the INPUT file I have:
>>>>>> f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>> "B",
>>>>>> "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A", "B", "C",
>> "A",
>>>>>> "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
"B", "B", "B", "C",
>> "C",
>>>>>> "C"), v4 = c(18.18530, 3.43806,0.00273,
1.42917, 1.05786, 0.00042,
>>>> 2.37232,
>>>>>> 3.01835, 0, 1.13430, 0.92872, 0)), .Names =
c("v1", "v2", "v3",
>> "v4"),
>>>> class
>>>>>> = "data.frame", row.names = c(2L, 9L, 11L,
41L, 48L, 50L, 158L, 165L,
>>>> 167L,
>>>>>> 197L, 204L, 206L))
>>>>>> 
>>>>>> # this is the procedure that Bert suggested (slightly
adjusted):
>>>>>> z <- rnorm(nrow(f1)) ## or anything you want
>>>>>> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)),
digits=5)
>>>>>> aggregate(v4~v1*v2,f1,sum)
>>>>>> aggregate(z1~v1*v2,f1,sum)
>>>>>> aggregate(v4~v3,f1,sum)
>>>>>> aggregate(z1~v3,f1,sum)
>>>>>> 
>>>>>> My question to you is: how can I set z so that I can
obtain specific
>>>> values
>>>>>> for z1-v4 in the v3 aggregation?
>>>>>> In other words, how can I configure the procedure so
that e.g. B=29
>> and
>>>>>> C=2.56723 after running the procedure:
>>>>>> aggregate(z1~v3,f1,sum)
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> Luca
>>>>>> 
>>>>>> PS: to avoid any doubts you might have about who I am
the following
>> is
>>>> my
>>>>>> web page: http://lucameyer.wordpress.com/
>>>>>> 
>>>>>> 
>>>>>> 2015-03-21 18:13 GMT+01:00 Bert Gunter
<gunter.berton at gene.com>:
>>>>>>> 
>>>>>>> ... or cleaner:
>>>>>>> 
>>>>>>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
>>>>>>> 
>>>>>>> 
>>>>>>> Just for curiosity, was this homework? (in which
case I should
>>>>>>> probably have not provided you an answer -- that
is, assuming that I
>>>>>>> HAVE provided an answer).
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Bert
>>>>>>> 
>>>>>>> Bert Gunter
>>>>>>> Genentech Nonclinical Biostatistics
>>>>>>> (650) 467-7374
>>>>>>> 
>>>>>>> "Data is not information. Information is not
knowledge. And
>> knowledge
>>>>>>> is certainly not wisdom."
>>>>>>> Clifford Stoll
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter
<bgunter at gene.com>
>> wrote:
>>>>>>>> z <- rnorm(nrow(f1)) ## or anything you want
>>>>>>>> z1 <- f1$v4 + z -
with(f1,ave(z,v1,v2,FUN=mean))
>>>>>>>> 
>>>>>>>> 
>>>>>>>> aggregate(v4~v1,f1,sum)
>>>>>>>> aggregate(z1~v1,f1,sum)
>>>>>>>> aggregate(v4~v2,f1,sum)
>>>>>>>> aggregate(z1~v2,f1,sum)
>>>>>>>> aggregate(v4~v3,f1,sum)
>>>>>>>> aggregate(z1~v3,f1,sum)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Bert
>>>>>>>> 
>>>>>>>> Bert Gunter
>>>>>>>> Genentech Nonclinical Biostatistics
>>>>>>>> (650) 467-7374
>>>>>>>> 
>>>>>>>> "Data is not information. Information is
not knowledge. And
>> knowledge
>>>>>>>> is certainly not wisdom."
>>>>>>>> Clifford Stoll
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer
<lucam1968 at gmail.com>
>>>> wrote:
>>>>>>>>> Hi Bert,
>>>>>>>>> 
>>>>>>>>> Thank you for your message. I am looking
into ave() and tapply()
>> as
>>>> you
>>>>>>>>> suggested but at the same time I have
prepared a example of input
>>>> and
>>>>>>>>> output
>>>>>>>>> files, just in case you or someone else
would like to make an
>>>> attempt
>>>>>>>>> to
>>>>>>>>> generate a code that goes from input to
output.
>>>>>>>>> 
>>>>>>>>> Please see below or download it from
>>>>>>>>>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
>>>>>>>>> 
>>>>>>>>> # this is (an extract of) the INPUT file I
have:
>>>>>>>>> f1 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
>> "B",
>>>>>>>>> "B", "B",
"B", "B"), v2 = c("A", "B",
"C", "A", "B", "C", "A",
>>>>>>>>> "B", "C",
"A", "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
>>>>>>>>> "B", "B",
"B", "C", "C", "C"), v4 = c(18.18530,
3.43806,0.00273,
>>>>>>>>> 1.42917,
>>>>>>>>> 1.05786, 0.00042, 2.37232, 3.01835, 0,
1.13430, 0.92872,
>>>>>>>>> 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
>>>>>>>>> row.names
>>>>>>>>> c(2L,
>>>>>>>>> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L,
197L, 204L, 206L))
>>>>>>>>> 
>>>>>>>>> # this is (an extract of) the OUTPUT file I
would like to obtain:
>>>>>>>>> f2 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
>> "B",
>>>>>>>>> "B", "B",
"B", "B"), v2 = c("A", "B",
"C", "A", "B", "C", "A",
>>>>>>>>> "B", "C",
"A", "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
>>>>>>>>> "B", "B",
"B", "C", "C", "C"), v4 = c(17.83529,
3.43806,0.00295,
>>>>>>>>> 1.77918,
>>>>>>>>> 1.05786, 0.0002, 2.37232, 3.01835, 0,
1.13430, 0.92872,
>>>>>>>>> 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
>>>>>>>>> row.names
>>>>>>>>> c(2L,
>>>>>>>>> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L,
197L, 204L, 206L))
>>>>>>>>> 
>>>>>>>>> # please notice that while the aggregated
v4 on v3 has changed ?
>>>>>>>>>
aggregate(f1[,c("v4")],list(f1$v3),sum)
>>>>>>>>>
aggregate(f2[,c("v4")],list(f2$v3),sum)
>>>>>>>>> 
>>>>>>>>> # ? the aggregated v4 over v1xv2 has
remained unchanged:
>>>>>>>>>
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>>>>>>>>>
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
>>>>>>>>> 
>>>>>>>>> Thank you very much in advance for your
assitance.
>>>>>>>>> 
>>>>>>>>> Luca
>>>>>>>>> 
>>>>>>>>> 2015-03-21 13:18 GMT+01:00 Bert Gunter
<gunter.berton at gene.com>:
>>>>>>>>>> 
>>>>>>>>>> 1. Still not sure what you mean, but
maybe look at ?ave and
>>>> ?tapply,
>>>>>>>>>> for which ave() is a wrapper.
>>>>>>>>>> 
>>>>>>>>>> 2. You still need to heed the rest of
Jeff's advice.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Bert
>>>>>>>>>> 
>>>>>>>>>> Bert Gunter
>>>>>>>>>> Genentech Nonclinical Biostatistics
>>>>>>>>>> (650) 467-7374
>>>>>>>>>> 
>>>>>>>>>> "Data is not information.
Information is not knowledge. And
>>>> knowledge
>>>>>>>>>> is certainly not wisdom."
>>>>>>>>>> Clifford Stoll
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Mar 21, 2015 at 4:53 AM, Luca
Meyer <
>> lucam1968 at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi Jeff & other R-experts,
>>>>>>>>>>> 
>>>>>>>>>>> Thank you for your note. I have
tried myself to solve the
>> issue
>>>>>>>>>>> without
>>>>>>>>>>> success.
>>>>>>>>>>> 
>>>>>>>>>>> Following your suggestion, I am
providing a sample of the
>>>> dataset I
>>>>>>>>>>> am
>>>>>>>>>>> using below (also downloadble in
plain text from
>>>>>>>>>>> 
>> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
>>>>>>>>>>> 
>>>>>>>>>>> #this is an extract of the overall
dataset (n=1200 cases)
>>>>>>>>>>> f1 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
>>>> "B",
>>>>>>>>>>> "B", "B",
"B", "B"), v2 = c("A", "B",
"C", "A", "B", "C", "A",
>>>>>>>>>>> "B", "C",
"A", "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
>>>>>>>>>>> "B", "B",
"B", "C", "C", "C"), v4 =
c(18.1853007621835,
>>>>>>>>>>> 3.43806581506388,
>>>>>>>>>>> 0.002733567617055,
1.42917483425029, 1.05786640463504,
>>>>>>>>>>> 0.000420548864162308,
>>>>>>>>>>> 2.37232740842861, 3.01835841813241,
0, 1.13430282139936,
>>>>>>>>>>> 0.928725667117666,
>>>>>>>>>>> 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
>>>>>>>>>>> row.names
>>>>>>>>>>>
>>>>>>>>>>> c(2L,
>>>>>>>>>>> 9L, 11L, 41L, 48L, 50L, 158L, 165L,
167L, 197L, 204L, 206L))
>>>>>>>>>>> 
>>>>>>>>>>> I need to find a automated
procedure that allows me to adjust
>> v3
>>>>>>>>>>> marginals
>>>>>>>>>>> while maintaining v1xv2 marginals
unchanged.
>>>>>>>>>>> 
>>>>>>>>>>> That is: modify the v4 values you
can find by running:
>>>>>>>>>>> 
>>>>>>>>>>>
aggregate(f1[,c("v4")],list(f1$v3),sum)
>>>>>>>>>>> 
>>>>>>>>>>> while maintaining costant the
values you can find by running:
>>>>>>>>>>> 
>>>>>>>>>>>
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>>>>>>>>>>> 
>>>>>>>>>>> Now does it make sense?
>>>>>>>>>>> 
>>>>>>>>>>> Please notice I have tried to build
some syntax that tries to
>>>> modify
>>>>>>>>>>> values
>>>>>>>>>>> within each v1xv2 combination by
computing sum of v4, row
>>>> percentage
>>>>>>>>>>> in
>>>>>>>>>>> terms of v4, and there is where my
effort is blocked. Not
>> really
>>>>>>>>>>> sure
>>>>>>>>>>> how I
>>>>>>>>>>> should proceed. Any suggestion?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Luca
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2015-03-19 2:38 GMT+01:00 Jeff
Newmiller <
>>>> jdnewmil at dcn.davis.ca.us>:
>>>>>>>>>>> 
>>>>>>>>>>>> I don't understand your
description. The standard practice on
>>>> this
>>>>>>>>>>>> list
>>>>>>>>>>>> is
>>>>>>>>>>>> to provide a reproducible R
example [1] of the kind of data
>> you
>>>> are
>>>>>>>>>>>> working
>>>>>>>>>>>> with (and any code you have
tried) to go along with your
>>>>>>>>>>>> description.
>>>>>>>>>>>> In
>>>>>>>>>>>> this case, that would be two
dputs of your input data frames
>>>> and a
>>>>>>>>>>>> dput
>>>>>>>>>>>> of
>>>>>>>>>>>> an output data frame (generated
by hand from your input data
>>>>>>>>>>>> frame).
>>>>>>>>>>>> (Probably best to not use the
full number of input values
>> just
>>>> to
>>>>>>>>>>>> keep
>>>>>>>>>>>> the
>>>>>>>>>>>> size down.) We could then make
an attempt to generate code
>> that
>>>>>>>>>>>> goes
>>>>>>>>>>>> from
>>>>>>>>>>>> input to output.
>>>>>>>>>>>> 
>>>>>>>>>>>> Of course, if you post that
hard work using HTML then it will
>>>> get
>>>>>>>>>>>> corrupted (much like the text
below from your earlier emails)
>>>> and
>>>>>>>>>>>> we
>>>>>>>>>>>> won't
>>>>>>>>>>>> be able to use it. Please learn
to post from your email
>> software
>>>>>>>>>>>> using
>>>>>>>>>>>> plain text when corresponding
with this mailing list.
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>>
---------------------------------------------------------------------------
>>>>>>>>>>>> Jeff Newmiller                 
The     .....
>>>> .....  Go
>>>>>>>>>>>> Live...
>>>>>>>>>>>> DCN:<jdnewmil at
dcn.davis.ca.us>        Basics: ##.#.
>> ##.#.
>>>>>>>>>>>> Live
>>>>>>>>>>>> Go...
>>>>>>>>>>>>                                
Live:   OO#.. Dead:
>> OO#..
>>>>>>>>>>>> Playing
>>>>>>>>>>>> Research Engineer
(Solar/Batteries            O.O#.
>> #.O#.
>>>>>>>>>>>> with
>>>>>>>>>>>> /Software/Embedded Controllers)
.OO#.
>> .OO#.
>>>>>>>>>>>> rocks...1k
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>>
---------------------------------------------------------------------------
>>>>>>>>>>>> Sent from my phone. Please
excuse my brevity.
>>>>>>>>>>>> 
>>>>>>>>>>>> On March 18, 2015 9:05:37 AM
PDT, Luca Meyer <
>>>> lucam1968 at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Thanks for you input
Michael,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The continuous variable I
have measures quantities (down to
>> the
>>>>>>>>>>>>> 3rd
>>>>>>>>>>>>> decimal level) so
unfortunately are not frequencies.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Any more specific
suggestions on how that could be tackled?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks & kind regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Luca
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>
==>>>>>>>>>>>>>
>>>>>>>>>>>>> Michael Friendly wrote:
>>>>>>>>>>>>> I'm not sure I
understand completely what you want to do,
>> but
>>>>>>>>>>>>> if the data were
frequencies, it sounds like task for
>> fitting a
>>>>>>>>>>>>> loglinear model with the
model formula
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ~ V1*V2 + V3
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 3/18/2015 2:17 AM, Luca
Meyer wrote:
>>>>>>>>>>>>>> * Hello,
>>>>>>>>>>>>> *>>* I am facing a
quite challenging task (at least to me)
>> and
>>>> I
>>>>>>>>>>>>> was
>>>>>>>>>>>>> wondering
>>>>>>>>>>>>> *>* if someone could
advise how R could assist me to speed
>> the
>>>>>>>>>>>>> task
>>>>>>>>>>>>> up.
>>>>>>>>>>>>> *>>* I am dealing
with a dataset with 3 discrete variables
>> and
>>>> one
>>>>>>>>>>>>> continuous
>>>>>>>>>>>>> *>* variable. The
discrete variables are:
>>>>>>>>>>>>> *>>* V1: 8 modalities
>>>>>>>>>>>>> *>* V2: 13 modalities
>>>>>>>>>>>>> *>* V3: 13 modalities
>>>>>>>>>>>>> *>>* The continuous
variable V4 is a decimal number always
>>>> greater
>>>>>>>>>>>>> than
>>>>>>>>>>>>> zero in
>>>>>>>>>>>>> *>* the marginals of
each of the 3 variables but it is
>>>> sometimes
>>>>>>>>>>>>> equal
>>>>>>>>>>>>> to zero
>>>>>>>>>>>>> *>* (and sometimes
negative) in the joint tables.
>>>>>>>>>>>>> *>>* I have got 2
files:
>>>>>>>>>>>>> *>>* => one with
distribution of all possible combinations
>> of
>>>>>>>>>>>>> V1xV2
>>>>>>>>>>>>> (some of
>>>>>>>>>>>>> *>* which are zero or
neagtive) and
>>>>>>>>>>>>> *>* => one with the
marginal distribution of V3.
>>>>>>>>>>>>> *>>* I am trying to
build the long and narrow dataset
>> V1xV2xV3
>>>> in
>>>>>>>>>>>>> such
>>>>>>>>>>>>> a way
>>>>>>>>>>>>> *>* that each V1xV2 cell
does not get modified and V3 fits
>> as
>>>>>>>>>>>>> closely
>>>>>>>>>>>>> as
>>>>>>>>>>>>> *>* possible to its
marginal distribution. Does it make
>> sense?
>>>>>>>>>>>>> *>>* To be even more
specific, my 2 input files look like
>> the
>>>>>>>>>>>>> following.
>>>>>>>>>>>>> *>>* FILE 1
>>>>>>>>>>>>> *>* V1,V2,V4
>>>>>>>>>>>>> *>* A, A, 24.251
>>>>>>>>>>>>> *>* A, B, 1.065
>>>>>>>>>>>>> *>* (...)
>>>>>>>>>>>>> *>* B, C, 0.294
>>>>>>>>>>>>> *>* B, D, 2.731
>>>>>>>>>>>>> *>* (...)
>>>>>>>>>>>>> *>* H, L, 0.345
>>>>>>>>>>>>> *>* H, M, 0.000
>>>>>>>>>>>>> *>>* FILE 2
>>>>>>>>>>>>> *>* V3, V4
>>>>>>>>>>>>> *>* A, 1.575
>>>>>>>>>>>>> *>* B, 4.294
>>>>>>>>>>>>> *>* C, 10.044
>>>>>>>>>>>>> *>* (...)
>>>>>>>>>>>>> *>* L, 5.123
>>>>>>>>>>>>> *>* M, 3.334
>>>>>>>>>>>>> *>>* What I need to
achieve is a file such as the following
>>>>>>>>>>>>> *>>* FILE 3
>>>>>>>>>>>>> *>* V1, V2, V3, V4
>>>>>>>>>>>>> *>* A, A, A, ???
>>>>>>>>>>>>> *>* A, A, B, ???
>>>>>>>>>>>>> *>* (...)
>>>>>>>>>>>>> *>* D, D, E, ???
>>>>>>>>>>>>> *>* D, D, F, ???
>>>>>>>>>>>>> *>* (...)
>>>>>>>>>>>>> *>* H, M, L, ???
>>>>>>>>>>>>> *>* H, M, M, ???
>>>>>>>>>>>>> *>>* Please notice
that FILE 3 need to be such that if I
>>>> aggregate
>>>>>>>>>>>>> on
>>>>>>>>>>>>> V1+V2 I
>>>>>>>>>>>>> *>* recover exactly FILE
1 and that if I aggregate on V3 I
>> can
>>>>>>>>>>>>> recover
>>>>>>>>>>>>> a file
>>>>>>>>>>>>> *>* as close as possible
to FILE 3 (ideally the same file).
>>>>>>>>>>>>> *>>* Can anyone
suggest how I could do that with R?
>>>>>>>>>>>>> *>>* Thank you very
much indeed for any assistance you are
>>>> able to
>>>>>>>>>>>>> provide.
>>>>>>>>>>>>> *>>* Kind regards,
>>>>>>>>>>>>> *>>* Luca*
>>>>>>>>>>>>> 
>>>>>>>>>>>>>      [[alternative HTML
version deleted]]

David Winsemius
Alameda, CA, USA

R help - Mar 2015 - Joining two datasets - recursive procedure?

[R] Fwd: Joining two datasets - recursive procedure?

[R] Fwd: Joining two datasets - recursive procedure?

[R] Joining two datasets - recursive procedure?