thr3ads.net - R help - [R] Fwd: Joining two datasets

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2015-Mar-22 14:55 UTC

[R] Joining two datasets - recursive procedure?

I would have thought that this is straightforward given my previous email...

Just set z to what you want -- e,g, all B values to 29/number of B's,
and all C values to 2.567/number of C's (etc. for more categories).

A slick but sort of cheat way to do this programmatically -- in the
sense that it relies on the implementation of factor() rather than its
API -- is:

y <- f1$v3  ## to simplify the notation; could be done using with()
z <- (c(29,2.567)/table(y))[c(y)]

Then proceed to z1 as I previously described

-- Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:> Hi Bert, hello R-experts,
>
> I am close to a solution but I still need one hint w.r.t. the following
> procedure (available also from
> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
>
> rm(list=ls())
>
> # this is (an extract of) the INPUT file I have:
> f1 <- structure(list(v1 = c("A", "A", "A",
"A", "A", "A", "B", "B",
"B",
> "B", "B", "B"), v2 = c("A",
"B", "C", "A", "B", "C",
"A", "B", "C", "A",
> "B", "C"), v3 = c("B", "B",
"B", "C", "C", "C", "B",
"B", "B", "C", "C",
> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786,
0.00042, 2.37232,
> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1",
"v2", "v3", "v4"), class
> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L,
165L, 167L,
> 197L, 204L, 206L))
>
> # this is the procedure that Bert suggested (slightly adjusted):
> z <- rnorm(nrow(f1)) ## or anything you want
> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
> aggregate(v4~v1*v2,f1,sum)
> aggregate(z1~v1*v2,f1,sum)
> aggregate(v4~v3,f1,sum)
> aggregate(z1~v3,f1,sum)
>
> My question to you is: how can I set z so that I can obtain specific values
> for z1-v4 in the v3 aggregation?
> In other words, how can I configure the procedure so that e.g. B=29 and
> C=2.56723 after running the procedure:
> aggregate(z1~v3,f1,sum)
>
> Thank you,
>
> Luca
>
> PS: to avoid any doubts you might have about who I am the following is my
> web page: http://lucameyer.wordpress.com/
>
>
> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
>>
>> ... or cleaner:
>>
>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
>>
>>
>> Just for curiosity, was this homework? (in which case I should
>> probably have not provided you an answer -- that is, assuming that I
>> HAVE provided an answer).
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>> (650) 467-7374
>>
>> "Data is not information. Information is not knowledge. And
knowledge
>> is certainly not wisdom."
>> Clifford Stoll
>>
>>
>>
>>
>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at
gene.com> wrote:
>> > z <- rnorm(nrow(f1)) ## or anything you want
>> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
>> >
>> >
>> > aggregate(v4~v1,f1,sum)
>> > aggregate(z1~v1,f1,sum)
>> > aggregate(v4~v2,f1,sum)
>> > aggregate(z1~v2,f1,sum)
>> > aggregate(v4~v3,f1,sum)
>> > aggregate(z1~v3,f1,sum)
>> >
>> >
>> > Cheers,
>> > Bert
>> >
>> > Bert Gunter
>> > Genentech Nonclinical Biostatistics
>> > (650) 467-7374
>> >
>> > "Data is not information. Information is not knowledge. And
knowledge
>> > is certainly not wisdom."
>> > Clifford Stoll
>> >
>> >
>> >
>> >
>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at
gmail.com> wrote:
>> >> Hi Bert,
>> >>
>> >> Thank you for your message. I am looking into ave() and
tapply() as you
>> >> suggested but at the same time I have prepared a example of
input and
>> >> output
>> >> files, just in case you or someone else would like to make an
attempt
>> >> to
>> >> generate a code that goes from input to output.
>> >>
>> >> Please see below or download it from
>> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
>> >>
>> >> # this is (an extract of) the INPUT file I have:
>> >> f1 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B",
>> >> "B", "B", "B", "B"),
v2 = c("A", "B", "C", "A",
"B", "C", "A",
>> >> "B", "C", "A", "B",
"C"), v3 = c("B", "B", "B",
"C", "C", "C",
>> >> "B", "B", "B", "C",
"C", "C"), v4 = c(18.18530, 3.43806,0.00273,
>> >> 1.42917,
>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>> >> row.names >> >> c(2L,
>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
>> >>
>> >> # this is (an extract of) the OUTPUT file I would like to
obtain:
>> >> f2 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B",
>> >> "B", "B", "B", "B"),
v2 = c("A", "B", "C", "A",
"B", "C", "A",
>> >> "B", "C", "A", "B",
"C"), v3 = c("B", "B", "B",
"C", "C", "C",
>> >> "B", "B", "B", "C",
"C", "C"), v4 = c(17.83529, 3.43806,0.00295,
>> >> 1.77918,
>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>> >> row.names >> >> c(2L,
>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
>> >>
>> >> # please notice that while the aggregated v4 on v3 has changed
?
>> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
>> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
>> >>
>> >> # ? the aggregated v4 over v1xv2 has remained unchanged:
>> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
>> >>
>> >> Thank you very much in advance for your assitance.
>> >>
>> >> Luca
>> >>
>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
>> >>>
>> >>> 1. Still not sure what you mean, but maybe look at ?ave
and ?tapply,
>> >>> for which ave() is a wrapper.
>> >>>
>> >>> 2. You still need to heed the rest of Jeff's advice.
>> >>>
>> >>> Cheers,
>> >>> Bert
>> >>>
>> >>> Bert Gunter
>> >>> Genentech Nonclinical Biostatistics
>> >>> (650) 467-7374
>> >>>
>> >>> "Data is not information. Information is not
knowledge. And knowledge
>> >>> is certainly not wisdom."
>> >>> Clifford Stoll
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968
at gmail.com>
>> >>> wrote:
>> >>> > Hi Jeff & other R-experts,
>> >>> >
>> >>> > Thank you for your note. I have tried myself to solve
the issue
>> >>> > without
>> >>> > success.
>> >>> >
>> >>> > Following your suggestion, I am providing a sample of
the dataset I
>> >>> > am
>> >>> > using below (also downloadble in plain text from
>> >>> >
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
>> >>> >
>> >>> > #this is an extract of the overall dataset (n=1200
cases)
>> >>> > f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>> >>> > "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>> >>> > "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>> >>> > "B", "B", "B",
"C", "C", "C"), v4 = c(18.1853007621835,
>> >>> > 3.43806581506388,
>> >>> > 0.002733567617055, 1.42917483425029,
1.05786640463504,
>> >>> > 0.000420548864162308,
>> >>> > 2.37232740842861, 3.01835841813241, 0,
1.13430282139936,
>> >>> > 0.928725667117666,
>> >>> > 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>> >>> > row.names
>> >>> > >> >>> > c(2L,
>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
>> >>> >
>> >>> > I need to find a automated procedure that allows me
to adjust v3
>> >>> > marginals
>> >>> > while maintaining v1xv2 marginals unchanged.
>> >>> >
>> >>> > That is: modify the v4 values you can find by
running:
>> >>> >
>> >>> > aggregate(f1[,c("v4")],list(f1$v3),sum)
>> >>> >
>> >>> > while maintaining costant the values you can find by
running:
>> >>> >
>> >>> >
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>> >>> >
>> >>> > Now does it make sense?
>> >>> >
>> >>> > Please notice I have tried to build some syntax that
tries to modify
>> >>> > values
>> >>> > within each v1xv2 combination by computing sum of v4,
row percentage
>> >>> > in
>> >>> > terms of v4, and there is where my effort is blocked.
Not really
>> >>> > sure
>> >>> > how I
>> >>> > should proceed. Any suggestion?
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>> > Luca
>> >>> >
>> >>> >
>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil
at dcn.davis.ca.us>:
>> >>> >
>> >>> >> I don't understand your description. The
standard practice on this
>> >>> >> list
>> >>> >> is
>> >>> >> to provide a reproducible R example [1] of the
kind of data you are
>> >>> >> working
>> >>> >> with (and any code you have tried) to go along
with your
>> >>> >> description.
>> >>> >> In
>> >>> >> this case, that would be two dputs of your input
data frames and a
>> >>> >> dput
>> >>> >> of
>> >>> >> an output data frame (generated by hand from your
input data
>> >>> >> frame).
>> >>> >> (Probably best to not use the full number of
input values just to
>> >>> >> keep
>> >>> >> the
>> >>> >> size down.) We could then make an attempt to
generate code that
>> >>> >> goes
>> >>> >> from
>> >>> >> input to output.
>> >>> >>
>> >>> >> Of course, if you post that hard work using HTML
then it will get
>> >>> >> corrupted (much like the text below from your
earlier emails) and
>> >>> >> we
>> >>> >> won't
>> >>> >> be able to use it. Please learn to post from your
email software
>> >>> >> using
>> >>> >> plain text when corresponding with this mailing
list.
>> >>> >>
>> >>> >> [1]
>> >>> >>
>> >>> >>
>> >>> >>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> >>> >>
>> >>> >>
>> >>> >>
---------------------------------------------------------------------------
>> >>> >> Jeff Newmiller                        The    
.....       .....  Go
>> >>> >> Live...
>> >>> >> DCN:<jdnewmil at dcn.davis.ca.us>       
Basics: ##.#.       ##.#.
>> >>> >> Live
>> >>> >> Go...
>> >>> >>                                       Live:  
OO#.. Dead: OO#..
>> >>> >> Playing
>> >>> >> Research Engineer (Solar/Batteries           
O.O#.       #.O#.
>> >>> >> with
>> >>> >> /Software/Embedded Controllers)              
.OO#.       .OO#.
>> >>> >> rocks...1k
>> >>> >>
>> >>> >>
>> >>> >>
---------------------------------------------------------------------------
>> >>> >> Sent from my phone. Please excuse my brevity.
>> >>> >>
>> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer
<lucam1968 at gmail.com>
>> >>> >> wrote:
>> >>> >> >Thanks for you input Michael,
>> >>> >> >
>> >>> >> >The continuous variable I have measures
quantities (down to the
>> >>> >> > 3rd
>> >>> >> >decimal level) so unfortunately are not
frequencies.
>> >>> >> >
>> >>> >> >Any more specific suggestions on how that
could be tackled?
>> >>> >> >
>> >>> >> >Thanks & kind regards,
>> >>> >> >
>> >>> >> >Luca
>> >>> >> >
>> >>> >> >
>> >>> >> >==>> >>> >> >
>> >>> >> >Michael Friendly wrote:
>> >>> >> >I'm not sure I understand completely what
you want to do, but
>> >>> >> >if the data were frequencies, it sounds like
task for fitting a
>> >>> >> >loglinear model with the model formula
>> >>> >> >
>> >>> >> >~ V1*V2 + V3
>> >>> >> >
>> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote:
>> >>> >> >>* Hello,
>> >>> >> >*>>* I am facing a quite challenging
task (at least to me) and I
>> >>> >> > was
>> >>> >> >wondering
>> >>> >> >*>* if someone could advise how R could
assist me to speed the
>> >>> >> > task
>> >>> >> > up.
>> >>> >> >*>>* I am dealing with a dataset with 3
discrete variables and one
>> >>> >> >continuous
>> >>> >> >*>* variable. The discrete variables are:
>> >>> >> >*>>* V1: 8 modalities
>> >>> >> >*>* V2: 13 modalities
>> >>> >> >*>* V3: 13 modalities
>> >>> >> >*>>* The continuous variable V4 is a
decimal number always greater
>> >>> >> > than
>> >>> >> >zero in
>> >>> >> >*>* the marginals of each of the 3
variables but it is sometimes
>> >>> >> > equal
>> >>> >> >to zero
>> >>> >> >*>* (and sometimes negative) in the joint
tables.
>> >>> >> >*>>* I have got 2 files:
>> >>> >> >*>>* => one with distribution of all
possible combinations of
>> >>> >> > V1xV2
>> >>> >> >(some of
>> >>> >> >*>* which are zero or neagtive) and
>> >>> >> >*>* => one with the marginal
distribution of V3.
>> >>> >> >*>>* I am trying to build the long and
narrow dataset V1xV2xV3 in
>> >>> >> > such
>> >>> >> >a way
>> >>> >> >*>* that each V1xV2 cell does not get
modified and V3 fits as
>> >>> >> > closely
>> >>> >> >as
>> >>> >> >*>* possible to its marginal distribution.
Does it make sense?
>> >>> >> >*>>* To be even more specific, my 2
input files look like the
>> >>> >> >following.
>> >>> >> >*>>* FILE 1
>> >>> >> >*>* V1,V2,V4
>> >>> >> >*>* A, A, 24.251
>> >>> >> >*>* A, B, 1.065
>> >>> >> >*>* (...)
>> >>> >> >*>* B, C, 0.294
>> >>> >> >*>* B, D, 2.731
>> >>> >> >*>* (...)
>> >>> >> >*>* H, L, 0.345
>> >>> >> >*>* H, M, 0.000
>> >>> >> >*>>* FILE 2
>> >>> >> >*>* V3, V4
>> >>> >> >*>* A, 1.575
>> >>> >> >*>* B, 4.294
>> >>> >> >*>* C, 10.044
>> >>> >> >*>* (...)
>> >>> >> >*>* L, 5.123
>> >>> >> >*>* M, 3.334
>> >>> >> >*>>* What I need to achieve is a file
such as the following
>> >>> >> >*>>* FILE 3
>> >>> >> >*>* V1, V2, V3, V4
>> >>> >> >*>* A, A, A, ???
>> >>> >> >*>* A, A, B, ???
>> >>> >> >*>* (...)
>> >>> >> >*>* D, D, E, ???
>> >>> >> >*>* D, D, F, ???
>> >>> >> >*>* (...)
>> >>> >> >*>* H, M, L, ???
>> >>> >> >*>* H, M, M, ???
>> >>> >> >*>>* Please notice that FILE 3 need to
be such that if I aggregate
>> >>> >> > on
>> >>> >> >V1+V2 I
>> >>> >> >*>* recover exactly FILE 1 and that if I
aggregate on V3 I can
>> >>> >> > recover
>> >>> >> >a file
>> >>> >> >*>* as close as possible to FILE 3
(ideally the same file).
>> >>> >> >*>>* Can anyone suggest how I could do
that with R?
>> >>> >> >*>>* Thank you very much indeed for any
assistance you are able to
>> >>> >> >provide.
>> >>> >> >*>>* Kind regards,
>> >>> >> >*>>* Luca*
>> >>> >> >
>> >>> >> >       [[alternative HTML version deleted]]
>> >>> >> >
>> >>> >>
>______________________________________________
>> >>> >> >R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>> >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> >> >PLEASE do read the posting guide
>> >>> >> >http://www.R-project.org/posting-guide.html
>> >>> >> >and provide commented, minimal,
self-contained, reproducible code.
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >         [[alternative HTML version deleted]]
>> >>> >
>> >>> > ______________________________________________
>> >>> > R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> > PLEASE do read the posting guide
>> >>> > http://www.R-project.org/posting-guide.html
>> >>> > and provide commented, minimal, self-contained,
reproducible code.
>> >>
>> >>
>
>

Bert Gunter

2015-Mar-22 15:05 UTC

head link

[R] Joining two datasets - recursive procedure?

Oh, wait a minute ...

You still want the marginals for the other columns to be as originally?

If so, then this is impossible in general as the sum of all the values
must be what they were originally and you cannot therefore choose your
values for V3 arbitrarily.

Or at least, that seems to be what you are trying to do.

-- Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at gene.com>
wrote:> I would have thought that this is straightforward given my previous
email...
>
> Just set z to what you want -- e,g, all B values to 29/number of B's,
> and all C values to 2.567/number of C's (etc. for more categories).
>
> A slick but sort of cheat way to do this programmatically -- in the
> sense that it relies on the implementation of factor() rather than its
> API -- is:
>
> y <- f1$v3  ## to simplify the notation; could be done using with()
> z <- (c(29,2.567)/table(y))[c(y)]
>
> Then proceed to z1 as I previously described
>
> -- Bert
>
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:
>> Hi Bert, hello R-experts,
>>
>> I am close to a solution but I still need one hint w.r.t. the following
>> procedure (available also from
>> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
>>
>> rm(list=ls())
>>
>> # this is (an extract of) the INPUT file I have:
>> f1 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B", "B",
>> "B", "B", "B"), v2 = c("A",
"B", "C", "A", "B", "C",
"A", "B", "C", "A",
>> "B", "C"), v3 = c("B", "B",
"B", "C", "C", "C", "B",
"B", "B", "C", "C",
>> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786,
0.00042, 2.37232,
>> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1",
"v2", "v3", "v4"), class
>> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L,
158L, 165L, 167L,
>> 197L, 204L, 206L))
>>
>> # this is the procedure that Bert suggested (slightly adjusted):
>> z <- rnorm(nrow(f1)) ## or anything you want
>> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
>> aggregate(v4~v1*v2,f1,sum)
>> aggregate(z1~v1*v2,f1,sum)
>> aggregate(v4~v3,f1,sum)
>> aggregate(z1~v3,f1,sum)
>>
>> My question to you is: how can I set z so that I can obtain specific
values
>> for z1-v4 in the v3 aggregation?
>> In other words, how can I configure the procedure so that e.g. B=29 and
>> C=2.56723 after running the procedure:
>> aggregate(z1~v3,f1,sum)
>>
>> Thank you,
>>
>> Luca
>>
>> PS: to avoid any doubts you might have about who I am the following is
my
>> web page: http://lucameyer.wordpress.com/
>>
>>
>> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
>>>
>>> ... or cleaner:
>>>
>>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
>>>
>>>
>>> Just for curiosity, was this homework? (in which case I should
>>> probably have not provided you an answer -- that is, assuming that
I
>>> HAVE provided an answer).
>>>
>>> Cheers,
>>> Bert
>>>
>>> Bert Gunter
>>> Genentech Nonclinical Biostatistics
>>> (650) 467-7374
>>>
>>> "Data is not information. Information is not knowledge. And
knowledge
>>> is certainly not wisdom."
>>> Clifford Stoll
>>>
>>>
>>>
>>>
>>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at
gene.com> wrote:
>>> > z <- rnorm(nrow(f1)) ## or anything you want
>>> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
>>> >
>>> >
>>> > aggregate(v4~v1,f1,sum)
>>> > aggregate(z1~v1,f1,sum)
>>> > aggregate(v4~v2,f1,sum)
>>> > aggregate(z1~v2,f1,sum)
>>> > aggregate(v4~v3,f1,sum)
>>> > aggregate(z1~v3,f1,sum)
>>> >
>>> >
>>> > Cheers,
>>> > Bert
>>> >
>>> > Bert Gunter
>>> > Genentech Nonclinical Biostatistics
>>> > (650) 467-7374
>>> >
>>> > "Data is not information. Information is not knowledge.
And knowledge
>>> > is certainly not wisdom."
>>> > Clifford Stoll
>>> >
>>> >
>>> >
>>> >
>>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at
gmail.com> wrote:
>>> >> Hi Bert,
>>> >>
>>> >> Thank you for your message. I am looking into ave() and
tapply() as you
>>> >> suggested but at the same time I have prepared a example
of input and
>>> >> output
>>> >> files, just in case you or someone else would like to make
an attempt
>>> >> to
>>> >> generate a code that goes from input to output.
>>> >>
>>> >> Please see below or download it from
>>> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
>>> >>
>>> >> # this is (an extract of) the INPUT file I have:
>>> >> f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
>>> >> 1.42917,
>>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
>>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>>> >> row.names >>> >> c(2L,
>>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
>>> >>
>>> >> # this is (an extract of) the OUTPUT file I would like to
obtain:
>>> >> f2 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
>>> >> 1.77918,
>>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
>>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>>> >> row.names >>> >> c(2L,
>>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
>>> >>
>>> >> # please notice that while the aggregated v4 on v3 has
changed ?
>>> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
>>> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
>>> >>
>>> >> # ? the aggregated v4 over v1xv2 has remained unchanged:
>>> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>>> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
>>> >>
>>> >> Thank you very much in advance for your assitance.
>>> >>
>>> >> Luca
>>> >>
>>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton
at gene.com>:
>>> >>>
>>> >>> 1. Still not sure what you mean, but maybe look at
?ave and ?tapply,
>>> >>> for which ave() is a wrapper.
>>> >>>
>>> >>> 2. You still need to heed the rest of Jeff's
advice.
>>> >>>
>>> >>> Cheers,
>>> >>> Bert
>>> >>>
>>> >>> Bert Gunter
>>> >>> Genentech Nonclinical Biostatistics
>>> >>> (650) 467-7374
>>> >>>
>>> >>> "Data is not information. Information is not
knowledge. And knowledge
>>> >>> is certainly not wisdom."
>>> >>> Clifford Stoll
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer
<lucam1968 at gmail.com>
>>> >>> wrote:
>>> >>> > Hi Jeff & other R-experts,
>>> >>> >
>>> >>> > Thank you for your note. I have tried myself to
solve the issue
>>> >>> > without
>>> >>> > success.
>>> >>> >
>>> >>> > Following your suggestion, I am providing a
sample of the dataset I
>>> >>> > am
>>> >>> > using below (also downloadble in plain text from
>>> >>> >
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
>>> >>> >
>>> >>> > #this is an extract of the overall dataset
(n=1200 cases)
>>> >>> > f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>>> >>> > "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>>> >>> > "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>>> >>> > "B", "B", "B",
"C", "C", "C"), v4 = c(18.1853007621835,
>>> >>> > 3.43806581506388,
>>> >>> > 0.002733567617055, 1.42917483425029,
1.05786640463504,
>>> >>> > 0.000420548864162308,
>>> >>> > 2.37232740842861, 3.01835841813241, 0,
1.13430282139936,
>>> >>> > 0.928725667117666,
>>> >>> > 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>>> >>> > row.names
>>> >>> > >>> >>> > c(2L,
>>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L,
204L, 206L))
>>> >>> >
>>> >>> > I need to find a automated procedure that allows
me to adjust v3
>>> >>> > marginals
>>> >>> > while maintaining v1xv2 marginals unchanged.
>>> >>> >
>>> >>> > That is: modify the v4 values you can find by
running:
>>> >>> >
>>> >>> > aggregate(f1[,c("v4")],list(f1$v3),sum)
>>> >>> >
>>> >>> > while maintaining costant the values you can find
by running:
>>> >>> >
>>> >>> >
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>>> >>> >
>>> >>> > Now does it make sense?
>>> >>> >
>>> >>> > Please notice I have tried to build some syntax
that tries to modify
>>> >>> > values
>>> >>> > within each v1xv2 combination by computing sum of
v4, row percentage
>>> >>> > in
>>> >>> > terms of v4, and there is where my effort is
blocked. Not really
>>> >>> > sure
>>> >>> > how I
>>> >>> > should proceed. Any suggestion?
>>> >>> >
>>> >>> > Thanks,
>>> >>> >
>>> >>> > Luca
>>> >>> >
>>> >>> >
>>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller
<jdnewmil at dcn.davis.ca.us>:
>>> >>> >
>>> >>> >> I don't understand your description. The
standard practice on this
>>> >>> >> list
>>> >>> >> is
>>> >>> >> to provide a reproducible R example [1] of
the kind of data you are
>>> >>> >> working
>>> >>> >> with (and any code you have tried) to go
along with your
>>> >>> >> description.
>>> >>> >> In
>>> >>> >> this case, that would be two dputs of your
input data frames and a
>>> >>> >> dput
>>> >>> >> of
>>> >>> >> an output data frame (generated by hand from
your input data
>>> >>> >> frame).
>>> >>> >> (Probably best to not use the full number of
input values just to
>>> >>> >> keep
>>> >>> >> the
>>> >>> >> size down.) We could then make an attempt to
generate code that
>>> >>> >> goes
>>> >>> >> from
>>> >>> >> input to output.
>>> >>> >>
>>> >>> >> Of course, if you post that hard work using
HTML then it will get
>>> >>> >> corrupted (much like the text below from your
earlier emails) and
>>> >>> >> we
>>> >>> >> won't
>>> >>> >> be able to use it. Please learn to post from
your email software
>>> >>> >> using
>>> >>> >> plain text when corresponding with this
mailing list.
>>> >>> >>
>>> >>> >> [1]
>>> >>> >>
>>> >>> >>
>>> >>> >>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>> >>> >>
>>> >>> >>
>>> >>> >>
---------------------------------------------------------------------------
>>> >>> >> Jeff Newmiller                        The    
.....       .....  Go
>>> >>> >> Live...
>>> >>> >> DCN:<jdnewmil at dcn.davis.ca.us>      
Basics: ##.#.       ##.#.
>>> >>> >> Live
>>> >>> >> Go...
>>> >>> >>                                       Live:  
OO#.. Dead: OO#..
>>> >>> >> Playing
>>> >>> >> Research Engineer (Solar/Batteries           
O.O#.       #.O#.
>>> >>> >> with
>>> >>> >> /Software/Embedded Controllers)              
.OO#.       .OO#.
>>> >>> >> rocks...1k
>>> >>> >>
>>> >>> >>
>>> >>> >>
---------------------------------------------------------------------------
>>> >>> >> Sent from my phone. Please excuse my brevity.
>>> >>> >>
>>> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer
<lucam1968 at gmail.com>
>>> >>> >> wrote:
>>> >>> >> >Thanks for you input Michael,
>>> >>> >> >
>>> >>> >> >The continuous variable I have measures
quantities (down to the
>>> >>> >> > 3rd
>>> >>> >> >decimal level) so unfortunately are not
frequencies.
>>> >>> >> >
>>> >>> >> >Any more specific suggestions on how that
could be tackled?
>>> >>> >> >
>>> >>> >> >Thanks & kind regards,
>>> >>> >> >
>>> >>> >> >Luca
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >==>>> >>> >> >
>>> >>> >> >Michael Friendly wrote:
>>> >>> >> >I'm not sure I understand completely
what you want to do, but
>>> >>> >> >if the data were frequencies, it sounds
like task for fitting a
>>> >>> >> >loglinear model with the model formula
>>> >>> >> >
>>> >>> >> >~ V1*V2 + V3
>>> >>> >> >
>>> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote:
>>> >>> >> >>* Hello,
>>> >>> >> >*>>* I am facing a quite
challenging task (at least to me) and I
>>> >>> >> > was
>>> >>> >> >wondering
>>> >>> >> >*>* if someone could advise how R
could assist me to speed the
>>> >>> >> > task
>>> >>> >> > up.
>>> >>> >> >*>>* I am dealing with a dataset
with 3 discrete variables and one
>>> >>> >> >continuous
>>> >>> >> >*>* variable. The discrete variables
are:
>>> >>> >> >*>>* V1: 8 modalities
>>> >>> >> >*>* V2: 13 modalities
>>> >>> >> >*>* V3: 13 modalities
>>> >>> >> >*>>* The continuous variable V4 is
a decimal number always greater
>>> >>> >> > than
>>> >>> >> >zero in
>>> >>> >> >*>* the marginals of each of the 3
variables but it is sometimes
>>> >>> >> > equal
>>> >>> >> >to zero
>>> >>> >> >*>* (and sometimes negative) in the
joint tables.
>>> >>> >> >*>>* I have got 2 files:
>>> >>> >> >*>>* => one with distribution of
all possible combinations of
>>> >>> >> > V1xV2
>>> >>> >> >(some of
>>> >>> >> >*>* which are zero or neagtive) and
>>> >>> >> >*>* => one with the marginal
distribution of V3.
>>> >>> >> >*>>* I am trying to build the long
and narrow dataset V1xV2xV3 in
>>> >>> >> > such
>>> >>> >> >a way
>>> >>> >> >*>* that each V1xV2 cell does not get
modified and V3 fits as
>>> >>> >> > closely
>>> >>> >> >as
>>> >>> >> >*>* possible to its marginal
distribution. Does it make sense?
>>> >>> >> >*>>* To be even more specific, my 2
input files look like the
>>> >>> >> >following.
>>> >>> >> >*>>* FILE 1
>>> >>> >> >*>* V1,V2,V4
>>> >>> >> >*>* A, A, 24.251
>>> >>> >> >*>* A, B, 1.065
>>> >>> >> >*>* (...)
>>> >>> >> >*>* B, C, 0.294
>>> >>> >> >*>* B, D, 2.731
>>> >>> >> >*>* (...)
>>> >>> >> >*>* H, L, 0.345
>>> >>> >> >*>* H, M, 0.000
>>> >>> >> >*>>* FILE 2
>>> >>> >> >*>* V3, V4
>>> >>> >> >*>* A, 1.575
>>> >>> >> >*>* B, 4.294
>>> >>> >> >*>* C, 10.044
>>> >>> >> >*>* (...)
>>> >>> >> >*>* L, 5.123
>>> >>> >> >*>* M, 3.334
>>> >>> >> >*>>* What I need to achieve is a
file such as the following
>>> >>> >> >*>>* FILE 3
>>> >>> >> >*>* V1, V2, V3, V4
>>> >>> >> >*>* A, A, A, ???
>>> >>> >> >*>* A, A, B, ???
>>> >>> >> >*>* (...)
>>> >>> >> >*>* D, D, E, ???
>>> >>> >> >*>* D, D, F, ???
>>> >>> >> >*>* (...)
>>> >>> >> >*>* H, M, L, ???
>>> >>> >> >*>* H, M, M, ???
>>> >>> >> >*>>* Please notice that FILE 3 need
to be such that if I aggregate
>>> >>> >> > on
>>> >>> >> >V1+V2 I
>>> >>> >> >*>* recover exactly FILE 1 and that if
I aggregate on V3 I can
>>> >>> >> > recover
>>> >>> >> >a file
>>> >>> >> >*>* as close as possible to FILE 3
(ideally the same file).
>>> >>> >> >*>>* Can anyone suggest how I could
do that with R?
>>> >>> >> >*>>* Thank you very much indeed for
any assistance you are able to
>>> >>> >> >provide.
>>> >>> >> >*>>* Kind regards,
>>> >>> >> >*>>* Luca*
>>> >>> >> >
>>> >>> >> >       [[alternative HTML version
deleted]]
>>> >>> >> >
>>> >>> >>
>______________________________________________
>>> >>> >> >R-help at r-project.org mailing list --
To UNSUBSCRIBE and more, see
>>> >>> >>
>https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>> >> >PLEASE do read the posting guide
>>> >>> >>
>http://www.R-project.org/posting-guide.html
>>> >>> >> >and provide commented, minimal,
self-contained, reproducible code.
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>> >         [[alternative HTML version deleted]]
>>> >>> >
>>> >>> > ______________________________________________
>>> >>> > R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>> > PLEASE do read the posting guide
>>> >>> > http://www.R-project.org/posting-guide.html
>>> >>> > and provide commented, minimal, self-contained,
reproducible code.
>>> >>
>>> >>
>>
>>

Luca Meyer

2015-Mar-22 15:21 UTC

head link

[R] Joining two datasets - recursive procedure?

Hi Bert,

Thanks again for your assistance.

Unfortunately when I apply the additional code you suggest I get B=40.23326
& C=-8.66603 and not  B=29 & C=2.56723. Any idea why that might be
happening?

Please see below or on
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 the code I
am running:

rm(list=ls())

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A",
"A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A",
"B", "C", "A", "B", "C"), v3 =
c("B", "B", "B", "C", "C",
"C",
"B", "B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"),
class = "data.frame", row.names c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is the procedure that Bert suggested (slightly adjusted):

y <- f1$v3  ## to simplify the notation; could be done using with()
z <- (c(29,2.567)/table(y))[c(y)]
# z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)

Thanks again,

Luca


2015-03-22 15:55 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
> I would have thought that this is straightforward given my previous
> email...
>
> Just set z to what you want -- e,g, all B values to 29/number of B's,
> and all C values to 2.567/number of C's (etc. for more categories).
>
> A slick but sort of cheat way to do this programmatically -- in the
> sense that it relies on the implementation of factor() rather than its
> API -- is:
>
> y <- f1$v3  ## to simplify the notation; could be done using with()
> z <- (c(29,2.567)/table(y))[c(y)]
>
> Then proceed to z1 as I previously described
>
> -- Bert
>
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:
> > Hi Bert, hello R-experts,
> >
> > I am close to a solution but I still need one hint w.r.t. the
following
> > procedure (available also from
> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
> >
> > rm(list=ls())
> >
> > # this is (an extract of) the INPUT file I have:
> > f1 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B", "B",
> > "B", "B", "B"), v2 = c("A",
"B", "C", "A", "B", "C",
"A", "B", "C", "A",
> > "B", "C"), v3 = c("B", "B",
"B", "C", "C", "C", "B",
"B", "B", "C", "C",
> > "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786,
0.00042,
> 2.37232,
> > 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1",
"v2", "v3", "v4"),
> class
> > = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L,
158L, 165L,
> 167L,
> > 197L, 204L, 206L))
> >
> > # this is the procedure that Bert suggested (slightly adjusted):
> > z <- rnorm(nrow(f1)) ## or anything you want
> > z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
> > aggregate(v4~v1*v2,f1,sum)
> > aggregate(z1~v1*v2,f1,sum)
> > aggregate(v4~v3,f1,sum)
> > aggregate(z1~v3,f1,sum)
> >
> > My question to you is: how can I set z so that I can obtain specific
> values
> > for z1-v4 in the v3 aggregation?
> > In other words, how can I configure the procedure so that e.g. B=29
and
> > C=2.56723 after running the procedure:
> > aggregate(z1~v3,f1,sum)
> >
> > Thank you,
> >
> > Luca
> >
> > PS: to avoid any doubts you might have about who I am the following is
my
> > web page: http://lucameyer.wordpress.com/
> >
> >
> > 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
> >>
> >> ... or cleaner:
> >>
> >> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
> >>
> >>
> >> Just for curiosity, was this homework? (in which case I should
> >> probably have not provided you an answer -- that is, assuming that
I
> >> HAVE provided an answer).
> >>
> >> Cheers,
> >> Bert
> >>
> >> Bert Gunter
> >> Genentech Nonclinical Biostatistics
> >> (650) 467-7374
> >>
> >> "Data is not information. Information is not knowledge. And
knowledge
> >> is certainly not wisdom."
> >> Clifford Stoll
> >>
> >>
> >>
> >>
> >> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at
gene.com> wrote:
> >> > z <- rnorm(nrow(f1)) ## or anything you want
> >> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
> >> >
> >> >
> >> > aggregate(v4~v1,f1,sum)
> >> > aggregate(z1~v1,f1,sum)
> >> > aggregate(v4~v2,f1,sum)
> >> > aggregate(z1~v2,f1,sum)
> >> > aggregate(v4~v3,f1,sum)
> >> > aggregate(z1~v3,f1,sum)
> >> >
> >> >
> >> > Cheers,
> >> > Bert
> >> >
> >> > Bert Gunter
> >> > Genentech Nonclinical Biostatistics
> >> > (650) 467-7374
> >> >
> >> > "Data is not information. Information is not knowledge.
And knowledge
> >> > is certainly not wisdom."
> >> > Clifford Stoll
> >> >
> >> >
> >> >
> >> >
> >> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at
gmail.com>
> wrote:
> >> >> Hi Bert,
> >> >>
> >> >> Thank you for your message. I am looking into ave() and
tapply() as
> you
> >> >> suggested but at the same time I have prepared a example
of input and
> >> >> output
> >> >> files, just in case you or someone else would like to
make an attempt
> >> >> to
> >> >> generate a code that goes from input to output.
> >> >>
> >> >> Please see below or download it from
> >> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
> >> >>
> >> >> # this is (an extract of) the INPUT file I have:
> >> >> f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
> >> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >> >> "B", "B", "B",
"C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
> >> >> 1.42917,
> >> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
> >> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
> >> >> row.names > >> >> c(2L,
> >> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
> >> >>
> >> >> # this is (an extract of) the OUTPUT file I would like to
obtain:
> >> >> f2 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
> >> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >> >> "B", "B", "B",
"C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
> >> >> 1.77918,
> >> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
> >> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
> >> >> row.names > >> >> c(2L,
> >> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
> >> >>
> >> >> # please notice that while the aggregated v4 on v3 has
changed ?
> >> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
> >> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
> >> >>
> >> >> # ? the aggregated v4 over v1xv2 has remained unchanged:
> >> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
> >> >>
> >> >> Thank you very much in advance for your assitance.
> >> >>
> >> >> Luca
> >> >>
> >> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton
at gene.com>:
> >> >>>
> >> >>> 1. Still not sure what you mean, but maybe look at
?ave and ?tapply,
> >> >>> for which ave() is a wrapper.
> >> >>>
> >> >>> 2. You still need to heed the rest of Jeff's
advice.
> >> >>>
> >> >>> Cheers,
> >> >>> Bert
> >> >>>
> >> >>> Bert Gunter
> >> >>> Genentech Nonclinical Biostatistics
> >> >>> (650) 467-7374
> >> >>>
> >> >>> "Data is not information. Information is not
knowledge. And
> knowledge
> >> >>> is certainly not wisdom."
> >> >>> Clifford Stoll
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer
<lucam1968 at gmail.com>
> >> >>> wrote:
> >> >>> > Hi Jeff & other R-experts,
> >> >>> >
> >> >>> > Thank you for your note. I have tried myself to
solve the issue
> >> >>> > without
> >> >>> > success.
> >> >>> >
> >> >>> > Following your suggestion, I am providing a
sample of the dataset
> I
> >> >>> > am
> >> >>> > using below (also downloadble in plain text from
> >> >>> >
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
> >> >>> >
> >> >>> > #this is an extract of the overall dataset
(n=1200 cases)
> >> >>> > f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B",
> "B",
> >> >>> > "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >> >>> > "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >> >>> > "B", "B", "B",
"C", "C", "C"), v4 = c(18.1853007621835,
> >> >>> > 3.43806581506388,
> >> >>> > 0.002733567617055, 1.42917483425029,
1.05786640463504,
> >> >>> > 0.000420548864162308,
> >> >>> > 2.37232740842861, 3.01835841813241, 0,
1.13430282139936,
> >> >>> > 0.928725667117666,
> >> >>> > 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
> >> >>> > row.names
> >> >>> > > >> >>> > c(2L,
> >> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L,
204L, 206L))
> >> >>> >
> >> >>> > I need to find a automated procedure that allows
me to adjust v3
> >> >>> > marginals
> >> >>> > while maintaining v1xv2 marginals unchanged.
> >> >>> >
> >> >>> > That is: modify the v4 values you can find by
running:
> >> >>> >
> >> >>> >
aggregate(f1[,c("v4")],list(f1$v3),sum)
> >> >>> >
> >> >>> > while maintaining costant the values you can
find by running:
> >> >>> >
> >> >>> >
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >> >>> >
> >> >>> > Now does it make sense?
> >> >>> >
> >> >>> > Please notice I have tried to build some syntax
that tries to
> modify
> >> >>> > values
> >> >>> > within each v1xv2 combination by computing sum
of v4, row
> percentage
> >> >>> > in
> >> >>> > terms of v4, and there is where my effort is
blocked. Not really
> >> >>> > sure
> >> >>> > how I
> >> >>> > should proceed. Any suggestion?
> >> >>> >
> >> >>> > Thanks,
> >> >>> >
> >> >>> > Luca
> >> >>> >
> >> >>> >
> >> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <
> jdnewmil at dcn.davis.ca.us>:
> >> >>> >
> >> >>> >> I don't understand your description. The
standard practice on
> this
> >> >>> >> list
> >> >>> >> is
> >> >>> >> to provide a reproducible R example [1] of
the kind of data you
> are
> >> >>> >> working
> >> >>> >> with (and any code you have tried) to go
along with your
> >> >>> >> description.
> >> >>> >> In
> >> >>> >> this case, that would be two dputs of your
input data frames and
> a
> >> >>> >> dput
> >> >>> >> of
> >> >>> >> an output data frame (generated by hand from
your input data
> >> >>> >> frame).
> >> >>> >> (Probably best to not use the full number of
input values just to
> >> >>> >> keep
> >> >>> >> the
> >> >>> >> size down.) We could then make an attempt to
generate code that
> >> >>> >> goes
> >> >>> >> from
> >> >>> >> input to output.
> >> >>> >>
> >> >>> >> Of course, if you post that hard work using
HTML then it will get
> >> >>> >> corrupted (much like the text below from
your earlier emails) and
> >> >>> >> we
> >> >>> >> won't
> >> >>> >> be able to use it. Please learn to post from
your email software
> >> >>> >> using
> >> >>> >> plain text when corresponding with this
mailing list.
> >> >>> >>
> >> >>> >> [1]
> >> >>> >>
> >> >>> >>
> >> >>> >>
>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> >> >>> >>
> >> >>> >>
> >> >>> >>
> ---------------------------------------------------------------------------
> >> >>> >> Jeff Newmiller                        The   
.....       .....
> Go
> >> >>> >> Live...
> >> >>> >> DCN:<jdnewmil at dcn.davis.ca.us>     
Basics: ##.#.       ##.#.
> >> >>> >> Live
> >> >>> >> Go...
> >> >>> >>                                       Live: 
OO#.. Dead: OO#..
> >> >>> >> Playing
> >> >>> >> Research Engineer (Solar/Batteries          
O.O#.       #.O#.
> >> >>> >> with
> >> >>> >> /Software/Embedded Controllers)             
.OO#.       .OO#.
> >> >>> >> rocks...1k
> >> >>> >>
> >> >>> >>
> >> >>> >>
> ---------------------------------------------------------------------------
> >> >>> >> Sent from my phone. Please excuse my
brevity.
> >> >>> >>
> >> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer
<
> lucam1968 at gmail.com>
> >> >>> >> wrote:
> >> >>> >> >Thanks for you input Michael,
> >> >>> >> >
> >> >>> >> >The continuous variable I have measures
quantities (down to the
> >> >>> >> > 3rd
> >> >>> >> >decimal level) so unfortunately are not
frequencies.
> >> >>> >> >
> >> >>> >> >Any more specific suggestions on how
that could be tackled?
> >> >>> >> >
> >> >>> >> >Thanks & kind regards,
> >> >>> >> >
> >> >>> >> >Luca
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >==> >> >>> >>
>
> >> >>> >> >Michael Friendly wrote:
> >> >>> >> >I'm not sure I understand completely
what you want to do, but
> >> >>> >> >if the data were frequencies, it sounds
like task for fitting a
> >> >>> >> >loglinear model with the model formula
> >> >>> >> >
> >> >>> >> >~ V1*V2 + V3
> >> >>> >> >
> >> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote:
> >> >>> >> >>* Hello,
> >> >>> >> >*>>* I am facing a quite
challenging task (at least to me) and I
> >> >>> >> > was
> >> >>> >> >wondering
> >> >>> >> >*>* if someone could advise how R
could assist me to speed the
> >> >>> >> > task
> >> >>> >> > up.
> >> >>> >> >*>>* I am dealing with a dataset
with 3 discrete variables and
> one
> >> >>> >> >continuous
> >> >>> >> >*>* variable. The discrete variables
are:
> >> >>> >> >*>>* V1: 8 modalities
> >> >>> >> >*>* V2: 13 modalities
> >> >>> >> >*>* V3: 13 modalities
> >> >>> >> >*>>* The continuous variable V4 is
a decimal number always
> greater
> >> >>> >> > than
> >> >>> >> >zero in
> >> >>> >> >*>* the marginals of each of the 3
variables but it is sometimes
> >> >>> >> > equal
> >> >>> >> >to zero
> >> >>> >> >*>* (and sometimes negative) in the
joint tables.
> >> >>> >> >*>>* I have got 2 files:
> >> >>> >> >*>>* => one with distribution
of all possible combinations of
> >> >>> >> > V1xV2
> >> >>> >> >(some of
> >> >>> >> >*>* which are zero or neagtive) and
> >> >>> >> >*>* => one with the marginal
distribution of V3.
> >> >>> >> >*>>* I am trying to build the long
and narrow dataset V1xV2xV3
> in
> >> >>> >> > such
> >> >>> >> >a way
> >> >>> >> >*>* that each V1xV2 cell does not get
modified and V3 fits as
> >> >>> >> > closely
> >> >>> >> >as
> >> >>> >> >*>* possible to its marginal
distribution. Does it make sense?
> >> >>> >> >*>>* To be even more specific, my
2 input files look like the
> >> >>> >> >following.
> >> >>> >> >*>>* FILE 1
> >> >>> >> >*>* V1,V2,V4
> >> >>> >> >*>* A, A, 24.251
> >> >>> >> >*>* A, B, 1.065
> >> >>> >> >*>* (...)
> >> >>> >> >*>* B, C, 0.294
> >> >>> >> >*>* B, D, 2.731
> >> >>> >> >*>* (...)
> >> >>> >> >*>* H, L, 0.345
> >> >>> >> >*>* H, M, 0.000
> >> >>> >> >*>>* FILE 2
> >> >>> >> >*>* V3, V4
> >> >>> >> >*>* A, 1.575
> >> >>> >> >*>* B, 4.294
> >> >>> >> >*>* C, 10.044
> >> >>> >> >*>* (...)
> >> >>> >> >*>* L, 5.123
> >> >>> >> >*>* M, 3.334
> >> >>> >> >*>>* What I need to achieve is a
file such as the following
> >> >>> >> >*>>* FILE 3
> >> >>> >> >*>* V1, V2, V3, V4
> >> >>> >> >*>* A, A, A, ???
> >> >>> >> >*>* A, A, B, ???
> >> >>> >> >*>* (...)
> >> >>> >> >*>* D, D, E, ???
> >> >>> >> >*>* D, D, F, ???
> >> >>> >> >*>* (...)
> >> >>> >> >*>* H, M, L, ???
> >> >>> >> >*>* H, M, M, ???
> >> >>> >> >*>>* Please notice that FILE 3
need to be such that if I
> aggregate
> >> >>> >> > on
> >> >>> >> >V1+V2 I
> >> >>> >> >*>* recover exactly FILE 1 and that
if I aggregate on V3 I can
> >> >>> >> > recover
> >> >>> >> >a file
> >> >>> >> >*>* as close as possible to FILE 3
(ideally the same file).
> >> >>> >> >*>>* Can anyone suggest how I
could do that with R?
> >> >>> >> >*>>* Thank you very much indeed
for any assistance you are able
> to
> >> >>> >> >provide.
> >> >>> >> >*>>* Kind regards,
> >> >>> >> >*>>* Luca*
> >> >>> >> >
> >> >>> >> >       [[alternative HTML version
deleted]]
> >> >>> >> >
> >> >>> >>
>______________________________________________
> >> >>> >> >R-help at r-project.org mailing list --
To UNSUBSCRIBE and more,
> see
> >> >>> >>
>https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> >> >PLEASE do read the posting guide
> >> >>> >>
>http://www.R-project.org/posting-guide.html
> >> >>> >> >and provide commented, minimal,
self-contained, reproducible
> code.
> >> >>> >>
> >> >>> >>
> >> >>> >
> >> >>> >         [[alternative HTML version deleted]]
> >> >>> >
> >> >>> > ______________________________________________
> >> >>> > R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> >> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> > PLEASE do read the posting guide
> >> >>> > http://www.R-project.org/posting-guide.html
> >> >>> > and provide commented, minimal, self-contained,
reproducible code.
> >> >>
> >> >>
> >
> >
>
	[[alternative HTML version deleted]]

Luca Meyer

2015-Mar-22 15:28 UTC

head link

[R] Fwd: Joining two datasets - recursive procedure?

Sorry forgot to keep the rest of the group in the loop - Luca
---------- Forwarded message ----------
From: Luca Meyer <lucam1968 at gmail.com>
Date: 2015-03-22 16:27 GMT+01:00
Subject: Re: [R] Joining two datasets - recursive procedure?
To: Bert Gunter <gunter.berton at gene.com>


Hi Bert,

That is exactly what I am trying to achieve. Please notice that negative v4
values are allowed. I have done a similar task in the past manually by
recursively alterating v4 distribution across v3 categories within fix each
v1&v2 combination so I am quite positive it can be achieved but honestly I
took me forever to do it manually and since this is likely to be an
exercise I need to repeat from time to time I wish I could learn how to do
it programmatically....

Thanks again for any further suggestion you might have,

Luca


2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
> Oh, wait a minute ...
>
> You still want the marginals for the other columns to be as originally?
>
> If so, then this is impossible in general as the sum of all the values
> must be what they were originally and you cannot therefore choose your
> values for V3 arbitrarily.
>
> Or at least, that seems to be what you are trying to do.
>
> -- Bert
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at gene.com>
wrote:
> > I would have thought that this is straightforward given my previous
> email...
> >
> > Just set z to what you want -- e,g, all B values to 29/number of
B's,
> > and all C values to 2.567/number of C's (etc. for more
categories).
> >
> > A slick but sort of cheat way to do this programmatically -- in the
> > sense that it relies on the implementation of factor() rather than its
> > API -- is:
> >
> > y <- f1$v3  ## to simplify the notation; could be done using with()
> > z <- (c(29,2.567)/table(y))[c(y)]
> >
> > Then proceed to z1 as I previously described
> >
> > -- Bert
> >
> >
> > Bert Gunter
> > Genentech Nonclinical Biostatistics
> > (650) 467-7374
> >
> > "Data is not information. Information is not knowledge. And
knowledge
> > is certainly not wisdom."
> > Clifford Stoll
> >
> >
> >
> >
> > On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at
gmail.com> wrote:
> >> Hi Bert, hello R-experts,
> >>
> >> I am close to a solution but I still need one hint w.r.t. the
following
> >> procedure (available also from
> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
> >>
> >> rm(list=ls())
> >>
> >> # this is (an extract of) the INPUT file I have:
> >> f1 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B", "B",
> >> "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A", "B", "C", "A",
> >> "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
"B", "B", "B", "C", "C",
> >> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042,
> 2.37232,
> >> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1",
"v2", "v3", "v4"),
> class
> >> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L,
50L, 158L, 165L,
> 167L,
> >> 197L, 204L, 206L))
> >>
> >> # this is the procedure that Bert suggested (slightly adjusted):
> >> z <- rnorm(nrow(f1)) ## or anything you want
> >> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
> >> aggregate(v4~v1*v2,f1,sum)
> >> aggregate(z1~v1*v2,f1,sum)
> >> aggregate(v4~v3,f1,sum)
> >> aggregate(z1~v3,f1,sum)
> >>
> >> My question to you is: how can I set z so that I can obtain
specific
> values
> >> for z1-v4 in the v3 aggregation?
> >> In other words, how can I configure the procedure so that e.g.
B=29 and
> >> C=2.56723 after running the procedure:
> >> aggregate(z1~v3,f1,sum)
> >>
> >> Thank you,
> >>
> >> Luca
> >>
> >> PS: to avoid any doubts you might have about who I am the
following is
> my
> >> web page: http://lucameyer.wordpress.com/
> >>
> >>
> >> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
> >>>
> >>> ... or cleaner:
> >>>
> >>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
> >>>
> >>>
> >>> Just for curiosity, was this homework? (in which case I should
> >>> probably have not provided you an answer -- that is, assuming
that I
> >>> HAVE provided an answer).
> >>>
> >>> Cheers,
> >>> Bert
> >>>
> >>> Bert Gunter
> >>> Genentech Nonclinical Biostatistics
> >>> (650) 467-7374
> >>>
> >>> "Data is not information. Information is not knowledge.
And knowledge
> >>> is certainly not wisdom."
> >>> Clifford Stoll
> >>>
> >>>
> >>>
> >>>
> >>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at
gene.com> wrote:
> >>> > z <- rnorm(nrow(f1)) ## or anything you want
> >>> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
> >>> >
> >>> >
> >>> > aggregate(v4~v1,f1,sum)
> >>> > aggregate(z1~v1,f1,sum)
> >>> > aggregate(v4~v2,f1,sum)
> >>> > aggregate(z1~v2,f1,sum)
> >>> > aggregate(v4~v3,f1,sum)
> >>> > aggregate(z1~v3,f1,sum)
> >>> >
> >>> >
> >>> > Cheers,
> >>> > Bert
> >>> >
> >>> > Bert Gunter
> >>> > Genentech Nonclinical Biostatistics
> >>> > (650) 467-7374
> >>> >
> >>> > "Data is not information. Information is not
knowledge. And knowledge
> >>> > is certainly not wisdom."
> >>> > Clifford Stoll
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968
at gmail.com>
> wrote:
> >>> >> Hi Bert,
> >>> >>
> >>> >> Thank you for your message. I am looking into ave()
and tapply() as
> you
> >>> >> suggested but at the same time I have prepared a
example of input
> and
> >>> >> output
> >>> >> files, just in case you or someone else would like to
make an
> attempt
> >>> >> to
> >>> >> generate a code that goes from input to output.
> >>> >>
> >>> >> Please see below or download it from
> >>> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
> >>> >>
> >>> >> # this is (an extract of) the INPUT file I have:
> >>> >> f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
> >>> >> 1.42917,
> >>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430,
0.92872,
> >>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
> >>> >> row.names > >>> >> c(2L,
> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
> >>> >>
> >>> >> # this is (an extract of) the OUTPUT file I would
like to obtain:
> >>> >> f2 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
> >>> >> 1.77918,
> >>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430,
0.92872,
> >>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
> >>> >> row.names > >>> >> c(2L,
> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
> >>> >>
> >>> >> # please notice that while the aggregated v4 on v3
has changed ?
> >>> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
> >>> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
> >>> >>
> >>> >> # ? the aggregated v4 over v1xv2 has remained
unchanged:
> >>> >>
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >>> >>
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
> >>> >>
> >>> >> Thank you very much in advance for your assitance.
> >>> >>
> >>> >> Luca
> >>> >>
> >>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter
<gunter.berton at gene.com>:
> >>> >>>
> >>> >>> 1. Still not sure what you mean, but maybe look
at ?ave and
> ?tapply,
> >>> >>> for which ave() is a wrapper.
> >>> >>>
> >>> >>> 2. You still need to heed the rest of Jeff's
advice.
> >>> >>>
> >>> >>> Cheers,
> >>> >>> Bert
> >>> >>>
> >>> >>> Bert Gunter
> >>> >>> Genentech Nonclinical Biostatistics
> >>> >>> (650) 467-7374
> >>> >>>
> >>> >>> "Data is not information. Information is not
knowledge. And
> knowledge
> >>> >>> is certainly not wisdom."
> >>> >>> Clifford Stoll
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer
<lucam1968 at gmail.com>
> >>> >>> wrote:
> >>> >>> > Hi Jeff & other R-experts,
> >>> >>> >
> >>> >>> > Thank you for your note. I have tried myself
to solve the issue
> >>> >>> > without
> >>> >>> > success.
> >>> >>> >
> >>> >>> > Following your suggestion, I am providing a
sample of the
> dataset I
> >>> >>> > am
> >>> >>> > using below (also downloadble in plain text
from
> >>> >>> >
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
> >>> >>> >
> >>> >>> > #this is an extract of the overall dataset
(n=1200 cases)
> >>> >>> > f1 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
> "B",
> >>> >>> > "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >>> >>> > "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >>> >>> > "B", "B", "B",
"C", "C", "C"), v4 = c(18.1853007621835,
> >>> >>> > 3.43806581506388,
> >>> >>> > 0.002733567617055, 1.42917483425029,
1.05786640463504,
> >>> >>> > 0.000420548864162308,
> >>> >>> > 2.37232740842861, 3.01835841813241, 0,
1.13430282139936,
> >>> >>> > 0.928725667117666,
> >>> >>> > 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
> >>> >>> > row.names
> >>> >>> > > >>> >>> > c(2L,
> >>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L,
197L, 204L, 206L))
> >>> >>> >
> >>> >>> > I need to find a automated procedure that
allows me to adjust v3
> >>> >>> > marginals
> >>> >>> > while maintaining v1xv2 marginals unchanged.
> >>> >>> >
> >>> >>> > That is: modify the v4 values you can find
by running:
> >>> >>> >
> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v3),sum)
> >>> >>> >
> >>> >>> > while maintaining costant the values you can
find by running:
> >>> >>> >
> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >>> >>> >
> >>> >>> > Now does it make sense?
> >>> >>> >
> >>> >>> > Please notice I have tried to build some
syntax that tries to
> modify
> >>> >>> > values
> >>> >>> > within each v1xv2 combination by computing
sum of v4, row
> percentage
> >>> >>> > in
> >>> >>> > terms of v4, and there is where my effort is
blocked. Not really
> >>> >>> > sure
> >>> >>> > how I
> >>> >>> > should proceed. Any suggestion?
> >>> >>> >
> >>> >>> > Thanks,
> >>> >>> >
> >>> >>> > Luca
> >>> >>> >
> >>> >>> >
> >>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller
<
> jdnewmil at dcn.davis.ca.us>:
> >>> >>> >
> >>> >>> >> I don't understand your description.
The standard practice on
> this
> >>> >>> >> list
> >>> >>> >> is
> >>> >>> >> to provide a reproducible R example [1]
of the kind of data you
> are
> >>> >>> >> working
> >>> >>> >> with (and any code you have tried) to go
along with your
> >>> >>> >> description.
> >>> >>> >> In
> >>> >>> >> this case, that would be two dputs of
your input data frames
> and a
> >>> >>> >> dput
> >>> >>> >> of
> >>> >>> >> an output data frame (generated by hand
from your input data
> >>> >>> >> frame).
> >>> >>> >> (Probably best to not use the full
number of input values just
> to
> >>> >>> >> keep
> >>> >>> >> the
> >>> >>> >> size down.) We could then make an
attempt to generate code that
> >>> >>> >> goes
> >>> >>> >> from
> >>> >>> >> input to output.
> >>> >>> >>
> >>> >>> >> Of course, if you post that hard work
using HTML then it will
> get
> >>> >>> >> corrupted (much like the text below from
your earlier emails)
> and
> >>> >>> >> we
> >>> >>> >> won't
> >>> >>> >> be able to use it. Please learn to post
from your email software
> >>> >>> >> using
> >>> >>> >> plain text when corresponding with this
mailing list.
> >>> >>> >>
> >>> >>> >> [1]
> >>> >>> >>
> >>> >>> >>
> >>> >>> >>
>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> >>> >>> >>
> >>> >>> >>
> >>> >>> >>
> ---------------------------------------------------------------------------
> >>> >>> >> Jeff Newmiller                       
The     .....
>  .....  Go
> >>> >>> >> Live...
> >>> >>> >> DCN:<jdnewmil at dcn.davis.ca.us> 
Basics: ##.#.       ##.#.
> >>> >>> >> Live
> >>> >>> >> Go...
> >>> >>> >>                                      
Live:   OO#.. Dead: OO#..
> >>> >>> >> Playing
> >>> >>> >> Research Engineer (Solar/Batteries      
O.O#.       #.O#.
> >>> >>> >> with
> >>> >>> >> /Software/Embedded Controllers)         
.OO#.       .OO#.
> >>> >>> >> rocks...1k
> >>> >>> >>
> >>> >>> >>
> >>> >>> >>
> ---------------------------------------------------------------------------
> >>> >>> >> Sent from my phone. Please excuse my
brevity.
> >>> >>> >>
> >>> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca
Meyer <
> lucam1968 at gmail.com>
> >>> >>> >> wrote:
> >>> >>> >> >Thanks for you input Michael,
> >>> >>> >> >
> >>> >>> >> >The continuous variable I have
measures quantities (down to the
> >>> >>> >> > 3rd
> >>> >>> >> >decimal level) so unfortunately are
not frequencies.
> >>> >>> >> >
> >>> >>> >> >Any more specific suggestions on how
that could be tackled?
> >>> >>> >> >
> >>> >>> >> >Thanks & kind regards,
> >>> >>> >> >
> >>> >>> >> >Luca
> >>> >>> >> >
> >>> >>> >> >
> >>> >>> >> >==> >>> >>>
>> >
> >>> >>> >> >Michael Friendly wrote:
> >>> >>> >> >I'm not sure I understand
completely what you want to do, but
> >>> >>> >> >if the data were frequencies, it
sounds like task for fitting a
> >>> >>> >> >loglinear model with the model
formula
> >>> >>> >> >
> >>> >>> >> >~ V1*V2 + V3
> >>> >>> >> >
> >>> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer
wrote:
> >>> >>> >> >>* Hello,
> >>> >>> >> >*>>* I am facing a quite
challenging task (at least to me) and
> I
> >>> >>> >> > was
> >>> >>> >> >wondering
> >>> >>> >> >*>* if someone could advise how R
could assist me to speed the
> >>> >>> >> > task
> >>> >>> >> > up.
> >>> >>> >> >*>>* I am dealing with a
dataset with 3 discrete variables and
> one
> >>> >>> >> >continuous
> >>> >>> >> >*>* variable. The discrete
variables are:
> >>> >>> >> >*>>* V1: 8 modalities
> >>> >>> >> >*>* V2: 13 modalities
> >>> >>> >> >*>* V3: 13 modalities
> >>> >>> >> >*>>* The continuous variable
V4 is a decimal number always
> greater
> >>> >>> >> > than
> >>> >>> >> >zero in
> >>> >>> >> >*>* the marginals of each of the
3 variables but it is
> sometimes
> >>> >>> >> > equal
> >>> >>> >> >to zero
> >>> >>> >> >*>* (and sometimes negative) in
the joint tables.
> >>> >>> >> >*>>* I have got 2 files:
> >>> >>> >> >*>>* => one with
distribution of all possible combinations of
> >>> >>> >> > V1xV2
> >>> >>> >> >(some of
> >>> >>> >> >*>* which are zero or neagtive)
and
> >>> >>> >> >*>* => one with the marginal
distribution of V3.
> >>> >>> >> >*>>* I am trying to build the
long and narrow dataset V1xV2xV3
> in
> >>> >>> >> > such
> >>> >>> >> >a way
> >>> >>> >> >*>* that each V1xV2 cell does not
get modified and V3 fits as
> >>> >>> >> > closely
> >>> >>> >> >as
> >>> >>> >> >*>* possible to its marginal
distribution. Does it make sense?
> >>> >>> >> >*>>* To be even more specific,
my 2 input files look like the
> >>> >>> >> >following.
> >>> >>> >> >*>>* FILE 1
> >>> >>> >> >*>* V1,V2,V4
> >>> >>> >> >*>* A, A, 24.251
> >>> >>> >> >*>* A, B, 1.065
> >>> >>> >> >*>* (...)
> >>> >>> >> >*>* B, C, 0.294
> >>> >>> >> >*>* B, D, 2.731
> >>> >>> >> >*>* (...)
> >>> >>> >> >*>* H, L, 0.345
> >>> >>> >> >*>* H, M, 0.000
> >>> >>> >> >*>>* FILE 2
> >>> >>> >> >*>* V3, V4
> >>> >>> >> >*>* A, 1.575
> >>> >>> >> >*>* B, 4.294
> >>> >>> >> >*>* C, 10.044
> >>> >>> >> >*>* (...)
> >>> >>> >> >*>* L, 5.123
> >>> >>> >> >*>* M, 3.334
> >>> >>> >> >*>>* What I need to achieve is
a file such as the following
> >>> >>> >> >*>>* FILE 3
> >>> >>> >> >*>* V1, V2, V3, V4
> >>> >>> >> >*>* A, A, A, ???
> >>> >>> >> >*>* A, A, B, ???
> >>> >>> >> >*>* (...)
> >>> >>> >> >*>* D, D, E, ???
> >>> >>> >> >*>* D, D, F, ???
> >>> >>> >> >*>* (...)
> >>> >>> >> >*>* H, M, L, ???
> >>> >>> >> >*>* H, M, M, ???
> >>> >>> >> >*>>* Please notice that FILE 3
need to be such that if I
> aggregate
> >>> >>> >> > on
> >>> >>> >> >V1+V2 I
> >>> >>> >> >*>* recover exactly FILE 1 and
that if I aggregate on V3 I can
> >>> >>> >> > recover
> >>> >>> >> >a file
> >>> >>> >> >*>* as close as possible to FILE
3 (ideally the same file).
> >>> >>> >> >*>>* Can anyone suggest how I
could do that with R?
> >>> >>> >> >*>>* Thank you very much
indeed for any assistance you are
> able to
> >>> >>> >> >provide.
> >>> >>> >> >*>>* Kind regards,
> >>> >>> >> >*>>* Luca*
> >>> >>> >> >
> >>> >>> >> >       [[alternative HTML version
deleted]]
> >>> >>> >> >
> >>> >>> >>
>______________________________________________
> >>> >>> >> >R-help at r-project.org mailing list
-- To UNSUBSCRIBE and more,
> see
> >>> >>> >>
>https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >>> >> >PLEASE do read the posting guide
> >>> >>> >>
>http://www.R-project.org/posting-guide.html
> >>> >>> >> >and provide commented, minimal,
self-contained, reproducible
> code.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >
> >>> >>> >         [[alternative HTML version deleted]]
> >>> >>> >
> >>> >>> >
______________________________________________
> >>> >>> > R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> see
> >>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >>> > PLEASE do read the posting guide
> >>> >>> > http://www.R-project.org/posting-guide.html
> >>> >>> > and provide commented, minimal,
self-contained, reproducible
> code.
> >>> >>
> >>> >>
> >>
> >>
>
	[[alternative HTML version deleted]]

Bert Gunter

2015-Mar-22 17:32 UTC

head link

[R] Fwd: Joining two datasets - recursive procedure?

Nonsense. You are not telling us something or I have failed to
understand something.

Consider:

v1 = c("a","b")
v2 = "c("a","a")

It is not possible to change the value of a sum of values
corresponding to v2="a" without also changing that for v1, which is
not supposed to change according to my understanding of your
specification.

So I'm done.

-- Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, Mar 22, 2015 at 8:28 AM, Luca Meyer <lucam1968 at gmail.com>
wrote:> Sorry forgot to keep the rest of the group in the loop - Luca
> ---------- Forwarded message ----------
> From: Luca Meyer <lucam1968 at gmail.com>
> Date: 2015-03-22 16:27 GMT+01:00
> Subject: Re: [R] Joining two datasets - recursive procedure?
> To: Bert Gunter <gunter.berton at gene.com>
>
>
> Hi Bert,
>
> That is exactly what I am trying to achieve. Please notice that negative v4
> values are allowed. I have done a similar task in the past manually by
> recursively alterating v4 distribution across v3 categories within fix each
> v1&v2 combination so I am quite positive it can be achieved but
honestly I
> took me forever to do it manually and since this is likely to be an
> exercise I need to repeat from time to time I wish I could learn how to do
> it programmatically....
>
> Thanks again for any further suggestion you might have,
>
> Luca
>
>
> 2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
>
>> Oh, wait a minute ...
>>
>> You still want the marginals for the other columns to be as originally?
>>
>> If so, then this is impossible in general as the sum of all the values
>> must be what they were originally and you cannot therefore choose your
>> values for V3 arbitrarily.
>>
>> Or at least, that seems to be what you are trying to do.
>>
>> -- Bert
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>> (650) 467-7374
>>
>> "Data is not information. Information is not knowledge. And
knowledge
>> is certainly not wisdom."
>> Clifford Stoll
>>
>>
>>
>>
>> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at
gene.com> wrote:
>> > I would have thought that this is straightforward given my
previous
>> email...
>> >
>> > Just set z to what you want -- e,g, all B values to 29/number of
B's,
>> > and all C values to 2.567/number of C's (etc. for more
categories).
>> >
>> > A slick but sort of cheat way to do this programmatically -- in
the
>> > sense that it relies on the implementation of factor() rather than
its
>> > API -- is:
>> >
>> > y <- f1$v3  ## to simplify the notation; could be done using
with()
>> > z <- (c(29,2.567)/table(y))[c(y)]
>> >
>> > Then proceed to z1 as I previously described
>> >
>> > -- Bert
>> >
>> >
>> > Bert Gunter
>> > Genentech Nonclinical Biostatistics
>> > (650) 467-7374
>> >
>> > "Data is not information. Information is not knowledge. And
knowledge
>> > is certainly not wisdom."
>> > Clifford Stoll
>> >
>> >
>> >
>> >
>> > On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at
gmail.com> wrote:
>> >> Hi Bert, hello R-experts,
>> >>
>> >> I am close to a solution but I still need one hint w.r.t. the
following
>> >> procedure (available also from
>> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
>> >>
>> >> rm(list=ls())
>> >>
>> >> # this is (an extract of) the INPUT file I have:
>> >> f1 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B", "B",
>> >> "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A", "B", "C", "A",
>> >> "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
"B", "B", "B", "C", "C",
>> >> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042,
>> 2.37232,
>> >> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1",
"v2", "v3", "v4"),
>> class
>> >> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L,
50L, 158L, 165L,
>> 167L,
>> >> 197L, 204L, 206L))
>> >>
>> >> # this is the procedure that Bert suggested (slightly
adjusted):
>> >> z <- rnorm(nrow(f1)) ## or anything you want
>> >> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)),
digits=5)
>> >> aggregate(v4~v1*v2,f1,sum)
>> >> aggregate(z1~v1*v2,f1,sum)
>> >> aggregate(v4~v3,f1,sum)
>> >> aggregate(z1~v3,f1,sum)
>> >>
>> >> My question to you is: how can I set z so that I can obtain
specific
>> values
>> >> for z1-v4 in the v3 aggregation?
>> >> In other words, how can I configure the procedure so that e.g.
B=29 and
>> >> C=2.56723 after running the procedure:
>> >> aggregate(z1~v3,f1,sum)
>> >>
>> >> Thank you,
>> >>
>> >> Luca
>> >>
>> >> PS: to avoid any doubts you might have about who I am the
following is
>> my
>> >> web page: http://lucameyer.wordpress.com/
>> >>
>> >>
>> >> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
>> >>>
>> >>> ... or cleaner:
>> >>>
>> >>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
>> >>>
>> >>>
>> >>> Just for curiosity, was this homework? (in which case I
should
>> >>> probably have not provided you an answer -- that is,
assuming that I
>> >>> HAVE provided an answer).
>> >>>
>> >>> Cheers,
>> >>> Bert
>> >>>
>> >>> Bert Gunter
>> >>> Genentech Nonclinical Biostatistics
>> >>> (650) 467-7374
>> >>>
>> >>> "Data is not information. Information is not
knowledge. And knowledge
>> >>> is certainly not wisdom."
>> >>> Clifford Stoll
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter
at gene.com> wrote:
>> >>> > z <- rnorm(nrow(f1)) ## or anything you want
>> >>> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
>> >>> >
>> >>> >
>> >>> > aggregate(v4~v1,f1,sum)
>> >>> > aggregate(z1~v1,f1,sum)
>> >>> > aggregate(v4~v2,f1,sum)
>> >>> > aggregate(z1~v2,f1,sum)
>> >>> > aggregate(v4~v3,f1,sum)
>> >>> > aggregate(z1~v3,f1,sum)
>> >>> >
>> >>> >
>> >>> > Cheers,
>> >>> > Bert
>> >>> >
>> >>> > Bert Gunter
>> >>> > Genentech Nonclinical Biostatistics
>> >>> > (650) 467-7374
>> >>> >
>> >>> > "Data is not information. Information is not
knowledge. And knowledge
>> >>> > is certainly not wisdom."
>> >>> > Clifford Stoll
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer
<lucam1968 at gmail.com>
>> wrote:
>> >>> >> Hi Bert,
>> >>> >>
>> >>> >> Thank you for your message. I am looking into
ave() and tapply() as
>> you
>> >>> >> suggested but at the same time I have prepared a
example of input
>> and
>> >>> >> output
>> >>> >> files, just in case you or someone else would
like to make an
>> attempt
>> >>> >> to
>> >>> >> generate a code that goes from input to output.
>> >>> >>
>> >>> >> Please see below or download it from
>> >>> >>
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
>> >>> >>
>> >>> >> # this is (an extract of) the INPUT file I have:
>> >>> >> f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
>> >>> >> 1.42917,
>> >>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430,
0.92872,
>> >>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>> >>> >> row.names >> >>> >> c(2L,
>> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L,
204L, 206L))
>> >>> >>
>> >>> >> # this is (an extract of) the OUTPUT file I would
like to obtain:
>> >>> >> f2 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
>> >>> >> "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
>> >>> >> "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
>> >>> >> "B", "B", "B",
"C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
>> >>> >> 1.77918,
>> >>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430,
0.92872,
>> >>> >> 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
>> >>> >> row.names >> >>> >> c(2L,
>> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L,
204L, 206L))
>> >>> >>
>> >>> >> # please notice that while the aggregated v4 on
v3 has changed ?
>> >>> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
>> >>> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
>> >>> >>
>> >>> >> # ? the aggregated v4 over v1xv2 has remained
unchanged:
>> >>> >>
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>> >>> >>
aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
>> >>> >>
>> >>> >> Thank you very much in advance for your
assitance.
>> >>> >>
>> >>> >> Luca
>> >>> >>
>> >>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter
<gunter.berton at gene.com>:
>> >>> >>>
>> >>> >>> 1. Still not sure what you mean, but maybe
look at ?ave and
>> ?tapply,
>> >>> >>> for which ave() is a wrapper.
>> >>> >>>
>> >>> >>> 2. You still need to heed the rest of
Jeff's advice.
>> >>> >>>
>> >>> >>> Cheers,
>> >>> >>> Bert
>> >>> >>>
>> >>> >>> Bert Gunter
>> >>> >>> Genentech Nonclinical Biostatistics
>> >>> >>> (650) 467-7374
>> >>> >>>
>> >>> >>> "Data is not information. Information is
not knowledge. And
>> knowledge
>> >>> >>> is certainly not wisdom."
>> >>> >>> Clifford Stoll
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer
<lucam1968 at gmail.com>
>> >>> >>> wrote:
>> >>> >>> > Hi Jeff & other R-experts,
>> >>> >>> >
>> >>> >>> > Thank you for your note. I have tried
myself to solve the issue
>> >>> >>> > without
>> >>> >>> > success.
>> >>> >>> >
>> >>> >>> > Following your suggestion, I am
providing a sample of the
>> dataset I
>> >>> >>> > am
>> >>> >>> > using below (also downloadble in plain
text from
>> >>> >>> >
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
>> >>> >>> >
>> >>> >>> > #this is an extract of the overall
dataset (n=1200 cases)
>> >>> >>> > f1 <- structure(list(v1 =
c("A", "A", "A", "A", "A",
"A", "B",
>> "B",
>> >>> >>> > "B", "B",
"B", "B"), v2 = c("A", "B",
"C", "A", "B", "C", "A",
>> >>> >>> > "B", "C",
"A", "B", "C"), v3 = c("B",
"B", "B", "C", "C", "C",
>> >>> >>> > "B", "B",
"B", "C", "C", "C"), v4 =
c(18.1853007621835,
>> >>> >>> > 3.43806581506388,
>> >>> >>> > 0.002733567617055, 1.42917483425029,
1.05786640463504,
>> >>> >>> > 0.000420548864162308,
>> >>> >>> > 2.37232740842861, 3.01835841813241, 0,
1.13430282139936,
>> >>> >>> > 0.928725667117666,
>> >>> >>> > 0)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame",
>> >>> >>> > row.names
>> >>> >>> > >> >>> >>> >
c(2L,
>> >>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L,
167L, 197L, 204L, 206L))
>> >>> >>> >
>> >>> >>> > I need to find a automated procedure
that allows me to adjust v3
>> >>> >>> > marginals
>> >>> >>> > while maintaining v1xv2 marginals
unchanged.
>> >>> >>> >
>> >>> >>> > That is: modify the v4 values you can
find by running:
>> >>> >>> >
>> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v3),sum)
>> >>> >>> >
>> >>> >>> > while maintaining costant the values you
can find by running:
>> >>> >>> >
>> >>> >>> >
aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>> >>> >>> >
>> >>> >>> > Now does it make sense?
>> >>> >>> >
>> >>> >>> > Please notice I have tried to build some
syntax that tries to
>> modify
>> >>> >>> > values
>> >>> >>> > within each v1xv2 combination by
computing sum of v4, row
>> percentage
>> >>> >>> > in
>> >>> >>> > terms of v4, and there is where my
effort is blocked. Not really
>> >>> >>> > sure
>> >>> >>> > how I
>> >>> >>> > should proceed. Any suggestion?
>> >>> >>> >
>> >>> >>> > Thanks,
>> >>> >>> >
>> >>> >>> > Luca
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller
<
>> jdnewmil at dcn.davis.ca.us>:
>> >>> >>> >
>> >>> >>> >> I don't understand your
description. The standard practice on
>> this
>> >>> >>> >> list
>> >>> >>> >> is
>> >>> >>> >> to provide a reproducible R example
[1] of the kind of data you
>> are
>> >>> >>> >> working
>> >>> >>> >> with (and any code you have tried)
to go along with your
>> >>> >>> >> description.
>> >>> >>> >> In
>> >>> >>> >> this case, that would be two dputs
of your input data frames
>> and a
>> >>> >>> >> dput
>> >>> >>> >> of
>> >>> >>> >> an output data frame (generated by
hand from your input data
>> >>> >>> >> frame).
>> >>> >>> >> (Probably best to not use the full
number of input values just
>> to
>> >>> >>> >> keep
>> >>> >>> >> the
>> >>> >>> >> size down.) We could then make an
attempt to generate code that
>> >>> >>> >> goes
>> >>> >>> >> from
>> >>> >>> >> input to output.
>> >>> >>> >>
>> >>> >>> >> Of course, if you post that hard
work using HTML then it will
>> get
>> >>> >>> >> corrupted (much like the text below
from your earlier emails)
>> and
>> >>> >>> >> we
>> >>> >>> >> won't
>> >>> >>> >> be able to use it. Please learn to
post from your email software
>> >>> >>> >> using
>> >>> >>> >> plain text when corresponding with
this mailing list.
>> >>> >>> >>
>> >>> >>> >> [1]
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>>
---------------------------------------------------------------------------
>> >>> >>> >> Jeff Newmiller                      
The     .....
>>  .....  Go
>> >>> >>> >> Live...
>> >>> >>> >> DCN:<jdnewmil at
dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>> >>> >>> >> Live
>> >>> >>> >> Go...
>> >>> >>> >>                                     
Live:   OO#.. Dead: OO#..
>> >>> >>> >> Playing
>> >>> >>> >> Research Engineer (Solar/Batteries  
O.O#.       #.O#.
>> >>> >>> >> with
>> >>> >>> >> /Software/Embedded Controllers)     
.OO#.       .OO#.
>> >>> >>> >> rocks...1k
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>>
---------------------------------------------------------------------------
>> >>> >>> >> Sent from my phone. Please excuse my
brevity.
>> >>> >>> >>
>> >>> >>> >> On March 18, 2015 9:05:37 AM PDT,
Luca Meyer <
>> lucam1968 at gmail.com>
>> >>> >>> >> wrote:
>> >>> >>> >> >Thanks for you input Michael,
>> >>> >>> >> >
>> >>> >>> >> >The continuous variable I have
measures quantities (down to the
>> >>> >>> >> > 3rd
>> >>> >>> >> >decimal level) so unfortunately
are not frequencies.
>> >>> >>> >> >
>> >>> >>> >> >Any more specific suggestions on
how that could be tackled?
>> >>> >>> >> >
>> >>> >>> >> >Thanks & kind regards,
>> >>> >>> >> >
>> >>> >>> >> >Luca
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> >==>> >>>
>>> >> >
>> >>> >>> >> >Michael Friendly wrote:
>> >>> >>> >> >I'm not sure I understand
completely what you want to do, but
>> >>> >>> >> >if the data were frequencies, it
sounds like task for fitting a
>> >>> >>> >> >loglinear model with the model
formula
>> >>> >>> >> >
>> >>> >>> >> >~ V1*V2 + V3
>> >>> >>> >> >
>> >>> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer
wrote:
>> >>> >>> >> >>* Hello,
>> >>> >>> >> >*>>* I am facing a quite
challenging task (at least to me) and
>> I
>> >>> >>> >> > was
>> >>> >>> >> >wondering
>> >>> >>> >> >*>* if someone could advise
how R could assist me to speed the
>> >>> >>> >> > task
>> >>> >>> >> > up.
>> >>> >>> >> >*>>* I am dealing with a
dataset with 3 discrete variables and
>> one
>> >>> >>> >> >continuous
>> >>> >>> >> >*>* variable. The discrete
variables are:
>> >>> >>> >> >*>>* V1: 8 modalities
>> >>> >>> >> >*>* V2: 13 modalities
>> >>> >>> >> >*>* V3: 13 modalities
>> >>> >>> >> >*>>* The continuous
variable V4 is a decimal number always
>> greater
>> >>> >>> >> > than
>> >>> >>> >> >zero in
>> >>> >>> >> >*>* the marginals of each of
the 3 variables but it is
>> sometimes
>> >>> >>> >> > equal
>> >>> >>> >> >to zero
>> >>> >>> >> >*>* (and sometimes negative)
in the joint tables.
>> >>> >>> >> >*>>* I have got 2 files:
>> >>> >>> >> >*>>* => one with
distribution of all possible combinations of
>> >>> >>> >> > V1xV2
>> >>> >>> >> >(some of
>> >>> >>> >> >*>* which are zero or
neagtive) and
>> >>> >>> >> >*>* => one with the
marginal distribution of V3.
>> >>> >>> >> >*>>* I am trying to build
the long and narrow dataset V1xV2xV3
>> in
>> >>> >>> >> > such
>> >>> >>> >> >a way
>> >>> >>> >> >*>* that each V1xV2 cell does
not get modified and V3 fits as
>> >>> >>> >> > closely
>> >>> >>> >> >as
>> >>> >>> >> >*>* possible to its marginal
distribution. Does it make sense?
>> >>> >>> >> >*>>* To be even more
specific, my 2 input files look like the
>> >>> >>> >> >following.
>> >>> >>> >> >*>>* FILE 1
>> >>> >>> >> >*>* V1,V2,V4
>> >>> >>> >> >*>* A, A, 24.251
>> >>> >>> >> >*>* A, B, 1.065
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* B, C, 0.294
>> >>> >>> >> >*>* B, D, 2.731
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* H, L, 0.345
>> >>> >>> >> >*>* H, M, 0.000
>> >>> >>> >> >*>>* FILE 2
>> >>> >>> >> >*>* V3, V4
>> >>> >>> >> >*>* A, 1.575
>> >>> >>> >> >*>* B, 4.294
>> >>> >>> >> >*>* C, 10.044
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* L, 5.123
>> >>> >>> >> >*>* M, 3.334
>> >>> >>> >> >*>>* What I need to
achieve is a file such as the following
>> >>> >>> >> >*>>* FILE 3
>> >>> >>> >> >*>* V1, V2, V3, V4
>> >>> >>> >> >*>* A, A, A, ???
>> >>> >>> >> >*>* A, A, B, ???
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* D, D, E, ???
>> >>> >>> >> >*>* D, D, F, ???
>> >>> >>> >> >*>* (...)
>> >>> >>> >> >*>* H, M, L, ???
>> >>> >>> >> >*>* H, M, M, ???
>> >>> >>> >> >*>>* Please notice that
FILE 3 need to be such that if I
>> aggregate
>> >>> >>> >> > on
>> >>> >>> >> >V1+V2 I
>> >>> >>> >> >*>* recover exactly FILE 1
and that if I aggregate on V3 I can
>> >>> >>> >> > recover
>> >>> >>> >> >a file
>> >>> >>> >> >*>* as close as possible to
FILE 3 (ideally the same file).
>> >>> >>> >> >*>>* Can anyone suggest
how I could do that with R?
>> >>> >>> >> >*>>* Thank you very much
indeed for any assistance you are
>> able to
>> >>> >>> >> >provide.
>> >>> >>> >> >*>>* Kind regards,
>> >>> >>> >> >*>>* Luca*
>> >>> >>> >> >
>> >>> >>> >> >       [[alternative HTML
version deleted]]
>> >>> >>> >> >
>> >>> >>> >>
>______________________________________________
>> >>> >>> >> >R-help at r-project.org mailing
list -- To UNSUBSCRIBE and more,
>> see
>> >>> >>> >>
>https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> >>> >> >PLEASE do read the posting guide
>> >>> >>> >>
>http://www.R-project.org/posting-guide.html
>> >>> >>> >> >and provide commented, minimal,
self-contained, reproducible
>> code.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >
>> >>> >>> >         [[alternative HTML version
deleted]]
>> >>> >>> >
>> >>> >>> >
______________________________________________
>> >>> >>> > R-help at r-project.org mailing list --
To UNSUBSCRIBE and more,
>> see
>> >>> >>> >
https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> >>> > PLEASE do read the posting guide
>> >>> >>> >
http://www.R-project.org/posting-guide.html
>> >>> >>> > and provide commented, minimal,
self-contained, reproducible
>> code.
>> >>> >>
>> >>> >>
>> >>
>> >>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Mar 2015 - Fwd: Joining two datasets - recursive procedure?

[R] Joining two datasets - recursive procedure?

[R] Joining two datasets - recursive procedure?

[R] Joining two datasets - recursive procedure?

[R] Fwd: Joining two datasets - recursive procedure?

[R] Fwd: Joining two datasets - recursive procedure?