... or cleaner: z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean)) Just for curiosity, was this homework? (in which case I should probably have not provided you an answer -- that is, assuming that I HAVE provided an answer). Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote:> z <- rnorm(nrow(f1)) ## or anything you want > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean)) > > > aggregate(v4~v1,f1,sum) > aggregate(z1~v1,f1,sum) > aggregate(v4~v2,f1,sum) > aggregate(z1~v2,f1,sum) > aggregate(v4~v3,f1,sum) > aggregate(z1~v3,f1,sum) > > > Cheers, > Bert > > Bert Gunter > Genentech Nonclinical Biostatistics > (650) 467-7374 > > "Data is not information. Information is not knowledge. And knowledge > is certainly not wisdom." > Clifford Stoll > > > > > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote: >> Hi Bert, >> >> Thank you for your message. I am looking into ave() and tapply() as you >> suggested but at the same time I have prepared a example of input and output >> files, just in case you or someone else would like to make an attempt to >> generate a code that goes from input to output. >> >> Please see below or download it from >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 >> >> # this is (an extract of) the INPUT file I have: >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names >> c(2L, >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >> # this is (an extract of) the OUTPUT file I would like to obtain: >> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, 1.77918, >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872, >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names >> c(2L, >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >> # please notice that while the aggregated v4 on v3 has changed ? >> aggregate(f1[,c("v4")],list(f1$v3),sum) >> aggregate(f2[,c("v4")],list(f2$v3),sum) >> >> # ? the aggregated v4 over v1xv2 has remained unchanged: >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum) >> >> Thank you very much in advance for your assitance. >> >> Luca >> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >>> >>> 1. Still not sure what you mean, but maybe look at ?ave and ?tapply, >>> for which ave() is a wrapper. >>> >>> 2. You still need to heed the rest of Jeff's advice. >>> >>> Cheers, >>> Bert >>> >>> Bert Gunter >>> Genentech Nonclinical Biostatistics >>> (650) 467-7374 >>> >>> "Data is not information. Information is not knowledge. And knowledge >>> is certainly not wisdom." >>> Clifford Stoll >>> >>> >>> >>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> wrote: >>> > Hi Jeff & other R-experts, >>> > >>> > Thank you for your note. I have tried myself to solve the issue without >>> > success. >>> > >>> > Following your suggestion, I am providing a sample of the dataset I am >>> > using below (also downloadble in plain text from >>> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0): >>> > >>> > #this is an extract of the overall dataset (n=1200 cases) >>> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >>> > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >>> > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >>> > "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, >>> > 3.43806581506388, >>> > 0.002733567617055, 1.42917483425029, 1.05786640463504, >>> > 0.000420548864162308, >>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936, >>> > 0.928725667117666, >>> > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names >>> > >>> > c(2L, >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >>> > >>> > I need to find a automated procedure that allows me to adjust v3 >>> > marginals >>> > while maintaining v1xv2 marginals unchanged. >>> > >>> > That is: modify the v4 values you can find by running: >>> > >>> > aggregate(f1[,c("v4")],list(f1$v3),sum) >>> > >>> > while maintaining costant the values you can find by running: >>> > >>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >>> > >>> > Now does it make sense? >>> > >>> > Please notice I have tried to build some syntax that tries to modify >>> > values >>> > within each v1xv2 combination by computing sum of v4, row percentage in >>> > terms of v4, and there is where my effort is blocked. Not really sure >>> > how I >>> > should proceed. Any suggestion? >>> > >>> > Thanks, >>> > >>> > Luca >>> > >>> > >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>: >>> > >>> >> I don't understand your description. The standard practice on this list >>> >> is >>> >> to provide a reproducible R example [1] of the kind of data you are >>> >> working >>> >> with (and any code you have tried) to go along with your description. >>> >> In >>> >> this case, that would be two dputs of your input data frames and a dput >>> >> of >>> >> an output data frame (generated by hand from your input data frame). >>> >> (Probably best to not use the full number of input values just to keep >>> >> the >>> >> size down.) We could then make an attempt to generate code that goes >>> >> from >>> >> input to output. >>> >> >>> >> Of course, if you post that hard work using HTML then it will get >>> >> corrupted (much like the text below from your earlier emails) and we >>> >> won't >>> >> be able to use it. Please learn to post from your email software using >>> >> plain text when corresponding with this mailing list. >>> >> >>> >> [1] >>> >> >>> >> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example >>> >> >>> >> --------------------------------------------------------------------------- >>> >> Jeff Newmiller The ..... ..... Go >>> >> Live... >>> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>> >> Go... >>> >> Live: OO#.. Dead: OO#.. >>> >> Playing >>> >> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> >> /Software/Embedded Controllers) .OO#. .OO#. >>> >> rocks...1k >>> >> >>> >> --------------------------------------------------------------------------- >>> >> Sent from my phone. Please excuse my brevity. >>> >> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com> >>> >> wrote: >>> >> >Thanks for you input Michael, >>> >> > >>> >> >The continuous variable I have measures quantities (down to the 3rd >>> >> >decimal level) so unfortunately are not frequencies. >>> >> > >>> >> >Any more specific suggestions on how that could be tackled? >>> >> > >>> >> >Thanks & kind regards, >>> >> > >>> >> >Luca >>> >> > >>> >> > >>> >> >==>>> >> > >>> >> >Michael Friendly wrote: >>> >> >I'm not sure I understand completely what you want to do, but >>> >> >if the data were frequencies, it sounds like task for fitting a >>> >> >loglinear model with the model formula >>> >> > >>> >> >~ V1*V2 + V3 >>> >> > >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote: >>> >> >>* Hello, >>> >> >*>>* I am facing a quite challenging task (at least to me) and I was >>> >> >wondering >>> >> >*>* if someone could advise how R could assist me to speed the task >>> >> > up. >>> >> >*>>* I am dealing with a dataset with 3 discrete variables and one >>> >> >continuous >>> >> >*>* variable. The discrete variables are: >>> >> >*>>* V1: 8 modalities >>> >> >*>* V2: 13 modalities >>> >> >*>* V3: 13 modalities >>> >> >*>>* The continuous variable V4 is a decimal number always greater >>> >> > than >>> >> >zero in >>> >> >*>* the marginals of each of the 3 variables but it is sometimes equal >>> >> >to zero >>> >> >*>* (and sometimes negative) in the joint tables. >>> >> >*>>* I have got 2 files: >>> >> >*>>* => one with distribution of all possible combinations of V1xV2 >>> >> >(some of >>> >> >*>* which are zero or neagtive) and >>> >> >*>* => one with the marginal distribution of V3. >>> >> >*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such >>> >> >a way >>> >> >*>* that each V1xV2 cell does not get modified and V3 fits as closely >>> >> >as >>> >> >*>* possible to its marginal distribution. Does it make sense? >>> >> >*>>* To be even more specific, my 2 input files look like the >>> >> >following. >>> >> >*>>* FILE 1 >>> >> >*>* V1,V2,V4 >>> >> >*>* A, A, 24.251 >>> >> >*>* A, B, 1.065 >>> >> >*>* (...) >>> >> >*>* B, C, 0.294 >>> >> >*>* B, D, 2.731 >>> >> >*>* (...) >>> >> >*>* H, L, 0.345 >>> >> >*>* H, M, 0.000 >>> >> >*>>* FILE 2 >>> >> >*>* V3, V4 >>> >> >*>* A, 1.575 >>> >> >*>* B, 4.294 >>> >> >*>* C, 10.044 >>> >> >*>* (...) >>> >> >*>* L, 5.123 >>> >> >*>* M, 3.334 >>> >> >*>>* What I need to achieve is a file such as the following >>> >> >*>>* FILE 3 >>> >> >*>* V1, V2, V3, V4 >>> >> >*>* A, A, A, ??? >>> >> >*>* A, A, B, ??? >>> >> >*>* (...) >>> >> >*>* D, D, E, ??? >>> >> >*>* D, D, F, ??? >>> >> >*>* (...) >>> >> >*>* H, M, L, ??? >>> >> >*>* H, M, M, ??? >>> >> >*>>* Please notice that FILE 3 need to be such that if I aggregate on >>> >> >V1+V2 I >>> >> >*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover >>> >> >a file >>> >> >*>* as close as possible to FILE 3 (ideally the same file). >>> >> >*>>* Can anyone suggest how I could do that with R? >>> >> >*>>* Thank you very much indeed for any assistance you are able to >>> >> >provide. >>> >> >*>>* Kind regards, >>> >> >*>>* Luca* >>> >> > >>> >> > [[alternative HTML version deleted]] >>> >> > >>> >> >______________________________________________ >>> >> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help >>> >> >PLEASE do read the posting guide >>> >> >http://www.R-project.org/posting-guide.html >>> >> >and provide commented, minimal, self-contained, reproducible code. >>> >> >>> >> >>> > >>> > [[alternative HTML version deleted]] >>> > >>> > ______________________________________________ >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > https://stat.ethz.ch/mailman/listinfo/r-help >>> > PLEASE do read the posting guide >>> > http://www.R-project.org/posting-guide.html >>> > and provide commented, minimal, self-contained, reproducible code. >> >>
Hi Bert, hello R-experts,
I am close to a solution but I still need one hint w.r.t. the following
procedure (available also from
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
rm(list=ls())
# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A",
"A", "A", "A", "B", "B",
"B",
"B", "B", "B"), v2 = c("A",
"B", "C", "A", "B", "C",
"A", "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C", "B",
"B", "B", "C", "C",
"C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042,
2.37232,
3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2",
"v3", "v4"),
class = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L,
165L,
167L, 197L, 204L, 206L))
# this is the procedure that Bert suggested (slightly adjusted):
z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)
My question to you is: how can I set z so that I can obtain specific values
for z1-v4 in the v3 aggregation?
In other words, how can I configure the procedure so that e.g. B=29 and
C=2.56723 after running the procedure:
aggregate(z1~v3,f1,sum)
Thank you,
Luca
PS: to avoid any doubts you might have about who I am the following is my
web page: http://lucameyer.wordpress.com/
2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
> ... or cleaner:
>
> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
>
>
> Just for curiosity, was this homework? (in which case I should
> probably have not provided you an answer -- that is, assuming that I
> HAVE provided an answer).
>
> Cheers,
> Bert
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com>
wrote:
> > z <- rnorm(nrow(f1)) ## or anything you want
> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
> >
> >
> > aggregate(v4~v1,f1,sum)
> > aggregate(z1~v1,f1,sum)
> > aggregate(v4~v2,f1,sum)
> > aggregate(z1~v2,f1,sum)
> > aggregate(v4~v3,f1,sum)
> > aggregate(z1~v3,f1,sum)
> >
> >
> > Cheers,
> > Bert
> >
> > Bert Gunter
> > Genentech Nonclinical Biostatistics
> > (650) 467-7374
> >
> > "Data is not information. Information is not knowledge. And
knowledge
> > is certainly not wisdom."
> > Clifford Stoll
> >
> >
> >
> >
> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at
gmail.com> wrote:
> >> Hi Bert,
> >>
> >> Thank you for your message. I am looking into ave() and tapply()
as you
> >> suggested but at the same time I have prepared a example of input
and
> output
> >> files, just in case you or someone else would like to make an
attempt to
> >> generate a code that goes from input to output.
> >>
> >> Please see below or download it from
> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
> >>
> >> # this is (an extract of) the INPUT file I have:
> >> f1 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B",
> >> "B", "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A",
> >> "B", "C", "A", "B",
"C"), v3 = c("B", "B", "B",
"C", "C", "C",
> >> "B", "B", "B", "C",
"C", "C"), v4 = c(18.18530, 3.43806,0.00273,
> 1.42917,
> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
> >> 0)), .Names = c("v1", "v2", "v3",
"v4"), class = "data.frame",
> row.names > >> c(2L,
> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
> >>
> >> # this is (an extract of) the OUTPUT file I would like to obtain:
> >> f2 <- structure(list(v1 = c("A", "A",
"A", "A", "A", "A", "B",
"B",
> >> "B", "B", "B", "B"), v2 =
c("A", "B", "C", "A", "B",
"C", "A",
> >> "B", "C", "A", "B",
"C"), v3 = c("B", "B", "B",
"C", "C", "C",
> >> "B", "B", "B", "C",
"C", "C"), v4 = c(17.83529, 3.43806,0.00295,
> 1.77918,
> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
> >> 0)), .Names = c("v1", "v2", "v3",
"v4"), class = "data.frame",
> row.names > >> c(2L,
> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
> >>
> >> # please notice that while the aggregated v4 on v3 has changed ?
> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
> >>
> >> # ? the aggregated v4 over v1xv2 has remained unchanged:
> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
> >>
> >> Thank you very much in advance for your assitance.
> >>
> >> Luca
> >>
> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at
gene.com>:
> >>>
> >>> 1. Still not sure what you mean, but maybe look at ?ave and
?tapply,
> >>> for which ave() is a wrapper.
> >>>
> >>> 2. You still need to heed the rest of Jeff's advice.
> >>>
> >>> Cheers,
> >>> Bert
> >>>
> >>> Bert Gunter
> >>> Genentech Nonclinical Biostatistics
> >>> (650) 467-7374
> >>>
> >>> "Data is not information. Information is not knowledge.
And knowledge
> >>> is certainly not wisdom."
> >>> Clifford Stoll
> >>>
> >>>
> >>>
> >>>
> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at
gmail.com>
> wrote:
> >>> > Hi Jeff & other R-experts,
> >>> >
> >>> > Thank you for your note. I have tried myself to solve the
issue
> without
> >>> > success.
> >>> >
> >>> > Following your suggestion, I am providing a sample of the
dataset I
> am
> >>> > using below (also downloadble in plain text from
> >>> >
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
> >>> >
> >>> > #this is an extract of the overall dataset (n=1200 cases)
> >>> > f1 <- structure(list(v1 = c("A",
"A", "A", "A", "A", "A",
"B", "B",
> >>> > "B", "B", "B",
"B"), v2 = c("A", "B", "C",
"A", "B", "C", "A",
> >>> > "B", "C", "A",
"B", "C"), v3 = c("B", "B",
"B", "C", "C", "C",
> >>> > "B", "B", "B",
"C", "C", "C"), v4 = c(18.1853007621835,
> >>> > 3.43806581506388,
> >>> > 0.002733567617055, 1.42917483425029, 1.05786640463504,
> >>> > 0.000420548864162308,
> >>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
> >>> > 0.928725667117666,
> >>> > 0)), .Names = c("v1", "v2",
"v3", "v4"), class = "data.frame",
> row.names
> >>> > > >>> > c(2L,
> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L,
206L))
> >>> >
> >>> > I need to find a automated procedure that allows me to
adjust v3
> >>> > marginals
> >>> > while maintaining v1xv2 marginals unchanged.
> >>> >
> >>> > That is: modify the v4 values you can find by running:
> >>> >
> >>> > aggregate(f1[,c("v4")],list(f1$v3),sum)
> >>> >
> >>> > while maintaining costant the values you can find by
running:
> >>> >
> >>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >>> >
> >>> > Now does it make sense?
> >>> >
> >>> > Please notice I have tried to build some syntax that
tries to modify
> >>> > values
> >>> > within each v1xv2 combination by computing sum of v4, row
percentage
> in
> >>> > terms of v4, and there is where my effort is blocked. Not
really sure
> >>> > how I
> >>> > should proceed. Any suggestion?
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Luca
> >>> >
> >>> >
> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at
dcn.davis.ca.us>:
> >>> >
> >>> >> I don't understand your description. The standard
practice on this
> list
> >>> >> is
> >>> >> to provide a reproducible R example [1] of the kind
of data you are
> >>> >> working
> >>> >> with (and any code you have tried) to go along with
your
> description.
> >>> >> In
> >>> >> this case, that would be two dputs of your input data
frames and a
> dput
> >>> >> of
> >>> >> an output data frame (generated by hand from your
input data frame).
> >>> >> (Probably best to not use the full number of input
values just to
> keep
> >>> >> the
> >>> >> size down.) We could then make an attempt to generate
code that goes
> >>> >> from
> >>> >> input to output.
> >>> >>
> >>> >> Of course, if you post that hard work using HTML then
it will get
> >>> >> corrupted (much like the text below from your earlier
emails) and we
> >>> >> won't
> >>> >> be able to use it. Please learn to post from your
email software
> using
> >>> >> plain text when corresponding with this mailing list.
> >>> >>
> >>> >> [1]
> >>> >>
> >>> >>
>
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> >>> >>
> >>> >>
> ---------------------------------------------------------------------------
> >>> >> Jeff Newmiller The .....
..... Go
> >>> >> Live...
> >>> >> DCN:<jdnewmil at dcn.davis.ca.us>
Basics: ##.#. ##.#.
> Live
> >>> >> Go...
> >>> >> Live: OO#..
Dead: OO#..
> >>> >> Playing
> >>> >> Research Engineer (Solar/Batteries O.O#.
#.O#.
> with
> >>> >> /Software/Embedded Controllers) .OO#.
.OO#.
> >>> >> rocks...1k
> >>> >>
> >>> >>
> ---------------------------------------------------------------------------
> >>> >> Sent from my phone. Please excuse my brevity.
> >>> >>
> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer
<lucam1968 at gmail.com>
> >>> >> wrote:
> >>> >> >Thanks for you input Michael,
> >>> >> >
> >>> >> >The continuous variable I have measures
quantities (down to the 3rd
> >>> >> >decimal level) so unfortunately are not
frequencies.
> >>> >> >
> >>> >> >Any more specific suggestions on how that could
be tackled?
> >>> >> >
> >>> >> >Thanks & kind regards,
> >>> >> >
> >>> >> >Luca
> >>> >> >
> >>> >> >
> >>> >> >==> >>> >> >
> >>> >> >Michael Friendly wrote:
> >>> >> >I'm not sure I understand completely what you
want to do, but
> >>> >> >if the data were frequencies, it sounds like task
for fitting a
> >>> >> >loglinear model with the model formula
> >>> >> >
> >>> >> >~ V1*V2 + V3
> >>> >> >
> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote:
> >>> >> >>* Hello,
> >>> >> >*>>* I am facing a quite challenging task
(at least to me) and I
> was
> >>> >> >wondering
> >>> >> >*>* if someone could advise how R could assist
me to speed the task
> >>> >> > up.
> >>> >> >*>>* I am dealing with a dataset with 3
discrete variables and one
> >>> >> >continuous
> >>> >> >*>* variable. The discrete variables are:
> >>> >> >*>>* V1: 8 modalities
> >>> >> >*>* V2: 13 modalities
> >>> >> >*>* V3: 13 modalities
> >>> >> >*>>* The continuous variable V4 is a
decimal number always greater
> >>> >> > than
> >>> >> >zero in
> >>> >> >*>* the marginals of each of the 3 variables
but it is sometimes
> equal
> >>> >> >to zero
> >>> >> >*>* (and sometimes negative) in the joint
tables.
> >>> >> >*>>* I have got 2 files:
> >>> >> >*>>* => one with distribution of all
possible combinations of V1xV2
> >>> >> >(some of
> >>> >> >*>* which are zero or neagtive) and
> >>> >> >*>* => one with the marginal distribution
of V3.
> >>> >> >*>>* I am trying to build the long and
narrow dataset V1xV2xV3 in
> such
> >>> >> >a way
> >>> >> >*>* that each V1xV2 cell does not get modified
and V3 fits as
> closely
> >>> >> >as
> >>> >> >*>* possible to its marginal distribution.
Does it make sense?
> >>> >> >*>>* To be even more specific, my 2 input
files look like the
> >>> >> >following.
> >>> >> >*>>* FILE 1
> >>> >> >*>* V1,V2,V4
> >>> >> >*>* A, A, 24.251
> >>> >> >*>* A, B, 1.065
> >>> >> >*>* (...)
> >>> >> >*>* B, C, 0.294
> >>> >> >*>* B, D, 2.731
> >>> >> >*>* (...)
> >>> >> >*>* H, L, 0.345
> >>> >> >*>* H, M, 0.000
> >>> >> >*>>* FILE 2
> >>> >> >*>* V3, V4
> >>> >> >*>* A, 1.575
> >>> >> >*>* B, 4.294
> >>> >> >*>* C, 10.044
> >>> >> >*>* (...)
> >>> >> >*>* L, 5.123
> >>> >> >*>* M, 3.334
> >>> >> >*>>* What I need to achieve is a file such
as the following
> >>> >> >*>>* FILE 3
> >>> >> >*>* V1, V2, V3, V4
> >>> >> >*>* A, A, A, ???
> >>> >> >*>* A, A, B, ???
> >>> >> >*>* (...)
> >>> >> >*>* D, D, E, ???
> >>> >> >*>* D, D, F, ???
> >>> >> >*>* (...)
> >>> >> >*>* H, M, L, ???
> >>> >> >*>* H, M, M, ???
> >>> >> >*>>* Please notice that FILE 3 need to be
such that if I aggregate
> on
> >>> >> >V1+V2 I
> >>> >> >*>* recover exactly FILE 1 and that if I
aggregate on V3 I can
> recover
> >>> >> >a file
> >>> >> >*>* as close as possible to FILE 3 (ideally
the same file).
> >>> >> >*>>* Can anyone suggest how I could do that
with R?
> >>> >> >*>>* Thank you very much indeed for any
assistance you are able to
> >>> >> >provide.
> >>> >> >*>>* Kind regards,
> >>> >> >*>>* Luca*
> >>> >> >
> >>> >> > [[alternative HTML version deleted]]
> >>> >> >
> >>> >> >______________________________________________
> >>> >> >R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >> >PLEASE do read the posting guide
> >>> >> >http://www.R-project.org/posting-guide.html
> >>> >> >and provide commented, minimal, self-contained,
reproducible code.
> >>> >>
> >>> >>
> >>> >
> >>> > [[alternative HTML version deleted]]
> >>> >
> >>> > ______________________________________________
> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
> >>> > PLEASE do read the posting guide
> >>> > http://www.R-project.org/posting-guide.html
> >>> > and provide commented, minimal, self-contained,
reproducible code.
> >>
> >>
>
[[alternative HTML version deleted]]
I would have thought that this is straightforward given my previous email... Just set z to what you want -- e,g, all B values to 29/number of B's, and all C values to 2.567/number of C's (etc. for more categories). A slick but sort of cheat way to do this programmatically -- in the sense that it relies on the implementation of factor() rather than its API -- is: y <- f1$v3 ## to simplify the notation; could be done using with() z <- (c(29,2.567)/table(y))[c(y)] Then proceed to z1 as I previously described -- Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com> wrote:> Hi Bert, hello R-experts, > > I am close to a solution but I still need one hint w.r.t. the following > procedure (available also from > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0) > > rm(list=ls()) > > # this is (an extract of) the INPUT file I have: > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", > "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", > "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C", > "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, 2.37232, > 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"), class > = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, > 197L, 204L, 206L)) > > # this is the procedure that Bert suggested (slightly adjusted): > z <- rnorm(nrow(f1)) ## or anything you want > z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5) > aggregate(v4~v1*v2,f1,sum) > aggregate(z1~v1*v2,f1,sum) > aggregate(v4~v3,f1,sum) > aggregate(z1~v3,f1,sum) > > My question to you is: how can I set z so that I can obtain specific values > for z1-v4 in the v3 aggregation? > In other words, how can I configure the procedure so that e.g. B=29 and > C=2.56723 after running the procedure: > aggregate(z1~v3,f1,sum) > > Thank you, > > Luca > > PS: to avoid any doubts you might have about who I am the following is my > web page: http://lucameyer.wordpress.com/ > > > 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >> >> ... or cleaner: >> >> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean)) >> >> >> Just for curiosity, was this homework? (in which case I should >> probably have not provided you an answer -- that is, assuming that I >> HAVE provided an answer). >> >> Cheers, >> Bert >> >> Bert Gunter >> Genentech Nonclinical Biostatistics >> (650) 467-7374 >> >> "Data is not information. Information is not knowledge. And knowledge >> is certainly not wisdom." >> Clifford Stoll >> >> >> >> >> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote: >> > z <- rnorm(nrow(f1)) ## or anything you want >> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean)) >> > >> > >> > aggregate(v4~v1,f1,sum) >> > aggregate(z1~v1,f1,sum) >> > aggregate(v4~v2,f1,sum) >> > aggregate(z1~v2,f1,sum) >> > aggregate(v4~v3,f1,sum) >> > aggregate(z1~v3,f1,sum) >> > >> > >> > Cheers, >> > Bert >> > >> > Bert Gunter >> > Genentech Nonclinical Biostatistics >> > (650) 467-7374 >> > >> > "Data is not information. Information is not knowledge. And knowledge >> > is certainly not wisdom." >> > Clifford Stoll >> > >> > >> > >> > >> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote: >> >> Hi Bert, >> >> >> >> Thank you for your message. I am looking into ave() and tapply() as you >> >> suggested but at the same time I have prepared a example of input and >> >> output >> >> files, just in case you or someone else would like to make an attempt >> >> to >> >> generate a code that goes from input to output. >> >> >> >> Please see below or download it from >> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 >> >> >> >> # this is (an extract of) the INPUT file I have: >> >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> >> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, >> >> 1.42917, >> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, >> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >> >> row.names >> >> c(2L, >> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >> >> >> # this is (an extract of) the OUTPUT file I would like to obtain: >> >> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> >> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, >> >> 1.77918, >> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872, >> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >> >> row.names >> >> c(2L, >> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >> >> >> # please notice that while the aggregated v4 on v3 has changed ? >> >> aggregate(f1[,c("v4")],list(f1$v3),sum) >> >> aggregate(f2[,c("v4")],list(f2$v3),sum) >> >> >> >> # ? the aggregated v4 over v1xv2 has remained unchanged: >> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum) >> >> >> >> Thank you very much in advance for your assitance. >> >> >> >> Luca >> >> >> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >> >>> >> >>> 1. Still not sure what you mean, but maybe look at ?ave and ?tapply, >> >>> for which ave() is a wrapper. >> >>> >> >>> 2. You still need to heed the rest of Jeff's advice. >> >>> >> >>> Cheers, >> >>> Bert >> >>> >> >>> Bert Gunter >> >>> Genentech Nonclinical Biostatistics >> >>> (650) 467-7374 >> >>> >> >>> "Data is not information. Information is not knowledge. And knowledge >> >>> is certainly not wisdom." >> >>> Clifford Stoll >> >>> >> >>> >> >>> >> >>> >> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> >> >>> wrote: >> >>> > Hi Jeff & other R-experts, >> >>> > >> >>> > Thank you for your note. I have tried myself to solve the issue >> >>> > without >> >>> > success. >> >>> > >> >>> > Following your suggestion, I am providing a sample of the dataset I >> >>> > am >> >>> > using below (also downloadble in plain text from >> >>> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0): >> >>> > >> >>> > #this is an extract of the overall dataset (n=1200 cases) >> >>> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> >>> > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> >>> > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> >>> > "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, >> >>> > 3.43806581506388, >> >>> > 0.002733567617055, 1.42917483425029, 1.05786640463504, >> >>> > 0.000420548864162308, >> >>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936, >> >>> > 0.928725667117666, >> >>> > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >> >>> > row.names >> >>> > >> >>> > c(2L, >> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >>> > >> >>> > I need to find a automated procedure that allows me to adjust v3 >> >>> > marginals >> >>> > while maintaining v1xv2 marginals unchanged. >> >>> > >> >>> > That is: modify the v4 values you can find by running: >> >>> > >> >>> > aggregate(f1[,c("v4")],list(f1$v3),sum) >> >>> > >> >>> > while maintaining costant the values you can find by running: >> >>> > >> >>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >> >>> > >> >>> > Now does it make sense? >> >>> > >> >>> > Please notice I have tried to build some syntax that tries to modify >> >>> > values >> >>> > within each v1xv2 combination by computing sum of v4, row percentage >> >>> > in >> >>> > terms of v4, and there is where my effort is blocked. Not really >> >>> > sure >> >>> > how I >> >>> > should proceed. Any suggestion? >> >>> > >> >>> > Thanks, >> >>> > >> >>> > Luca >> >>> > >> >>> > >> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>: >> >>> > >> >>> >> I don't understand your description. The standard practice on this >> >>> >> list >> >>> >> is >> >>> >> to provide a reproducible R example [1] of the kind of data you are >> >>> >> working >> >>> >> with (and any code you have tried) to go along with your >> >>> >> description. >> >>> >> In >> >>> >> this case, that would be two dputs of your input data frames and a >> >>> >> dput >> >>> >> of >> >>> >> an output data frame (generated by hand from your input data >> >>> >> frame). >> >>> >> (Probably best to not use the full number of input values just to >> >>> >> keep >> >>> >> the >> >>> >> size down.) We could then make an attempt to generate code that >> >>> >> goes >> >>> >> from >> >>> >> input to output. >> >>> >> >> >>> >> Of course, if you post that hard work using HTML then it will get >> >>> >> corrupted (much like the text below from your earlier emails) and >> >>> >> we >> >>> >> won't >> >>> >> be able to use it. Please learn to post from your email software >> >>> >> using >> >>> >> plain text when corresponding with this mailing list. >> >>> >> >> >>> >> [1] >> >>> >> >> >>> >> >> >>> >> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example >> >>> >> >> >>> >> >> >>> >> --------------------------------------------------------------------------- >> >>> >> Jeff Newmiller The ..... ..... Go >> >>> >> Live... >> >>> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. >> >>> >> Live >> >>> >> Go... >> >>> >> Live: OO#.. Dead: OO#.. >> >>> >> Playing >> >>> >> Research Engineer (Solar/Batteries O.O#. #.O#. >> >>> >> with >> >>> >> /Software/Embedded Controllers) .OO#. .OO#. >> >>> >> rocks...1k >> >>> >> >> >>> >> >> >>> >> --------------------------------------------------------------------------- >> >>> >> Sent from my phone. Please excuse my brevity. >> >>> >> >> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com> >> >>> >> wrote: >> >>> >> >Thanks for you input Michael, >> >>> >> > >> >>> >> >The continuous variable I have measures quantities (down to the >> >>> >> > 3rd >> >>> >> >decimal level) so unfortunately are not frequencies. >> >>> >> > >> >>> >> >Any more specific suggestions on how that could be tackled? >> >>> >> > >> >>> >> >Thanks & kind regards, >> >>> >> > >> >>> >> >Luca >> >>> >> > >> >>> >> > >> >>> >> >==>> >>> >> > >> >>> >> >Michael Friendly wrote: >> >>> >> >I'm not sure I understand completely what you want to do, but >> >>> >> >if the data were frequencies, it sounds like task for fitting a >> >>> >> >loglinear model with the model formula >> >>> >> > >> >>> >> >~ V1*V2 + V3 >> >>> >> > >> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote: >> >>> >> >>* Hello, >> >>> >> >*>>* I am facing a quite challenging task (at least to me) and I >> >>> >> > was >> >>> >> >wondering >> >>> >> >*>* if someone could advise how R could assist me to speed the >> >>> >> > task >> >>> >> > up. >> >>> >> >*>>* I am dealing with a dataset with 3 discrete variables and one >> >>> >> >continuous >> >>> >> >*>* variable. The discrete variables are: >> >>> >> >*>>* V1: 8 modalities >> >>> >> >*>* V2: 13 modalities >> >>> >> >*>* V3: 13 modalities >> >>> >> >*>>* The continuous variable V4 is a decimal number always greater >> >>> >> > than >> >>> >> >zero in >> >>> >> >*>* the marginals of each of the 3 variables but it is sometimes >> >>> >> > equal >> >>> >> >to zero >> >>> >> >*>* (and sometimes negative) in the joint tables. >> >>> >> >*>>* I have got 2 files: >> >>> >> >*>>* => one with distribution of all possible combinations of >> >>> >> > V1xV2 >> >>> >> >(some of >> >>> >> >*>* which are zero or neagtive) and >> >>> >> >*>* => one with the marginal distribution of V3. >> >>> >> >*>>* I am trying to build the long and narrow dataset V1xV2xV3 in >> >>> >> > such >> >>> >> >a way >> >>> >> >*>* that each V1xV2 cell does not get modified and V3 fits as >> >>> >> > closely >> >>> >> >as >> >>> >> >*>* possible to its marginal distribution. Does it make sense? >> >>> >> >*>>* To be even more specific, my 2 input files look like the >> >>> >> >following. >> >>> >> >*>>* FILE 1 >> >>> >> >*>* V1,V2,V4 >> >>> >> >*>* A, A, 24.251 >> >>> >> >*>* A, B, 1.065 >> >>> >> >*>* (...) >> >>> >> >*>* B, C, 0.294 >> >>> >> >*>* B, D, 2.731 >> >>> >> >*>* (...) >> >>> >> >*>* H, L, 0.345 >> >>> >> >*>* H, M, 0.000 >> >>> >> >*>>* FILE 2 >> >>> >> >*>* V3, V4 >> >>> >> >*>* A, 1.575 >> >>> >> >*>* B, 4.294 >> >>> >> >*>* C, 10.044 >> >>> >> >*>* (...) >> >>> >> >*>* L, 5.123 >> >>> >> >*>* M, 3.334 >> >>> >> >*>>* What I need to achieve is a file such as the following >> >>> >> >*>>* FILE 3 >> >>> >> >*>* V1, V2, V3, V4 >> >>> >> >*>* A, A, A, ??? >> >>> >> >*>* A, A, B, ??? >> >>> >> >*>* (...) >> >>> >> >*>* D, D, E, ??? >> >>> >> >*>* D, D, F, ??? >> >>> >> >*>* (...) >> >>> >> >*>* H, M, L, ??? >> >>> >> >*>* H, M, M, ??? >> >>> >> >*>>* Please notice that FILE 3 need to be such that if I aggregate >> >>> >> > on >> >>> >> >V1+V2 I >> >>> >> >*>* recover exactly FILE 1 and that if I aggregate on V3 I can >> >>> >> > recover >> >>> >> >a file >> >>> >> >*>* as close as possible to FILE 3 (ideally the same file). >> >>> >> >*>>* Can anyone suggest how I could do that with R? >> >>> >> >*>>* Thank you very much indeed for any assistance you are able to >> >>> >> >provide. >> >>> >> >*>>* Kind regards, >> >>> >> >*>>* Luca* >> >>> >> > >> >>> >> > [[alternative HTML version deleted]] >> >>> >> > >> >>> >> >______________________________________________ >> >>> >> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help >> >>> >> >PLEASE do read the posting guide >> >>> >> >http://www.R-project.org/posting-guide.html >> >>> >> >and provide commented, minimal, self-contained, reproducible code. >> >>> >> >> >>> >> >> >>> > >> >>> > [[alternative HTML version deleted]] >> >>> > >> >>> > ______________________________________________ >> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>> > https://stat.ethz.ch/mailman/listinfo/r-help >> >>> > PLEASE do read the posting guide >> >>> > http://www.R-project.org/posting-guide.html >> >>> > and provide commented, minimal, self-contained, reproducible code. >> >> >> >> > >