Nonsense. You are not telling us something or I have failed to understand something. Consider: v1 = c("a","b") v2 = "c("a","a") It is not possible to change the value of a sum of values corresponding to v2="a" without also changing that for v1, which is not supposed to change according to my understanding of your specification. So I'm done. -- Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Sun, Mar 22, 2015 at 8:28 AM, Luca Meyer <lucam1968 at gmail.com> wrote:> Sorry forgot to keep the rest of the group in the loop - Luca > ---------- Forwarded message ---------- > From: Luca Meyer <lucam1968 at gmail.com> > Date: 2015-03-22 16:27 GMT+01:00 > Subject: Re: [R] Joining two datasets - recursive procedure? > To: Bert Gunter <gunter.berton at gene.com> > > > Hi Bert, > > That is exactly what I am trying to achieve. Please notice that negative v4 > values are allowed. I have done a similar task in the past manually by > recursively alterating v4 distribution across v3 categories within fix each > v1&v2 combination so I am quite positive it can be achieved but honestly I > took me forever to do it manually and since this is likely to be an > exercise I need to repeat from time to time I wish I could learn how to do > it programmatically.... > > Thanks again for any further suggestion you might have, > > Luca > > > 2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: > >> Oh, wait a minute ... >> >> You still want the marginals for the other columns to be as originally? >> >> If so, then this is impossible in general as the sum of all the values >> must be what they were originally and you cannot therefore choose your >> values for V3 arbitrarily. >> >> Or at least, that seems to be what you are trying to do. >> >> -- Bert >> >> Bert Gunter >> Genentech Nonclinical Biostatistics >> (650) 467-7374 >> >> "Data is not information. Information is not knowledge. And knowledge >> is certainly not wisdom." >> Clifford Stoll >> >> >> >> >> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at gene.com> wrote: >> > I would have thought that this is straightforward given my previous >> email... >> > >> > Just set z to what you want -- e,g, all B values to 29/number of B's, >> > and all C values to 2.567/number of C's (etc. for more categories). >> > >> > A slick but sort of cheat way to do this programmatically -- in the >> > sense that it relies on the implementation of factor() rather than its >> > API -- is: >> > >> > y <- f1$v3 ## to simplify the notation; could be done using with() >> > z <- (c(29,2.567)/table(y))[c(y)] >> > >> > Then proceed to z1 as I previously described >> > >> > -- Bert >> > >> > >> > Bert Gunter >> > Genentech Nonclinical Biostatistics >> > (650) 467-7374 >> > >> > "Data is not information. Information is not knowledge. And knowledge >> > is certainly not wisdom." >> > Clifford Stoll >> > >> > >> > >> > >> > On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com> wrote: >> >> Hi Bert, hello R-experts, >> >> >> >> I am close to a solution but I still need one hint w.r.t. the following >> >> procedure (available also from >> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0) >> >> >> >> rm(list=ls()) >> >> >> >> # this is (an extract of) the INPUT file I have: >> >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", >> >> "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", >> >> "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C", >> >> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, >> 2.37232, >> >> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"), >> class >> >> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, >> 167L, >> >> 197L, 204L, 206L)) >> >> >> >> # this is the procedure that Bert suggested (slightly adjusted): >> >> z <- rnorm(nrow(f1)) ## or anything you want >> >> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5) >> >> aggregate(v4~v1*v2,f1,sum) >> >> aggregate(z1~v1*v2,f1,sum) >> >> aggregate(v4~v3,f1,sum) >> >> aggregate(z1~v3,f1,sum) >> >> >> >> My question to you is: how can I set z so that I can obtain specific >> values >> >> for z1-v4 in the v3 aggregation? >> >> In other words, how can I configure the procedure so that e.g. B=29 and >> >> C=2.56723 after running the procedure: >> >> aggregate(z1~v3,f1,sum) >> >> >> >> Thank you, >> >> >> >> Luca >> >> >> >> PS: to avoid any doubts you might have about who I am the following is >> my >> >> web page: http://lucameyer.wordpress.com/ >> >> >> >> >> >> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >> >>> >> >>> ... or cleaner: >> >>> >> >>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean)) >> >>> >> >>> >> >>> Just for curiosity, was this homework? (in which case I should >> >>> probably have not provided you an answer -- that is, assuming that I >> >>> HAVE provided an answer). >> >>> >> >>> Cheers, >> >>> Bert >> >>> >> >>> Bert Gunter >> >>> Genentech Nonclinical Biostatistics >> >>> (650) 467-7374 >> >>> >> >>> "Data is not information. Information is not knowledge. And knowledge >> >>> is certainly not wisdom." >> >>> Clifford Stoll >> >>> >> >>> >> >>> >> >>> >> >>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote: >> >>> > z <- rnorm(nrow(f1)) ## or anything you want >> >>> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean)) >> >>> > >> >>> > >> >>> > aggregate(v4~v1,f1,sum) >> >>> > aggregate(z1~v1,f1,sum) >> >>> > aggregate(v4~v2,f1,sum) >> >>> > aggregate(z1~v2,f1,sum) >> >>> > aggregate(v4~v3,f1,sum) >> >>> > aggregate(z1~v3,f1,sum) >> >>> > >> >>> > >> >>> > Cheers, >> >>> > Bert >> >>> > >> >>> > Bert Gunter >> >>> > Genentech Nonclinical Biostatistics >> >>> > (650) 467-7374 >> >>> > >> >>> > "Data is not information. Information is not knowledge. And knowledge >> >>> > is certainly not wisdom." >> >>> > Clifford Stoll >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> >> wrote: >> >>> >> Hi Bert, >> >>> >> >> >>> >> Thank you for your message. I am looking into ave() and tapply() as >> you >> >>> >> suggested but at the same time I have prepared a example of input >> and >> >>> >> output >> >>> >> files, just in case you or someone else would like to make an >> attempt >> >>> >> to >> >>> >> generate a code that goes from input to output. >> >>> >> >> >>> >> Please see below or download it from >> >>> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 >> >>> >> >> >>> >> # this is (an extract of) the INPUT file I have: >> >>> >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> >>> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> >>> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> >>> >> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, >> >>> >> 1.42917, >> >>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, >> >>> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >> >>> >> row.names >> >>> >> c(2L, >> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >>> >> >> >>> >> # this is (an extract of) the OUTPUT file I would like to obtain: >> >>> >> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> >>> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> >>> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> >>> >> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, >> >>> >> 1.77918, >> >>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872, >> >>> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >> >>> >> row.names >> >>> >> c(2L, >> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >>> >> >> >>> >> # please notice that while the aggregated v4 on v3 has changed ? >> >>> >> aggregate(f1[,c("v4")],list(f1$v3),sum) >> >>> >> aggregate(f2[,c("v4")],list(f2$v3),sum) >> >>> >> >> >>> >> # ? the aggregated v4 over v1xv2 has remained unchanged: >> >>> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >> >>> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum) >> >>> >> >> >>> >> Thank you very much in advance for your assitance. >> >>> >> >> >>> >> Luca >> >>> >> >> >>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >> >>> >>> >> >>> >>> 1. Still not sure what you mean, but maybe look at ?ave and >> ?tapply, >> >>> >>> for which ave() is a wrapper. >> >>> >>> >> >>> >>> 2. You still need to heed the rest of Jeff's advice. >> >>> >>> >> >>> >>> Cheers, >> >>> >>> Bert >> >>> >>> >> >>> >>> Bert Gunter >> >>> >>> Genentech Nonclinical Biostatistics >> >>> >>> (650) 467-7374 >> >>> >>> >> >>> >>> "Data is not information. Information is not knowledge. And >> knowledge >> >>> >>> is certainly not wisdom." >> >>> >>> Clifford Stoll >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> >> >>> >>> wrote: >> >>> >>> > Hi Jeff & other R-experts, >> >>> >>> > >> >>> >>> > Thank you for your note. I have tried myself to solve the issue >> >>> >>> > without >> >>> >>> > success. >> >>> >>> > >> >>> >>> > Following your suggestion, I am providing a sample of the >> dataset I >> >>> >>> > am >> >>> >>> > using below (also downloadble in plain text from >> >>> >>> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0): >> >>> >>> > >> >>> >>> > #this is an extract of the overall dataset (n=1200 cases) >> >>> >>> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", >> "B", >> >>> >>> > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >> >>> >>> > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >> >>> >>> > "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, >> >>> >>> > 3.43806581506388, >> >>> >>> > 0.002733567617055, 1.42917483425029, 1.05786640463504, >> >>> >>> > 0.000420548864162308, >> >>> >>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936, >> >>> >>> > 0.928725667117666, >> >>> >>> > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >> >>> >>> > row.names >> >>> >>> > >> >>> >>> > c(2L, >> >>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >> >>> >>> > >> >>> >>> > I need to find a automated procedure that allows me to adjust v3 >> >>> >>> > marginals >> >>> >>> > while maintaining v1xv2 marginals unchanged. >> >>> >>> > >> >>> >>> > That is: modify the v4 values you can find by running: >> >>> >>> > >> >>> >>> > aggregate(f1[,c("v4")],list(f1$v3),sum) >> >>> >>> > >> >>> >>> > while maintaining costant the values you can find by running: >> >>> >>> > >> >>> >>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >> >>> >>> > >> >>> >>> > Now does it make sense? >> >>> >>> > >> >>> >>> > Please notice I have tried to build some syntax that tries to >> modify >> >>> >>> > values >> >>> >>> > within each v1xv2 combination by computing sum of v4, row >> percentage >> >>> >>> > in >> >>> >>> > terms of v4, and there is where my effort is blocked. Not really >> >>> >>> > sure >> >>> >>> > how I >> >>> >>> > should proceed. Any suggestion? >> >>> >>> > >> >>> >>> > Thanks, >> >>> >>> > >> >>> >>> > Luca >> >>> >>> > >> >>> >>> > >> >>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller < >> jdnewmil at dcn.davis.ca.us>: >> >>> >>> > >> >>> >>> >> I don't understand your description. The standard practice on >> this >> >>> >>> >> list >> >>> >>> >> is >> >>> >>> >> to provide a reproducible R example [1] of the kind of data you >> are >> >>> >>> >> working >> >>> >>> >> with (and any code you have tried) to go along with your >> >>> >>> >> description. >> >>> >>> >> In >> >>> >>> >> this case, that would be two dputs of your input data frames >> and a >> >>> >>> >> dput >> >>> >>> >> of >> >>> >>> >> an output data frame (generated by hand from your input data >> >>> >>> >> frame). >> >>> >>> >> (Probably best to not use the full number of input values just >> to >> >>> >>> >> keep >> >>> >>> >> the >> >>> >>> >> size down.) We could then make an attempt to generate code that >> >>> >>> >> goes >> >>> >>> >> from >> >>> >>> >> input to output. >> >>> >>> >> >> >>> >>> >> Of course, if you post that hard work using HTML then it will >> get >> >>> >>> >> corrupted (much like the text below from your earlier emails) >> and >> >>> >>> >> we >> >>> >>> >> won't >> >>> >>> >> be able to use it. Please learn to post from your email software >> >>> >>> >> using >> >>> >>> >> plain text when corresponding with this mailing list. >> >>> >>> >> >> >>> >>> >> [1] >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> --------------------------------------------------------------------------- >> >>> >>> >> Jeff Newmiller The ..... >> ..... Go >> >>> >>> >> Live... >> >>> >>> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. >> >>> >>> >> Live >> >>> >>> >> Go... >> >>> >>> >> Live: OO#.. Dead: OO#.. >> >>> >>> >> Playing >> >>> >>> >> Research Engineer (Solar/Batteries O.O#. #.O#. >> >>> >>> >> with >> >>> >>> >> /Software/Embedded Controllers) .OO#. .OO#. >> >>> >>> >> rocks...1k >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> --------------------------------------------------------------------------- >> >>> >>> >> Sent from my phone. Please excuse my brevity. >> >>> >>> >> >> >>> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer < >> lucam1968 at gmail.com> >> >>> >>> >> wrote: >> >>> >>> >> >Thanks for you input Michael, >> >>> >>> >> > >> >>> >>> >> >The continuous variable I have measures quantities (down to the >> >>> >>> >> > 3rd >> >>> >>> >> >decimal level) so unfortunately are not frequencies. >> >>> >>> >> > >> >>> >>> >> >Any more specific suggestions on how that could be tackled? >> >>> >>> >> > >> >>> >>> >> >Thanks & kind regards, >> >>> >>> >> > >> >>> >>> >> >Luca >> >>> >>> >> > >> >>> >>> >> > >> >>> >>> >> >==>> >>> >>> >> > >> >>> >>> >> >Michael Friendly wrote: >> >>> >>> >> >I'm not sure I understand completely what you want to do, but >> >>> >>> >> >if the data were frequencies, it sounds like task for fitting a >> >>> >>> >> >loglinear model with the model formula >> >>> >>> >> > >> >>> >>> >> >~ V1*V2 + V3 >> >>> >>> >> > >> >>> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote: >> >>> >>> >> >>* Hello, >> >>> >>> >> >*>>* I am facing a quite challenging task (at least to me) and >> I >> >>> >>> >> > was >> >>> >>> >> >wondering >> >>> >>> >> >*>* if someone could advise how R could assist me to speed the >> >>> >>> >> > task >> >>> >>> >> > up. >> >>> >>> >> >*>>* I am dealing with a dataset with 3 discrete variables and >> one >> >>> >>> >> >continuous >> >>> >>> >> >*>* variable. The discrete variables are: >> >>> >>> >> >*>>* V1: 8 modalities >> >>> >>> >> >*>* V2: 13 modalities >> >>> >>> >> >*>* V3: 13 modalities >> >>> >>> >> >*>>* The continuous variable V4 is a decimal number always >> greater >> >>> >>> >> > than >> >>> >>> >> >zero in >> >>> >>> >> >*>* the marginals of each of the 3 variables but it is >> sometimes >> >>> >>> >> > equal >> >>> >>> >> >to zero >> >>> >>> >> >*>* (and sometimes negative) in the joint tables. >> >>> >>> >> >*>>* I have got 2 files: >> >>> >>> >> >*>>* => one with distribution of all possible combinations of >> >>> >>> >> > V1xV2 >> >>> >>> >> >(some of >> >>> >>> >> >*>* which are zero or neagtive) and >> >>> >>> >> >*>* => one with the marginal distribution of V3. >> >>> >>> >> >*>>* I am trying to build the long and narrow dataset V1xV2xV3 >> in >> >>> >>> >> > such >> >>> >>> >> >a way >> >>> >>> >> >*>* that each V1xV2 cell does not get modified and V3 fits as >> >>> >>> >> > closely >> >>> >>> >> >as >> >>> >>> >> >*>* possible to its marginal distribution. Does it make sense? >> >>> >>> >> >*>>* To be even more specific, my 2 input files look like the >> >>> >>> >> >following. >> >>> >>> >> >*>>* FILE 1 >> >>> >>> >> >*>* V1,V2,V4 >> >>> >>> >> >*>* A, A, 24.251 >> >>> >>> >> >*>* A, B, 1.065 >> >>> >>> >> >*>* (...) >> >>> >>> >> >*>* B, C, 0.294 >> >>> >>> >> >*>* B, D, 2.731 >> >>> >>> >> >*>* (...) >> >>> >>> >> >*>* H, L, 0.345 >> >>> >>> >> >*>* H, M, 0.000 >> >>> >>> >> >*>>* FILE 2 >> >>> >>> >> >*>* V3, V4 >> >>> >>> >> >*>* A, 1.575 >> >>> >>> >> >*>* B, 4.294 >> >>> >>> >> >*>* C, 10.044 >> >>> >>> >> >*>* (...) >> >>> >>> >> >*>* L, 5.123 >> >>> >>> >> >*>* M, 3.334 >> >>> >>> >> >*>>* What I need to achieve is a file such as the following >> >>> >>> >> >*>>* FILE 3 >> >>> >>> >> >*>* V1, V2, V3, V4 >> >>> >>> >> >*>* A, A, A, ??? >> >>> >>> >> >*>* A, A, B, ??? >> >>> >>> >> >*>* (...) >> >>> >>> >> >*>* D, D, E, ??? >> >>> >>> >> >*>* D, D, F, ??? >> >>> >>> >> >*>* (...) >> >>> >>> >> >*>* H, M, L, ??? >> >>> >>> >> >*>* H, M, M, ??? >> >>> >>> >> >*>>* Please notice that FILE 3 need to be such that if I >> aggregate >> >>> >>> >> > on >> >>> >>> >> >V1+V2 I >> >>> >>> >> >*>* recover exactly FILE 1 and that if I aggregate on V3 I can >> >>> >>> >> > recover >> >>> >>> >> >a file >> >>> >>> >> >*>* as close as possible to FILE 3 (ideally the same file). >> >>> >>> >> >*>>* Can anyone suggest how I could do that with R? >> >>> >>> >> >*>>* Thank you very much indeed for any assistance you are >> able to >> >>> >>> >> >provide. >> >>> >>> >> >*>>* Kind regards, >> >>> >>> >> >*>>* Luca* >> >>> >>> >> > >> >>> >>> >> > [[alternative HTML version deleted]] >> >>> >>> >> > >> >>> >>> >> >______________________________________________ >> >>> >>> >> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, >> see >> >>> >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help >> >>> >>> >> >PLEASE do read the posting guide >> >>> >>> >> >http://www.R-project.org/posting-guide.html >> >>> >>> >> >and provide commented, minimal, self-contained, reproducible >> code. >> >>> >>> >> >> >>> >>> >> >> >>> >>> > >> >>> >>> > [[alternative HTML version deleted]] >> >>> >>> > >> >>> >>> > ______________________________________________ >> >>> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, >> see >> >>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help >> >>> >>> > PLEASE do read the posting guide >> >>> >>> > http://www.R-project.org/posting-guide.html >> >>> >>> > and provide commented, minimal, self-contained, reproducible >> code. >> >>> >> >> >>> >> >> >> >> >> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Bert, Maybe I did not explain myself clearly enough. But let me show you with a manual example that indeed what I would like to do is feasible. The following is also available for download from https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 rm(list=ls()) This is usual (an extract of) the INPUT file I have: f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) This are the initial marginal distributions aggregate(v4~v1*v2,f1,sum) aggregate(v4~v3,f1,sum) First I order the file such that I have nicely listed 6 distinct v1xv2 combinations. f1 <- f1[order(f1$v1,f1$v2),] Then I compute (manually) the relative importance of each v1xv2 combination: tAA <- (18.18530+1.42917)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) # this is for combination v1=A & v2=A tAB <- (3.43806+1.05786)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) # this is for combination v1=A & v2=B tAC <- (0.00273+0.00042)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) # this is for combination v1=A & v2=C tBA <- (2.37232+1.13430)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) # this is for combination v1=B & v2=A tBB <- (3.01835+0.92872)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) # this is for combination v1=B & v2=B tBC <- (0.00000+0.00000)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) # this is for combination v1=B & v2=C # and just to make sure I have not made mistakes the following should be equal to 1 tAA+tAB+tAC+tBA+tBB+tBC Next, I know I need to increase v4 any time v3=B and the total increase I need to have over the whole dataset is 29-27.01676=1.98324. In turn, I need to dimish v4 any time V3=C by the same amount (4.55047-2.56723=1.98324). This aspect was perhaps not clear at first. I need to move v4 across v3 categories, but the totals will always remain unchanged. Since I want the data alteration to be proportional to the v1xv2 combinations I do the following: f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" & f1$v3=="B", f1$v4+(tAA*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" & f1$v3=="C", f1$v4-(tAA*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" & f1$v3=="B", f1$v4+(tAB*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" & f1$v3=="C", f1$v4-(tAB*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" & f1$v3=="B", f1$v4+(tAC*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" & f1$v3=="C", f1$v4-(tAC*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" & f1$v3=="B", f1$v4+(tBA*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" & f1$v3=="C", f1$v4-(tBA*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" & f1$v3=="B", f1$v4+(tBB*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" & f1$v3=="C", f1$v4-(tBB*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" & f1$v3=="B", f1$v4+(tBC*1.98324), f1$v4) f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" & f1$v3=="C", f1$v4-(tBC*1.98324), f1$v4) This are the final marginal distributions: aggregate(v4~v1*v2,f1,sum) aggregate(v4~v3,f1,sum) Can this procedure be made programmatic so that I can run it on the (8x13x13) categories matrix? if so, how would you do it? I have really hard time to do it with some (semi)automatic procedure. Thank you very much indeed once more :) Luca 2015-03-22 18:32 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:> Nonsense. You are not telling us something or I have failed to > understand something. > > Consider: > > v1 = c("a","b") > v2 = "c("a","a") > > It is not possible to change the value of a sum of values > corresponding to v2="a" without also changing that for v1, which is > not supposed to change according to my understanding of your > specification. > > So I'm done. > > -- Bert > > > Bert Gunter > Genentech Nonclinical Biostatistics > (650) 467-7374 > > "Data is not information. Information is not knowledge. And knowledge > is certainly not wisdom." > Clifford Stoll > > > > > On Sun, Mar 22, 2015 at 8:28 AM, Luca Meyer <lucam1968 at gmail.com> wrote: > > Sorry forgot to keep the rest of the group in the loop - Luca > > ---------- Forwarded message ---------- > > From: Luca Meyer <lucam1968 at gmail.com> > > Date: 2015-03-22 16:27 GMT+01:00 > > Subject: Re: [R] Joining two datasets - recursive procedure? > > To: Bert Gunter <gunter.berton at gene.com> > > > > > > Hi Bert, > > > > That is exactly what I am trying to achieve. Please notice that negative > v4 > > values are allowed. I have done a similar task in the past manually by > > recursively alterating v4 distribution across v3 categories within fix > each > > v1&v2 combination so I am quite positive it can be achieved but honestly > I > > took me forever to do it manually and since this is likely to be an > > exercise I need to repeat from time to time I wish I could learn how to > do > > it programmatically.... > > > > Thanks again for any further suggestion you might have, > > > > Luca > > > > > > 2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: > > > >> Oh, wait a minute ... > >> > >> You still want the marginals for the other columns to be as originally? > >> > >> If so, then this is impossible in general as the sum of all the values > >> must be what they were originally and you cannot therefore choose your > >> values for V3 arbitrarily. > >> > >> Or at least, that seems to be what you are trying to do. > >> > >> -- Bert > >> > >> Bert Gunter > >> Genentech Nonclinical Biostatistics > >> (650) 467-7374 > >> > >> "Data is not information. Information is not knowledge. And knowledge > >> is certainly not wisdom." > >> Clifford Stoll > >> > >> > >> > >> > >> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at gene.com> wrote: > >> > I would have thought that this is straightforward given my previous > >> email... > >> > > >> > Just set z to what you want -- e,g, all B values to 29/number of B's, > >> > and all C values to 2.567/number of C's (etc. for more categories). > >> > > >> > A slick but sort of cheat way to do this programmatically -- in the > >> > sense that it relies on the implementation of factor() rather than its > >> > API -- is: > >> > > >> > y <- f1$v3 ## to simplify the notation; could be done using with() > >> > z <- (c(29,2.567)/table(y))[c(y)] > >> > > >> > Then proceed to z1 as I previously described > >> > > >> > -- Bert > >> > > >> > > >> > Bert Gunter > >> > Genentech Nonclinical Biostatistics > >> > (650) 467-7374 > >> > > >> > "Data is not information. Information is not knowledge. And knowledge > >> > is certainly not wisdom." > >> > Clifford Stoll > >> > > >> > > >> > > >> > > >> > On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com> > wrote: > >> >> Hi Bert, hello R-experts, > >> >> > >> >> I am close to a solution but I still need one hint w.r.t. the > following > >> >> procedure (available also from > >> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0) > >> >> > >> >> rm(list=ls()) > >> >> > >> >> # this is (an extract of) the INPUT file I have: > >> >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", > "B", > >> >> "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", > "A", > >> >> "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", > "C", > >> >> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, > >> 2.37232, > >> >> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", > "v4"), > >> class > >> >> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, > >> 167L, > >> >> 197L, 204L, 206L)) > >> >> > >> >> # this is the procedure that Bert suggested (slightly adjusted): > >> >> z <- rnorm(nrow(f1)) ## or anything you want > >> >> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5) > >> >> aggregate(v4~v1*v2,f1,sum) > >> >> aggregate(z1~v1*v2,f1,sum) > >> >> aggregate(v4~v3,f1,sum) > >> >> aggregate(z1~v3,f1,sum) > >> >> > >> >> My question to you is: how can I set z so that I can obtain specific > >> values > >> >> for z1-v4 in the v3 aggregation? > >> >> In other words, how can I configure the procedure so that e.g. B=29 > and > >> >> C=2.56723 after running the procedure: > >> >> aggregate(z1~v3,f1,sum) > >> >> > >> >> Thank you, > >> >> > >> >> Luca > >> >> > >> >> PS: to avoid any doubts you might have about who I am the following > is > >> my > >> >> web page: http://lucameyer.wordpress.com/ > >> >> > >> >> > >> >> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: > >> >>> > >> >>> ... or cleaner: > >> >>> > >> >>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean)) > >> >>> > >> >>> > >> >>> Just for curiosity, was this homework? (in which case I should > >> >>> probably have not provided you an answer -- that is, assuming that I > >> >>> HAVE provided an answer). > >> >>> > >> >>> Cheers, > >> >>> Bert > >> >>> > >> >>> Bert Gunter > >> >>> Genentech Nonclinical Biostatistics > >> >>> (650) 467-7374 > >> >>> > >> >>> "Data is not information. Information is not knowledge. And > knowledge > >> >>> is certainly not wisdom." > >> >>> Clifford Stoll > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> > wrote: > >> >>> > z <- rnorm(nrow(f1)) ## or anything you want > >> >>> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean)) > >> >>> > > >> >>> > > >> >>> > aggregate(v4~v1,f1,sum) > >> >>> > aggregate(z1~v1,f1,sum) > >> >>> > aggregate(v4~v2,f1,sum) > >> >>> > aggregate(z1~v2,f1,sum) > >> >>> > aggregate(v4~v3,f1,sum) > >> >>> > aggregate(z1~v3,f1,sum) > >> >>> > > >> >>> > > >> >>> > Cheers, > >> >>> > Bert > >> >>> > > >> >>> > Bert Gunter > >> >>> > Genentech Nonclinical Biostatistics > >> >>> > (650) 467-7374 > >> >>> > > >> >>> > "Data is not information. Information is not knowledge. And > knowledge > >> >>> > is certainly not wisdom." > >> >>> > Clifford Stoll > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> > >> wrote: > >> >>> >> Hi Bert, > >> >>> >> > >> >>> >> Thank you for your message. I am looking into ave() and tapply() > as > >> you > >> >>> >> suggested but at the same time I have prepared a example of input > >> and > >> >>> >> output > >> >>> >> files, just in case you or someone else would like to make an > >> attempt > >> >>> >> to > >> >>> >> generate a code that goes from input to output. > >> >>> >> > >> >>> >> Please see below or download it from > >> >>> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 > >> >>> >> > >> >>> >> # this is (an extract of) the INPUT file I have: > >> >>> >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", > "B", > >> >>> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", > >> >>> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", > >> >>> >> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, > >> >>> >> 1.42917, > >> >>> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, > >> >>> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", > >> >>> >> row.names > >> >>> >> c(2L, > >> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) > >> >>> >> > >> >>> >> # this is (an extract of) the OUTPUT file I would like to obtain: > >> >>> >> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", > "B", > >> >>> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", > >> >>> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", > >> >>> >> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, > >> >>> >> 1.77918, > >> >>> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872, > >> >>> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", > >> >>> >> row.names > >> >>> >> c(2L, > >> >>> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) > >> >>> >> > >> >>> >> # please notice that while the aggregated v4 on v3 has changed ? > >> >>> >> aggregate(f1[,c("v4")],list(f1$v3),sum) > >> >>> >> aggregate(f2[,c("v4")],list(f2$v3),sum) > >> >>> >> > >> >>> >> # ? the aggregated v4 over v1xv2 has remained unchanged: > >> >>> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) > >> >>> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum) > >> >>> >> > >> >>> >> Thank you very much in advance for your assitance. > >> >>> >> > >> >>> >> Luca > >> >>> >> > >> >>> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: > >> >>> >>> > >> >>> >>> 1. Still not sure what you mean, but maybe look at ?ave and > >> ?tapply, > >> >>> >>> for which ave() is a wrapper. > >> >>> >>> > >> >>> >>> 2. You still need to heed the rest of Jeff's advice. > >> >>> >>> > >> >>> >>> Cheers, > >> >>> >>> Bert > >> >>> >>> > >> >>> >>> Bert Gunter > >> >>> >>> Genentech Nonclinical Biostatistics > >> >>> >>> (650) 467-7374 > >> >>> >>> > >> >>> >>> "Data is not information. Information is not knowledge. And > >> knowledge > >> >>> >>> is certainly not wisdom." > >> >>> >>> Clifford Stoll > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer < > lucam1968 at gmail.com> > >> >>> >>> wrote: > >> >>> >>> > Hi Jeff & other R-experts, > >> >>> >>> > > >> >>> >>> > Thank you for your note. I have tried myself to solve the > issue > >> >>> >>> > without > >> >>> >>> > success. > >> >>> >>> > > >> >>> >>> > Following your suggestion, I am providing a sample of the > >> dataset I > >> >>> >>> > am > >> >>> >>> > using below (also downloadble in plain text from > >> >>> >>> > > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0): > >> >>> >>> > > >> >>> >>> > #this is an extract of the overall dataset (n=1200 cases) > >> >>> >>> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", > >> "B", > >> >>> >>> > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", > >> >>> >>> > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", > >> >>> >>> > "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, > >> >>> >>> > 3.43806581506388, > >> >>> >>> > 0.002733567617055, 1.42917483425029, 1.05786640463504, > >> >>> >>> > 0.000420548864162308, > >> >>> >>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936, > >> >>> >>> > 0.928725667117666, > >> >>> >>> > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", > >> >>> >>> > row.names > >> >>> >>> > > >> >>> >>> > c(2L, > >> >>> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) > >> >>> >>> > > >> >>> >>> > I need to find a automated procedure that allows me to adjust > v3 > >> >>> >>> > marginals > >> >>> >>> > while maintaining v1xv2 marginals unchanged. > >> >>> >>> > > >> >>> >>> > That is: modify the v4 values you can find by running: > >> >>> >>> > > >> >>> >>> > aggregate(f1[,c("v4")],list(f1$v3),sum) > >> >>> >>> > > >> >>> >>> > while maintaining costant the values you can find by running: > >> >>> >>> > > >> >>> >>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) > >> >>> >>> > > >> >>> >>> > Now does it make sense? > >> >>> >>> > > >> >>> >>> > Please notice I have tried to build some syntax that tries to > >> modify > >> >>> >>> > values > >> >>> >>> > within each v1xv2 combination by computing sum of v4, row > >> percentage > >> >>> >>> > in > >> >>> >>> > terms of v4, and there is where my effort is blocked. Not > really > >> >>> >>> > sure > >> >>> >>> > how I > >> >>> >>> > should proceed. Any suggestion? > >> >>> >>> > > >> >>> >>> > Thanks, > >> >>> >>> > > >> >>> >>> > Luca > >> >>> >>> > > >> >>> >>> > > >> >>> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller < > >> jdnewmil at dcn.davis.ca.us>: > >> >>> >>> > > >> >>> >>> >> I don't understand your description. The standard practice on > >> this > >> >>> >>> >> list > >> >>> >>> >> is > >> >>> >>> >> to provide a reproducible R example [1] of the kind of data > you > >> are > >> >>> >>> >> working > >> >>> >>> >> with (and any code you have tried) to go along with your > >> >>> >>> >> description. > >> >>> >>> >> In > >> >>> >>> >> this case, that would be two dputs of your input data frames > >> and a > >> >>> >>> >> dput > >> >>> >>> >> of > >> >>> >>> >> an output data frame (generated by hand from your input data > >> >>> >>> >> frame). > >> >>> >>> >> (Probably best to not use the full number of input values > just > >> to > >> >>> >>> >> keep > >> >>> >>> >> the > >> >>> >>> >> size down.) We could then make an attempt to generate code > that > >> >>> >>> >> goes > >> >>> >>> >> from > >> >>> >>> >> input to output. > >> >>> >>> >> > >> >>> >>> >> Of course, if you post that hard work using HTML then it will > >> get > >> >>> >>> >> corrupted (much like the text below from your earlier emails) > >> and > >> >>> >>> >> we > >> >>> >>> >> won't > >> >>> >>> >> be able to use it. Please learn to post from your email > software > >> >>> >>> >> using > >> >>> >>> >> plain text when corresponding with this mailing list. > >> >>> >>> >> > >> >>> >>> >> [1] > >> >>> >>> >> > >> >>> >>> >> > >> >>> >>> >> > >> > http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example > >> >>> >>> >> > >> >>> >>> >> > >> >>> >>> >> > >> > --------------------------------------------------------------------------- > >> >>> >>> >> Jeff Newmiller The ..... > >> ..... Go > >> >>> >>> >> Live... > >> >>> >>> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. > ##.#. > >> >>> >>> >> Live > >> >>> >>> >> Go... > >> >>> >>> >> Live: OO#.. Dead: > OO#.. > >> >>> >>> >> Playing > >> >>> >>> >> Research Engineer (Solar/Batteries O.O#. > #.O#. > >> >>> >>> >> with > >> >>> >>> >> /Software/Embedded Controllers) .OO#. > .OO#. > >> >>> >>> >> rocks...1k > >> >>> >>> >> > >> >>> >>> >> > >> >>> >>> >> > >> > --------------------------------------------------------------------------- > >> >>> >>> >> Sent from my phone. Please excuse my brevity. > >> >>> >>> >> > >> >>> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer < > >> lucam1968 at gmail.com> > >> >>> >>> >> wrote: > >> >>> >>> >> >Thanks for you input Michael, > >> >>> >>> >> > > >> >>> >>> >> >The continuous variable I have measures quantities (down to > the > >> >>> >>> >> > 3rd > >> >>> >>> >> >decimal level) so unfortunately are not frequencies. > >> >>> >>> >> > > >> >>> >>> >> >Any more specific suggestions on how that could be tackled? > >> >>> >>> >> > > >> >>> >>> >> >Thanks & kind regards, > >> >>> >>> >> > > >> >>> >>> >> >Luca > >> >>> >>> >> > > >> >>> >>> >> > > >> >>> >>> >> >==> >> >>> >>> >> > > >> >>> >>> >> >Michael Friendly wrote: > >> >>> >>> >> >I'm not sure I understand completely what you want to do, > but > >> >>> >>> >> >if the data were frequencies, it sounds like task for > fitting a > >> >>> >>> >> >loglinear model with the model formula > >> >>> >>> >> > > >> >>> >>> >> >~ V1*V2 + V3 > >> >>> >>> >> > > >> >>> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote: > >> >>> >>> >> >>* Hello, > >> >>> >>> >> >*>>* I am facing a quite challenging task (at least to me) > and > >> I > >> >>> >>> >> > was > >> >>> >>> >> >wondering > >> >>> >>> >> >*>* if someone could advise how R could assist me to speed > the > >> >>> >>> >> > task > >> >>> >>> >> > up. > >> >>> >>> >> >*>>* I am dealing with a dataset with 3 discrete variables > and > >> one > >> >>> >>> >> >continuous > >> >>> >>> >> >*>* variable. The discrete variables are: > >> >>> >>> >> >*>>* V1: 8 modalities > >> >>> >>> >> >*>* V2: 13 modalities > >> >>> >>> >> >*>* V3: 13 modalities > >> >>> >>> >> >*>>* The continuous variable V4 is a decimal number always > >> greater > >> >>> >>> >> > than > >> >>> >>> >> >zero in > >> >>> >>> >> >*>* the marginals of each of the 3 variables but it is > >> sometimes > >> >>> >>> >> > equal > >> >>> >>> >> >to zero > >> >>> >>> >> >*>* (and sometimes negative) in the joint tables. > >> >>> >>> >> >*>>* I have got 2 files: > >> >>> >>> >> >*>>* => one with distribution of all possible combinations > of > >> >>> >>> >> > V1xV2 > >> >>> >>> >> >(some of > >> >>> >>> >> >*>* which are zero or neagtive) and > >> >>> >>> >> >*>* => one with the marginal distribution of V3. > >> >>> >>> >> >*>>* I am trying to build the long and narrow dataset > V1xV2xV3 > >> in > >> >>> >>> >> > such > >> >>> >>> >> >a way > >> >>> >>> >> >*>* that each V1xV2 cell does not get modified and V3 fits > as > >> >>> >>> >> > closely > >> >>> >>> >> >as > >> >>> >>> >> >*>* possible to its marginal distribution. Does it make > sense? > >> >>> >>> >> >*>>* To be even more specific, my 2 input files look like > the > >> >>> >>> >> >following. > >> >>> >>> >> >*>>* FILE 1 > >> >>> >>> >> >*>* V1,V2,V4 > >> >>> >>> >> >*>* A, A, 24.251 > >> >>> >>> >> >*>* A, B, 1.065 > >> >>> >>> >> >*>* (...) > >> >>> >>> >> >*>* B, C, 0.294 > >> >>> >>> >> >*>* B, D, 2.731 > >> >>> >>> >> >*>* (...) > >> >>> >>> >> >*>* H, L, 0.345 > >> >>> >>> >> >*>* H, M, 0.000 > >> >>> >>> >> >*>>* FILE 2 > >> >>> >>> >> >*>* V3, V4 > >> >>> >>> >> >*>* A, 1.575 > >> >>> >>> >> >*>* B, 4.294 > >> >>> >>> >> >*>* C, 10.044 > >> >>> >>> >> >*>* (...) > >> >>> >>> >> >*>* L, 5.123 > >> >>> >>> >> >*>* M, 3.334 > >> >>> >>> >> >*>>* What I need to achieve is a file such as the following > >> >>> >>> >> >*>>* FILE 3 > >> >>> >>> >> >*>* V1, V2, V3, V4 > >> >>> >>> >> >*>* A, A, A, ??? > >> >>> >>> >> >*>* A, A, B, ??? > >> >>> >>> >> >*>* (...) > >> >>> >>> >> >*>* D, D, E, ??? > >> >>> >>> >> >*>* D, D, F, ??? > >> >>> >>> >> >*>* (...) > >> >>> >>> >> >*>* H, M, L, ??? > >> >>> >>> >> >*>* H, M, M, ??? > >> >>> >>> >> >*>>* Please notice that FILE 3 need to be such that if I > >> aggregate > >> >>> >>> >> > on > >> >>> >>> >> >V1+V2 I > >> >>> >>> >> >*>* recover exactly FILE 1 and that if I aggregate on V3 I > can > >> >>> >>> >> > recover > >> >>> >>> >> >a file > >> >>> >>> >> >*>* as close as possible to FILE 3 (ideally the same file). > >> >>> >>> >> >*>>* Can anyone suggest how I could do that with R? > >> >>> >>> >> >*>>* Thank you very much indeed for any assistance you are > >> able to > >> >>> >>> >> >provide. > >> >>> >>> >> >*>>* Kind regards, > >> >>> >>> >> >*>>* Luca* > >> >>> >>> >> > > >> >>> >>> >> > [[alternative HTML version deleted]] > >> >>> >>> >> > > >> >>> >>> >> >______________________________________________ > >> >>> >>> >> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and > more, > >> see > >> >>> >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help > >> >>> >>> >> >PLEASE do read the posting guide > >> >>> >>> >> >http://www.R-project.org/posting-guide.html > >> >>> >>> >> >and provide commented, minimal, self-contained, reproducible > >> code. > >> >>> >>> >> > >> >>> >>> >> > >> >>> >>> > > >> >>> >>> > [[alternative HTML version deleted]] > >> >>> >>> > > >> >>> >>> > ______________________________________________ > >> >>> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, > >> see > >> >>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help > >> >>> >>> > PLEASE do read the posting guide > >> >>> >>> > http://www.R-project.org/posting-guide.html > >> >>> >>> > and provide commented, minimal, self-contained, reproducible > >> code. > >> >>> >> > >> >>> >> > >> >> > >> >> > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On Mar 22, 2015, at 1:12 PM, Luca Meyer wrote:> Hi Bert, > > Maybe I did not explain myself clearly enough. But let me show you with a > manual example that indeed what I would like to do is feasible. > > The following is also available for download from > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 > > rm(list=ls()) > > This is usual (an extract of) the INPUT file I have: > > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", > "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, > 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names > c(2L, > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) > > This are the initial marginal distributions > > aggregate(v4~v1*v2,f1,sum) > aggregate(v4~v3,f1,sum) > > First I order the file such that I have nicely listed 6 distinct v1xv2 > combinations. > > f1 <- f1[order(f1$v1,f1$v2),] > > Then I compute (manually) the relative importance of each v1xv2 combination: > > tAA <- > (18.18530+1.42917)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) > # this is for combination v1=A & v2=A > tAB <- > (3.43806+1.05786)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) > # this is for combination v1=A & v2=B > tAC <- > (0.00273+0.00042)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) > # this is for combination v1=A & v2=C > tBA <- > (2.37232+1.13430)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) > # this is for combination v1=B & v2=A > tBB <- > (3.01835+0.92872)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) > # this is for combination v1=B & v2=B > tBC <- > (0.00000+0.00000)/(18.18530+1.42917+3.43806+1.05786+0.00273+0.00042+2.37232+1.13430+3.01835+0.92872+0.00000+0.00000) > # this is for combination v1=B & v2=C > # and just to make sure I have not made mistakes the following should be > equal to 1 > tAA+tAB+tAC+tBA+tBB+tBC > > Next, I know I need to increase v4 any time v3=B and the total increase I > need to have over the whole dataset is 29-27.01676=1.98324. In turn, I need > to dimish v4 any time V3=C by the same amount (4.55047-2.56723=1.98324). > This aspect was perhaps not clear at first. I need to move v4 across v3 > categories, but the totals will always remain unchanged. > > Since I want the data alteration to be proportional to the v1xv2 > combinations I do the following: > > f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" & f1$v3=="B", f1$v4+(tAA*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="A" & f1$v3=="C", f1$v4-(tAA*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" & f1$v3=="B", f1$v4+(tAB*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="B" & f1$v3=="C", f1$v4-(tAB*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" & f1$v3=="B", f1$v4+(tAC*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="A" & f1$v2=="C" & f1$v3=="C", f1$v4-(tAC*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" & f1$v3=="B", f1$v4+(tBA*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="A" & f1$v3=="C", f1$v4-(tBA*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" & f1$v3=="B", f1$v4+(tBB*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="B" & f1$v3=="C", f1$v4-(tBB*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" & f1$v3=="B", f1$v4+(tBC*1.98324), > f1$v4) > f1$v4 <- ifelse (f1$v1=="B" & f1$v2=="C" & f1$v3=="C", f1$v4-(tBC*1.98324), > f1$v4) >Seems that this could be done a lot more simply with a lookup matrix and ordinary indexing> lookarr <- array(NA, dim=c(length(unique(f1$v1)),length(unique(f1$v2)),length(unique(f1$v3)) ) , dimnames=list( unique(f1$v1), unique(f1$v2), unique(f1$v3) ) ) > lookarr[] <- c(tAA,tAA,tAB,tAB,tAC,tAC,tBA,tBA,tBB, tBB, tBC, tBC)> lookarr[ "A","B","C"][1] 0.1250369> lookarr[ with(f1, cbind(v1, v2, v3)) ][1] 6.213554e-01 1.110842e-01 1.424236e-01 1.250369e-01 9.978703e-05 [6] 0.000000e+00 6.213554e-01 1.110842e-01 1.424236e-01 1.250369e-01 [11] 9.978703e-05 0.000000e+00> f1$v4mod <- f1$v4*lookarr[ with(f1, cbind(v1,v2,v3)) ] > f1v1 v2 v3 v4 v4mod 2 A A B 18.18530 1.129954e+01 41 A A C 1.42917 1.587582e-01 9 A B B 3.43806 4.896610e-01 48 A B C 1.05786 1.322716e-01 11 A C B 0.00273 2.724186e-07 50 A C C 0.00042 0.000000e+00 158 B A B 2.37232 1.474054e+00 197 B A C 1.13430 1.260028e-01 165 B B B 3.01835 4.298844e-01 204 B B C 0.92872 1.161243e-01 167 B C B 0.00000 0.000000e+00 206 B C C 0.00000 0.000000e+00 -- david.> This are the final marginal distributions: > > aggregate(v4~v1*v2,f1,sum) > aggregate(v4~v3,f1,sum) > > Can this procedure be made programmatic so that I can run it on the > (8x13x13) categories matrix? if so, how would you do it? I have really hard > time to do it with some (semi)automatic procedure. > > Thank you very much indeed once more :) > > Luca > > > 2015-03-22 18:32 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: > >> Nonsense. You are not telling us something or I have failed to >> understand something. >> >> Consider: >> >> v1 = c("a","b") >> v2 = "c("a","a") >> >> It is not possible to change the value of a sum of values >> corresponding to v2="a" without also changing that for v1, which is >> not supposed to change according to my understanding of your >> specification. >> >> So I'm done. >> >> -- Bert >> >> >> Bert Gunter >> Genentech Nonclinical Biostatistics >> (650) 467-7374 >> >> "Data is not information. Information is not knowledge. And knowledge >> is certainly not wisdom." >> Clifford Stoll >> >> >> >> >> On Sun, Mar 22, 2015 at 8:28 AM, Luca Meyer <lucam1968 at gmail.com> wrote: >>> Sorry forgot to keep the rest of the group in the loop - Luca >>> ---------- Forwarded message ---------- >>> From: Luca Meyer <lucam1968 at gmail.com> >>> Date: 2015-03-22 16:27 GMT+01:00 >>> Subject: Re: [R] Joining two datasets - recursive procedure? >>> To: Bert Gunter <gunter.berton at gene.com> >>> >>> >>> Hi Bert, >>> >>> That is exactly what I am trying to achieve. Please notice that negative >> v4 >>> values are allowed. I have done a similar task in the past manually by >>> recursively alterating v4 distribution across v3 categories within fix >> each >>> v1&v2 combination so I am quite positive it can be achieved but honestly >> I >>> took me forever to do it manually and since this is likely to be an >>> exercise I need to repeat from time to time I wish I could learn how to >> do >>> it programmatically.... >>> >>> Thanks again for any further suggestion you might have, >>> >>> Luca >>> >>> >>> 2015-03-22 16:05 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >>> >>>> Oh, wait a minute ... >>>> >>>> You still want the marginals for the other columns to be as originally? >>>> >>>> If so, then this is impossible in general as the sum of all the values >>>> must be what they were originally and you cannot therefore choose your >>>> values for V3 arbitrarily. >>>> >>>> Or at least, that seems to be what you are trying to do. >>>> >>>> -- Bert >>>> >>>> Bert Gunter >>>> Genentech Nonclinical Biostatistics >>>> (650) 467-7374 >>>> >>>> "Data is not information. Information is not knowledge. And knowledge >>>> is certainly not wisdom." >>>> Clifford Stoll >>>> >>>> >>>> >>>> >>>> On Sun, Mar 22, 2015 at 7:55 AM, Bert Gunter <bgunter at gene.com> wrote: >>>>> I would have thought that this is straightforward given my previous >>>> email... >>>>> >>>>> Just set z to what you want -- e,g, all B values to 29/number of B's, >>>>> and all C values to 2.567/number of C's (etc. for more categories). >>>>> >>>>> A slick but sort of cheat way to do this programmatically -- in the >>>>> sense that it relies on the implementation of factor() rather than its >>>>> API -- is: >>>>> >>>>> y <- f1$v3 ## to simplify the notation; could be done using with() >>>>> z <- (c(29,2.567)/table(y))[c(y)] >>>>> >>>>> Then proceed to z1 as I previously described >>>>> >>>>> -- Bert >>>>> >>>>> >>>>> Bert Gunter >>>>> Genentech Nonclinical Biostatistics >>>>> (650) 467-7374 >>>>> >>>>> "Data is not information. Information is not knowledge. And knowledge >>>>> is certainly not wisdom." >>>>> Clifford Stoll >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 at gmail.com> >> wrote: >>>>>> Hi Bert, hello R-experts, >>>>>> >>>>>> I am close to a solution but I still need one hint w.r.t. the >> following >>>>>> procedure (available also from >>>>>> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0) >>>>>> >>>>>> rm(list=ls()) >>>>>> >>>>>> # this is (an extract of) the INPUT file I have: >>>>>> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", >> "B", >>>>>> "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", >> "A", >>>>>> "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", >> "C", >>>>>> "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042, >>>> 2.37232, >>>>>> 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", >> "v4"), >>>> class >>>>>> = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L, >>>> 167L, >>>>>> 197L, 204L, 206L)) >>>>>> >>>>>> # this is the procedure that Bert suggested (slightly adjusted): >>>>>> z <- rnorm(nrow(f1)) ## or anything you want >>>>>> z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5) >>>>>> aggregate(v4~v1*v2,f1,sum) >>>>>> aggregate(z1~v1*v2,f1,sum) >>>>>> aggregate(v4~v3,f1,sum) >>>>>> aggregate(z1~v3,f1,sum) >>>>>> >>>>>> My question to you is: how can I set z so that I can obtain specific >>>> values >>>>>> for z1-v4 in the v3 aggregation? >>>>>> In other words, how can I configure the procedure so that e.g. B=29 >> and >>>>>> C=2.56723 after running the procedure: >>>>>> aggregate(z1~v3,f1,sum) >>>>>> >>>>>> Thank you, >>>>>> >>>>>> Luca >>>>>> >>>>>> PS: to avoid any doubts you might have about who I am the following >> is >>>> my >>>>>> web page: http://lucameyer.wordpress.com/ >>>>>> >>>>>> >>>>>> 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >>>>>>> >>>>>>> ... or cleaner: >>>>>>> >>>>>>> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean)) >>>>>>> >>>>>>> >>>>>>> Just for curiosity, was this homework? (in which case I should >>>>>>> probably have not provided you an answer -- that is, assuming that I >>>>>>> HAVE provided an answer). >>>>>>> >>>>>>> Cheers, >>>>>>> Bert >>>>>>> >>>>>>> Bert Gunter >>>>>>> Genentech Nonclinical Biostatistics >>>>>>> (650) 467-7374 >>>>>>> >>>>>>> "Data is not information. Information is not knowledge. And >> knowledge >>>>>>> is certainly not wisdom." >>>>>>> Clifford Stoll >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> >> wrote: >>>>>>>> z <- rnorm(nrow(f1)) ## or anything you want >>>>>>>> z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean)) >>>>>>>> >>>>>>>> >>>>>>>> aggregate(v4~v1,f1,sum) >>>>>>>> aggregate(z1~v1,f1,sum) >>>>>>>> aggregate(v4~v2,f1,sum) >>>>>>>> aggregate(z1~v2,f1,sum) >>>>>>>> aggregate(v4~v3,f1,sum) >>>>>>>> aggregate(z1~v3,f1,sum) >>>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Bert >>>>>>>> >>>>>>>> Bert Gunter >>>>>>>> Genentech Nonclinical Biostatistics >>>>>>>> (650) 467-7374 >>>>>>>> >>>>>>>> "Data is not information. Information is not knowledge. And >> knowledge >>>>>>>> is certainly not wisdom." >>>>>>>> Clifford Stoll >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> >>>> wrote: >>>>>>>>> Hi Bert, >>>>>>>>> >>>>>>>>> Thank you for your message. I am looking into ave() and tapply() >> as >>>> you >>>>>>>>> suggested but at the same time I have prepared a example of input >>>> and >>>>>>>>> output >>>>>>>>> files, just in case you or someone else would like to make an >>>> attempt >>>>>>>>> to >>>>>>>>> generate a code that goes from input to output. >>>>>>>>> >>>>>>>>> Please see below or download it from >>>>>>>>> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 >>>>>>>>> >>>>>>>>> # this is (an extract of) the INPUT file I have: >>>>>>>>> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", >> "B", >>>>>>>>> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >>>>>>>>> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >>>>>>>>> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, >>>>>>>>> 1.42917, >>>>>>>>> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872, >>>>>>>>> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >>>>>>>>> row.names >>>>>>>>> c(2L, >>>>>>>>> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >>>>>>>>> >>>>>>>>> # this is (an extract of) the OUTPUT file I would like to obtain: >>>>>>>>> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", >> "B", >>>>>>>>> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >>>>>>>>> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >>>>>>>>> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, >>>>>>>>> 1.77918, >>>>>>>>> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872, >>>>>>>>> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >>>>>>>>> row.names >>>>>>>>> c(2L, >>>>>>>>> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >>>>>>>>> >>>>>>>>> # please notice that while the aggregated v4 on v3 has changed ? >>>>>>>>> aggregate(f1[,c("v4")],list(f1$v3),sum) >>>>>>>>> aggregate(f2[,c("v4")],list(f2$v3),sum) >>>>>>>>> >>>>>>>>> # ? the aggregated v4 over v1xv2 has remained unchanged: >>>>>>>>> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >>>>>>>>> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum) >>>>>>>>> >>>>>>>>> Thank you very much in advance for your assitance. >>>>>>>>> >>>>>>>>> Luca >>>>>>>>> >>>>>>>>> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>: >>>>>>>>>> >>>>>>>>>> 1. Still not sure what you mean, but maybe look at ?ave and >>>> ?tapply, >>>>>>>>>> for which ave() is a wrapper. >>>>>>>>>> >>>>>>>>>> 2. You still need to heed the rest of Jeff's advice. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Bert >>>>>>>>>> >>>>>>>>>> Bert Gunter >>>>>>>>>> Genentech Nonclinical Biostatistics >>>>>>>>>> (650) 467-7374 >>>>>>>>>> >>>>>>>>>> "Data is not information. Information is not knowledge. And >>>> knowledge >>>>>>>>>> is certainly not wisdom." >>>>>>>>>> Clifford Stoll >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer < >> lucam1968 at gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> Hi Jeff & other R-experts, >>>>>>>>>>> >>>>>>>>>>> Thank you for your note. I have tried myself to solve the >> issue >>>>>>>>>>> without >>>>>>>>>>> success. >>>>>>>>>>> >>>>>>>>>>> Following your suggestion, I am providing a sample of the >>>> dataset I >>>>>>>>>>> am >>>>>>>>>>> using below (also downloadble in plain text from >>>>>>>>>>> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0): >>>>>>>>>>> >>>>>>>>>>> #this is an extract of the overall dataset (n=1200 cases) >>>>>>>>>>> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", >>>> "B", >>>>>>>>>>> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", >>>>>>>>>>> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", >>>>>>>>>>> "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, >>>>>>>>>>> 3.43806581506388, >>>>>>>>>>> 0.002733567617055, 1.42917483425029, 1.05786640463504, >>>>>>>>>>> 0.000420548864162308, >>>>>>>>>>> 2.37232740842861, 3.01835841813241, 0, 1.13430282139936, >>>>>>>>>>> 0.928725667117666, >>>>>>>>>>> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", >>>>>>>>>>> row.names >>>>>>>>>>> >>>>>>>>>>> c(2L, >>>>>>>>>>> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L)) >>>>>>>>>>> >>>>>>>>>>> I need to find a automated procedure that allows me to adjust >> v3 >>>>>>>>>>> marginals >>>>>>>>>>> while maintaining v1xv2 marginals unchanged. >>>>>>>>>>> >>>>>>>>>>> That is: modify the v4 values you can find by running: >>>>>>>>>>> >>>>>>>>>>> aggregate(f1[,c("v4")],list(f1$v3),sum) >>>>>>>>>>> >>>>>>>>>>> while maintaining costant the values you can find by running: >>>>>>>>>>> >>>>>>>>>>> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum) >>>>>>>>>>> >>>>>>>>>>> Now does it make sense? >>>>>>>>>>> >>>>>>>>>>> Please notice I have tried to build some syntax that tries to >>>> modify >>>>>>>>>>> values >>>>>>>>>>> within each v1xv2 combination by computing sum of v4, row >>>> percentage >>>>>>>>>>> in >>>>>>>>>>> terms of v4, and there is where my effort is blocked. Not >> really >>>>>>>>>>> sure >>>>>>>>>>> how I >>>>>>>>>>> should proceed. Any suggestion? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Luca >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2015-03-19 2:38 GMT+01:00 Jeff Newmiller < >>>> jdnewmil at dcn.davis.ca.us>: >>>>>>>>>>> >>>>>>>>>>>> I don't understand your description. The standard practice on >>>> this >>>>>>>>>>>> list >>>>>>>>>>>> is >>>>>>>>>>>> to provide a reproducible R example [1] of the kind of data >> you >>>> are >>>>>>>>>>>> working >>>>>>>>>>>> with (and any code you have tried) to go along with your >>>>>>>>>>>> description. >>>>>>>>>>>> In >>>>>>>>>>>> this case, that would be two dputs of your input data frames >>>> and a >>>>>>>>>>>> dput >>>>>>>>>>>> of >>>>>>>>>>>> an output data frame (generated by hand from your input data >>>>>>>>>>>> frame). >>>>>>>>>>>> (Probably best to not use the full number of input values >> just >>>> to >>>>>>>>>>>> keep >>>>>>>>>>>> the >>>>>>>>>>>> size down.) We could then make an attempt to generate code >> that >>>>>>>>>>>> goes >>>>>>>>>>>> from >>>>>>>>>>>> input to output. >>>>>>>>>>>> >>>>>>>>>>>> Of course, if you post that hard work using HTML then it will >>>> get >>>>>>>>>>>> corrupted (much like the text below from your earlier emails) >>>> and >>>>>>>>>>>> we >>>>>>>>>>>> won't >>>>>>>>>>>> be able to use it. Please learn to post from your email >> software >>>>>>>>>>>> using >>>>>>>>>>>> plain text when corresponding with this mailing list. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>> >> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>> >> --------------------------------------------------------------------------- >>>>>>>>>>>> Jeff Newmiller The ..... >>>> ..... Go >>>>>>>>>>>> Live... >>>>>>>>>>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. >> ##.#. >>>>>>>>>>>> Live >>>>>>>>>>>> Go... >>>>>>>>>>>> Live: OO#.. Dead: >> OO#.. >>>>>>>>>>>> Playing >>>>>>>>>>>> Research Engineer (Solar/Batteries O.O#. >> #.O#. >>>>>>>>>>>> with >>>>>>>>>>>> /Software/Embedded Controllers) .OO#. >> .OO#. >>>>>>>>>>>> rocks...1k >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>> >> --------------------------------------------------------------------------- >>>>>>>>>>>> Sent from my phone. Please excuse my brevity. >>>>>>>>>>>> >>>>>>>>>>>> On March 18, 2015 9:05:37 AM PDT, Luca Meyer < >>>> lucam1968 at gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> Thanks for you input Michael, >>>>>>>>>>>>> >>>>>>>>>>>>> The continuous variable I have measures quantities (down to >> the >>>>>>>>>>>>> 3rd >>>>>>>>>>>>> decimal level) so unfortunately are not frequencies. >>>>>>>>>>>>> >>>>>>>>>>>>> Any more specific suggestions on how that could be tackled? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks & kind regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Luca >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ==>>>>>>>>>>>>> >>>>>>>>>>>>> Michael Friendly wrote: >>>>>>>>>>>>> I'm not sure I understand completely what you want to do, >> but >>>>>>>>>>>>> if the data were frequencies, it sounds like task for >> fitting a >>>>>>>>>>>>> loglinear model with the model formula >>>>>>>>>>>>> >>>>>>>>>>>>> ~ V1*V2 + V3 >>>>>>>>>>>>> >>>>>>>>>>>>> On 3/18/2015 2:17 AM, Luca Meyer wrote: >>>>>>>>>>>>>> * Hello, >>>>>>>>>>>>> *>>* I am facing a quite challenging task (at least to me) >> and >>>> I >>>>>>>>>>>>> was >>>>>>>>>>>>> wondering >>>>>>>>>>>>> *>* if someone could advise how R could assist me to speed >> the >>>>>>>>>>>>> task >>>>>>>>>>>>> up. >>>>>>>>>>>>> *>>* I am dealing with a dataset with 3 discrete variables >> and >>>> one >>>>>>>>>>>>> continuous >>>>>>>>>>>>> *>* variable. The discrete variables are: >>>>>>>>>>>>> *>>* V1: 8 modalities >>>>>>>>>>>>> *>* V2: 13 modalities >>>>>>>>>>>>> *>* V3: 13 modalities >>>>>>>>>>>>> *>>* The continuous variable V4 is a decimal number always >>>> greater >>>>>>>>>>>>> than >>>>>>>>>>>>> zero in >>>>>>>>>>>>> *>* the marginals of each of the 3 variables but it is >>>> sometimes >>>>>>>>>>>>> equal >>>>>>>>>>>>> to zero >>>>>>>>>>>>> *>* (and sometimes negative) in the joint tables. >>>>>>>>>>>>> *>>* I have got 2 files: >>>>>>>>>>>>> *>>* => one with distribution of all possible combinations >> of >>>>>>>>>>>>> V1xV2 >>>>>>>>>>>>> (some of >>>>>>>>>>>>> *>* which are zero or neagtive) and >>>>>>>>>>>>> *>* => one with the marginal distribution of V3. >>>>>>>>>>>>> *>>* I am trying to build the long and narrow dataset >> V1xV2xV3 >>>> in >>>>>>>>>>>>> such >>>>>>>>>>>>> a way >>>>>>>>>>>>> *>* that each V1xV2 cell does not get modified and V3 fits >> as >>>>>>>>>>>>> closely >>>>>>>>>>>>> as >>>>>>>>>>>>> *>* possible to its marginal distribution. Does it make >> sense? >>>>>>>>>>>>> *>>* To be even more specific, my 2 input files look like >> the >>>>>>>>>>>>> following. >>>>>>>>>>>>> *>>* FILE 1 >>>>>>>>>>>>> *>* V1,V2,V4 >>>>>>>>>>>>> *>* A, A, 24.251 >>>>>>>>>>>>> *>* A, B, 1.065 >>>>>>>>>>>>> *>* (...) >>>>>>>>>>>>> *>* B, C, 0.294 >>>>>>>>>>>>> *>* B, D, 2.731 >>>>>>>>>>>>> *>* (...) >>>>>>>>>>>>> *>* H, L, 0.345 >>>>>>>>>>>>> *>* H, M, 0.000 >>>>>>>>>>>>> *>>* FILE 2 >>>>>>>>>>>>> *>* V3, V4 >>>>>>>>>>>>> *>* A, 1.575 >>>>>>>>>>>>> *>* B, 4.294 >>>>>>>>>>>>> *>* C, 10.044 >>>>>>>>>>>>> *>* (...) >>>>>>>>>>>>> *>* L, 5.123 >>>>>>>>>>>>> *>* M, 3.334 >>>>>>>>>>>>> *>>* What I need to achieve is a file such as the following >>>>>>>>>>>>> *>>* FILE 3 >>>>>>>>>>>>> *>* V1, V2, V3, V4 >>>>>>>>>>>>> *>* A, A, A, ??? >>>>>>>>>>>>> *>* A, A, B, ??? >>>>>>>>>>>>> *>* (...) >>>>>>>>>>>>> *>* D, D, E, ??? >>>>>>>>>>>>> *>* D, D, F, ??? >>>>>>>>>>>>> *>* (...) >>>>>>>>>>>>> *>* H, M, L, ??? >>>>>>>>>>>>> *>* H, M, M, ??? >>>>>>>>>>>>> *>>* Please notice that FILE 3 need to be such that if I >>>> aggregate >>>>>>>>>>>>> on >>>>>>>>>>>>> V1+V2 I >>>>>>>>>>>>> *>* recover exactly FILE 1 and that if I aggregate on V3 I >> can >>>>>>>>>>>>> recover >>>>>>>>>>>>> a file >>>>>>>>>>>>> *>* as close as possible to FILE 3 (ideally the same file). >>>>>>>>>>>>> *>>* Can anyone suggest how I could do that with R? >>>>>>>>>>>>> *>>* Thank you very much indeed for any assistance you are >>>> able to >>>>>>>>>>>>> provide. >>>>>>>>>>>>> *>>* Kind regards, >>>>>>>>>>>>> *>>* Luca* >>>>>>>>>>>>> >>>>>>>>>>>>> [[alternative HTML version deleted]]David Winsemius Alameda, CA, USA