David Freedman
2009-May-25 12:45 UTC
[R] long format - find age when another variable is first 'high'
Dear R, I've got a data frame with children examined multiple times and at various ages. I'm trying to find the first age at which another variable (LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never happen. I can do this with transformBy and ddply, but with 10,000 different children, these functions take some time on my PCs - is there a faster way to do this in R? My code on a small dataset follows. Thanks very much, David Freedman d<-data.frame(id=c(rep(1,3),rep(2,2),3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160)) d$high.ldlc<-ifelse(d$ldlc>=130,1,0) d library(plyr) d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1])); library(doBy) d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1])); d2 -- View this message in context: http://www.nabble.com/long-format---find-age-when-another-variable-is-first-%27high%27-tp23706393p23706393.html Sent from the R help mailing list archive at Nabble.com.
ONKELINX, Thierry
2009-May-25 12:57 UTC
[R] long format - find age when another variable is first 'high'
Dear David, You would speed up things is you first create a subset were all values of ldlc is >= 130. Then you only have to find the lowest age for each child in this subset. HTH, Thierry ------------------------------------------------------------------------ ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 Thierry.Onkelinx at inbo.be www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey -----Oorspronkelijk bericht----- Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Namens David Freedman Verzonden: maandag 25 mei 2009 14:45 Aan: r-help at r-project.org Onderwerp: [R] long format - find age when another variable is first 'high' Dear R, I've got a data frame with children examined multiple times and at various ages. I'm trying to find the first age at which another variable (LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never happen. I can do this with transformBy and ddply, but with 10,000 different children, these functions take some time on my PCs - is there a faster way to do this in R? My code on a small dataset follows. Thanks very much, David Freedman d<-data.frame(id=c(rep(1,3),rep(2,2),3),age=c(5,10,15,4,7,12),ldlc=c(132 ,120,125,105,142,160)) d$high.ldlc<-ifelse(d$ldlc>=130,1,0) d library(plyr) d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1])); library(doBy) d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1])); d2 -- View this message in context: http://www.nabble.com/long-format---find-age-when-another-variable-is-fi rst-%27high%27-tp23706393p23706393.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document. The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document.
Marc Schwartz
2009-May-25 13:52 UTC
[R] long format - find age when another variable is first 'high'
On May 25, 2009, at 7:45 AM, David Freedman wrote:> > Dear R, > > I've got a data frame with children examined multiple times and at > various > ages. I'm trying to find the first age at which another variable > (LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never > happen. > I can do this with transformBy and ddply, but with 10,000 different > children, these functions take some time on my PCs - is there a > faster way > to do this in R? My code on a small dataset follows. > > Thanks very much, David Freedman > > d<-data.frame(id=c(rep(1,3),rep(2,2), > 3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160)) > d$high.ldlc<-ifelse(d$ldlc>=130,1,0) > d > library(plyr) > d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1])); > library(doBy) > d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1])); > d2The first thing that I would do is to get rid of records that are not relevant to your question: > d id age ldlc high.ldlc 1 1 5 132 1 2 1 10 120 0 3 1 15 125 0 4 2 4 105 0 5 2 7 142 1 6 3 12 160 1 # Get records with high ldl d.new <- subset(d, ldlc >= 130) > d.new id age ldlc high.ldlc 1 1 5 132 1 5 2 7 142 1 6 3 12 160 1 That will help to reduce the total size of the dataset, perhaps substantially. It will also remove entire subjects that are not relevant (eg. never have LDL >= 130). Then get the minimum age for each of the remaining subjects: > aggregate(d.new$age, list(id = d.new$id), min) id x 1 1 5 2 2 7 3 3 12 Try that to see what sort of time reduction you observe. HTH, Marc Schwartz
Gabor Grothendieck
2009-May-25 14:14 UTC
[R] long format - find age when another variable is first 'high'
Depending on what you want (haven't checked the speed) you could try this one where we have changed the ldlc in the first row so that it has none > 130 for id=1 just to illustrate that case as well:> d <- data.frame(id = c(rep(1, 3), rep(2, 2), 3), age=c(5, 10, 15, 4, 7, 12),+ ldlc=c(122, 120, 125, 105, 142, 160))> library(sqldf) > sqldf("select * from d left join (select id, min(age) min_age from d where ldlc > 130 group by id) using(id)")id age ldlc min_age 1 1 5 122 <NA> 2 1 10 120 <NA> 3 1 15 125 <NA> 4 2 4 105 7.0 5 2 7 142 7.0 6 3 12 160 12.0> # or this (which just gives the data frame of id and min_age):> sqldf("select id, min_age from d left join (select id, min(age) min_age from d where ldlc > 130 group by id) using(id) group by id")id min_age 1 1 <NA> 2 2 7.0 3 3 12.0> # or this (which is similar but omits the NAs)> sqldf("select id, min(age) from d where ldlc > 130 group by id")id min(age) 1 2 7 2 3 12 See sqldf home page at: http://sqldf.googlecode.com On Mon, May 25, 2009 at 8:45 AM, David Freedman <3.14david at gmail.com> wrote:> > Dear R, > > I've got a data frame with children examined multiple times and at various > ages. ?I'm trying to find the first age at which another variable > (LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never happen. > I can do this with transformBy and ddply, but with 10,000 different > children, these functions take some time on my PCs - is there a faster way > to do this in R? ?My code on a small dataset follows. > > Thanks very much, David Freedman > > d<-data.frame(id=c(rep(1,3),rep(2,2),3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160)) > d$high.ldlc<-ifelse(d$ldlc>=130,1,0) > d > library(plyr) > d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1])); > library(doBy) > d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1])); > d2 > -- > View this message in context: http://www.nabble.com/long-format---find-age-when-another-variable-is-first-%27high%27-tp23706393p23706393.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >