Ted Harding
2011-Apr-20 20:38 UTC
[R] [test message] Can R replicate this data manipulation in SAS?
Apologies for troubling the list, but it is a test that needs to be carried out. I am resending the message that I sent earlier on behalf of Paul Miller, but with a certain word used in the variables names of the SAS code changed to a different word. With thanks for your tolerance ... Ted. [*** PLEASE NOTE: I am sending this message on behalf of Paul Miller: Paul Miller <pjmiller_57 at yahoo.com> (to whom this message has also been copied). He has been trying to send it, but it has never got through. Please do not reply to me, but either to the list and/or to Paul at that address ***] =========================================================Hello Everyone, I'm learning R and am trying to get a better sense of what it will and will not do. I'm hearing in some places that R may not be able to accomplish all of the data manipulation tasks that SAS can. In others, I'm hearing that R can do pretty much any data manipulation that SAS can but the way in which it does so is likely to be quite different. Below is some SAS syntax that that codes Highly Active Antiretroviral Therapy (HAART) regimens in HIV patients by retaining the values of variables. Interspersed between the bits of code are printouts of data sets that are created in the process of coding. I'm hoping this will come through clearly and that people will be able to see exactly what is being done. Basically, the code keeps track of how many books people are on and what types of books they are taking during specific periods of time and decides whether that constitutes HAART or not. To me, this is a pretty tricky data manipulation in SAS. Is there any way to get the equivalent result in R? Thanks, Paul **** SAS syntax for coding HAART in HIV patients; **** Read in test data; data haart; input id book_class $ start_date :mmddyy. stop_date :mmddyy.; format start_date stop_date mmddyy8.; cards; 1004 NRTI 07/24/95 01/05/99 1004 NRTI 11/20/95 12/10/95 1004 NRTI 01/10/96 01/05/99 1004 PI 05/09/96 11/16/97 1004 NRTI 06/01/96 02/01/97 1004 NRTI 07/01/96 03/01/97 9999 PI 01/02/03 . 9999 NNRTI 04/05/06 07/08/09 ; run; proc print data=haart; run; book_ start_ stop_ Obs id class date date 1 1004 NRTI 07/24/95 01/05/99 2 1004 NRTI 11/20/95 12/10/95 3 1004 NRTI 01/10/96 01/05/99 4 1004 PI 05/09/96 11/16/97 5 1004 NRTI 06/01/96 02/01/97 6 1004 NRTI 07/01/96 03/01/97 7 9999 PI 01/02/03 . 8 9999 NNRTI 04/05/06 07/08/09 **** Reshape data into series with 1 date rather than separate starts and stops; data changes (drop=start_date stop_date where=(not missing(date))); set haart; date = start_date; change = 1; output; date = stop_date; change = -1; output; format date mmddyy10.; run; proc sort data=changes; by id date; run; proc print data=changes; run; book_ Obs id class date change 1 1004 NRTI 07/24/1995 1 2 1004 NRTI 11/20/1995 1 3 1004 NRTI 12/10/1995 -1 4 1004 NRTI 01/10/1996 1 5 1004 PI 05/09/1996 1 6 1004 NRTI 06/01/1996 1 7 1004 NRTI 07/01/1996 1 8 1004 NRTI 02/01/1997 -1 9 1004 NRTI 03/01/1997 -1 10 1004 PI 11/16/1997 -1 11 1004 NRTI 01/05/1999 -1 12 1004 NRTI 01/05/1999 -1 13 9999 PI 01/02/2003 1 14 9999 NNRTI 04/05/2006 1 15 9999 NNRTI 07/08/2009 -1 **** Get regimen information plus start and stop dates; data cumulative(drop=book_class change stop_date) stop_dates(keep=id regimen stop_date); set changes; by id date; if first.id then do; regimen = 0; NRTI = 0; NNRTI = 0; PI = 0; end; if book_class = 'NNRTI' then NNRTI + change; else if book_class = 'NRTI' then NRTI + change; else if book_class = 'PI ' then PI + change; if last.date then do; stop_date = date - 1; if regimen then output stop_dates; regimen + 1; allbooks = NNRTI + NRTI + PI; HAART = (NRTI >= 3 AND NNRTI=0 AND PI=0) OR (NRTI >= 2 AND (NNRTI >= 1 OR PI >= 1)) OR (NRTI = 1 AND NNRTI >= 1 AND PI >= 1); output cumulative; end; format stop_date mmddyy10.; run; proc print data=cumulative; run; Obs id date regimen NRTI NNRTI PI allbooks HAART 1 1004 07/24/1995 1 1 0 0 1 0 2 1004 11/20/1995 2 2 0 0 2 0 3 1004 12/10/1995 3 1 0 0 1 0 4 1004 01/10/1996 4 2 0 0 2 0 5 1004 05/09/1996 5 2 0 1 3 1 6 1004 06/01/1996 6 3 0 1 4 1 7 1004 07/01/1996 7 4 0 1 5 1 8 1004 02/01/1997 8 3 0 1 4 1 9 1004 03/01/1997 9 2 0 1 3 1 10 1004 11/16/1997 10 2 0 0 2 0 11 1004 01/05/1999 11 0 0 0 0 0 12 9999 01/02/2003 1 0 0 1 1 0 13 9999 04/05/2006 2 0 1 1 2 0 14 9999 07/08/2009 3 0 0 1 1 0 proc print data=stop_dates; run; Obs id regimen stop_date 1 1004 1 11/19/1995 2 1004 2 12/09/1995 3 1004 3 01/09/1996 4 1004 4 05/08/1996 5 1004 5 05/31/1996 6 1004 6 06/30/1996 7 1004 7 01/31/1997 8 1004 8 02/28/1997 9 1004 9 11/15/1997 10 1004 10 01/04/1999 11 9999 1 04/04/2006 12 9999 2 07/07/2009 **** Merge data to create regimens dataset; data regimens; retain id start_date stop_date; merge cumulative(rename=(date=start_date)) stop_dates; by id regimen; if allbooks; run; proc print data=regimens; run; Obs id start_date stop_date regimen NRTI NNRTI PI allbooks HAART 1 1004 07/24/1995 11/19/1995 1 1 0 0 1 0 2 1004 11/20/1995 12/09/1995 2 2 0 0 2 0 3 1004 12/10/1995 01/09/1996 3 1 0 0 1 0 4 1004 01/10/1996 05/08/1996 4 2 0 0 2 0 5 1004 05/09/1996 05/31/1996 5 2 0 1 3 1 6 1004 06/01/1996 06/30/1996 6 3 0 1 4 1 7 1004 07/01/1996 01/31/1997 7 4 0 1 5 1 8 1004 02/01/1997 02/28/1997 8 3 0 1 4 1 9 1004 03/01/1997 11/15/1997 9 2 0 1 3 1 10 1004 11/16/1997 01/04/1999 10 2 0 0 2 0 11 9999 01/02/2003 04/04/2006 1 0 0 1 1 0 12 9999 04/05/2006 07/07/2009 2 0 1 1 2 0 13 9999 07/08/2009 . 3 0 0 1 1 0 ========================================================= Paul Miller Paul Miller <pjmiller_57 at yahoo.com> -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.harding at wlandres.net> Fax-to-email: +44 (0)870 094 0861 Date: 20-Apr-11 Time: 19:59:21 ------------------------------ XFMail ------------------------------
I think this is kind of like asking "will your Land Rover make it up my driveway?", but I'll assume the question was asked in all seriousness. Here is one solution: ## **** Read in test data; dat <- read.table(textConnection("id drug start stop 1004 NRTI 07/24/95 01/05/99 1004 NRTI 11/20/95 12/10/95 1004 NRTI 01/10/96 01/05/99 1004 PI 05/09/96 11/16/97 1004 NRTI 06/01/96 02/01/97 1004 NRTI 07/01/96 03/01/97 9999 PI 01/02/03 NA 9999 NNRTI 04/05/06 07/08/09"), header=TRUE) closeAllConnections() dat$start <- as.Date(dat$start, format = "%m/%d/%y") dat$stop <- as.Date(dat$stop, format = "%m/%d/%y") ## **** Reshape data into series with 1 date rather than separate starts and ## stops; library(reshape) m.dat <- melt(dat, id = c("id", "drug")) m.dat <- m.dat[order(m.dat$id, m.dat$value),] m.dat$variable <- ifelse(m.dat$variable == "start", 1, -1) names(m.dat) <- c("id", "drug", "value", "date") m.dat ## **** Get regimen information plus start and stop dates; n.dat <- cast(m.dat, id + date ~ drug, fun.aggregate=sum, margins="grand_col") for (i in names(n.dat)[-c(1:2)]) { n.dat[i] <- cumsum(n.dat[i]) } n.dat <- ddply(n.dat, .(id), transform, regimen = 1:length(id)) n.dat ssd.dat <- ddply(n.dat, .(id), summarize, id = id[-1], regimen = regimen[-length(regimen)], start_date = date[-length(date)], stop_date = date[-1]) ssd.dat ## **** Merge data to create regimens dataset; all.dat <- merge(n.dat[-2], ssd.dat) all.dat <- all.dat[order(all.dat$id, all.dat$regimen), c("id", "start_date", "stop_date", "regimen", "NRTI", "NNRTI", "PI", "X.all.")] all.dat Best, Ista On Wed, Apr 20, 2011 at 2:59 PM, Ted Harding <ted.harding at wlandres.net> wrote:> [*** PLEASE NOTE: I am sending this message on behalf of > ?Paul Miller: > ?Paul Miller <pjmiller_57 at yahoo.com> > ?(to whom this message has also been copied). He has been > ?trying to send it, but it has never got through. Please > ?do ?not reply to me, but either to the list and/or to Paul > ?at that address ***] > =========================================================> Hello Everyone, > > I'm learning R and am trying to get a better sense of what it will and > will not > do. I'm hearing in some places that R may not be able to accomplish all > of the > data manipulation tasks that SAS can. In others, I'm hearing that R can do > pretty much any data manipulation that SAS can but the way in which it > does so > is likely to be quite different. > > Below is some SAS syntax that that codes Highly Active Antiretroviral > Therapy > (HAART) regimens in HIV patients by retaining the values of variables. > Interspersed between the bits of code are printouts of data sets that are > created in the process of coding. I'm hoping this will come through > clearly and > that people will be able to see exactly what is being done. Basically, > the code > keeps track of how many drugs people are on and what types of drugs they > are > taking during specific periods of time and decides whether that > constitutes > HAART or not. > > To me, this is a pretty tricky data manipulation in SAS. Is there any way > to > get the equivalent result in R? > > Thanks, > > Paul > > > **** SAS syntax for coding HAART in HIV patients; > **** Read in test data; > > data haart; > input id drug_class $ start_date :mmddyy. stop_date :mmddyy.; > format start_date stop_date mmddyy8.; > cards; > 1004 NRTI ?07/24/95 01/05/99 > 1004 NRTI ?11/20/95 12/10/95 > 1004 NRTI ?01/10/96 01/05/99 > 1004 PI ? ?05/09/96 11/16/97 > 1004 NRTI ?06/01/96 02/01/97 > 1004 NRTI ?07/01/96 03/01/97 > 9999 PI ? ?01/02/03 . > 9999 NNRTI 04/05/06 07/08/09 > ; > run; > > proc print data=haart; > run; > > ? ? ? ? ? ? ? drug_ ? ? ?start_ ? ? ? stop_ > Obs ? ? id ? ? class ? ? ? ?date ? ? ? ?date > 1 ? ? 1004 ? ?NRTI ? ? 07/24/95 ? ?01/05/99 > 2 ? ? 1004 ? ?NRTI ? ? 11/20/95 12/10/95 > 3 ? ? 1004 ? ?NRTI ? ? 01/10/96 ? ?01/05/99 > 4 ? ? 1004 ? ?PI ? ? ? 05/09/96 ? ?11/16/97 > 5 ? ? 1004 ? ?NRTI ? ? 06/01/96 ? ?02/01/97 > 6 ? ? 1004 ? ?NRTI ? ? 07/01/96 ? ?03/01/97 > 7 ? ? 9999 ? ?PI ? ? ? 01/02/03 ? ? ? ? ? . > 8 ? ? 9999 ? ?NNRTI ? ?04/05/06 ? ?07/08/09 > > **** Reshape data into series with 1 date rather than separate starts and > stops; > > data changes (drop=start_date stop_date where=(not missing(date))); > set haart; > date = start_date; > change = ?1; > output; > date = ?stop_date; > change = -1; > output; > format date mmddyy10.; > run; > > proc sort data=changes; > by id date; > run; > > proc print data=changes; > run; > > ? ? ? ? ? ? ? drug_ > Obs ? ? id ? ? class ? ? ? ? ?date ? ?change > ?1 ? ?1004 ? ?NRTI ? ? 07/24/1995 ? ? ? 1 > ?2 ? ?1004 ? ?NRTI ? ? 11/20/1995 ? ? ? 1 > ?3 ? ?1004 ? ?NRTI ? ? 12/10/1995 ? ? ?-1 > ?4 ? ?1004 ? ?NRTI ? ? 01/10/1996 ? ? ? 1 > ?5 ? ?1004 ? ?PI ? ? ? 05/09/1996 ? ? ? 1 > ?6 ? ?1004 ? ?NRTI ? ? 06/01/1996 ? ? ? 1 > ?7 ? ?1004 ? ?NRTI ? ? 07/01/1996 ? ? ? 1 > ?8 ? ?1004 ? ?NRTI ? ? 02/01/1997 ? ? ?-1 > ?9 ? ?1004 ? ?NRTI ? ? 03/01/1997 ? ? ?-1 > 10 ? ?1004 ? ?PI ? ? ? 11/16/1997 ? ? ?-1 > 11 ? ?1004 ? ?NRTI ? ? 01/05/1999 ? ? ?-1 > 12 ? ?1004 ? ?NRTI ? ? 01/05/1999 ? ? ?-1 > 13 ? ?9999 ? ?PI ? ? ? 01/02/2003 ? ? ? 1 > 14 ? ?9999 ? ?NNRTI ? ?04/05/2006 ? ? ? 1 > 15 ? ?9999 ? ?NNRTI ? ?07/08/2009 ? ? ?-1 > > **** Get regimen information plus start and stop dates; > > data cumulative(drop=drug_class change stop_date) > ? ? stop_dates(keep=id regimen stop_date); > set changes; > by id date; > > if first.id then do; > ?regimen = 0; > ?NRTI = 0; > ?NNRTI = 0; > ?PI = 0; > end; > > if drug_class = 'NNRTI' then NNRTI + change; > else if drug_class = 'NRTI' then NRTI + change; > else if drug_class = 'PI ?' then PI + change; > > if last.date then do; > ?stop_date = date - 1; > if regimen then output stop_dates; > ? regimen + 1; > ?alldrugs = NNRTI + NRTI + PI; > ?HAART = (NRTI >= 3 AND NNRTI=0 AND PI=0) OR > ? ?(NRTI >= 2 AND (NNRTI >= 1 OR PI >= 1)) OR > ? ?(NRTI = 1 AND NNRTI >= 1 AND PI >= 1); > output cumulative; > end; > > format stop_date mmddyy10.; > run; > > proc print data=cumulative; > run; > Obs ? ? id ? ? ? ? ? date ? ?regimen ? ?NRTI ? ?NNRTI ? ?PI ? ?alldrugs > ?HAART > ?1 ? ?1004 ? ?07/24/1995 ? ? ? ?1 ? ? ? ?1 ? ? ? 0 ? ? ? 0 ? ? ? ?1 > ? 0 > ?2 ? ?1004 ? ?11/20/1995 ? ? ? ?2 ? ? ? ?2 ? ? ? 0 ? ? ? 0 ? ? ? ?2 > ? 0 > ?3 ? ?1004 ? ?12/10/1995 ? ? ? ?3 ? ? ? ?1 ? ? ? 0 ? ? ? 0 ? ? ? ?1 > ? 0 > ?4 ? ?1004 ? ?01/10/1996 ? ? ? ?4 ? ? ? ?2 ? ? ? 0 ? ? ? 0 ? ? ? ?2 > ? 0 > ?5 ? ?1004 ? ?05/09/1996 ? ? ? ?5 ? ? ? ?2 ? ? ? 0 ? ? ? 1 ? ? ? ?3 > ? 1 > ?6 ? ?1004 ? ?06/01/1996 ? ? ? ?6 ? ? ? ?3 ? ? ? 0 ? ? ? 1 ? ? ? ?4 > ? 1 > ?7 ? ?1004 ? ?07/01/1996 ? ? ? ?7 ? ? ? ?4 ? ? ? 0 ? ? ? 1 ? ? ? ?5 > ? 1 > ?8 ? ?1004 ? ?02/01/1997 ? ? ? ?8 ? ? ? ?3 ? ? ? 0 ? ? ? 1 ? ? ? ?4 > ? 1 > ?9 ? ?1004 ? ?03/01/1997 ? ? ? ?9 ? ? ? ?2 ? ? ? 0 ? ? ? 1 ? ? ? ?3 > ? 1 > 10 ? ?1004 ? ?11/16/1997 ? ? ? 10 ? ? ? ?2 ? ? ? 0 ? ? ? 0 ? ? ? ?2 > ?0 > 11 ? ?1004 ? ?01/05/1999 ? ? ? 11 ? ? ? ?0 ? ? ? 0 ? ? ? 0 ? ? ? ?0 > ?0 > 12 ? ?9999 ? ?01/02/2003 ? ? ? ?1 ? ? ? ?0 ? ? ? 0 ? ? ? 1 ? ? ? ?1 > ?0 > 13 ? ?9999 ? ?04/05/2006 ? ? ? ?2 ? ? ? ?0 ? ? ? 1 ? ? ? 1 ? ? ? ?2 > ?0 > 14 ? ?9999 ? ?07/08/2009 ? ? ? ?3 ? ? ? ?0 ? ? ? 0 ? ? ? 1 ? ? ? ?1 > ?0 > > proc print data=stop_dates; > run; > > Obs ? ? id ? ? regimen ? ? stop_date > ?1 ? ?1004 ? ? ? ?1 ? ? ?11/19/1995 > ?2 ? ?1004 ? ? ? ?2 ? ? ?12/09/1995 > ?3 ? ?1004 ? ? ? ?3 ? ? ?01/09/1996 > ?4 ? ?1004 ? ? ? ?4 ? ? ?05/08/1996 > ?5 ? ?1004 ? ? ? ?5 ? ? ?05/31/1996 > ?6 ? ?1004 ? ? ? ?6 ? ? ?06/30/1996 > ?7 ? ?1004 ? ? ? ?7 ? ? ?01/31/1997 > ?8 ? ?1004 ? ? ? ?8 ? ? ?02/28/1997 > ?9 ? ?1004 ? ? ? ?9 ? ? ?11/15/1997 > 10 ? ?1004 ? ? ? 10 ? ? ?01/04/1999 > 11 ? ?9999 ? ? ? ?1 ? ? ?04/04/2006 > 12 ? ?9999 ? ? ? ?2 ? ? ?07/07/2009 > > **** Merge data to create regimens dataset; > > data regimens; > retain id start_date stop_date; > merge cumulative(rename=(date=start_date)) stop_dates; > by id regimen; > if alldrugs; > run; > > proc print data=regimens; > run; > > Obs ? ? id ? ? start_date ? ? stop_date ? ?regimen ? ?NRTI ? ?NNRTI ? ?PI > > alldrugs ? ?HAART > ?1 ? ?1004 ? ?07/24/1995 ? ?11/19/1995 ? ? ? ?1 ? ? ? ?1 ? ? ? 0 ? ? ? 0 > > ?1 ? ? ? ? 0 > ?2 ? ?1004 ? ?11/20/1995 ? ?12/09/1995 ? ? ? ?2 ? ? ? ?2 ? ? ? 0 ? ? ? 0 > > ?2 ? ? ? ? 0 > ?3 ? ?1004 ? ?12/10/1995 ? ?01/09/1996 ? ? ? ?3 ? ? ? ?1 ? ? ? 0 ? ? ? 0 > > ?1 ? ? ? ? 0 > ?4 ? ?1004 ? ?01/10/1996 ? ?05/08/1996 ? ? ? ?4 ? ? ? ?2 ? ? ? 0 ? ? ? 0 > > ?2 ? ? ? ? 0 > ?5 ? ?1004 ? ?05/09/1996 ? ?05/31/1996 ? ? ? ?5 ? ? ? ?2 ? ? ? 0 ? ? ? 1 > > ?3 ? ? ? ? 1 > ?6 ? ?1004 ? ?06/01/1996 ? ?06/30/1996 ? ? ? ?6 ? ? ? ?3 ? ? ? 0 ? ? ? 1 > > ?4 ? ? ? ? 1 > ?7 ? ?1004 ? ?07/01/1996 ? ?01/31/1997 ? ? ? ?7 ? ? ? ?4 ? ? ? 0 ? ? ? 1 > > ?5 ? ? ? ? 1 > ?8 ? ?1004 ? ?02/01/1997 ? ?02/28/1997 ? ? ? ?8 ? ? ? ?3 ? ? ? 0 ? ? ? 1 > > ?4 ? ? ? ? 1 > ?9 ? ?1004 ? ?03/01/1997 ? ?11/15/1997 ? ? ? ?9 ? ? ? ?2 ? ? ? 0 ? ? ? 1 > > ?3 ? ? ? ? 1 > 10 ? ?1004 ? ?11/16/1997 ? ?01/04/1999 ? ? ? 10 ? ? ? ?2 ? ? ? 0 ? ? ? 0 > > 2 ? ? ? ? 0 > 11 ? ?9999 ? ?01/02/2003 ? ?04/04/2006 ? ? ? ?1 ? ? ? ?0 ? ? ? 0 ? ? ? 1 > > 1 ? ? ? ? 0 > 12 ? ?9999 ? ?04/05/2006 ? ?07/07/2009 ? ? ? ?2 ? ? ? ?0 ? ? ? 1 ? ? ? 1 > > 2 ? ? ? ? 0 > 13 ? ?9999 ? ?07/08/2009 ? ? ? ? ? ? . ? ? ? ?3 ? ? ? ?0 ? ? ? 0 ? ? ? 1 > > 1 ? ? ? ? 0 > > =========================================================> > Paul Miller > Paul Miller <pjmiller_57 at yahoo.com> > > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <ted.harding at wlandres.net> > Fax-to-email: +44 (0)870 094 0861 > Date: 20-Apr-11 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Time: 19:59:21 > ------------------------------ XFMail ------------------------------ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Ista Zahn Graduate student University of Rochester Department of Clinical and Social Psychology http://yourpsyche.org
Hi Paul, On Fri, Apr 22, 2011 at 10:21 AM, Paul Miller <pjmiller_57 at yahoo.com> wrote:> > Hello Everyone, > > It seems to me that Bert?s assertion about my question is not entirely accurate. > > In my question, I wrote: > > I'm hearing in some places that R may not be able to accomplish all of the data manipulation tasks that SAS can. In others, I'm hearing that R can do pretty much any data manipulation that SAS can but the way in which it does so is likely to be quite different. > > It's not clear to me how this represents a "wrong question," in the sense that it implies that R should solve the problem in the same way that SAS does. > > I really liked Ista?s solution, but now I?m wondering if it?s an attempt to mimic SAS and not a solution that?s natural to R. > > Is there another solution that is more natural in R? And if so, what would it look like?R is flexible, which is one of the things I like most about it. That flexibility means you can do almost any given task in more than one way. This can sometimes lead to the question "which is the best way", which may be interesting but in my experience is almost always a waste of time (this statement is ment to apply to data manipulation/analysis, not nesessarily to package development). Of course you don't want to make things more complicated than they have to be, and you of course want to get the right result in the end. But my view is that when manipulating/analyzing data one should strive for a "good enough" solution and not worry too much about finding the "best" one. So while I'm sure someone can improve on my solution, it works, is relatively easy to follow, and is not overly verbose. Good enough I say. I should also note that I don't understand SAS speak at all. I did NOT translate your SAS code (except for the HARRT part, where a direct translation seemed like the easies thing to do) -- in fact the first thing I did was copy your email to a text editor and delete all the SAS stuff. The way I arrived at the solution was just by looking at the output you wanted and producing it in R in the way that seemed most natural to me. Would someone else do it differently? Yes. Could someone else do it better? Yes. But is it worth the time to try and find a better way? I doubt it. Best, Ista> > Thanks, > > Paul-- Ista Zahn Graduate student University of Rochester Department of Clinical and Social Psychology http://yourpsyche.org
Gabor Grothendieck
2011-Apr-22 15:27 UTC
[R] Can R replicate this data manipulation in SAS?
Since this involves time series within each id group I thought it would be interesting to see if this could be formulated using zoo series. The approach is to read it in, and convert it to a long form by just stacking the start and stop times in a data frame and converting that to zoo using the concatenation of date and id as the time index. It makes use of the fact that - zoo can use non-standard "times" (here the concatenation of date and id) - read.zoo can split and aggregate and - the resulting object can be directly cumsum'd using cumsum.zoo . Finally we convert it back to the required form. This was mostly done by looking at the output rather than trying to follow the SAS code so some differences are possible. It does seem to give the same output in this case. Yet another approach might be to use model.matrix to create the NRTI/NNRTI/PI columns: model.matrix(~ drug - 1, long2) * long2$change but below we stick with zoo. Some of the code here could be transplanted into Ista's solution (which uses the reshape and plyr packages) to get a combination approach. Lines <- "1004 NRTI 07/24/95 01/05/99 1004 NRTI 11/20/95 12/10/95 1004 NRTI 01/10/96 01/05/99 1004 PI 05/09/96 11/16/97 1004 NRTI 06/01/96 02/01/97 1004 NRTI 07/01/96 03/01/97 9999 PI 01/02/03 . 9999 NNRTI 04/05/06 07/08/09" library(zoo) # need na.fill from development version of zoo source("http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo") # read in data and reshape to long form DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".", col.names = c("id", "drug", "date", "date"), check.names = FALSE) long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1)) # convert to zoo. date.id, the concatenation of date & id, becomes index. # Then take cumulative sums of columns and append regimen. long2 <- with(long, data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, change)) z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = sum) z <- cumsum(na.fill(z, fill = 0)) z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along) # convert z to data frame for final processing id <- sub(".* ", "", time(z)) start_date = as.Date(sub(" .*", "", time(z))) d2 <- data.frame(id, start_date, coredata(z)) # from each group of rows in same id form the desired columns. # f processes one such group. f <- function(x) with(x, data.frame( id = head(id, -1), start_date = head(start_date, -1), stop_date = start_date[-1], head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI, HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) | (NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) | (NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1))) result <- do.call(rbind, by(d2, id, f)) rownames(result) <- NULL result The end result is this:> resultid start_date stop_date regimen NRTI NNRTI PI all HAART 1 1004 1995-07-24 1995-11-20 1 1 0 0 1 0 2 1004 1995-11-20 1995-12-10 2 2 0 0 2 0 3 1004 1995-12-10 1996-01-10 3 1 0 0 1 0 4 1004 1996-01-10 1996-05-09 4 2 0 0 2 0 5 1004 1996-05-09 1996-06-01 5 2 0 1 3 1 6 1004 1996-06-01 1996-07-01 6 3 0 1 4 1 7 1004 1996-07-01 1997-02-01 7 4 0 1 5 1 8 1004 1997-02-01 1997-03-01 8 3 0 1 4 1 9 1004 1997-03-01 1997-11-16 9 2 0 1 3 1 10 1004 1997-11-16 1999-01-05 10 2 0 0 2 0 11 9999 2003-01-02 2006-04-05 1 0 0 1 1 0 12 9999 2006-04-05 2009-07-08 2 0 1 1 2 0 13 9999 2009-07-08 <NA> 3 0 0 1 1 0 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com