Hi, Being in the process of translating some of my SAS programs to R, I encountered one difficulty. I have a solution, but it is not elegant (and not pleasant to implement). I have a large dataset with many variables needed to identify the origin of a sample, many to describe sample characteristics, others to describe site characteristics. I want only a (shorter) list of sites and their characteristics. If "origin", "ship_cat", "ship_nb", "trip" and "set" are needed to identify a site, in SAS you'd sort on those variables, then read the data with: data sites; set alldata; by origin ship_cat ship_nb trip set; if first.set; keep list-of-variables-detailing-sites; run; In R I did this with the Lag function of Hmisc, and the original data set also needs to be sorted first: oL <- Lag(origin) scL <- Lag(ship_cat) snL <- Lag(ship_nb) tL <- Lag(trip) sL <- Lag(set) same <- origin==oL & ship_cat==scL & ship_nb==snL & trip==tL & set==sL sites <- subset(alldata, !same, select=c(list-of-variables-detailing-sites) Could I do better than this? Thanks in advance, Denis Chabot
Hi Denis maybe unique() can choose unique entries from your data set without need for sorting. Cheers Petr On 13 Jan 2005 at 11:52, Denis Chabot wrote:> Hi, > > Being in the process of translating some of my SAS programs to R, I > encountered one difficulty. I have a solution, but it is not elegant > (and not pleasant to implement). > > I have a large dataset with many variables needed to identify the > origin of a sample, many to describe sample characteristics, others to > describe site characteristics. > > I want only a (shorter) list of sites and their characteristics. > > If "origin", "ship_cat", "ship_nb", "trip" and "set" are needed to > identify a site, in SAS you'd sort on those variables, then read the > data with: > > data sites; > set alldata; > by origin ship_cat ship_nb trip set; > if first.set; > keep list-of-variables-detailing-sites; > run; > > In R I did this with the Lag function of Hmisc, and the original data > set also needs to be sorted first: > > oL <- Lag(origin) > scL <- Lag(ship_cat) > snL <- Lag(ship_nb) > tL <- Lag(trip) > sL <- Lag(set) > same <- origin==oL & ship_cat==scL & ship_nb==snL & trip==tL & set==sL > sites <- subset(alldata, !same, > select=c(list-of-variables-detailing-sites) > > Could I do better than this? > > Thanks in advance, > > Denis Chabot > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.htmlPetr Pikal petr.pikal at precheza.cz
I want to thank Petr Pikal, Robert Balshaw and Na Li for suggesting the use of "unique" or "!duplicated" on a subset of my data where unwanted variables have been removed. This worked perfectly. Denis Chabot On 13 Jan 2005 at 11:52, Denis Chabot wrote:> Hi, > > Being in the process of translating some of my SAS programs to R, I > encountered one difficulty. I have a solution, but it is not elegant > (and not pleasant to implement). > > I have a large dataset with many variables needed to identify the > origin of a sample, many to describe sample characteristics, others to > describe site characteristics. > > I want only a (shorter) list of sites and their characteristics. > > If "origin", "ship_cat", "ship_nb", "trip" and "set" are needed to > identify a site, in SAS you'd sort on those variables, then read the > data with: > > data sites; > set alldata; > by origin ship_cat ship_nb trip set; > if first.set; > keep list-of-variables-detailing-sites; > run; > > In R I did this with the Lag function of Hmisc, and the original data > set also needs to be sorted first: > > oL <- Lag(origin) > scL <- Lag(ship_cat) > snL <- Lag(ship_nb) > tL <- Lag(trip) > sL <- Lag(set) > same <- origin==oL & ship_cat==scL & ship_nb==snL & trip==tL & set==sL > sites <- subset(alldata, !same, > select=c(list-of-variables-detailing-sites) > > Could I do better than this?