Farrel Buchinsky
2006-Dec-31 07:17 UTC
[R] Does SQL group by have a heavy duty equivalent in R
I have hundreds of humans who have undergone SNP genotyping at hundreds of loci. Some have even undergone the procedure twice or thrice (kind of an internal control). So obviously I need to find those replications, and confirm that the results are the same. If there is discordance then I need to address it. I tried to use the aggregate function nr.attempts <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) This was simply to figure out how many times the same piece of information had been obtained. I ran out of patience. It took beyond forever and tapply did not perform much better. The reshape package did not help - it implied one was out of luck if the data was not numeric. All of my data is character or factor. Instead I used RODBC sqlSave(channel,RawSeq) to push the table into a Microsoft Access database Then a sql query, courtesy of the Microsoft Access Query Wizard a la design mode. SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count( RawSeq.rownames) AS CountOfrownames FROM RawSeq WHERE (((RawSeq.GENOTYPE_ID)<>"")) GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID ORDER BY Count(RawSeq.rownames) DESC; This way I could easily use the minimum and maximum values to see if they were discordant. Microsoft Access handled it with aplomb. I plan to use RODBC to bring the result of the SQL query back into R. This is the first time I have seen Microsoft Access outpace R. Is my observation correct or am I missing something. I would much rather perform all data manipulation and analyses in R. -- Farrel Buchinsky [[alternative HTML version deleted]]
hadley wickham
2006-Dec-31 15:58 UTC
[R] Does SQL group by have a heavy duty equivalent in R
> nr.attempts > <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) > This was simply to figure out how many times the same piece of information > had been obtained. I ran out of patience. It took beyond forever and tapply > did not perform much better. The reshape package did not help - it implied > one was out of luck if the data was not numeric. All of my data is character > or factor.The reshape package will work if all your data is numeric, or all of it is character - it doesn't work with a mix. I will try and make this more clear in the documentation. However, depending on the size and structure of your data it may not be any faster than tapply or aggregate. Hadley
Charles C. Berry
2007-Jan-01 02:02 UTC
[R] Does SQL group by have a heavy duty equivalent in R
On Sun, 31 Dec 2006, Farrel Buchinsky wrote:> I have hundreds of humans who have undergone SNP genotyping at hundreds of > loci. Some have even undergone the procedure twice or thrice (kind of an > internal control). > > So obviously I need to find those replications, and confirm that the results > are the same. If there is discordance then I need to address it.Why not use duplicated() ? For a data.frame with 200 rows of which about 50 are duplicates and 201 columns finding the (non) duplicates takes little time on my year old AMD 64 running Windows XP:> my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T))) > snp.dat <- lapply(1:200,function(x) 0:1 ) > snp.frame <- as.data.frame(do.call(cbind,snp.dat)) > my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,]) > system.time( table(duplicated(my.dat)) )[1] 0.03 0.00 0.03 NA NA>Finding the non-duplicated rows for which there is at least one replication:> system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in% names(which(table(my.dat$ID)>1)) ) ))[1] 0.05 0.00 0.05 NA NA> > > I tried to use the aggregate function > > nr.attempts > <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) > This was simply to figure out how many times the same piece of information > had been obtained. I ran out of patience. It took beyond forever and tapply > did not perform much better. The reshape package did not help - it implied > one was out of luck if the data was not numeric. All of my data is character > or factor. > > Instead I used RODBC > > sqlSave(channel,RawSeq) > to push the table into a Microsoft Access database > Then a sql query, courtesy of the Microsoft Access Query Wizard a la design > mode. > > SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS > MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count( > RawSeq.rownames) AS CountOfrownames > FROM RawSeq > WHERE (((RawSeq.GENOTYPE_ID)<>"")) > GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID > ORDER BY Count(RawSeq.rownames) DESC; > > This way I could easily use the minimum and maximum values to see if they > were discordant. > Microsoft Access handled it with aplomb. I plan to use RODBC to bring the > result of the SQL query back into R. > > This is the first time I have seen Microsoft Access outpace R. > Is my observation correct or am I missing something. I would much rather > perform all data manipulation and analyses in R. > > > > -- > Farrel Buchinsky > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
Charles C. Berry
2007-Jan-01 03:46 UTC
[R] Does SQL group by have a heavy duty equivalent in R
On Sun, 31 Dec 2006, Charles C. Berry wrote:> On Sun, 31 Dec 2006, Farrel Buchinsky wrote: > >> I have hundreds of humans who have undergone SNP genotyping at hundreds of >> loci. Some have even undergone the procedure twice or thrice (kind of an >> internal control). >> >> So obviously I need to find those replications, and confirm that the >> results >> are the same. If there is discordance then I need to address it. > > Why not use duplicated() ?More specifically: unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] ) gives a list of those IDs for which the SNPs in all replicates of an ID are not the same.> > For a data.frame with 200 rows of which about 50 are duplicates and 201 > columns finding the (non) duplicates takes little time on my year old AMD 64 > running Windows XP: > >> my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T))) >> snp.dat <- lapply(1:200,function(x) 0:1 ) >> snp.frame <- as.data.frame(do.call(cbind,snp.dat)) >> my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,]) >> system.time( table(duplicated(my.dat)) ) > [1] 0.03 0.00 0.03 NA NA >> > > Finding the non-duplicated rows for which there is at least one replication: > >> system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in% >> names(which(table(my.dat$ID)>1)) ) )) > [1] 0.05 0.00 0.05 NA NA > > >> >> >> I tried to use the aggregate function >> >> nr.attempts >> <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) >> This was simply to figure out how many times the same piece of information >> had been obtained. I ran out of patience. It took beyond forever and >> tapply >> did not perform much better. The reshape package did not help - it implied >> one was out of luck if the data was not numeric. All of my data is >> character >> or factor. >> >> Instead I used RODBC >> >> sqlSave(channel,RawSeq) >> to push the table into a Microsoft Access database >> Then a sql query, courtesy of the Microsoft Access Query Wizard a la >> design >> mode. >> >> SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS >> MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count( >> RawSeq.rownames) AS CountOfrownames >> FROM RawSeq >> WHERE (((RawSeq.GENOTYPE_ID)<>"")) >> GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID >> ORDER BY Count(RawSeq.rownames) DESC; >> >> This way I could easily use the minimum and maximum values to see if they >> were discordant. >> Microsoft Access handled it with aplomb. I plan to use RODBC to bring the >> result of the SQL query back into R. >> >> This is the first time I have seen Microsoft Access outpace R. >> Is my observation correct or am I missing something. I would much rather >> perform all data manipulation and analyses in R. >> >> >> >> -- >> Farrel Buchinsky >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive Medicine > E mailto:cberry at tajo.ucsd.edu UC San Diego > http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717 > > > >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
On 12/30/06, Farrel Buchinsky <fjbuch at gmail.com> wrote:> Instead I used RODBC > > sqlSave(channel,RawSeq) > to push the table into a Microsoft Access database > Then a sql query, courtesy of the Microsoft Access Query Wizard a la design > mode.If SQL does prove to be part of your approach you might want to have a look at SQLite. Dirk introduced it to me back when we worked on Crusher. We use it for development and some production work. Main attributes: simple, fast and light weight. Everything you for R need here: http://cran.r-project.org/src/contrib/Descriptions/RSQLite.html SQLite's home page: http://www.sqlite.org/ And a Python connector: http://initd.org/tracker/pysqlite jab -- John Bollinger, CFA, CMT www.BollingerBands.com If you advance far enough, you arrive at the beginning.