thr3ads.net - R help - [R] Does SQL group by have a heavy duty equivalent in R [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Farrel Buchinsky

2006-Dec-31 07:17 UTC

[R] Does SQL group by have a heavy duty equivalent in R

I have hundreds of humans who have undergone SNP genotyping at hundreds of
loci. Some have even undergone the procedure twice or thrice (kind of an
internal control).

So obviously I need to find those replications, and confirm that the results
are the same. If there is discordance then I need to address it.

I tried to use the aggregate function

nr.attempts
<-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
This was simply to figure out how many times the same piece of information
had been obtained. I ran out of patience. It took beyond forever and tapply
did not perform much better. The reshape package did not help - it implied
one was out of luck if the data was not numeric. All of my data is character
or factor.

Instead I used RODBC

sqlSave(channel,RawSeq)
to push the table into a Microsoft Access database
Then a sql query, courtesy of the Microsoft Access Query Wizard a la design
mode.

SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
RawSeq.rownames) AS CountOfrownames
FROM RawSeq
WHERE (((RawSeq.GENOTYPE_ID)<>""))
GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
ORDER BY Count(RawSeq.rownames) DESC;

This way I could easily use the minimum and maximum values to see if they
were discordant.
Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
result of the SQL query back into R.

This is the first time I have seen Microsoft Access outpace R.
Is my observation correct or am I missing something. I would much rather
perform all data manipulation and analyses in R.



-- 
Farrel Buchinsky

	[[alternative HTML version deleted]]

hadley wickham

2006-Dec-31 15:58 UTC

head link

[R] Does SQL group by have a heavy duty equivalent in R

> nr.attempts
>
<-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
> This was simply to figure out how many times the same piece of information
> had been obtained. I ran out of patience. It took beyond forever and tapply
> did not perform much better. The reshape package did not help - it implied
> one was out of luck if the data was not numeric. All of my data is
character
> or factor.
The reshape package will work if all your data is numeric, or all of
it is character - it doesn't work with a mix.  I will try and make
this more clear in the documentation.
However, depending on the size and structure of your data it may not
be any faster than tapply or aggregate.

Hadley

Charles C. Berry

2007-Jan-01 02:02 UTC

head link

[R] Does SQL group by have a heavy duty equivalent in R

On Sun, 31 Dec 2006, Farrel Buchinsky wrote:
> I have hundreds of humans who have undergone SNP genotyping at hundreds of
> loci. Some have even undergone the procedure twice or thrice (kind of an
> internal control).
>
> So obviously I need to find those replications, and confirm that the
results
> are the same. If there is discordance then I need to address it.
Why not use  duplicated() ?

For a data.frame with 200 rows of which about 50 are duplicates and 201 
columns finding the (non) duplicates takes little time on my year old AMD 
64 running Windows XP:
> my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
> snp.dat <- lapply(1:200,function(x) 0:1 )
> snp.frame <- as.data.frame(do.call(cbind,snp.dat))
> my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
> system.time( table(duplicated(my.dat)) )
[1] 0.03 0.00 0.03   NA   NA>
Finding the non-duplicated rows for which there is at least one 
replication:
> system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in%
names(which(table(my.dat$ID)>1)) ) ))[1] 0.05 0.00 0.05   NA   NA

> 
>
> I tried to use the aggregate function
>
> nr.attempts
>
<-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
> This was simply to figure out how many times the same piece of information
> had been obtained. I ran out of patience. It took beyond forever and tapply
> did not perform much better. The reshape package did not help - it implied
> one was out of luck if the data was not numeric. All of my data is
character
> or factor.
>
> Instead I used RODBC
>
> sqlSave(channel,RawSeq)
> to push the table into a Microsoft Access database
> Then a sql query, courtesy of the Microsoft Access Query Wizard a la design
> mode.
>
> SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
> MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
> RawSeq.rownames) AS CountOfrownames
> FROM RawSeq
> WHERE (((RawSeq.GENOTYPE_ID)<>""))
> GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
> ORDER BY Count(RawSeq.rownames) DESC;
>
> This way I could easily use the minimum and maximum values to see if they
> were discordant.
> Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
> result of the SQL query back into R.
>
> This is the first time I have seen Microsoft Access outpace R.
> Is my observation correct or am I missing something. I would much rather
> perform all data manipulation and analyses in R.
>
>
>
> -- 
> Farrel Buchinsky
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717

Charles C. Berry

2007-Jan-01 03:46 UTC

head link

[R] Does SQL group by have a heavy duty equivalent in R

On Sun, 31 Dec 2006, Charles C. Berry wrote:
> On Sun, 31 Dec 2006, Farrel Buchinsky wrote:
>
>>  I have hundreds of humans who have undergone SNP genotyping at
hundreds of
>>  loci. Some have even undergone the procedure twice or thrice (kind of
an
>>  internal control).
>>
>>  So obviously I need to find those replications, and confirm that the
>>  results
>>  are the same. If there is discordance then I need to address it.
>
> Why not use  duplicated() ?
More specifically:

 	unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] )

gives a list of those IDs for which the SNPs in all replicates of an ID 
are not the same.

>
> For a data.frame with 200 rows of which about 50 are duplicates and 201 
> columns finding the (non) duplicates takes little time on my year old AMD
64
> running Windows XP:
>
>>  my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
>>  snp.dat <- lapply(1:200,function(x) 0:1 )
>>  snp.frame <- as.data.frame(do.call(cbind,snp.dat))
>>  my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
>>  system.time( table(duplicated(my.dat)) )
> [1] 0.03 0.00 0.03   NA   NA
>> 
>
> Finding the non-duplicated rows for which there is at least one
replication:
>
>>  system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in%
>>  names(which(table(my.dat$ID)>1)) ) ))
> [1] 0.05 0.00 0.05   NA   NA
>
>
>> 
>>
>>  I tried to use the aggregate function
>>
>>  nr.attempts
>> 
<-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
>>  This was simply to figure out how many times the same piece of
information
>>  had been obtained. I ran out of patience. It took beyond forever and
>>  tapply
>>  did not perform much better. The reshape package did not help - it
implied
>>  one was out of luck if the data was not numeric. All of my data is
>>  character
>>  or factor.
>>
>>  Instead I used RODBC
>>
>>  sqlSave(channel,RawSeq)
>>  to push the table into a Microsoft Access database
>>  Then a sql query, courtesy of the Microsoft Access Query Wizard a la
>>  design
>>  mode.
>>
>>  SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
>>  MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
>>  RawSeq.rownames) AS CountOfrownames
>>  FROM RawSeq
>>  WHERE (((RawSeq.GENOTYPE_ID)<>""))
>>  GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
>>  ORDER BY Count(RawSeq.rownames) DESC;
>>
>>  This way I could easily use the minimum and maximum values to see if
they
>>  were discordant.
>>  Microsoft Access handled it with aplomb. I plan to use RODBC to bring
the
>>  result of the SQL query back into R.
>>
>>  This is the first time I have seen Microsoft Access outpace R.
>>  Is my observation correct or am I missing something. I would much
rather
>>  perform all data manipulation and analyses in R.
>> 
>> 
>>
>>  --
>>  Farrel Buchinsky
>>
>>   [[alternative HTML version deleted]]
>>
>>  ______________________________________________
>>  R-help at stat.math.ethz.ch mailing list
>>  https://stat.ethz.ch/mailman/listinfo/r-help
>>  PLEASE do read the posting guide
>>  http://www.R-project.org/posting-guide.html
>>  and provide commented, minimal, self-contained, reproducible code.
>> 
>
> Charles C. Berry                        (858) 534-2098
>                                         Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu	         UC San Diego
> http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717
>
>
>
>
Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717

BBands

2007-Jan-01 18:41 UTC

head link

[R] Does SQL group by have a heavy duty equivalent in R

On 12/30/06, Farrel Buchinsky <fjbuch at gmail.com>
wrote:> Instead I used RODBC
>
> sqlSave(channel,RawSeq)
> to push the table into a Microsoft Access database
> Then a sql query, courtesy of the Microsoft Access Query Wizard a la design
> mode.
If SQL does prove to be part of your approach you might want to have a
look at SQLite. Dirk introduced it to me back when we worked on
Crusher. We use it for development and some production work. Main
attributes: simple, fast and light weight.

Everything you for R need here:
http://cran.r-project.org/src/contrib/Descriptions/RSQLite.html

SQLite's home page:
http://www.sqlite.org/

And a Python connector:
http://initd.org/tracker/pysqlite

    jab
-- 
John Bollinger, CFA, CMT
www.BollingerBands.com

If you advance far enough, you arrive at the beginning.

Reasonably Related Threads

Search for more seemingly similar threads

R help - Dec 2006 - Does SQL group by have a heavy duty equivalent in R

[R] Does SQL group by have a heavy duty equivalent in R

[R] Does SQL group by have a heavy duty equivalent in R

[R] Does SQL group by have a heavy duty equivalent in R

[R] Does SQL group by have a heavy duty equivalent in R

[R] Does SQL group by have a heavy duty equivalent in R

Reasonably Related Threads