Dear r-help readers, why is it so much slower to query an sqlite database using RSQlite ?from the outside? using param like statement <- "SELECT * FROM gene2refseq LEFT JOIN gene_info ON gene_info.GeneID = gene2refseq.GeneID WHERE gene2refseq.`RNA_nucleotide_accession.version` LIKE ?" db <- "gene_info.sqlite" conn <- DBI::dbConnect(RSQLite::SQLite(), db) x1 <- DBI::dbGetQuery(conn=conn, statement=statement, param=list(H?kan20210914$RNANucleotideAccession)) compared to querying ?from the inside? of sqlite, by writing your search terms as a table first, and then calling it statement <- "SELECT * FROM H LEFT JOIN gene2refseq R ON R.`RNA_nucleotide_accession.version` LIKE '%' || H.RNANucleotideAccession || '%' LEFT JOIN gene_info I ON I.GeneID = R.GeneID" DBI::dbWriteTable(conn, "H", H?kan20210914) x2 <- DBI::dbGetQuery(conn=conn, statement=statement) DBI::dbDisconnect(conn) On my system (E5-2603 v4), the first query took more than an hour, while the second took only a few minutes ... Do you guys know of any faster (but also nice) way to dig around in very large tsv files like https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz and https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz ? Best, Rasmus -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20211006/671429f1/attachment.sig>
Generally, such gene-related questions are better asked on Bioconductor than here. They also might know of more efficient, purpose built tools for your efforts there. No guarantees, of course, and you might get a helpful response here. But if not ... Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Oct 6, 2021 at 9:23 AM Rasmus Liland <jral at posteo.no> wrote:> Dear r-help readers, > > why is it so much slower to query an > sqlite database using RSQlite ?from the > outside? using param like > > statement <- > "SELECT * FROM gene2refseq > LEFT JOIN gene_info ON > gene_info.GeneID = gene2refseq.GeneID > WHERE gene2refseq.`RNA_nucleotide_accession.version` > LIKE ?" > db <- "gene_info.sqlite" > conn <- DBI::dbConnect(RSQLite::SQLite(), db) > x1 <- DBI::dbGetQuery(conn=conn, > statement=statement, > param=list(H?kan20210914$RNANucleotideAccession)) > > compared to querying ?from the inside? > of sqlite, by writing your search terms > as a table first, and then calling it > > statement <- > "SELECT * FROM H > LEFT JOIN gene2refseq R ON > R.`RNA_nucleotide_accession.version` > LIKE '%' || H.RNANucleotideAccession || '%' > LEFT JOIN gene_info I ON I.GeneID = R.GeneID" > DBI::dbWriteTable(conn, "H", H?kan20210914) > x2 <- DBI::dbGetQuery(conn=conn, statement=statement) > DBI::dbDisconnect(conn) > > On my system (E5-2603 v4), the first > query took more than an hour, while the > second took only a few minutes ... > > Do you guys know of any faster (but also > nice) way to dig around in very large > tsv files like > https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz > and > https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz > ? > > Best, > Rasmus > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On Wed, 6 Oct 2021 16:23:15 +0000 Rasmus Liland <jral at posteo.no> wrote:> "SELECT * FROM gene2refseq > LEFT JOIN gene_info ON > gene_info.GeneID = gene2refseq.GeneID > WHERE gene2refseq.`RNA_nucleotide_accession.version` > LIKE ?"<...>> x1 <- DBI::dbGetQuery(conn=conn, > statement=statement, > param=list(H?kan20210914$RNANucleotideAccession))I think that the problem here is that you pass a vector as a bound parameter to LIKE, when parameter placeholders usually expect a scalar. DBI transparently handles this:>> The elements of the `params` argument do not need to be scalars, >> vectors of arbitrary length (including length 0) are supported. For >> queries, calling dbFetch() binding such parameters returns >> concatenated results, equivalent to binding and fetching for each >> set of values and connecting via rbind().I think this means that DBI runs a SELECT for each value in H?kan20210914$RNANucleotideAccession, which is understandably slower than a single query. Unfortunately, it's hard to pass vectors of values to queries with bound parameters; the SQL engines I know don't have a syntax for "WHERE param IN (:multi_placeholder:)". SQLite comes with carray [1], but I don't know whether it's exposed by RSQLite (could be hard to do in a pointer-safe way), and you're already aware of the traditional way of doing that: create a temporary table, populate it and JOIN with the rest of the query. -- Best regards, Ivan [1] https://www.sqlite.org/carray.html