thr3ads.net - R devel - [Rd] RSQLite indexing [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Thomas Lumley

2007-Oct-22 21:54 UTC

[Rd] RSQLite indexing

I am trying to use RSQLite for storing data and  I need to create indexes on 
two variables in the table. It appears from searching the web that the CREATE 
INDEX operation in SQLite is relatively slow for large files, and this has been 
my experience as well.

The two index variables are crossed. One has about 350,000 levels [yes, it's
genetic association data]. The other will have about 4000 levels eventually, 
but is up to about 100 now.   When the data were entered they were already
ordered by this second index variable.

Creating the index took about an hour on the 100-level, presorted variable and
about 12 hours on the 350,000-level unsorted variable.  I'm looking for
advice on how to reduce this. Specifically
1/ would it be faster if the variable with more levels was the presorted one?
2/ would it be faster or slower if the index were created before adding all the
data?
3/ are there any options that can be set to speed up the indexing?

The SQLite database will not be the primary archive for the data, so
optimizations that are risky in the case of power loss or hardware failure are
still acceptable.  Since Bioconductor seems to use SQLite a lot I'm hoping
there is some simple solution.


     -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

hadley wickham

2007-Oct-22 22:37 UTC

head link

[Rd] RSQLite indexing

On 10/22/07, Thomas Lumley <tlumley at u.washington.edu>
wrote:>
> I am trying to use RSQLite for storing data and  I need to create indexes
on
> two variables in the table. It appears from searching the web that the
CREATE
> INDEX operation in SQLite is relatively slow for large files, and this has
been
> my experience as well.
>
> The two index variables are crossed. One has about 350,000 levels [yes,
it's
> genetic association data]. The other will have about 4000 levels
eventually,
> but is up to about 100 now.   When the data were entered they were already
ordered by this second index variable.
>
> Creating the index took about an hour on the 100-level, presorted variable
and about 12 hours on the 350,000-level unsorted variable.  I'm looking for
advice on how to reduce this. Specifically
How big is your dataset?  SQLite can be slow, but I didn't think it
was that slow.
> 1/ would it be faster if the variable with more levels was the presorted
one?
I didn't think this would matter (and can imagine situations where it
would be worse).  Is there a particular reason you think this might
help?
> 2/ would it be faster or slower if the index were created before adding all
the data?
It's generally much faster to create the index after loading all the data.
> 3/ are there any options that can be set to speed up the indexing?
Have you tried 'vacuum'ing your database prior to indexing?

Will you be indexing on both columns simultaneously?  If so, you might
try creating a single index.


Hadley

-- 
http://had.co.nz/

Jeffrey Horner

2007-Oct-22 23:03 UTC

head link

[Rd] RSQLite indexing

Thomas Lumley wrote on 10/22/2007 04:54 PM:> I am trying to use RSQLite for storing data and  I need to create indexes
on
> two variables in the table. It appears from searching the web that the
CREATE
> INDEX operation in SQLite is relatively slow for large files, and this has
been
> my experience as well.
> 
> The two index variables are crossed. One has about 350,000 levels [yes,
it's
> genetic association data]. The other will have about 4000 levels
eventually,
> but is up to about 100 now.   When the data were entered they were already
ordered by this second index variable.
> 
> Creating the index took about an hour on the 100-level, presorted variable
and about 12 hours on the 350,000-level unsorted variable.  I'm looking for
advice on how to reduce this. Specifically
> 1/ would it be faster if the variable with more levels was the presorted
one?
> 2/ would it be faster or slower if the index were created before adding all
the data?
> 3/ are there any options that can be set to speed up the indexing?
> 
> The SQLite database will not be the primary archive for the data, so
optimizations that are risky in the case of power loss or hardware failure are
still acceptable.  Since Bioconductor seems to use SQLite a lot I'm hoping
there is some simple solution.
Well then, while some may think it overkill to run a mysql database 
server for 1 to a handfull of clients, the benefits of using typed 
columns (rather than strings, which might have something to do with slow 
indexing) plus all the other goodies could save you the extra day or two 
speeding up SQLite indexing.

A project here at vanderbilt stores (mostly read-only) data in mysql 5.0 
tables in myisam format. The tables have 3 to 5 million records each. 
Index creation speed for a varchar(9) column with 1,559,100 levels takes 
roughly 1.5 minutes, for example.

Jeff
-- 
http://biostat.mc.vanderbilt.edu/JeffreyHorner

Kasper Daniel Hansen

2007-Oct-23 01:07 UTC

head link

[Rd] RSQLite indexing

On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:
>
> I am trying to use RSQLite for storing data and  I need to create  
> indexes on
> two variables in the table. It appears from searching the web that  
> the CREATE
> INDEX operation in SQLite is relatively slow for large files, and  
> this has been
> my experience as well.
>
> The two index variables are crossed. One has about 350,000 levels  
> [yes, it's
> genetic association data]. The other will have about 4000 levels  
> eventually,
> but is up to about 100 now.   When the data were entered they were  
> already ordered by this second index variable.
>
> Creating the index took about an hour on the 100-level, presorted  
> variable and about 12 hours on the 350,000-level unsorted  
> variable.  I'm looking for advice on how to reduce this. Specifically
> 1/ would it be faster if the variable with more levels was the  
> presorted one?
> 2/ would it be faster or slower if the index were created before  
> adding all the data?
> 3/ are there any options that can be set to speed up the indexing?
>
> The SQLite database will not be the primary archive for the data,  
> so optimizations that are risky in the case of power loss or  
> hardware failure are still acceptable.  Since Bioconductor seems to  
> use SQLite a lot I'm hoping there is some simple solution.
I have not used RSQLite, but have some experience doing this thing  
for big sqlite databases using the command line client. Every  
database in sqlite has a number of parameters associated with it. You  
want to make sure that cache_size is at _most_ 2000 (yes, I know this  
is totally counterintuitive as it tells sqlite to use as little  
memory as possible). You also tell it to be non-synchronous. In  
sqlite the commands are
sqlite> pragma default_cache_size = 2000
sqlite> pragma_synchronous = off
You can test the setting of these parameters by just doing a
sqlite> pragma default_cache_size
As far as I remember, cache size can only be set when you create the  
database. I have no idea how RSQlite handles it.

When I asked about this problem on the sqlite mailing list, the  
sqlite-creator said that this was a "locality of reference problem"  
and that it was being "worked on". And that I could search the  
archives for more info (which did not help me back then).

I don't know whether or not sorting helps.

Another thing to do is to check in what amount sqlite sits idle while  
doing I/O. It is probably impossible to avoid some idleness with such  
a thing, but it should of course be kept to a minimum.

It is true that some of the other databases are probably much faster  
at creating indices. But in the post-index analysis, sqlite is a  
really fast database, probably amongst the fastest there is. It does  
not do a good job a converting your queries into smart queries, but  
if you are doing something simple, it is blazingly fast with the  
right user options.

Kasper

Reasonably Related Threads

Search for more apparently analagous threads

R devel - Oct 2007 - RSQLite indexing

[Rd] RSQLite indexing

[Rd] RSQLite indexing

[Rd] RSQLite indexing

[Rd] RSQLite indexing

Reasonably Related Threads