Eric Parusel
2004-Oct-15 04:42 UTC
[Xapian-discuss] Suitability of Xapian for my application?
Hello, I'm currently using PostgreSQL to store keywords for documents in an indexed table, one row per keyword per document. I'm also using a perl document importing script to extract keywords from documents as they arrive and store (no positional data) in pgsql... The two components (import, and pgsql) are on different servers.. My problem is that certain databases have keywords tables that have 30 million rows or so.. A standard index on the varchar for 30 million rows = one very large inefficient index. Table columns: keyword (varchar, avg length 8 chars), and idnum (int4). I would want to feed Xapian just a list of keywords, no positional data at this time. How efficient would Xapian be if I converted my keyword search over to it? What's important to me, in no particular order: 1) Import speeds when the tables grow (avg # of keywords per document: 150 approx) 2) Searching speed (I don't think this will be a problem from what I've heard) 3) keywords "database" size -- any rough estimates for what I'm working with? 4) Stability -- it won't corrupt, or crap out and die on me, will it? :) 5) Backups -- Is there a backup dump utility of some sort? Can I take backups of the live system? Can I use filesystem snapshots, then back up the xapian db file snapshot? Anything else I should be concerned about? As you can see, I have alot of questions since I'm quite new to Xapian... Hopefully all my questions are not out of line :) Thanks for any help you can offer, Eric
Olly Betts
2004-Oct-15 06:05 UTC
[Xapian-discuss] Suitability of Xapian for my application?
On Thu, Oct 14, 2004 at 08:43:33PM -0700, Eric Parusel wrote:> I would want to feed Xapian just a list of keywords, no positional data > at this time. > > How efficient would Xapian be if I converted my keyword search over to it?Approximately infinitely better than your current scheme I suspect! A friend had implemented a search in a similar way to you (except with mysql I think). I built a Xapian version from a SQL dump and the speed up was startling. That was searching around 150K documents.> What's important to me, in no particular order: > 1) Import speeds when the tables grow (avg # of keywords per document: > 150 approx)So if there's about 150 keywords per document and 30 million or so rows, then the corpus is of the order of 200K documents? It's hard to say how fast a system will be without a reference point. Indexing speed depends a lot on the hardware. CPU speed isn't too important. You want lots of RAM and fast disks. The gmane index has an average doc length of 186 terms. It takes about 15 minutes to index 200K documents from scratch. That's got 3G of RAM and SATA disks.> 2) Searching speed (I don't think this will be a problem from what I've > heard)Should be fractions of a second for that size index.> 3) keywords "database" size -- any rough estimates for what I'm working > with?I'd guess something like 500MB for 200K documents. There are plans in the pipeline to improve the packing and compression (which should improve both index and search speed too).> 4) Stability -- it won't corrupt, or crap out and die on me, will it? :)I'd hope not. We try hard to make releases stable, and there's an extensive automated test suite to assist this aim. We also indicate in the release notes when major code reworking has taken place. But as the licence says there's no warranty. If that bothers you (or your boss!) commercial support is available.> 5) Backups -- Is there a backup dump utility of some sort?There's dbtools in CVS which allows you to dump and reload databases as XML. But unless you want to process the dumped data, it's probably not the right approach. It's a lot slower to dump the contents of a database than to just copy the files comprising it.> Can I take backups of the live system?If you can pause updates during the backup you can. There's currently no support within Xapian for backing up while updates are happening.> Can I use filesystem snapshots, then back up the xapian db file > snapshot?That's a good way to do it. Make sure that there's no updates happening and snapshot the filesystem. Then you can restart updates and back up from the snapshot to tape at your leisure. Alternatively, if you keep the documents and can build your Xapian database in ~15 minutes you might decide you can live without the Xapian backup (especially if it takes more than 15 minutes to restore from tape!) Of course this decision also hinges on how critical search is to your application.> Anything else I should be concerned about?Nothing comes to mind. Cheers, Olly
Penz, Bernhard
2004-Oct-15 08:48 UTC
[Xapian-discuss] Suitability of Xapian for my application?
Hi all,> > Can I take backups of the live system? > > If you can pause updates during the backup you can. There's > currently no support within Xapian for backing up while > updates are happening.Follow-up question on this: With pausing updates you mean that Xapian shall not perform a flush during backup. I can continue adding documents as long as I prevent it from flushing the updated index to disk, right? Regards, Bernhard