Jesper Krogh
2008-Mar-24 06:07 UTC
[Xapian-discuss] Feature request: Ligthen pressure on backup
Hi. This i a small feature request for Xapian. Currently I have a xapian-database with >5m records, the files fills around 124GB in the Xapian catalog. With a few "quite large" files: # du -sh * 0 flintlock 4.0K iamflint 1000K position.baseA 63G position.DB 716K postlist.baseA 624K postlist.baseB 45G postlist.DB 8.0K record.baseA 385M record.DB 240K termlist.baseA 15G termlist.DB 12K value.baseB 696M value.DB (And it is my impression that I have a quite small record.DB-file) The idea comes from PostgreSQL's filesystem layout, it has a (probably historic) filesize of 2GB, but it helps the backup significantly. This layout, gives some "challenges" to backup systems since the daily incremental runs basically now has to backup the complete set => 124GB even if only a single new document has been merged. The suggesting would be to split the files in several smaller files. I know that the algorithms for searching the binary trees probably would be a bit more complex, but it could result in that changes only touches a subset of the files, thus letting the backup proceed easier. Another solution could be to let Xapian query several databases and "merge" the result. Then I could make a new database each day and merge once a week (or another timepattern that would fit the purpose). Other suggestions are welcome. Thanks. Jesper -- Jesper
James Aylett
2008-Mar-24 12:34 UTC
[Xapian-discuss] Feature request: Ligthen pressure on backup
On Mon, Mar 24, 2008 at 07:07:38AM +0100, Jesper Krogh wrote:> Another solution could be to let Xapian query several databases and > "merge" the result. Then I could make a new database each day and merge > once a week (or another timepattern that would fit the purpose).Jesper - Xapian can query multiple databases. You have to manage yourself which database you write into, but a database per day or similar would allow this. (You could perhaps merge fully shortly before you would do a level 0 backup anyway.) If you're doing this kind of index-to-new-then-merge strategy (which some people use for the different challenge of live indexing with high search load), then the xapian-compact(1) command will probably be helpful to you. Note that if you ascribe external meaning to Xapian document ids (for instance referencing them in a relational database), you may need to change things a little (such as by bringing external ids into Xapian and storing them in the document data, ie reversing the dependency) because of the way multiple database support is implemented. You may want to look at: <http://xapian.org/docs/admin_notes.html#backup-strategies> and <http://xapian.org/docs/admin_notes.html#merging-databases> for some other notes that may be of use here. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james at tartarus.org uncertaintydivision.org
Olly Betts
2008-Mar-31 04:09 UTC
[Xapian-discuss] Feature request: Ligthen pressure on backup
On Mon, Mar 24, 2008 at 07:07:38AM +0100, Jesper Krogh wrote:> The suggesting would be to split the files in several smaller files. I > know that the algorithms for searching the binary trees probably would > be a bit more complex, but it could result in that changes only touches > a subset of the files, thus letting the backup proceed easier.This idea seems problematic. We'd either need to keep a lot more files open (and file handles are a limited resource, though the limit is reasonable for most modern OSes), or manage opening and closing them, which will incur system call overheads, and may cause undesirable cache flushing behaviour. And for a system which updates old records, it doesn't even relieve the backup system much - you only need to update a single document (or term for the postlist table) in a chunk of the table to mean that whole chunk needs to be backed up. It's much better for a single document update, but does progressively less well for 2, 3, 4, ... unless you only add new documents. I think a better way to ease the backup pain would be to build upon the database replication functionality which should be in 1.1.0 (unless there's a major issue found which we can't address in time). This would allow a truly incremental backup - you'd save away a file which describes the changes since the last backup and which can be replayed to update the previous version of the database fairly efficiently. The incremental file should be proportional to the size of the changes. Cheers, Olly