thr3ads.net - Xapian discuss - [Xapian-discuss] Feature request: Ligthen pressure on backup [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Jesper Krogh

2008-Mar-24 06:07 UTC

[Xapian-discuss] Feature request: Ligthen pressure on backup

Hi.

This i a small feature request for Xapian. Currently I have a 
xapian-database with >5m records, the files fills around 124GB in the
Xapian catalog. With a few "quite large" files:

# du -sh *
0       flintlock
4.0K    iamflint
1000K   position.baseA
63G     position.DB
716K    postlist.baseA
624K    postlist.baseB
45G     postlist.DB
8.0K    record.baseA
385M    record.DB
240K    termlist.baseA
15G     termlist.DB
12K     value.baseB
696M    value.DB

(And it is my impression that I have a quite small record.DB-file)
The idea comes from PostgreSQL's filesystem layout, it has a (probably 
historic) filesize of 2GB, but it helps the backup significantly.

This layout, gives some "challenges" to backup systems since the daily
incremental runs basically now has to backup the complete set => 124GB 
even if only a single new document has been merged.

The suggesting would be to split the files in several smaller files. I 
know that the algorithms for searching the binary trees probably would 
be a bit more complex, but it could result in that changes only touches 
a subset of the files, thus letting the backup proceed easier.

Another solution could be to let Xapian query several databases and 
"merge" the result. Then I could make a new database each day and
merge
once a week (or another timepattern that would fit the purpose).

Other suggestions are welcome.

Thanks.

Jesper
-- 
Jesper

James Aylett

2008-Mar-24 12:34 UTC

head link

[Xapian-discuss] Feature request: Ligthen pressure on backup

On Mon, Mar 24, 2008 at 07:07:38AM +0100, Jesper Krogh wrote:
> Another solution could be to let Xapian query several databases and 
> "merge" the result. Then I could make a new database each day and
merge
> once a week (or another timepattern that would fit the purpose).
Jesper - Xapian can query multiple databases. You have to manage
yourself which database you write into, but a database per day or
similar would allow this. (You could perhaps merge fully shortly
before you would do a level 0 backup anyway.)

If you're doing this kind of index-to-new-then-merge strategy (which
some people use for the different challenge of live indexing with high
search load), then the xapian-compact(1) command will probably be
helpful to you.

Note that if you ascribe external meaning to Xapian document ids (for
instance referencing them in a relational database), you may need to
change things a little (such as by bringing external ids into Xapian
and storing them in the document data, ie reversing the dependency)
because of the way multiple database support is implemented.

You may want to look at:

<http://xapian.org/docs/admin_notes.html#backup-strategies>

and

<http://xapian.org/docs/admin_notes.html#merging-databases>

for some other notes that may be of use here.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org

Olly Betts

2008-Mar-31 04:09 UTC

head link

[Xapian-discuss] Feature request: Ligthen pressure on backup

On Mon, Mar 24, 2008 at 07:07:38AM +0100, Jesper Krogh
wrote:> The suggesting would be to split the files in several smaller files. I 
> know that the algorithms for searching the binary trees probably would 
> be a bit more complex, but it could result in that changes only touches 
> a subset of the files, thus letting the backup proceed easier.
This idea seems problematic.  We'd either need to keep a lot more files
open (and file handles are a limited resource, though the limit is
reasonable for most modern OSes), or manage opening and closing them,
which will incur system call overheads, and may cause undesirable cache
flushing behaviour.

And for a system which updates old records, it doesn't even relieve the
backup system much - you only need to update a single document (or term
for the postlist table) in a chunk of the table to mean that whole chunk
needs to be backed up.  It's much better for a single document update,
but does progressively less well for 2, 3, 4, ... unless you only add
new documents.

I think a better way to ease the backup pain would be to build upon the
database replication functionality which should be in 1.1.0 (unless
there's a major issue found which we can't address in time).

This would allow a truly incremental backup - you'd save away a file
which describes the changes since the last backup and which can be
replayed to update the previous version of the database fairly
efficiently.  The incremental file should be proportional to the
size of the changes.

Cheers,
    Olly

Xapian discuss - Mar 2008 - Feature request: Ligthen pressure on backup

[Xapian-discuss] Feature request: Ligthen pressure on backup

[Xapian-discuss] Feature request: Ligthen pressure on backup

[Xapian-discuss] Feature request: Ligthen pressure on backup