thr3ads.net - Xapian devel - [Xapian-devel] Reducing Xapian memory usage [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Vishesh Handa

2014-Mar-28 17:03 UTC

[Xapian-devel] Reducing Xapian memory usage

Hey guys

I noticed xapian using a lot of memory while indexing [1] so I decided to look 
at the bottle necks and where this can be improved.

Here are some large spots that I noticed (Chert) -

1. Every document has map<string, OmDocumentTerm> and OmDocumentTerm
contains
the same string again. This results in every term being stored in memory 
twice. Additionally multiple documents may have the same terms, and each of 
them would have their own copies to the string, even if the term is the same.

2. Spelling db - It too allocates std::strings again

3. database mod_plist - ditto

4. When fetching the terms for a document the entire term list is loaded in 
one go. This causes a huge block on memory to be loaded. Depending on the 
number of terms in a document, it can get quite bad. We might want to do this 
in smaller chunks.

--

The main way that I think this can be fixed is by introducing a new string 
class which uses reference counting. That way (1) will easily be solved.

Regarding (2), (3), it probably makes sense to have a set<XapianString> in
the
Database class. When spellings, documents, position lists are loaded into 
memory, the string is fetched from that set, this way all strings will only 
have 1 copy of themselves.

I'm not too sure how this would relate to the public API though. Opinions?

-- 
Vishesh Handa

[1] http://lists.xapian.org/pipermail/xapian-discuss/2014-March/009079.html

Olly Betts

2014-Apr-09 11:38 UTC

head link

[Xapian-devel] Reducing Xapian memory usage

On Fri, Mar 28, 2014 at 06:03:18PM +0100, Vishesh Handa
wrote:> Here are some large spots that I noticed (Chert) -
We really want to fix things in brass first.  If they prove solid, we
can then backport changes to chert.
> 1. Every document has map<string, OmDocumentTerm> and OmDocumentTerm
> contains the same string again. This results in every term being
> stored in memory twice. Additionally multiple documents may have the
> same terms, and each of them would have their own copies to the
> string, even if the term is the same.
Good spot - it looks like OmDocumentTerm's copy of the term isn't really
used (just reported in the object description and used in an exception
message), so we can probably just drop that.  I'll commit that change
when I've had a chance to test it fully.
> 2. Spelling db - It too allocates std::strings again
> 
> 3. database mod_plist - ditto
> 
> --
> 
> The main way that I think this can be fixed is by introducing a new
> string class which uses reference counting. That way (1) will easily
> be solved.
Well, it's easier to address (1) by just eliminating the second string,
but at least with current GCC, std::string is reference counted, so
the two strings are already sharing in that case anyway.
> Regarding (2), (3), it probably makes sense to have a
> set<XapianString> in the Database class. When spellings, documents,
> position lists are loaded into memory, the string is fetched from that
> set, this way all strings will only have 1 copy of themselves.
I think we need better evidence this would actually help, as this is
quite a lot of work, and looking up each string in the set will add
some overhead.

AIUI, C++11 doesn't permit std::string to be referenced counted, so
GCC will need to change (I guess they're waiting for a suitable time
to make an ABI-incompatible change).  If the new implementation uses
the Short String Optimisation (SSO), most common terms are probably
short enough to fit in the space the pointers would otherwise use (at
least for 64-bit platforms), so this probably isn't as wasteful as it
might seem.

I suspect strings aren't actually the biggest space waste when indexing
- for example, compressing the pending postlist changes when appending
documents feels like it would make more difference for typical use
cases:

http://trac.xapian.org/ticket/59
> I'm not too sure how this would relate to the public API though.
> Opinions?
I don't think we would want to inflict a home-made string class on API
users.  Most terms will probably come from TermGenerator anyway.
> 4. When fetching the terms for a document the entire term list is
> loaded in one go. This causes a huge block on memory to be loaded.
> Depending on the number of terms in a document, it can get quite bad.
> We might want to do this 
> in smaller chunks.
Are you talking about getting the compressed termlist data from the
B-tree in stages?  How big is the whole thing in the cases you are
looking at?

Cheers,
    Olly

Olly Betts

2014-Apr-09 22:33 UTC

head link

[Xapian-devel] Reducing Xapian memory usage

On Wed, Apr 09, 2014 at 12:38:12PM +0100, Olly Betts
wrote:> On Fri, Mar 28, 2014 at 06:03:18PM +0100, Vishesh Handa wrote:
> > 1. Every document has map<string, OmDocumentTerm> and
OmDocumentTerm
> > contains the same string again. This results in every term being
> > stored in memory twice. Additionally multiple documents may have the
> > same terms, and each of them would have their own copies to the
> > string, even if the term is the same.
> 
> Good spot - it looks like OmDocumentTerm's copy of the term isn't
really
> used (just reported in the object description and used in an exception
> message), so we can probably just drop that.  I'll commit that change
> when I've had a chance to test it fully.
Now committed to trunk as r17946.  I'll make sure it's backported for
1.2.18.

Cheers,
    Olly

Maybe Matching Threads

Search for more seemingly similar threads

Xapian devel - Mar 2014 - Reducing Xapian memory usage

[Xapian-devel] Reducing Xapian memory usage

[Xapian-devel] Reducing Xapian memory usage

[Xapian-devel] Reducing Xapian memory usage

Maybe Matching Threads