On Fri, Mar 28, 2014 at 06:03:18PM +0100, Vishesh Handa
wrote:> Here are some large spots that I noticed (Chert) -
We really want to fix things in brass first. If they prove solid, we
can then backport changes to chert.
> 1. Every document has map<string, OmDocumentTerm> and OmDocumentTerm
> contains the same string again. This results in every term being
> stored in memory twice. Additionally multiple documents may have the
> same terms, and each of them would have their own copies to the
> string, even if the term is the same.
Good spot - it looks like OmDocumentTerm's copy of the term isn't really
used (just reported in the object description and used in an exception
message), so we can probably just drop that. I'll commit that change
when I've had a chance to test it fully.
> 2. Spelling db - It too allocates std::strings again
>
> 3. database mod_plist - ditto
>
> --
>
> The main way that I think this can be fixed is by introducing a new
> string class which uses reference counting. That way (1) will easily
> be solved.
Well, it's easier to address (1) by just eliminating the second string,
but at least with current GCC, std::string is reference counted, so
the two strings are already sharing in that case anyway.
> Regarding (2), (3), it probably makes sense to have a
> set<XapianString> in the Database class. When spellings, documents,
> position lists are loaded into memory, the string is fetched from that
> set, this way all strings will only have 1 copy of themselves.
I think we need better evidence this would actually help, as this is
quite a lot of work, and looking up each string in the set will add
some overhead.
AIUI, C++11 doesn't permit std::string to be referenced counted, so
GCC will need to change (I guess they're waiting for a suitable time
to make an ABI-incompatible change). If the new implementation uses
the Short String Optimisation (SSO), most common terms are probably
short enough to fit in the space the pointers would otherwise use (at
least for 64-bit platforms), so this probably isn't as wasteful as it
might seem.
I suspect strings aren't actually the biggest space waste when indexing
- for example, compressing the pending postlist changes when appending
documents feels like it would make more difference for typical use
cases:
http://trac.xapian.org/ticket/59
> I'm not too sure how this would relate to the public API though.
> Opinions?
I don't think we would want to inflict a home-made string class on API
users. Most terms will probably come from TermGenerator anyway.
> 4. When fetching the terms for a document the entire term list is
> loaded in one go. This causes a huge block on memory to be loaded.
> Depending on the number of terms in a document, it can get quite bad.
> We might want to do this
> in smaller chunks.
Are you talking about getting the compressed termlist data from the
B-tree in stages? How big is the whole thing in the cases you are
looking at?
Cheers,
Olly