Hi, I am evaluating to use xapian in our product. I just downloaded the core and examples code from the website. I'm puzzeled about one thing though, when I used the test program "simpleIndexer", I found out that the index size is four times the size of the corpus. I indexed 4MB worth of text files, and the index was 16MB to index, and even after compaction, it still consumed 10MB. when I added additional 4MB of text files, the original index went to 32MB. The index size is four times the size of the corpus, it doesn't seem right. Am I doing something wrong? Thanks, John
Hi, I am evaluating to use xapian in our product. I just downloaded the core and examples code from the website. I'm puzzeled about one thing though, when I used the test program "simpleIndexer", I found out that the index size is four times the size of the corpus. I indexed 4MB worth of text files, and the index was 16MB to index, and even after compaction, it still consumed 10MB. when I added additional 4MB of text files, the original index went to 32MB. The index size is four times the size of the corpus, it doesn't seem right. Am I doing something wrong? Thanks, John
On Thu, May 05, 2005 at 01:39:20PM -0400, John Paige wrote:> Hi, > I am evaluating to use xapian in our product. I just downloaded the > core and examples code from the website. > I'm puzzeled about one thing though, when I used the test program > "simpleIndexer", I found out that the index size is four times the > size of the corpus. I indexed 4MB worth of text files, and the index > was 16MB to index, and even after compaction, it still consumed 10MB. > when I added additional 4MB of text files, the original index went to 32MB. > > The index size is four times the size of the corpus, it doesn't seem > right. Am I doing something wrong?Most likely not - but tell us what you _expect_ the index size to be? Do you expect the index size to be _smaller_ than the corpus? Cheers Ralf Mattes> Thanks, > John > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss
On Thu, May 05, 2005 at 01:39:20PM -0400, John Paige wrote:> I am evaluating to use xapian in our product. I just downloaded the > core and examples code from the website. > I'm puzzeled about one thing though, when I used the test program > "simpleIndexer", I found out that the index size is four times the > size of the corpus.I guess you mean "simpleindex" - that splits the input file into paragraphs, and indexes each paragraph by the terms in it, storing the whole paragraph as the document data. Currently document data is stored uncompressed (I have patches to use zlib I'll be integrating soon) so currently the size of an index built by simpleindex will inevitably be bigger than the text indexed, because it *contains* the entire text indexed in uncompressed form. Typically the document data is used to store a URL or UID for a database, a document title, and a sample of text from the document,> I indexed 4MB worth of text files, and the index was 16MB to index, > and even after compaction, it still consumed 10MB. when I added > additional 4MB of text files, the original index went to 32MB.It does seem larger than I'd expect. There's scope for reducing the size of Xapian databases (this will improve in the coming months), but even so that sounds excessively large. The output of "ls -l" on the index directory before and after compaction might be interesting. Can you post that?> The index size is four times the size of the corpus, it doesn't seem > right. Am I doing something wrong?Using simpleindex, perhaps. It's really meant to show what the code for a Xapian indexer looks like without too much non-Xapian related complication. Are you just experimenting, or trying to build an actual system? Cheers, Olly