thr3ads.net - Xapian discuss - [Xapian-discuss] Size of the index [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Justine Demeyer

2008-Nov-24 14:47 UTC

[Xapian-discuss] Size of the index

Hi,

I have a question about the size of the Xapian index.

I indexed a set of 200 000 data who has a global size of about 1Gb and the
index created has a size of more than 3Gb!! What can explain this
difference???

Thanks for your help.

Henry

2008-Nov-25 07:52 UTC

head link

[Xapian-discuss] Size of the index

Quoting "Justine Demeyer" <justine.demeyer at
gmail.com>:> I have a question about the size of the Xapian index.
>
> I indexed a set of 200 000 data who has a global size of about 1Gb and the
> index created has a size of more than 3Gb!! What can explain this
> difference???
You'll find this with all indexing systems, to some degree.  The size  
of your index is almost always larger than the raw text, depending on  
how you've structured the index/terms, whether you're stopalizing,  
etc, and also depends on whether you've compacted the DB.

If you post more detail about your index then that will help to  
pinpoint why your index is so large.

Cheers
Henry

Robert Young

2008-Nov-25 14:37 UTC

head link

[Xapian-discuss] Size of the index

Oops, xapian-discuss doesn't seem to set reply-to.

Stop words that appear in such a high proportion of the documents in your
corpus that they can be safely ignored. Words like 'the', 'and',
'a' etc.
Remove these and you can improve the precision of your queries, the
performance of both queries and indexing and reduce the size of your index.
At the potential expense of recall.

Cheers
Rob

On Tue, Nov 25, 2008 at 2:23 PM, Justine Demeyer
<justine.demeyer at gmail.com>wrote:
>
> Ok, thanks!!
>
> But what is the purpose of the stop words??
>
>
> 2008/11/25 Robert Young <rob at roryoung.co.uk>
>
> As Henry alluded to earlier, you could potentially reduce the size of your
>> index by removing stop words.
>>
>> Cheers
>> Rob
>>
>>
>> On Tue, Nov 25, 2008 at 10:32 AM, Justine Demeyer <
>> justine.demeyer at gmail.com> wrote:
>>
>>> Here is the code of the index :
>>>
>>> void Index(char* ind, char* directory)
>>> {
>>>       try
>>>       {
>>>           timeval tim;
>>>           double t1, t2, dif;
>>>
>>>           string index(ind);
>>>
>>>           //Heure de debut de l'operation
>>>           gettimeofday(&tim, NULL);
>>>       t1=tim.tv_sec+(tim.tv_usec/1000000.0);
>>>
>>>       //Creattion ou ouverture de l'index
>>>       Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
>>>       Xapian::TermGenerator indexer;
>>>       Xapian::Stem stemmer("english");
>>>       indexer.set_stemmer(stemmer);
>>>
>>>
>>>       struct dirent *lecture;
>>>       DIR *rep;
>>>
>>>       rep = opendir(directory);
>>>       while((lecture = readdir(rep)))
>>>       {
>>>
>>>               char* name = lecture->d_name;
>>>               std::string name2(name);
>>>
>>>               string path= directory+name2;
>>>
>>>               ifstream fichier(path.c_str(), ios::in);
>>>
>>>               if(fichier) // ce test ?(c)choue si le fichier
n'est pas
>>> ouvert
>>>               {
>>>                   string ligne; // variable contenant chaque ligne
lue
>>>                       string contenu;
>>>
>>>                   // cette boucle s'arr??te d??s qu'une
erreur de lecture
>>> survient
>>>                       while(std::getline(fichier, ligne))
>>>                       {
>>>                           contenu = contenu + ligne +
"\n";
>>>                       }
>>>
>>>                   //Indexation
>>>                       Xapian::Document doc;
>>>                       doc.set_data(contenu);
>>>
>>>                       indexer.set_document(doc);
>>>                       indexer.index_text(contenu);
>>>
>>>                       db.add_document(doc);
>>>                       cout << "add " <<
path.c_str() << endl;
>>>
>>>               }
>>>
>>>
>>>       }
>>>       //Mise a jour
>>>       cout << "Optimizing" << endl;
>>>       db.flush();
>>>       closedir(rep);
>>>
>>>       //Heure de fin de l'operation
>>>       gettimeofday(&tim, NULL);
>>>       t2=tim.tv_sec+(tim.tv_usec/1000000.0);
>>>
>>>       //Calcul de la duree de l'operation
>>>       dif = t2 - t1;
>>>       Calculate(dif);
>>>
>>>
>>>   }
>>>       catch (const Xapian::Error &e)
>>>       {
>>>               cout << e.get_description() << endl;
>>>       }
>>> }
>>>
>>> Thanks for helping me
>>>
>>>
>>> 2008/11/25 Henry <henka at cityweb.co.za>
>>>
>>> > Quoting "Justine Demeyer" <justine.demeyer at
gmail.com>:
>>> > > I have a question about the size of the Xapian index.
>>> > >
>>> > > I indexed a set of 200 000 data who has a global size of
about 1Gb
>>> and
>>> > the
>>> > > index created has a size of more than 3Gb!! What can
explain this
>>> > > difference???
>>> >
>>> > You'll find this with all indexing systems, to some
degree.  The size
>>> > of your index is almost always larger than the raw text,
depending on
>>> > how you've structured the index/terms, whether you're
stopalizing,
>>> > etc, and also depends on whether you've compacted the DB.
>>> >
>>> > If you post more detail about your index then that will help
to
>>> > pinpoint why your index is so large.
>>> >
>>> > Cheers
>>> > Henry
>>> >
>>> >
>>> > _______________________________________________
>>> > Xapian-discuss mailing list
>>> > Xapian-discuss at lists.xapian.org
>>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>> >
>>> _______________________________________________
>>> Xapian-discuss mailing list
>>> Xapian-discuss at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>>
>>
>>
>

Olly Betts

2008-Dec-01 09:56 UTC

head link

[Xapian-discuss] Size of the index

On Mon, Nov 24, 2008 at 03:47:27PM +0100, Justine Demeyer
wrote:> I indexed a set of 200 000 data who has a global size of about 1Gb and the
> index created has a size of more than 3Gb!! What can explain this
> difference???
I've added a new FAQ entry about database size:

http://trac.xapian.org/wiki/FAQ/DatabaseSize

Hope that helps.

Cheers,
    Olly

Xapian discuss - Nov 2008 - Size of the index

[Xapian-discuss] Size of the index

[Xapian-discuss] Size of the index

[Xapian-discuss] Size of the index

[Xapian-discuss] Size of the index