Justin Finkelstein
2011-Apr-01 11:18 UTC
[Xapian-discuss] Xapian-discuss Digest, Vol 83, Issue 1
I think this is a shining example of how well Xapian works with large document collections. I was just discussing this with my colleagues here and one of the issues that came up is that we'd love Xapian to become really lot more popular but have found that the documentation's a bit difficult to get into, as is the API. So I was wondering: do you have any thoughts on improving this and would you like some help? I use Xapian a fair bit (mostly on www.reportbuyer.com) together with a new wrapper for our CMS and have a bit of spare time. I'd be happy to write up examples of how to use some of the bindings, particularly PHP as that's my area.> Message: 1 > Date: Thu, 31 Mar 2011 11:55:32 -0700 > From: Kevin Duraj <kevinduraj at gmail.com> > Subject: [Xapian-discuss] Xapian Index: 607GB = 219 million of unique > documents > To: xapian-discuss at lists.xapian.org > Message-ID: > <AANLkTiku6tA06=s9hmX7nTcBHWSDfxdDgnHJuLUKhRBN at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > It took approximately five days, having single process using one core > CPU and 6GB of memory to build this giant 607GB single Xapian index, > containing 219 million of unique documents (web sites). So far I did > not found any other implementation that would enable me to build such > a single index containing over 200 million documents, while testing > Lucene, Solr, MySQL, Hadoop and Oracle. Probably that would be the > real reason why Xapian was not approved last year, for Google's Summer > of Code. Xapian is the type of open source that they don't want you to > know about. > > Following index can be search from: http://myhealthcare.com/ > > total 607G > -rw-r--r-- 1 kevin kevin 28 2011-03-31 06:09 iamchert > -rw-r--r-- 1 kevin kevin 14 2011-03-31 01:50 position.baseA > -rw-r--r-- 1 kevin kevin 622K 2011-03-31 06:09 position.baseB > -rw-r--r-- 1 kevin kevin 311G 2011-03-31 06:09 position.DB > -rw-r--r-- 1 kevin kevin 14 2011-03-30 17:19 postlist.baseA > -rw-r--r-- 1 kevin kevin 139K 2011-03-31 00:49 postlist.baseB > -rw-r--r-- 1 kevin kevin 70G 2011-03-31 00:49 postlist.DB > -rw-r--r-- 1 kevin kevin 14 2011-03-31 00:49 record.baseA > -rw-r--r-- 1 kevin kevin 261K 2011-03-31 01:24 record.baseB > -rw-r--r-- 1 kevin kevin 131G 2011-03-31 01:24 record.DB > -rw-r--r-- 1 kevin kevin 14 2011-03-31 01:24 termlist.baseA > -rw-r--r-- 1 kevin kevin 192K 2011-03-31 01:50 termlist.baseB > -rw-r--r-- 1 kevin kevin 96G 2011-03-31 01:50 termlist.DB > > $ delve . > number of documents = 219344757 > average document length = 28255.9 > document length lower bound = 1 > document length upper bound = 173153 > highest document id ever used = 219344757 > > Cheers, > Kevin Duraj > http://myhealthcare.com > > > > ------------------------------ > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss > > > End of Xapian-discuss Digest, Vol 83, Issue 1 > *********************************************