Kevin Duraj
2011-Mar-31 18:55 UTC
[Xapian-discuss] Xapian Index: 607GB = 219 million of unique documents
It took approximately five days, having single process using one core CPU and 6GB of memory to build this giant 607GB single Xapian index, containing 219 million of unique documents (web sites). So far I did not found any other implementation that would enable me to build such a single index containing over 200 million documents, while testing Lucene, Solr, MySQL, Hadoop and Oracle. Probably that would be the real reason why Xapian was not approved last year, for Google's Summer of Code. Xapian is the type of open source that they don't want you to know about. Following index can be search from: http://myhealthcare.com/ total 607G -rw-r--r-- 1 kevin kevin 28 2011-03-31 06:09 iamchert -rw-r--r-- 1 kevin kevin 14 2011-03-31 01:50 position.baseA -rw-r--r-- 1 kevin kevin 622K 2011-03-31 06:09 position.baseB -rw-r--r-- 1 kevin kevin 311G 2011-03-31 06:09 position.DB -rw-r--r-- 1 kevin kevin 14 2011-03-30 17:19 postlist.baseA -rw-r--r-- 1 kevin kevin 139K 2011-03-31 00:49 postlist.baseB -rw-r--r-- 1 kevin kevin 70G 2011-03-31 00:49 postlist.DB -rw-r--r-- 1 kevin kevin 14 2011-03-31 00:49 record.baseA -rw-r--r-- 1 kevin kevin 261K 2011-03-31 01:24 record.baseB -rw-r--r-- 1 kevin kevin 131G 2011-03-31 01:24 record.DB -rw-r--r-- 1 kevin kevin 14 2011-03-31 01:24 termlist.baseA -rw-r--r-- 1 kevin kevin 192K 2011-03-31 01:50 termlist.baseB -rw-r--r-- 1 kevin kevin 96G 2011-03-31 01:50 termlist.DB $ delve . number of documents = 219344757 average document length = 28255.9 document length lower bound = 1 document length upper bound = 173153 highest document id ever used = 219344757 Cheers, Kevin Duraj http://myhealthcare.com