Kevin Duraj
2010-Dec-18 23:58 UTC
[Xapian-discuss] Xapian index size 475GB = 170 million documents (URLs)
Xapians, I am maintaining about two indexes for my search engines which approximately is each the same size. I would like to share this knowledge with you, since many of you have never seen Xapian index of this size. And of course you can search the index by yourself at - http://myhealthcare.com/ - http://find1friend.com/ I need 2 x 100 million more documents into each index, and I hope it will fit on one hard disk of 2TB, and I will soon beat single handedly the largest Xapian BrightStation's Webtop search engine implementation (archive.org snapshot), which offered a sub-second search over around 500 million web pages (around 1.5 terabytes of database files). Reference: http://xapian.org/history One sample index size: total 475G -rw-r--r-- 1 kevin kevin 28 2010-12-18 15:25 iamchert -rw-r--r-- 1 kevin kevin 13 2010-12-18 12:19 position.baseA -rw-r--r-- 1 kevin kevin 3.8M 2010-12-18 15:25 position.baseB -rw-r--r-- 1 kevin kevin 240G 2010-12-18 15:25 position.DB -rw-r--r-- 1 kevin kevin 13 2010-12-18 04:31 postlist.baseA -rw-r--r-- 1 kevin kevin 923K 2010-12-18 11:36 postlist.baseB -rw-r--r-- 1 kevin kevin 58G 2010-12-18 11:36 postlist.DB -rw-r--r-- 1 kevin kevin 13 2010-12-18 11:36 record.baseA -rw-r--r-- 1 kevin kevin 1.6M 2010-12-18 12:03 record.baseB -rw-r--r-- 1 kevin kevin 102G 2010-12-18 12:02 record.DB -rw-r--r-- 1 kevin kevin 13 2010-12-18 12:03 termlist.baseA -rw-r--r-- 1 kevin kevin 1.2M 2010-12-18 12:19 termlist.baseB -rw-r--r-- 1 kevin kevin 76G 2010-12-18 12:18 termlist.DB $ delve . number of documents = 169346678 average document length = 230970 document length lower bound = 1 document length upper bound = 3585385 highest document id ever used = 169346678 Kevin Duraj http://pacificair.com/
Felix Antonius Wilhelm Ostmann
2010-Dec-20 10:31 UTC
[Xapian-discuss] Xapian index size 475GB = 170 million documents (URLs)
Can you give us more? I like so see info about cpu/ram/hdd setup and query-time/avg/max/ and query-count/parallel/total and all other you can give :) Am 19.12.2010 00:58, schrieb Kevin Duraj:> Xapians, > > I am maintaining about two indexes for my search engines which > approximately is each the same size. I would like to share this > knowledge with you, since many of you have never seen Xapian index of > this size. And of course you can search the index by yourself at > > - http://myhealthcare.com/ > - http://find1friend.com/ > > I need 2 x 100 million more documents into each index, and I hope it > will fit on one hard disk of 2TB, and I will soon beat single handedly > the largest Xapian BrightStation's Webtop search engine implementation > (archive.org snapshot), which offered a sub-second search over around > 500 million web pages (around 1.5 terabytes of database files). > Reference: http://xapian.org/history > > One sample index size: > > total 475G > -rw-r--r-- 1 kevin kevin 28 2010-12-18 15:25 iamchert > -rw-r--r-- 1 kevin kevin 13 2010-12-18 12:19 position.baseA > -rw-r--r-- 1 kevin kevin 3.8M 2010-12-18 15:25 position.baseB > -rw-r--r-- 1 kevin kevin 240G 2010-12-18 15:25 position.DB > -rw-r--r-- 1 kevin kevin 13 2010-12-18 04:31 postlist.baseA > -rw-r--r-- 1 kevin kevin 923K 2010-12-18 11:36 postlist.baseB > -rw-r--r-- 1 kevin kevin 58G 2010-12-18 11:36 postlist.DB > -rw-r--r-- 1 kevin kevin 13 2010-12-18 11:36 record.baseA > -rw-r--r-- 1 kevin kevin 1.6M 2010-12-18 12:03 record.baseB > -rw-r--r-- 1 kevin kevin 102G 2010-12-18 12:02 record.DB > -rw-r--r-- 1 kevin kevin 13 2010-12-18 12:03 termlist.baseA > -rw-r--r-- 1 kevin kevin 1.2M 2010-12-18 12:19 termlist.baseB > -rw-r--r-- 1 kevin kevin 76G 2010-12-18 12:18 termlist.DB > > $ delve . > number of documents = 169346678 > average document length = 230970 > document length lower bound = 1 > document length upper bound = 3585385 > highest document id ever used = 169346678 > > Kevin Duraj > http://pacificair.com/ > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >-- Mit freundlichen Gr??en Felix Antonius Wilhelm Ostmann ----------------------------------------------------------- Websuche Search Technology GmbH & Co. KG Martinistra?e 3, D-49080 Osnabr?ck, Germany ----------------------------------------------------------- Tel.: +49 541 40666-0, Fax: +49 541 40666-22 Email: info at websuche.de, Web: www.websuche.de ----------------------------------------------------------- AG Osnabr?ck - HRA 200252, Ust-IdNr.: DE814737310 ----------------------------------------------------------- Komplement?rin: Websuche Search Technology Verwaltungs GmbH AG Osnabr?ck - HRB 200359, Gesch?ftsf?hrer: Ansas Meyer -----------------------------------------------------------