Xapian Terms vs. Document Partition. On December 2007, Diego Puppin from Google had interesting talk about parallel architecture distributing index based on terms rather than documents. Reference: http://youtube.com/watch?v=KpZpsu2wM1s This describing similar idea we have discussed 7 months earlier on May 2007, before Diego's presentation in the following Xapian discussion threads. Reference: http://lists.tartarus.org/pipermail/xapian-discuss/2007-May/003889.html My index is growing to 100 million of documents at http://myhealthcare.com and I need to implement some parallel architecture, because it takes too long to update and add new documents into index. I would like again encourage Xapian community to start looking into distributing index based on terms rather than documents. To make each server be responsible for set of terms rather then set of documents would enable us to scale our search engine to Google's level. Thank you, Kevin Duraj http://myhealthcare.com
Alex Brasetvik
2008-May-08 15:28 UTC
[Xapian-discuss] Xapian Terms vs. Document Partition.
On Tue, 6 May 2008 16:48:01 -0700, "Kevin Duraj" <kevin.softdev at gmail.com> wrote:> Xapian Terms vs. Document Partition. > > On December 2007, Diego Puppin from Google had interesting talk about > parallel architecture distributing index based on terms rather than > documents. > Reference: > http://youtube.com/watch?v=KpZpsu2wM1s[snip]> I would like again encourage Xapian community to > start looking into distributing index based on terms rather than > documents. To make each server be responsible for set of terms rather > then set of documents would enable us to scale our search engine to > Google's level.If you watch the talk again and read their paper[1], you'll see that the gist of the talk is *not* about neither document- nor term-partitioning. Also, in their paper, they suggest ``Document partitioning is the strategy usually chosen by the most popular web search engines'', citing Page and Brin's paper on Google's architecture. You may want to read it. ~ [1] http://scholar.google.no/scholar?hl=en&lr=&cluster=10013139656811614516 -- Alex Brasetvik
"Kevin Duraj" wrote:> My index is growing to 100 million of documents at > http://myhealthcare.com and I need to implement some parallel > architecture, because it takes too long to update and add new > documents into index.Kevin, what sorts of timings, document update rates and what hardware are you running on? Scaling xapian isn't too hard provided that you get your hardware and system architecture right and 100m documents wouldn't concern me greatly if I were asked to implement it. Chris
On Wed, Jun 04, 2008 at 12:53:46AM -0700, Juan Gargiulo wrote:> I am experiencing the problem described in ticket 185 ( > http://trac.xapian.org/ticket/185), and the suggested workaround is not > working for me.It might be better to keep all discussion of this issue in one place (i.e. on that ticket) to make it easier to track.> I am running Apache 2.0.x, Python 2.5, mod_python 3.3.1, xapian-core 1.0.6, > xapian-bindings 1.0.6. Everything under Mac OS 10.5 64 bits. > > I am using PythonInterpreter main_interpreter but I am still experiencing > the hang. > > Any help is welcome.Have you tried using a snapshot of Xapian SVN trunk (as suggested in comment 20)? Cheers, Olly