I've looked over the docs on remote backends, the protocol, and a bit of the c++ for doing distributed and remote searches. I've got a couple of questions: * The remote protocol is usable only as a Database, not as a WriteableDatabase -- is this correct? So, if I don't want my application to have a copy of the database on the same machine I'll need to write an indexer daemon on the remote machine and talk to it over TCP if i want to be able to remotely index? * The socketserver.cc and the corresponding xapian-tcpsrv looks like it blocks, even for reads. As far as I know, Xapian currently supports "single-write, multiple-reads" of the database, which means the tcpserver could be doing more. Am I mistaken in thinking that a read will block another read with the tcp server? I'm building an application that I'd love to have near-real-time indexing, e.g. when a user saves a document it's sent to xapian. That's how it works now, but it's on such a small scale that issues like this don't matter. What's the easiest way to make this work? Here's what I'm thinking: Write a small xapian-daemon server in Python that listens on TCP and can index and search. Because xapian can only do one write at a time (last i checked?) the server will keep a queue of index requests and apply them in order/thread to avoid blocking. Is this something that's useful or is there a more xapiantic way to do this? --Philip Neustrom
Philip Neustrom wrote:> Here's what I'm thinking: Write a small xapian-daemon server in > Python that listens on TCP and can index and search. Because xapian > can only do one write at a time (last i checked?) the server will keep > a queue of index requests and apply them in order/thread to avoid > blocking. Is this something that's useful or is there a more > xapiantic way to do this?Here is a twisted perspective broker server object that is xapwrap specific but gives you the basic pattern for a multi-threaded reader daemon in Python. Just replace the xapwrap read only indexes with xapian databases, etc. xaql.query_xapwrap is a function that does the actual query which you would also replace with your own. As for writing a single thread and connection is all that's needed and obvious enough. #! /usr/bin/python from twisted.internet import threads from twisted.spread import pb from twisted.python import threadable threadable.init() from xapwrap.index import SmartIndex, SmartReadOnlyIndex from xapwrap.document import Document, TextField, SortKey import pool import xaql class Xapd(pb.Root): def __init__(self, path): self.readers = pool.Pool(pool.Constructor(SmartReadOnlyIndex, path)) def remote_query(self, query): d = self.readers.get() r = threads.deferToThread(xaql.query_xapwrap, d, query) self.readers.put(d) return r if __name__ == '__main__': from twisted.internet import reactor reactor.suggestThreadPoolSize(10) reactor.listenTCP(3333, pb.PBServerFactory(Xapd('/home/michel/tmp/index5'))) reactor.run()
Michael Pelletier: Your code will work as a xapian search server that's searching a local xapian database. This is fine, and I can easily do something like this (along with indexing), but here's where I'm stuck: How do I combine the results from the different servers into a unified result that makes sense? My guess is to do something like this: have each server return to the client an MSet, have the client sort the MSets (by the returned search rank for each match) into a master MSet and cut off at the number of desired results (say 11). The problem that I run into when thinking about this approach is that it would be hard to get it to work with non-zero values for the "first" doccount attribute of enquire, e.g. "next ten results" would be hard to find because different results of different relevancy could attrive from different databases. The reason I want to do this is so that I can spread my database across multiple machines without any issues. Any ideas? --Philip Neustrom On 3/26/06, Philip Neustrom <philipn@gmail.com> wrote:> I've looked over the docs on remote backends, the protocol, and a bit > of the c++ for doing distributed and remote searches. I've got a > couple of questions: > > * The remote protocol is usable only as a Database, not as a > WriteableDatabase -- is this correct? So, if I don't want my > application to have a copy of the database on the same machine I'll > need to write an indexer daemon on the remote machine and talk to it > over TCP if i want to be able to remotely index? > > * The socketserver.cc and the corresponding xapian-tcpsrv looks like > it blocks, even for reads. As far as I know, Xapian currently > supports "single-write, multiple-reads" of the database, which means > the tcpserver could be doing more. Am I mistaken in thinking that a > read will block another read with the tcp server? > > I'm building an application that I'd love to have near-real-time > indexing, e.g. when a user saves a document it's sent to xapian. > That's how it works now, but it's on such a small scale that issues > like this don't matter. What's the easiest way to make this work? > > Here's what I'm thinking: Write a small xapian-daemon server in > Python that listens on TCP and can index and search. Because xapian > can only do one write at a time (last i checked?) the server will keep > a queue of index requests and apply them in order/thread to avoid > blocking. Is this something that's useful or is there a more > xapiantic way to do this? > > --Philip Neustrom >
On Sun, Mar 26, 2006 at 09:42:17PM -0800, Philip Neustrom wrote:> * The remote protocol is usable only as a Database, not as a > WriteableDatabase -- is this correct?Yes. There's no inherent reason, but that's what's implemented at present.> So, if I don't want my application to have a copy of the database on > the same machine I'll need to write an indexer daemon on the remote > machine and talk to it over TCP if i want to be able to remotely > index?Or mount it via NFS or similar perhaps.> * The socketserver.cc and the corresponding xapian-tcpsrv looks like > it blocks, even for reads.xapian-tcpsrv forks itself for each connection, so the blocking is only within a single connection. So it can handle multiple concurrent sessions without them blocking each other.> Here's what I'm thinking: Write a small xapian-daemon server in > Python that listens on TCP and can index and search. Because xapian > can only do one write at a time (last i checked?) the server will keep > a queue of index requests and apply them in order/thread to avoid > blocking. Is this something that's useful or is there a more > xapiantic way to do this?That sounds a reasonable approach. Cheers, Olly
On Mon, Mar 27, 2006 at 08:57:45AM -0800, Philip Neustrom wrote:> How do I combine the results from the different servers into a unified > result that makes sense?Just use Database::add_database to combine the remote Database objects for all the servers into a single Database object and Xapian will take care of everything. Cheers, Olly