Hi, I've been working on a xapian implementation for the last month or so and have implemented (well, hacked until it worked) QueryParser for Perl. Could xapian have the ability to specify docids? My system - as I'm sure many others do - maintains it's own ids for people, docs etc. For the moment I've opted to rebuild the index from scratch everyday, rather than maintaining a docid => myid mapping in order to perform incremental nightly changes. The cleanest method from outside of the API would be if replace_document accepted a non-existent (to xapian) docid, in which case it adds the document rather than excepting (i.e. SQL's "REPLACE" behaviour). Having added wrappers for QueryParser I wonder whether it would be worthwhile revising Stopper. I can't think of a situation where a stopper would need to be more intelligent than containing a list of words to stop, so seems a little pointless distributing a class in Xapian that doesn't do this. e.g. the class I've submitted to Alex: class MyStopper : public Stopper { public: bool operator()(const string &term) { return terms.find(term) != terms.end() ? terms[term] : false; }; void add(string term) { terms[term] = true; }; void del(string term) { terms[term] = false; }; private: map<string,bool> terms; }; Of course if I could wave a magic wand I would modify QueryParser's API anyway .... :-) class QueryParser { public: QueryParser(); ~QueryParser(); set_database(&Xapian::Database); set_default_op(Query::op); set_stemmer(&Xapian::Stemmer,bool); add_stopword(&string); add_prefix(&string,&string); parse(&string); <get methods for termlist, stoplist, unstem etc> private: ... }; All the best, Tim.
On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody wrote:> I've been working on a xapian implementation for the last month or so and > have implemented (well, hacked until it worked) QueryParser for Perl.I see these have been added to the Search::Xapian module on CPAN. Thanks for your work.> Could xapian have the ability to specify docids? My system - as I'm sure > many others do - maintains it's own ids for people, docs etc. For the moment > I've opted to rebuild the index from scratch everyday, rather than > maintaining a docid => myid mapping in order to perform incremental nightly > changes.With Xapian as it currently stands, the way to do this is to specify a unique term, and store it in each document. The unique terms would be comprised of a prefix followed by your document identifiers. Traditionally, "Q" has been used for the prefix, but any prefix which will avoid collisions with other terms is acceptable. Whenever a document is modified, you would first open the postlist for the term, which gives you a list of all documents containing the term, and delete these documents (hopefully, this list would be of length 0 or 1). Then, add the new document. There is a proposal to add a new API method to delete all documents containing a given term, which would ease the implementation of this scheme (I'm not sure of the status of this proposal). This method is used by "scriptindex" - see the implementation of the "uniq" command. It can be more flexible and robust than using Xapian's document identifiers: in particular, it allows you to use any string as an identifier, rather than restricting you to 32 bit numbers. Also, if you combine databases together using Xapian's multidatabase facility, the Xapian docids will change (an interleaving scheme is used to disambiguate document identifiers document), which could break software which relies on the document identifiers being tied more closely to the contents of the document. Using Xapian's built in identifiers should be more efficient, allowing documents to be referenced without having to perform a database lookup to determine the internal identifier first, but I'm not sure how much of a cost this actually incurs.> The cleanest method from outside of the API would be if replace_document > accepted a non-existent (to xapian) docid, in which case it adds the > document rather than excepting (i.e. SQL's "REPLACE" behaviour).This would be a perfectly reasonable extension. I haven't time right now to take a look at how feasible it would be, but I can't think of any likely problem. Could you file a bug request for this feature request, so that it doesn't get lost? -- Richard
Tim Brody wrote:> Having added wrappers for QueryParser I wonder whether it would be > worthwhile revising Stopper. I can't think of a situation where a stopper > would need to be more intelligent than containing a list of words to stop, > so seems a little pointless distributing a class in Xapian that doesn't do > this.I think the actual process of stopping is always going to be this simple, but the selection of words to stop isn't necessarily so simple. In particular, it would be useful to have prebuilt lists of common stopword for (at least) each of the languages which we provide stemmers for. The user might then create, for example, a StandardStopper object, passing the name of a language, rather than having to keep a list of words in their application. However, there's a strong argument for providing a class such as yours as part of Xapian, since it would be useful to many users. Could you add this to the bugzilla too, so it won't get forgotten?> Of course if I could wave a magic wand I would modify QueryParser's API > anyway .... :-)QueryParser is a great deal less polished than other parts of Xapian's interface - which is partly why it is separated out into a separate library. It was originally written for a specific application (omega), and then extracted into a separate library, but it is due for a good look. In other words - its API is open for discussion. Certainly, it is weird to have "set_stemming_options()" take a stopper: I'd like to see that fixed. It also has a load of public members which really should be private... Additionally, I'd like to see some code for indexing a chunk of text in a manner compatible with the query parser put into a library. Currently, the easiest approach for application writers is to cut and paste blocks of code from omindex... Patches for any of these things would be most welcome - but discussion and other suggestions are also appreciated. -- Richard
On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody wrote:> Could xapian have the ability to specify docids? My system - as I'm sure > many others do - maintains it's own ids for people, docs etc. For the moment > I've opted to rebuild the index from scratch everyday, rather than > maintaining a docid => myid mapping in order to perform incremental nightly > changes. > > The cleanest method from outside of the API would be if replace_document > accepted a non-existent (to xapian) docid, in which case it adds the > document rather than excepting (i.e. SQL's "REPLACE" behaviour).This is something I'd noticed might be useful. The main caveats are that docid 0 is always invalid in Xapian, and that specifying sparse document ids would subvert the compression techniques used in the quartz backend to such an extent that you are likely to be better off using the unique term approach. There's a comment in scriptindex.cc sketching out the idea of allowing the hash of an external UID as the Xapian docid, but upon reflection I think this is a very bad idea, especially as it risks collisions between UIDs too. I'll remove that comment shortly.> Having added wrappers for QueryParser I wonder whether it would be > worthwhile revising Stopper. I can't think of a situation where a stopper > would need to be more intelligent than containing a list of words to stop, > so seems a little pointless distributing a class in Xapian that doesn't do > this.Two fairly reasonable examples: (a) you might want to unconditionally stop all terms of N or fewer characters. Your approach would require specifying all 26^N terms to stop (probably more actually since digits, etc are usually allowed in terms). (b) you might want to stop based on term frequency - for example any term which occurs in more than M% of documents in the database could be treated as a stopword (which provides a self-tuning application specific stopword list!)> map<string,bool> terms;I think set<string> is probably a more appropriate data structure here. Cheers, Olly