thr3ads.net - Xapian discuss - [Xapian-discuss] Rqt for Features [Jul 2004]

If this information is useful, please help other people find it:
Share via:

Tim Brody

2004-Jul-09 15:59 UTC

[Xapian-discuss] Rqt for Features

Hi,

I've been working on a xapian implementation for the last month or so and
have implemented (well, hacked until it worked) QueryParser for Perl.

Could xapian have the ability to specify docids? My system - as I'm sure
many others do - maintains it's own ids for people, docs etc. For the moment
I've opted to rebuild the index from scratch everyday, rather than
maintaining a docid => myid mapping in order to perform incremental nightly
changes.

The cleanest method from outside of the API would be if replace_document
accepted a non-existent (to xapian) docid, in which case it adds the
document rather than excepting (i.e. SQL's "REPLACE" behaviour).

Having added wrappers for QueryParser I wonder whether it would be
worthwhile revising Stopper. I can't think of a situation where a stopper
would need to be more intelligent than containing a list of words to stop,
so seems a little pointless distributing a class in Xapian that doesn't do
this.

e.g. the class I've submitted to Alex:
class MyStopper : public Stopper
{
    public:
        bool operator()(const string &term) {
            return terms.find(term) != terms.end() ? terms[term] : false;
        };
        void add(string term) {
            terms[term] = true;
        };
        void del(string term) {
            terms[term] = false;
        };

    private:
        map<string,bool> terms;
};

Of course if I could wave a magic wand I would modify QueryParser's API
anyway .... :-)

class QueryParser {
public:
    QueryParser();
    ~QueryParser();
    set_database(&Xapian::Database);
    set_default_op(Query::op);
    set_stemmer(&Xapian::Stemmer,bool);
    add_stopword(&string);
    add_prefix(&string,&string);
    parse(&string);
    <get methods for termlist, stoplist, unstem etc>
private:
    ...
};

All the best,
Tim.

Richard Boulton

2004-Jul-09 16:52 UTC

head link

[Xapian-discuss] Rqt for Features

On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody
wrote:> I've been working on a xapian implementation for the last month or so
and
> have implemented (well, hacked until it worked) QueryParser for Perl.
I see these have been added to the Search::Xapian module on CPAN.  Thanks
for your work.
> Could xapian have the ability to specify docids? My system - as I'm
sure
> many others do - maintains it's own ids for people, docs etc. For the
moment
> I've opted to rebuild the index from scratch everyday, rather than
> maintaining a docid => myid mapping in order to perform incremental
nightly
> changes.
With Xapian as it currently stands, the way to do this is to specify
a unique term, and store it in each document.  The unique terms would
be comprised of a prefix followed by your document identifiers.
Traditionally, "Q" has been used for the prefix, but any prefix which
will avoid collisions with other terms is acceptable.

Whenever a document is modified, you would first open the postlist for the
term, which gives you a list of all documents containing the term, and
delete these documents (hopefully, this list would be of length 0 or 1).
Then, add the new document.

There is a proposal to add a new API method to delete all documents
containing a given term, which would ease the implementation of this scheme
(I'm not sure of the status of this proposal).

This method is used by "scriptindex" - see the implementation of the
"uniq"
command.  It can be more flexible and robust than using Xapian's document
identifiers: in particular, it allows you to use any string as an
identifier, rather than restricting you to 32 bit numbers.  Also, if you
combine databases together using Xapian's multidatabase facility, the
Xapian docids will change (an interleaving scheme is used to disambiguate
document identifiers document), which could break software which relies on
the document identifiers being tied more closely to the contents of the
document.

Using Xapian's built in identifiers should be more efficient, allowing
documents to be referenced without having to perform a database lookup to
determine the internal identifier first, but I'm not sure how much of a
cost this actually incurs.
> The cleanest method from outside of the API would be if replace_document
> accepted a non-existent (to xapian) docid, in which case it adds the
> document rather than excepting (i.e. SQL's "REPLACE"
behaviour).
This would be a perfectly reasonable extension.  I haven't time right now
to take a look at how feasible it would be, but I can't think of any likely
problem.

Could you file a bug request for this feature request, so that it doesn't
get lost?

-- 
Richard

Richard Boulton

2004-Jul-09 17:14 UTC

head link

[Xapian-discuss] Rqt for Features

Tim Brody wrote:> Having added wrappers for QueryParser I wonder whether it would be
> worthwhile revising Stopper. I can't think of a situation where a
stopper
> would need to be more intelligent than containing a list of words to stop,
> so seems a little pointless distributing a class in Xapian that doesn't
do
> this.
I think the actual process of stopping is always going to be this 
simple, but the selection of words to stop isn't necessarily so simple. 
  In particular, it would be useful to have prebuilt lists of common 
stopword for (at least) each of the languages which we provide stemmers 
for.  The user might then create, for example, a StandardStopper object, 
passing the name of a language, rather than having to keep a list of 
words in their application.

However, there's a strong argument for providing a class such as yours 
as part of Xapian, since it would be useful to many users.  Could you 
add this to the bugzilla too, so it won't get forgotten?
> Of course if I could wave a magic wand I would modify QueryParser's API
> anyway .... :-)
QueryParser is a great deal less polished than other parts of Xapian's 
interface - which is partly why it is separated out into a separate 
library.  It was originally written for a specific application (omega), 
and then extracted into a separate library, but it is due for a good 
look.  In other words - its API is open for discussion.

Certainly, it is weird to have "set_stemming_options()" take a
stopper:
I'd like to see that fixed.  It also has a load of public members which 
really should be private...

Additionally, I'd like to see some code for indexing a chunk of text in 
a manner compatible with the query parser put into a library. 
Currently, the easiest approach for application writers is to cut and 
paste blocks of code from omindex...

Patches for any of these things would be most welcome - but discussion 
and other suggestions are also appreciated.

-- 
Richard

Olly Betts

2004-Aug-09 12:17 UTC

head link

[Xapian-discuss] Rqt for Features

On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody
wrote:> Could xapian have the ability to specify docids? My system - as I'm
sure
> many others do - maintains it's own ids for people, docs etc. For the
moment
> I've opted to rebuild the index from scratch everyday, rather than
> maintaining a docid => myid mapping in order to perform incremental
nightly
> changes.
> 
> The cleanest method from outside of the API would be if replace_document
> accepted a non-existent (to xapian) docid, in which case it adds the
> document rather than excepting (i.e. SQL's "REPLACE"
behaviour).
This is something I'd noticed might be useful.  The main caveats are that
docid 0 is always invalid in Xapian, and that specifying sparse document
ids would subvert the compression techniques used in the quartz backend
to such an extent that you are likely to be better off using the unique
term approach.

There's a comment in scriptindex.cc sketching out the idea of allowing
the hash of an external UID as the Xapian docid, but upon reflection I
think this is a very bad idea, especially as it risks collisions between
UIDs too.  I'll remove that comment shortly.
> Having added wrappers for QueryParser I wonder whether it would be
> worthwhile revising Stopper. I can't think of a situation where a
stopper
> would need to be more intelligent than containing a list of words to stop,
> so seems a little pointless distributing a class in Xapian that doesn't
do
> this.
Two fairly reasonable examples:

(a) you might want to unconditionally stop all terms of N or fewer
characters.  Your approach would require specifying all 26^N terms to
stop (probably more actually since digits, etc are usually allowed in
terms).

(b) you might want to stop based on term frequency - for example any
term which occurs in more than M% of documents in the database could
be treated as a stopword (which provides a self-tuning application
specific stopword list!)
>         map<string,bool> terms;
I think set<string> is probably a more appropriate data structure here.

Cheers,
    Olly

Xapian discuss - Jul 2004 - Rqt for Features

[Xapian-discuss] Rqt for Features

[Xapian-discuss] Rqt for Features

[Xapian-discuss] Rqt for Features

[Xapian-discuss] Rqt for Features