Alessandro Pasotti
2008-May-21 07:23 UTC
[Xapian-discuss] check for blacklisted words (and thanks)
Hi again, first of all I wish to thank everybody here on this list (Olly, Richard etc.) for the very helpful answers (I do prefer not to add noise on this list thanking people every time). Now the question: I must check if a particular document contains blacklisted words (which are in a textfile, unstemmed one per line), is there a way to restrict a query to a single document and return a boolean value if one of the terms in the query are contained in the checked document? -- Alessandro Pasotti w3: www.itopen.it
James Aylett
2008-May-21 09:17 UTC
[Xapian-discuss] check for blacklisted words (and thanks)
On Wed, May 21, 2008 at 09:23:28AM +0200, Alessandro Pasotti wrote:> Now the question: I must check if a particular document contains > blacklisted words (which are in a textfile, unstemmed one per line), > is there a way to restrict a query to a single document and return a > boolean value if one of the terms in the query are contained in the > checked document?If you want the blacklist to work unstemmed, and are using the QueryParser, you can construct a new Query using QueryParser::unstem_begin() and QueryParser::unstem_end(), OP_OR them all together, and then OP_FILTER with a special (probably prefixed) term that's only in the blacklist document. You'll get back nothing, or the blacklist document. If you want to employ stemming, instead use Query::get_terms_begin() to get out the stemmed terms. There are going to be other ways, possibly more efficient, than doing this (for instance, if you're not using a stopper, you could write a custom one and check if it's fired on any of your words; however I suspect the above will scale to lots of blacklisted words better, if that's an issue for you). J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james at tartarus.org uncertaintydivision.org
Olly Betts
2008-May-21 13:46 UTC
[Xapian-discuss] check for blacklisted words (and thanks)
On Wed, May 21, 2008 at 09:23:28AM +0200, Alessandro Pasotti wrote:> Now the question: I must check if a particular document contains > blacklisted words (which are in a textfile, unstemmed one per line), > is there a way to restrict a query to a single document and return a > boolean value if one of the terms in the query are contained in the > checked document?Rather than running a query in this case, I'd suggest you just take the Document object (before you've even added it to the database if you like) and iterate its termlist. If the blacklist is long, you could either stick its entries in a C++ std::set (or Perl hash, Python dict, etc) at start-up, and test each document term. Or if the blacklist is short, you can use skip_to() on the Document's termlist to check for blacklist terms in sorted order. If the blacklist is to prevent indexing, this has the added benefit that you don't need to delete the document from the database if it fails the blacklist test. Cheers, Olly