To Olly Betts: Thank you very much for any feedback. I apologise for this belated reply and also for the fact that the text of the previous posting appeared fragmented, due to its fixed chars/line format. With reference to:> Can, or could, one construct a query so that Omega (Xapian) canhandle> this ?>> ... perhaps with some type of Regex ?>> It would seem that Wild Cards fall short here.> If it is possible but not immediately available what would one haveto do> to enable this option ? Are there any working examples, HowTos, Faqs? and ...................................................................... .. I have a branch which adds support for arbitrary glob-style wildcard patterns (where * matches 0 or more characters and ? a single character): <https://github.com/ojwb/xapian/tree/extended-wildcards> https://github.com/ojwb/xapian/tree/extended-wildcards The code there works, but is waiting for some benchmarking and profiling before being merged. ...................................................................... ... Regarding the above questions and comments: I looked at the link you suggested with interest however unfortunately I failed to see the detailed information regarding the new possibilities of your extended wild cards (that is in comparison to the basic well known options) and also on how one can implement/make use of your proposal with the support of the arbitrary glob-style wildcard patterns. a. What other types of extended wild card(s) options are there ? or is this still currently limited to these two characters '*?' ? b. Apart from 0 or more and single char options are there any other options ? either via Omega, formulating an appropriate query for CGI or via Xapian. Rearding the question in relation to Text Patterns The reference to ISBNs was of course merely a simple example, but it could be any other typical pattern of letters, numbers and separator characters. Were you suggesting that one possibility would be trying something similar to : isbn:?-???-?????-? as a very loose general query for ISBNs ? (so long as the option is enabled). 1 Could you mention how one enables and can take advantage of your extended option in Omega and/or Xapian ? (working example ?) 2 The ? Wild Char is for general characters, is it not ? ie. It cannot distinguish between digits and letters and thus cannot act as a RE \d or [0-9] ?>> $match{REGEX,STRING[,OPTIONS]},>> $transform{REGEXP,SUBST,STRING[,OPTIONS]}> These are for use in the templating language - they're not searchoptions. Yes I mentioned that it seemed from reading that these were Post Query Options acting on the result set. ...................................................................... ...>> If none of the above are possible for Omega, can one manage thiswith>> Xapian, or do something similar ?>> and>> Again any links to working examples etc. would be most appreciated.> If you have particular "code" patterns which are important in yourdomain,> I'd consider pulling them out at index time and adding them as afilter term ...................................................................... ... It is no doubt due to my lack of understanding but how would this interesting option 'pulling them out at index time ...' be implemented ? It would be very useful if there were some working examples in relation to these themes, (at least for those less expert than the xapian developer level). Xapian-Omega appears to be a very interesting solution and with an RE option it would be one of the most flexible and versatile SEs currently available on the net Thank you again for your follow-up. Best wishes, Giulio
On Thu, Dec 29, 2016 at 05:44:50PM +0100, Giulio Teslano wrote:> a. What other types of extended wild card(s) options are there ? > > or is this still currently limited to these two characters '*?' ?As I said, the branch "adds support for arbitrary glob-style wildcard patterns (where * matches 0 or more characters and ? a single character)".> b. Apart from 0 or more and single char options are there any other > options ?Not that are currently implemented on that branch.> Were you suggesting that one possibility would be trying something > similar to : > > isbn:?-???-?????-? as a very loose general query for ISBNs ? > > (so long as the option is enabled).It seems you must be talking about the query the user would write here, but then I'm not sure what the "isbn:" prefix would map to. But yes, that's the sort of pattern you'd have to use. One wrinkle with this is that (assuming you use the Xapian::TermGenerator class) "-" is a word separator character at index time - i.e. you'll get terms from 1-234-56789-0 and OP_WILDCARD only matches within a term. So you need different word splitting behaviour for this to work, which currently means you'll need to do it yourself instead of using TermGenerator as that isn't currently configurable.> 1 Could you mention how one enables and can take advantage of your > extended option in Omega and/or Xapian ? (working example ?)Currently you need to use one of the WILDCARD_PATTERN_* constants when constructing an OP_WILDCARD Query object, e.g.: Xapian::Query wild(Xapian::OP_QUERY_WILDCARD, "?-???-?????-?", 0, Xapian::Query::OP_WILDCARD_GLOB); There isn't yet any integration into omega (or even into Xapian::QueryParser). Such are the pitfalls of using code from unmerged branches I'm afraid.> 2 The ? Wild Char is for general characters, is it not ? > > ie. It cannot distinguish between digits and letters and thus cannot > act as a RE \d or [0-9] ?"?" matches any single character. The project this branch is for only required allowing "*" anywhere in a term (rather than it only being supported at the end) and adding support for "?", so there's not currently a plan to support pattern styles other than globbing, or additional glob-style patterns. The flags to control this were picked such that either could be done in the future.> > If you have particular "code" patterns which are important in your domain, > > I'd consider pulling them out at index time and adding them as a filter > > term > > It is no doubt due to my lack of understanding but how would this > interesting option 'pulling them out at index time ...' be implemented > ?For example in Perl, at index time: while ($text =~ /(\b\d-\d{3}-\d{5}-[\dX]\b)/g) { $doc->add_boolean_term("XISBN$1"); } With this approach, you could also easily do additional validation (such as checking the check digit for codes which have one, as ISBNs do). Then at query time: $queryparser->add_boolean_prefix("isbn", "XISBN"); Then the user can use isbn:1-234-56789-0 to filter only documents mentioning that ISBN. Or if you want to be able to find documents which mention any ISBN (or anything which looks like one) then: if ($text =~ /(\b\d-\d{3}-\d{5}-[\dX]\b)/) { $doc->add_boolean_term("XHASisbn"); } Then at query time: $queryparser->add_boolean_prefix("has", "XHAS"); And then the user can filter a search by: has:isbn> It would be very useful if there were some working examples in > relation to these themes, (at least for those less expert than the > xapian developer level). Xapian-Omega appears to be a very interesting > solution and with an RE option it would be one of the most flexible > and versatile SEs currently available on the netI suspect that most end users wanting "regexp search" don't just want to search for terms matching a specified regexp (which is how OP_WILDCARD inherently works), but rather to perform regexp matches over the whole document (like https://codesearch.debian.net/ does for source code). To do that efficiently you need a different index structure. Cheers, Olly
On Wed, Jan 04, 2017 at 02:08:29AM +0000, Olly Betts wrote:> One wrinkle with this is that (assuming you use the Xapian::TermGenerator > class) "-" is a word separator character at index time - i.e. you'll get > terms from 1-234-56789-0 and OP_WILDCARD only matches within a term.Sorry, lost a key word there - I meant to say: "you'll get FOUR terms from 1-234-56789-0" Cheers, Olly
Apparently Analagous Threads
- Formulating Advanced Queries with Xapian-Omega
- Formulating Advanced Queries with Xapian-Omega
- How to filter search result with query with has white space.
- How to filter search result with query with has white space.
- trouble with user's right indexing with omega