thr3ads.net - Xapian discuss - Formulating Advanced Queries with Xapian-Omega [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Giulio Teslano

2016-Dec-22 10:51 UTC

Formulating Advanced Queries with Xapian-Omega

Hello,

 

We have Xapian-Omega installed (Linux) and working in default mode and
have 

browsed several interesting pages on the main site, at trac.xapian.org

(the wiki) and in the mailing list. Having tested various search
options 

(up to now only for Omega) we would like to ask a couple of questions.

 

1. Is it possible to search for Patterns of Text with Omega and/or
Xapian Queries ?

 

ie : Given a file archive containing a series of documents, (Business
Office 

documents of various formats) and that one would like to find all
documents 

which contain text of a certain 'format', such as some type of ID
Code, an 

example might be ISBNs for Libraries or some other custom Contact ID
etc.

 

Can, or could, one construct a query so that Omega (Xapian) can handle
this ?

... perhaps with some type of Regex ?

 

It would seem that Wild Cards fall short here.

If it is possible but not immediately available what would one have to
do 

to enable this option ? Are there any working examples, HowTos, Faqs ?

 

(We read about a couple of Omega options hinting about this :

$match{REGEX,STRING[,OPTIONS]},
$transform{REGEXP,SUBST,STRING[,OPTIONS]}

but it is not immediately clear (to us at least) how to implement them
and 

we have not seen any examples from which to learn.

 

Is transform{} only a post query option acting on the result set ?

 

If none of the above are possible for Omega, can one manage this with
Xapian, 

or do something similar ?

and 

Again any links to working examples etc. would be most appreciated.

 

2. Suggestions for Indexing files of Miscellaneous Types

On the Xapian site several pages put emphasis on the importance and
way in which 

the database and index are created with custom fields, in particular
using 

(semi)structured files (for files with a regular, recognizable common
format) 

which are well disposed to this type of field indexing. .csv etc.

 

Could someone offer any comments regarding the best way to prepare
Omega and/or 

Xapian for File Archives (where the files are of miscellaneous type)
and where 

internal fields are not always obvious and metadata/tags etc. are
quite often 

lacking in homogeneity, if indeed present at all ?

 

Thank you very much for any feedback.

Best wishes,

Giulio

Olly Betts

2016-Dec-23 20:46 UTC

head link

Formulating Advanced Queries with Xapian-Omega

On Thu, Dec 22, 2016 at 11:51:35AM +0100, Giulio Teslano
wrote:> find all documents which contain text of a certain 'format', such
as some
> type of ID Code, an example might be ISBNs for Libraries or some other
custom
> Contact ID etc.
> 
> Can, or could, one construct a query so that Omega (Xapian) can handle
> this ?
> 
> ... perhaps with some type of Regex ?
> 
> It would seem that Wild Cards fall short here.
> 
> If it is possible but not immediately available what would one have to do 
> to enable this option ? Are there any working examples, HowTos, Faqs ?
I have a branch which adds support for arbitrary glob-style wildcard
patterns (where * matches 0 or more characters and ? a single character):

https://github.com/ojwb/xapian/tree/extended-wildcards

The code there works, but is waiting for some benchmarking and profiling
before being merged.
> $match{REGEX,STRING[,OPTIONS]},
> $transform{REGEXP,SUBST,STRING[,OPTIONS]}
These are for use in the templating language - they're not search options.
> If none of the above are possible for Omega, can one manage this with
> Xapian, or do something similar ?
If you have particular "code" patterns which are important in your
domain,
I'd consider pulling them out at index time and adding them as a filter
term.  Then you can support a search query containing something like
isbn:0-201-03801-3 and it'll be faster than a wildcard pattern.
> Could someone offer any comments regarding the best way to prepare
> Omega and/or Xapian for File Archives (where the files are of
> miscellaneous type) and where internal fields are not always obvious and
> metadata/tags etc. are quite often lacking in homogeneity, if indeed
> present at all ?
To put it briefly, do the best you can.

In the situation you describe it's not going to be possible to match up
and index every bit of metadata perfectly, but you don't have to for it to
be useful in searches - in most search applications over large datasets,
precision matters more than recall (see
https://en.wikipedia.org/wiki/Precision_and_recall for definitions, but in
simple terms it generally matters more that the top returned results are
relevant than that every possible relevant result is returned).

And folksonomies (https://en.wikipedia.org/wiki/Folksonomy) show that 
you don't need a rigidly defined tag hierarchy for tagging to be useful
in search.

Omega's omindex indexer already handles some common metadata for formats
where the filters make it available (title, keywords, author, etc).

I'd suggest looking at the distribution of file types you have and focus
effort on the common ones first.  Also look at what metadata searches
your users will find most useful and prioritise indexing those well.

Cheers,
    Olly

Seemingly Similar Threads

Search for more seemingly similar threads

Xapian discuss - Dec 2016 - Formulating Advanced Queries with Xapian-Omega

Formulating Advanced Queries with Xapian-Omega

Formulating Advanced Queries with Xapian-Omega

Seemingly Similar Threads