On Thu, Dec 22, 2016 at 11:51:35AM +0100, Giulio Teslano
wrote:> find all documents which contain text of a certain 'format', such
as some
> type of ID Code, an example might be ISBNs for Libraries or some other
custom
> Contact ID etc.
>
> Can, or could, one construct a query so that Omega (Xapian) can handle
> this ?
>
> ... perhaps with some type of Regex ?
>
> It would seem that Wild Cards fall short here.
>
> If it is possible but not immediately available what would one have to do
> to enable this option ? Are there any working examples, HowTos, Faqs ?
I have a branch which adds support for arbitrary glob-style wildcard
patterns (where * matches 0 or more characters and ? a single character):
https://github.com/ojwb/xapian/tree/extended-wildcards
The code there works, but is waiting for some benchmarking and profiling
before being merged.
> $match{REGEX,STRING[,OPTIONS]},
> $transform{REGEXP,SUBST,STRING[,OPTIONS]}
These are for use in the templating language - they're not search options.
> If none of the above are possible for Omega, can one manage this with
> Xapian, or do something similar ?
If you have particular "code" patterns which are important in your
domain,
I'd consider pulling them out at index time and adding them as a filter
term. Then you can support a search query containing something like
isbn:0-201-03801-3 and it'll be faster than a wildcard pattern.
> Could someone offer any comments regarding the best way to prepare
> Omega and/or Xapian for File Archives (where the files are of
> miscellaneous type) and where internal fields are not always obvious and
> metadata/tags etc. are quite often lacking in homogeneity, if indeed
> present at all ?
To put it briefly, do the best you can.
In the situation you describe it's not going to be possible to match up
and index every bit of metadata perfectly, but you don't have to for it to
be useful in searches - in most search applications over large datasets,
precision matters more than recall (see
https://en.wikipedia.org/wiki/Precision_and_recall for definitions, but in
simple terms it generally matters more that the top returned results are
relevant than that every possible relevant result is returned).
And folksonomies (https://en.wikipedia.org/wiki/Folksonomy) show that
you don't need a rigidly defined tag hierarchy for tagging to be useful
in search.
Omega's omindex indexer already handles some common metadata for formats
where the filters make it available (title, keywords, author, etc).
I'd suggest looking at the distribution of file types you have and focus
effort on the common ones first. Also look at what metadata searches
your users will find most useful and prioritise indexing those well.
Cheers,
Olly