thr3ads.net - Ferret talk - [Ferret-talk] List of terms matched by a query (and their position/offset) [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Jens Krämer

2008-Oct-28 19:03 UTC

[Ferret-talk] List of terms matched by a query (and their position/offset)

Hi,

first of all, please don''t use the web forum to ask questions, but use 
the mailing list (ferret-talk at rubyforge.org). Unfortunately it seems 
that not every message posted here makes it to the mailing list, and I 
don''t check the forum here very often... The other way around (messages
posted via email) works reliably, so in the end you''ll reach more 
people...

Karl Meisterheim wrote:> Hi,
> 
> I have some xml that represents a document.  I parse the xml and place
> specific parts (like the title) into the appropriate fields in my
> document.  The xml  contains the normal document elements like a title,
> body etc.  It also contains illustrations, of which there may be 0 or
> many for a given document.  Each illustration also has a title and
> caption text.
> 
> I''m struggling to figure out how to index this data, since there
are
> many documents in my xml dataset and each document may have a random
> number of illustrations.  Therefore, I can''t just add several
fields to
> my index like illustration1, illustration2, etc.
> 
> Instead, the only way I can think to do it is grab all of the
> illustration / caption text for a given document and glob it together
> into one field, :illustration.
> 
> This will work fine, searches will match terms in that field.  The
> problem comes when wanting to distinguish which illustration the term
> belonged to.
the answer is simple - whatever is the smallest unit you want to get as 
a search result is what you have to index. So if you want to find out 
which illustration a query matches you''ll have to index each 
illustration as a separate document (in the Ferret sense of the word).

You should then index the document''s id along with each illustration, 
and maybe even shared information like the document title. Or build a 
separate index for global document data to avoid that redundancy. 
however then you would have to run each query twice - against the 
document index, and against the illustrations index. trade off between 
indexing speed (2 indexes and therefore no indexing of redundant 
information means faster indexing) versus search speed (searching once 
vs. searching twice for each user query)...

Does that sound like it might work?

Cheers,
Jens

-- 
Posted via http://www.ruby-forum.com/.

Karl Meisterheim

2008-Oct-31 14:59 UTC

head link

[Ferret-talk] List of terms matched by a query (and their position/offset)

Hi Jens,

Thanks for the reply.

What you say makes sense, but I''m hoping for a simpler implementation.
I guess it boils down to this:
When I conduct a search in ferret / AAF, I get an array of documents
back.  Somehow, the highlight method knows where the terms that were
matched by the search exist in those documents / field.  Is there
anyway that I can get that information?  I looked through the API and
even the source and unfortunately, couldn''t quite grok how it was
happening.

This will allow me to do the following, I can chunk together several
distinct pieces of information into one field for the purpose of
indexing.  Then, if I know which terms were matched and their position
in the field, I can use that information to figure out which piece of
information in that single field it came from.

For example, if my document has six figures, I take the caption text
of those six figures and concatenate them all together and index them
in one field, captions.  Then, when the document comes back as a
match, if I knew that the term "empire" matched whatever search query
was used, and that the offset was 100 in my captions field, I could
piece together which of the original illustrations it belongs to.
Again, this is all necessary because I don''t know in advance how many
figures a given document may contain.

I know this sounds overly complicated, but I think it''s easier than
creating a new model, a separate index and then having to search
multiple times etc.

Does this make sense, or am I going at this completely wrong?

Thank you,

-km

Ferret talk - Oct 2008 - List of terms matched by a query (and their position/offset)

[Ferret-talk] List of terms matched by a query (and their position/offset)

[Ferret-talk] List of terms matched by a query (and their position/offset)