Deron Meranda
2007-Aug-30 22:31 UTC
[Xapian-discuss] Clarification of values, data, fields, and prefixed terms
I'm fairly new to Xapian and one of the more confusing hurdles to understand is the different ways to attach meta-data to documents. It seems like there are several different ways: * values * data (which can then by convention be formatted into fields) * prefixed terms I have not really seen a clear description of these in one place and why you would use one over another. Here's my understanding, please fill in or correct me... Values are user-defined discrete strings (identified by a "slot" number). A document can have either zero or exactly one value for any given slot number. Xapian does not interpret the meaning of the value string nor does it predefine any slots, but it does allow for filtering queries based upon a simple lexigraphical "range" of values that matched documents should posses. Prefixed terms index documents just like ordinary terms/words and thus are used in probabiistic searches, and can carry positional information if desired. Prefix terms are really just a convention (not part of Xapian core) by prepending some letters to the front of terms before they are put in the index. This usually means that even normal terms derived from the words in a document need to be prefixed as well so everything remains unambiguous. It is common for prefixed terms to describe additional information about the document (such as document id, URL, etc) other than just the actual words appearing in the document text. Prefixed terms essentially attach type or semantic meaning to terms by the selection of the prefix letter(s). Within the Omega application framework many predefined prefixes are standardized. The core's QueryParser has limited support for prefix terms by mapping a "field name" to a prefix, but otherwise the core does not distinguish a prefixed-term from a non-prefixed one. Prefixed terms can also be used when you want to index both a stemmed word as well as the original unmodified word, while not inhibiting the ability to do phrase (near) searching. Finally, document Data is just an opaque bunch of data attached to the document. It can not be used as part of a query (although applications built on top of the core can use them for processing and displaying the search results). By a convention of the Omega application (not the core), the data is formatted as a multiline text suppliment, where each line is like "field=value", and thus allows one to define fields for capturing meta data on a document. Is my understanding essentially correct? Also, why would one ever use the data fields rather than values? Deron Meranda
James Aylett
2007-Sep-02 13:46 UTC
[Xapian-discuss] Clarification of values, data, fields, and prefixed terms
On Thu, Aug 30, 2007 at 05:31:23PM -0400, Deron Meranda wrote:> I'm fairly new to Xapian and one of the more confusing hurdles to > understand is the different ways to attach meta-data to documents. It > seems like there are several different ways: > > * values > * data (which can then by convention be formatted into fields) > * prefixed termsEach of these has a distinct use in Xapian. Two (values and prefixes terms) are giving different types of metadata that Xapian itself can use; the other (data) is for application metadata that Xapian can happily ignore.> Values are user-defined discrete strings (identified by a "slot" > number). A document can have either zero or exactly one value for any > given slot number. Xapian does not interpret the meaning of the value > string nor does it predefine any slots, but it does allow for > filtering queries based upon a simple lexigraphical "range" of values > that matched documents should posses.Values are used for filtering in the match process. So collapsing can be done on a value; you can use them in a MatchDecider and so on. Range filtering is another example, as you point out.> Prefixed terms index documents just like ordinary terms/words and thus > are used in probabiistic searches, and can carry positional > information if desired. Prefix terms are really just a convention > (not part of Xapian core) by prepending some letters to the front of > terms before they are put in the index.Right. As far as Xapian is concerned, you just have a bunch of terms. How you create those terms, and your convention for term construction, is an important part of your index plan. Prefixes are a useful convention for reflecting document data/metadata structure in the terms you generate.> Finally, document Data is just an opaque bunch of data attached to the > document. It can not be used as part of a query (although > applications built on top of the core can use them for processing and > displaying the search results). By a convention of the Omega > application (not the core), the data is formatted as a multiline text > suppliment, where each line is like "field=value", and thus allows one > to define fields for capturing meta data on a document.Yes. Data is just somewhere to shove stuff that Xapian doesn't have to care about. This could be as simple as the id of the document somewhere else, or contain summary metadata (or in theory the entire thing, although often that's not going to be a great idea).> Is my understanding essentially correct? Also, why would one ever use > the data fields rather than values?I'm not certain that it is actually true right now, but in theory you'll get better performance in some cases by using values as they're intended (to be looked up and used during the match process), and data as it's intended (to store additional metadata that Xapian doesn't care about, for display/whatever in your application). J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org