I just learned probabilisitc retrieval(PR) in my IR course. And I do find Xapian is a fantastic open source to help me understand how PR can be implemented in practice. I tried to build up a local article PR system with Xapian and omega with an eye on extending it into a larger text categorization project. My index script is as below: **************************************************** UID: field=UID boolean=XUID unique=Q JN: boolean=XJN field=JN PY: boolean=XPY field=PY TI: index=XTA truncate=200 field=TI AU: boolean=XAU truncate=200 field=AU CA: boolean=XCA field=CA AB: index=XTA truncate=200 field=AB **************************************************** (UID: unique ID for each article; JN: journal;PY: publication year; TI: title; AU: author; CA: classification; AB: abstract. I hope I can search TI and AB, and probably use JN, PY, and CA as BOOLEAN filters in some occasions ). Here are my questions: (to clarify, when I say field with no quotation mark, I mean a field in the db. "field=" refers to the action to add as a field to the Xapian record) (1) Do I really need "field=" for each? Isn't "field=" just for displaying web search results (As Sam described in earlier messages: "Fields are used to retreive per-record text for summaries and things like for Omega." ) ? Can't I get these values with "get_document().get_data()" using MSetiterator in my local system even without "field="? say, output search results into a text file? (2) How does "truncate=" work? Does it work for both probabilistic field and BOOLEAN field? Does it truncate each word while indexing, e.g. truncate a term if it's longer than 200 characters while indexing? Or does it truncate the whole field while doing the action "field="? (3) In the indexing process, I got an error message as following: "Exception: Key too long: length was 264 bytes, maximum length of a key is Btree::max_key_len bytes". I understand it means a single term is too long. But a term in which field: the primary field UID? or any field such as JN, CA, and AB? If it's a term in a probabilistic field, which I'd like to keep as it is and searchable, what shall I do? Any idea/suggestion is highly appreciated. Sabrina circumvent this problem?
On Sun, Mar 27, 2005 at 07:45:21PM +0000, Sabrina Shen wrote:> UID: field=UID boolean=XUID unique=QUnless you're trying to do something clever, you want the same prefix for boolean and unique.> (1) Do I really need "field=" for each?You do if you want Xapian to store them.> Isn't "field=" just for displaying web search results (As Sam > described in earlier messages: "Fields are used to retreive per-record > text for summaries and things like for Omega." ) ?There's nothing special about web search results. Sam meant "for Omega" simply as an example of a program which might use them.> Can't I get these values with "get_document().get_data()" using > MSetiterator in my local system even without "field="? say, output > search results into a text file?The document data is built from the values processed with "field=". So if you don't have a field action, the value won't be stored in the document data. Sometimes that's what you want...> (2) How does "truncate=" work?The "input field" from the dump file is fed through each action in turn. The "truncate" action simply truncates the value to the given length, so actions on the same line after the "truncate" see the truncated text.> Does it work for both probabilistic field and BOOLEAN field?For *ANY* action after it.> Does it truncate each word while indexing, e.g. truncate a term > if it's longer than 200 characters while indexing?No - "index" after "truncate" means the text will be truncated before word splitting. But "index" will discard any word of more than 64 characters anyway.> (3) In the indexing process, I got an error message as following: > "Exception: Key too long: length was 264 bytes, maximum length of a key is > Btree::max_key_len bytes". I understand it means a single term is too > long. But a term in which field: the primary field UID? or any field > such as JN, CA, and AB?It'll be in one of the boolean fields (unless you passed "index" a prefix of 200 or so characters!) This should be reported better. We need to check term length explicitly up front (at present this exception comes from a lower level which is handling keys built from terms and document ids). Cheers, Olly
Thanks a lot! Now I have a much better understanding. --- Olly Betts <olly@survex.com> wrote:> On Sun, Mar 27, 2005 at 07:45:21PM +0000, Sabrina > Shen wrote: > > UID: field=UID boolean=XUID unique=Q > > Unless you're trying to do something clever, you > want the same prefix for > boolean and unique. >Yes, you're right. I'll change it.> > (1) Do I really need "field=" for each? > > You do if you want Xapian to store them. > > > Isn't "field=" just for displaying web search > results (As Sam > > described in earlier messages: "Fields are used to > retreive per-record > > text for summaries and things like for Omega." ) ? > > There's nothing special about web search results. > Sam meant "for Omega" > simply as an example of a program which might use > them. > > > Can't I get these values with > "get_document().get_data()" using > > MSetiterator in my local system even without > "field="? say, output > > search results into a text file? > > The document data is built from the values processed > with "field=". So > if you don't have a field action, the value won't be > stored in the > document data. Sometimes that's what you want... >Oh, I see. I have to keep the "field=" action.> > (2) How does "truncate=" work? > > The "input field" from the dump file is fed through > each action in turn. > The "truncate" action simply truncates the value to > the given length, so > actions on the same line after the "truncate" see > the truncated text. > > > Does it work for both probabilistic field and > BOOLEAN field? > > For *ANY* action after it. > > > Does it truncate each word while indexing, e.g. > truncate a term > > if it's longer than 200 characters while indexing? > > No - "index" after "truncate" means the text will be > truncated before > word splitting. But "index" will discard any word > of more than 64 > characters anyway.I got it. That's also why the key too long error is probably not from the "index" field.> > (3) In the indexing process, I got an error > message as following: > > "Exception: Key too long: length was 264 bytes, > maximum length of a key is > > Btree::max_key_len bytes". I understand it means a > single term is too > > long. But a term in which field: the primary field > UID? or any field > > such as JN, CA, and AB? > > It'll be in one of the boolean fields (unless you > passed "index" a prefix > of 200 or so characters!)This is somewhat unexpected. It seems to me that there shouldn't be a single term longer than 200 in the boolean fields. JN (journal name) is separated by spaces. Publication Year is a 4-digit number. Classification is a code with two chars. I assigned multiple values for articles with multiple authors (AU). Anyway, I'll check whether there is such a long term in a single value.> This should be reported better. We need to check > term length explicitly > up front (at present this exception comes from a > lower level which is > handling keys built from terms and document ids). > > Cheers, > OllyIs there a way that I can check exactly where this error happened, say, with which term and which document? Thanks! Sabrina __________________________________ Do you Yahoo!? Yahoo! Small Business - Try our new resources site! smallbusiness.yahoo.com/resources