thr3ads.net - Xapian discuss - [Xapian-discuss] some questions with scriptindex [Mar 2005]

If this information is useful, please help other people find it:
Share via:

Sabrina Shen

2005-Mar-27 21:01 UTC

[Xapian-discuss] some questions with scriptindex

I just learned probabilisitc retrieval(PR) in my IR course. And I do find Xapian
is a fantastic open source to help me understand how PR can be implemented in
practice. I tried to build up a local article PR system with Xapian and omega
with an eye on extending it into a larger text categorization project. My index
script is as below:

****************************************************
UID: field=UID boolean=XUID unique=Q 
JN:  boolean=XJN field=JN
PY:  boolean=XPY field=PY
TI:  index=XTA truncate=200 field=TI
AU:  boolean=XAU truncate=200 field=AU
CA:  boolean=XCA field=CA
AB: index=XTA truncate=200  field=AB
****************************************************
(UID: unique ID for each article; JN: journal;PY: publication year; TI: title;
AU: author; CA: classification; AB: abstract. I hope I can search TI and AB, and
probably use JN, PY, and CA as BOOLEAN filters in some occasions ). 

Here are my questions: (to clarify, when I say field with no quotation mark, I
mean a field in the db. "field=" refers to the action to add as a
field to the
Xapian record)

(1) Do I really need "field=" for each? Isn't "field="
just for displaying web
search results (As Sam described in earlier messages: "Fields are used to
retreive per-record text for summaries and things like for Omega." ) ?
Can't I
get these values with "get_document().get_data()" using MSetiterator
in my local
system even without "field="? say, output search results into a text
file?

(2) How does "truncate=" work? Does it work for both probabilistic
field and
BOOLEAN field? Does it truncate each word while indexing, e.g. truncate a term
if it's longer than 200 characters while indexing? Or does it truncate the
whole
field while doing the action "field="? 

(3) In the indexing process, I got an error message as following: 
"Exception: Key too long: length was 264 bytes, maximum length of a key is
Btree::max_key_len bytes". I understand it means a single term is too long.
But
a term in which field: the primary field UID? or any field such as JN, CA, and
AB? If it's a term in a probabilistic field, which I'd like to keep as
it is and
searchable, what shall I do?

Any idea/suggestion is highly appreciated.

Sabrina
circumvent this problem?

Olly Betts

2005-Mar-28 00:05 UTC

head link

[Xapian-discuss] some questions with scriptindex

On Sun, Mar 27, 2005 at 07:45:21PM +0000, Sabrina Shen
wrote:> UID: field=UID boolean=XUID unique=Q 
Unless you're trying to do something clever, you want the same prefix for
boolean and unique.
> (1) Do I really need "field=" for each?
You do if you want Xapian to store them.
> Isn't "field=" just for displaying web search results (As Sam
> described in earlier messages: "Fields are used to retreive per-record
> text for summaries and things like for Omega." ) ?
There's nothing special about web search results.  Sam meant "for
Omega"
simply as an example of a program which might use them.
> Can't I get these values with "get_document().get_data()"
using
> MSetiterator in my local system even without "field="? say,
output
> search results into a text file?
The document data is built from the values processed with "field=". 
So
if you don't have a field action, the value won't be stored in the
document data.  Sometimes that's what you want...
> (2) How does "truncate=" work?
The "input field" from the dump file is fed through each action in
turn.
The "truncate" action simply truncates the value to the given length,
so
actions on the same line after the "truncate" see the truncated text.
> Does it work for both probabilistic field and BOOLEAN field?
For *ANY* action after it.
> Does it truncate each word while indexing, e.g. truncate a term
> if it's longer than 200 characters while indexing?
No - "index" after "truncate" means the text will be
truncated before
word splitting.  But "index" will discard any word of more than 64
characters anyway.
> (3) In the indexing process, I got an error message as following: 
> "Exception: Key too long: length was 264 bytes, maximum length of a
key is
> Btree::max_key_len bytes". I understand it means a single term is too
> long. But a term in which field: the primary field UID? or any field
> such as JN, CA, and AB?
It'll be in one of the boolean fields (unless you passed "index" a
prefix
of 200 or so characters!)

This should be reported better.  We need to check term length explicitly
up front (at present this exception comes from a lower level which is
handling keys built from terms and document ids).

Cheers,
    Olly

Sabrina Shen

2005-Mar-28 01:08 UTC

head link

[Xapian-discuss] some questions with scriptindex

Thanks a lot! Now I have a much better understanding.

--- Olly Betts <olly@survex.com> wrote:> On Sun, Mar 27, 2005 at 07:45:21PM +0000, Sabrina
> Shen wrote:
> > UID: field=UID boolean=XUID unique=Q 
> 
> Unless you're trying to do something clever, you
> want the same prefix for
> boolean and unique.
> 
Yes,  you're right. I'll change it.
> > (1) Do I really need "field=" for each?
> 
> You do if you want Xapian to store them.
> 
> > Isn't "field=" just for displaying web search
> results (As Sam
> > described in earlier messages: "Fields are used to
> retreive per-record
> > text for summaries and things like for Omega." ) ?
> 
> There's nothing special about web search results. 
> Sam meant "for Omega"
> simply as an example of a program which might use
> them.
> 
> > Can't I get these values with
> "get_document().get_data()" using
> > MSetiterator in my local system even without
> "field="? say, output
> > search results into a text file?
> 
> The document data is built from the values processed
> with "field=".  So
> if you don't have a field action, the value won't be
> stored in the
> document data.  Sometimes that's what you want...
> 
Oh, I see. I have to keep the "field=" action.
> > (2) How does "truncate=" work?
> 
> The "input field" from the dump file is fed through
> each action in turn.
> The "truncate" action simply truncates the value to
> the given length, so
> actions on the same line after the "truncate" see
> the truncated text.
> 
> > Does it work for both probabilistic field and
> BOOLEAN field?
> 
> For *ANY* action after it.
> 
> > Does it truncate each word while indexing, e.g.
> truncate a term
> > if it's longer than 200 characters while indexing?
> 
> No - "index" after "truncate" means the text will be
> truncated before
> word splitting.  But "index" will discard any word
> of more than 64
> characters anyway.
I got it. That's also why the key too long error is
probably not from the "index" field.
> > (3) In the indexing process, I got an error
> message as following: 
> > "Exception: Key too long: length was 264 bytes,
> maximum length of a key is
> > Btree::max_key_len bytes". I understand it means a
> single term is too
> > long. But a term in which field: the primary field
> UID? or any field
> > such as JN, CA, and AB?
> 
> It'll be in one of the boolean fields (unless you
> passed "index" a prefix
> of 200 or so characters!)
This is somewhat unexpected.  It seems to me that
there shouldn't be a single term longer than 200 in
the boolean fields. JN (journal name) is separated by
spaces. Publication Year is a 4-digit number.
Classification is a code with two chars. I assigned
multiple values for articles with multiple authors
(AU). Anyway, I'll check whether there is such a long 
term in a single value. 
> This should be reported better.  We need to check
> term length explicitly
> up front (at present this exception comes from a
> lower level which is
> handling keys built from terms and document ids).
> 
> Cheers,
>     Olly
Is there a  way  that I can check exactly where this 
error happened, say, with which term and which
document? 

Thanks!

Sabrina


		
__________________________________ 
Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/

Xapian discuss - Mar 2005 - some questions with scriptindex

[Xapian-discuss] some questions with scriptindex

[Xapian-discuss] some questions with scriptindex

[Xapian-discuss] some questions with scriptindex