thr3ads.net - Xapian discuss - Xapian 1.4.0 released [Jul 2016]

If this information is useful, please help other people find it:
Share via:

James Aylett

2016-Jul-24 14:16 UTC

Xapian 1.4.0 released

On Fri, Jul 22, 2016 at 07:19:43PM -0700, Kevin Duraj wrote:
> I would like to propose to change the following code while indexing a
> term that is larger than 245 characters and then crashing and aborting
> the entire index, we could rather truncate the term to 245 characters
> and continue with indexing.
Kevin -- I wonder what others are currently doing when this comes up
(or if they're just ignoring it). Another approach, which I've
mentioned on the PR, might be to auto-truncate terms earlier in the
process, using a convenience function wrapped inside a call to
`add_term()` and similar. This would allow people who find use for the
exception to continue using things that way.

Alternatively, maybe we could find a way of configuring this
behaviour. I certainly see the benefit in some situations of being
able to just fling data at an indexer and not worry over-much about
long terms, which are mostly flotsam anyway in a lot of applications.

Anyone else have any thoughts? Now is a good time to think about
things like this.

(I'm not a fan of silent truncation; it's bitten me on too many other
EIS in the past. Choosing it deliberately is of course another matter.)

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Kevin Duraj

2016-Jul-25 20:48 UTC

head link

Xapian 1.4.0 released

Now imagine my situation and probably others too, when we are working
with big data. I select 1 billion of YouTube videos, and then I index
it with Xapian. Now a kid uploads Pokemon video and for some reason,
the kid keeps pressing a single key on the keyboard until the term
become 500 characters long (e.g., EEEEEEE).

Xapian index is running and after it has indexed 500 million
documents, suddenly come to the kid Pokemon video with 500 characters
long term in the description and Xapian will stop the entire index,
saying that "Term too long > 245."

I think, a log file with a warning would be sufficient stating the
document id, the term that is too long. Of course, I can fix it by
myself and check every terms length, but that will add more overhead
to big data computing.



On Sun, Jul 24, 2016 at 7:16 AM, James Aylett <james-xapian at
tartarus.org> wrote:> On Fri, Jul 22, 2016 at 07:19:43PM -0700, Kevin Duraj wrote:
>
>> I would like to propose to change the following code while indexing a
>> term that is larger than 245 characters and then crashing and aborting
>> the entire index, we could rather truncate the term to 245 characters
>> and continue with indexing.
>
> Kevin -- I wonder what others are currently doing when this comes up
> (or if they're just ignoring it). Another approach, which I've
> mentioned on the PR, might be to auto-truncate terms earlier in the
> process, using a convenience function wrapped inside a call to
> `add_term()` and similar. This would allow people who find use for the
> exception to continue using things that way.
>
> Alternatively, maybe we could find a way of configuring this
> behaviour. I certainly see the benefit in some situations of being
> able to just fling data at an indexer and not worry over-much about
> long terms, which are mostly flotsam anyway in a lot of applications.
>
> Anyone else have any thoughts? Now is a good time to think about
> things like this.
>
> (I'm not a fan of silent truncation; it's bitten me on too many
other
> EIS in the past. Choosing it deliberately is of course another matter.)
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>

Adam Sjøgren

2016-Jul-25 21:14 UTC

head link

Xapian 1.4.0 released

Kevin writes:
> Of course, I can fix it by myself and check every terms length, but
> that will add more overhead to big data computing.
How is the overhead different whether your code checks it or Xapian does?


  Best regards,

    Adam

-- 
 "Oh, we all like motorcycles, to some degree."               Adam
Sj?gren
                                                         asjo at koldfront.dk

Olly Betts

2016-Sep-12 04:51 UTC

head link

Xapian 1.4.0 released

On Mon, Jul 25, 2016 at 01:48:02PM -0700, Kevin Duraj
wrote:> Now imagine my situation and probably others too, when we are working
> with big data. I select 1 billion of YouTube videos, and then I index
> it with Xapian. Now a kid uploads Pokemon video and for some reason,
> the kid keeps pressing a single key on the keyboard until the term
> become 500 characters long (e.g., EEEEEEE).
> 
> Xapian index is running and after it has indexed 500 million
> documents, suddenly come to the kid Pokemon video with 500 characters
> long term in the description and Xapian will stop the entire index,
> saying that "Term too long > 245."
No, TermGenerator will skip the term because it is longer than
max_word_length (which you can set through the API but defaults to 64
bytes).

You'll only hit this exception if you set max_word_length much higher
than the default, or if you directly call add_term() and/or
add_posting() directly instead of using TermGenerator, or when
calling add_boolean_term().  So with modern API use, you will only need
to check boolean terms fit in the length limit (and those are the case
where blindly truncating is most problematic).
> I think, a log file with a warning would be sufficient stating the
> document id, the term that is too long. Of course, I can fix it by
> myself and check every terms length, but that will add more overhead
> to big data computing.
There's not currently a log file to log this to.

Cheers,
    Olly

Maybe Matching Threads

Search for more maybe matching threads

Xapian discuss - Jul 2016 - Xapian 1.4.0 released

Xapian 1.4.0 released

Xapian 1.4.0 released

Xapian 1.4.0 released

Xapian 1.4.0 released

Maybe Matching Threads