On Fri, Jul 22, 2016 at 07:19:43PM -0700, Kevin Duraj wrote:> I would like to propose to change the following code while indexing a > term that is larger than 245 characters and then crashing and aborting > the entire index, we could rather truncate the term to 245 characters > and continue with indexing.Kevin -- I wonder what others are currently doing when this comes up (or if they're just ignoring it). Another approach, which I've mentioned on the PR, might be to auto-truncate terms earlier in the process, using a convenience function wrapped inside a call to `add_term()` and similar. This would allow people who find use for the exception to continue using things that way. Alternatively, maybe we could find a way of configuring this behaviour. I certainly see the benefit in some situations of being able to just fling data at an indexer and not worry over-much about long terms, which are mostly flotsam anyway in a lot of applications. Anyone else have any thoughts? Now is a good time to think about things like this. (I'm not a fan of silent truncation; it's bitten me on too many other EIS in the past. Choosing it deliberately is of course another matter.) J -- James Aylett, occasional trouble-maker xapian.org
Now imagine my situation and probably others too, when we are working with big data. I select 1 billion of YouTube videos, and then I index it with Xapian. Now a kid uploads Pokemon video and for some reason, the kid keeps pressing a single key on the keyboard until the term become 500 characters long (e.g., EEEEEEE). Xapian index is running and after it has indexed 500 million documents, suddenly come to the kid Pokemon video with 500 characters long term in the description and Xapian will stop the entire index, saying that "Term too long > 245." I think, a log file with a warning would be sufficient stating the document id, the term that is too long. Of course, I can fix it by myself and check every terms length, but that will add more overhead to big data computing. On Sun, Jul 24, 2016 at 7:16 AM, James Aylett <james-xapian at tartarus.org> wrote:> On Fri, Jul 22, 2016 at 07:19:43PM -0700, Kevin Duraj wrote: > >> I would like to propose to change the following code while indexing a >> term that is larger than 245 characters and then crashing and aborting >> the entire index, we could rather truncate the term to 245 characters >> and continue with indexing. > > Kevin -- I wonder what others are currently doing when this comes up > (or if they're just ignoring it). Another approach, which I've > mentioned on the PR, might be to auto-truncate terms earlier in the > process, using a convenience function wrapped inside a call to > `add_term()` and similar. This would allow people who find use for the > exception to continue using things that way. > > Alternatively, maybe we could find a way of configuring this > behaviour. I certainly see the benefit in some situations of being > able to just fling data at an indexer and not worry over-much about > long terms, which are mostly flotsam anyway in a lot of applications. > > Anyone else have any thoughts? Now is a good time to think about > things like this. > > (I'm not a fan of silent truncation; it's bitten me on too many other > EIS in the past. Choosing it deliberately is of course another matter.) > > J > > -- > James Aylett, occasional trouble-maker > xapian.org >
Kevin writes:> Of course, I can fix it by myself and check every terms length, but > that will add more overhead to big data computing.How is the overhead different whether your code checks it or Xapian does? Best regards, Adam -- "Oh, we all like motorcycles, to some degree." Adam Sj?gren asjo at koldfront.dk
On Mon, Jul 25, 2016 at 01:48:02PM -0700, Kevin Duraj wrote:> Now imagine my situation and probably others too, when we are working > with big data. I select 1 billion of YouTube videos, and then I index > it with Xapian. Now a kid uploads Pokemon video and for some reason, > the kid keeps pressing a single key on the keyboard until the term > become 500 characters long (e.g., EEEEEEE). > > Xapian index is running and after it has indexed 500 million > documents, suddenly come to the kid Pokemon video with 500 characters > long term in the description and Xapian will stop the entire index, > saying that "Term too long > 245."No, TermGenerator will skip the term because it is longer than max_word_length (which you can set through the API but defaults to 64 bytes). You'll only hit this exception if you set max_word_length much higher than the default, or if you directly call add_term() and/or add_posting() directly instead of using TermGenerator, or when calling add_boolean_term(). So with modern API use, you will only need to check boolean terms fit in the length limit (and those are the case where blindly truncating is most problematic).> I think, a log file with a warning would be sufficient stating the > document id, the term that is too long. Of course, I can fix it by > myself and check every terms length, but that will add more overhead > to big data computing.There's not currently a log file to log this to. Cheers, Olly