On Mon, Jul 11, 2016 at 02:02:56PM -0700, Kevin Duraj wrote:> You are saying that when I search for "delve Xapian 1.4" on Google, a > company worth of 491 Billion of Dollars and you saying that their top > of the search result has nothing to do with Xapian. > > https://www.google.com/search?q=xapian+delve&ie=utf-8&oe=utf-8#q=delve+xapian+1.4Well, I'm not saying that it's nothing to do with Xapian: clearly the page is about Xapian. However since it wasn't _created_ by the Xapian project, it's not authoritative. Google search results are personalised; for me, the top hit for that search is the Xapian Administrator's Guide, which correctly names the command for 1.4 as xapian-delve (and the history of the change). If I'm not logged into Google, I still get that document top, but the Linux From Scratch one you found is in the list further down. I'd guess the difference in ordering is down to your own personalisation from Google, although of course we can never be certain with Google.> Whooo, that is strong. I don't think Google will ever call you for an > Interview now *LOL*Kevin, I don't think sarcasm was called for here. Let's try to keep our mailing lists and forums welcoming places that everyone can feel comfortable! J -- James Aylett, occasional trouble-maker xapian.org
James, I would like to propose to change the following code while indexing a term that is larger than 245 characters and then crashing and aborting the entire index, we could rather truncate the term to 245 characters and continue with indexing. if (tname.size() > MAX_SAFE_TERM_LENGTH) throw Xapian::InvalidArgumentError("Term too long (> " STRINGIZE(MAX_SAFE_TERM_LENGTH) "): " + tname); Reference: https://github.com/xapian/xapian/blob/e3692bff7b7c25c8e09536889d5884d033199f36/xapian-core/backends/glass/glass_database.cc#L1083-L1084 On Tue, Jul 12, 2016 at 10:13 AM, James Aylett <james-xapian at tartarus.org> wrote:> On Mon, Jul 11, 2016 at 02:02:56PM -0700, Kevin Duraj wrote: > >> You are saying that when I search for "delve Xapian 1.4" on Google, a >> company worth of 491 Billion of Dollars and you saying that their top >> of the search result has nothing to do with Xapian. >> >> https://www.google.com/search?q=xapian+delve&ie=utf-8&oe=utf-8#q=delve+xapian+1.4 > > Well, I'm not saying that it's nothing to do with Xapian: clearly the > page is about Xapian. However since it wasn't _created_ by the Xapian > project, it's not authoritative. > > Google search results are personalised; for me, the top hit for that > search is the Xapian Administrator's Guide, which correctly names the > command for 1.4 as xapian-delve (and the history of the > change). > > If I'm not logged into Google, I still get that document top, but the > Linux From Scratch one you found is in the list further down. I'd > guess the difference in ordering is down to your own personalisation > from Google, although of course we can never be certain with Google. > >> Whooo, that is strong. I don't think Google will ever call you for an >> Interview now *LOL* > > Kevin, I don't think sarcasm was called for here. Let's try to keep > our mailing lists and forums welcoming places that everyone can feel > comfortable! > > J > > -- > James Aylett, occasional trouble-maker > xapian.org >
James, I have submitted the following pull request 113 if the term is too large, resize it to MAX_SAFE_TERM_LENGTH instead of throwing an exception. https://github.com/xapian/xapian/pull/113 if (tname.size() > MAX_SAFE_TERM_LENGTH) - throw Xapian::InvalidArgumentError("Term too long (> " STRINGIZE(MAX_SAFE_TERM_LENGTH) "): " + tname); + tname.resize(MAX_SAFE_TERM_LENGTH) - Kevin Duraj On Fri, Jul 22, 2016 at 7:19 PM, Kevin Duraj <kevin.duraj at zefr.com> wrote:> James, > > I would like to propose to change the following code while indexing a > term that is larger than 245 characters and then crashing and aborting > the entire index, we could rather truncate the term to 245 characters > and continue with indexing. > > if (tname.size() > MAX_SAFE_TERM_LENGTH) throw > Xapian::InvalidArgumentError("Term too long (> " > STRINGIZE(MAX_SAFE_TERM_LENGTH) "): " + tname); > > Reference: > https://github.com/xapian/xapian/blob/e3692bff7b7c25c8e09536889d5884d033199f36/xapian-core/backends/glass/glass_database.cc#L1083-L1084 > > On Tue, Jul 12, 2016 at 10:13 AM, James Aylett > <james-xapian at tartarus.org> wrote: >> On Mon, Jul 11, 2016 at 02:02:56PM -0700, Kevin Duraj wrote: >> >>> You are saying that when I search for "delve Xapian 1.4" on Google, a >>> company worth of 491 Billion of Dollars and you saying that their top >>> of the search result has nothing to do with Xapian. >>> >>> https://www.google.com/search?q=xapian+delve&ie=utf-8&oe=utf-8#q=delve+xapian+1.4 >> >> Well, I'm not saying that it's nothing to do with Xapian: clearly the >> page is about Xapian. However since it wasn't _created_ by the Xapian >> project, it's not authoritative. >> >> Google search results are personalised; for me, the top hit for that >> search is the Xapian Administrator's Guide, which correctly names the >> command for 1.4 as xapian-delve (and the history of the >> change). >> >> If I'm not logged into Google, I still get that document top, but the >> Linux From Scratch one you found is in the list further down. I'd >> guess the difference in ordering is down to your own personalisation >> from Google, although of course we can never be certain with Google. >> >>> Whooo, that is strong. I don't think Google will ever call you for an >>> Interview now *LOL* >> >> Kevin, I don't think sarcasm was called for here. Let's try to keep >> our mailing lists and forums welcoming places that everyone can feel >> comfortable! >> >> J >> >> -- >> James Aylett, occasional trouble-maker >> xapian.org >>
On Fri, Jul 22, 2016 at 07:19:43PM -0700, Kevin Duraj wrote:> I would like to propose to change the following code while indexing a > term that is larger than 245 characters and then crashing and aborting > the entire index, we could rather truncate the term to 245 characters > and continue with indexing.Kevin -- I wonder what others are currently doing when this comes up (or if they're just ignoring it). Another approach, which I've mentioned on the PR, might be to auto-truncate terms earlier in the process, using a convenience function wrapped inside a call to `add_term()` and similar. This would allow people who find use for the exception to continue using things that way. Alternatively, maybe we could find a way of configuring this behaviour. I certainly see the benefit in some situations of being able to just fling data at an indexer and not worry over-much about long terms, which are mostly flotsam anyway in a lot of applications. Anyone else have any thoughts? Now is a good time to think about things like this. (I'm not a fan of silent truncation; it's bitten me on too many other EIS in the past. Choosing it deliberately is of course another matter.) J -- James Aylett, occasional trouble-maker xapian.org