Francis Irving
2010-Jun-07 18:38 UTC
[Xapian-discuss] Is there a 64 character term size limit? In Ruby bindings?
I've just found some items in my Xapian database which aren't being indexed, when the terms are quite long. Example term: Frotherham_doncaster_and_south_humber_mental_health_nhs_foundation_trust It represents that the Freedom of Information request was made to a particular public body. It results in pages like this not correctly showing results: http://www.whatdotheyknow.com/body/rotherham_doncaster_and_south_humber_mental_health_nhs_foundation_trust As far as I can tell the terms aren't being indexed when they are longer than 64 characters. They don't get put in the Xapian database at all. I'm just quickly emailing to see if this is something people know about. I can try and make an enclosed test case if it isn't. I'm using the Xapian ruby bindings, everything version 1.0.7-3.1, on Debian. Francis
Richard Boulton
2010-Jun-07 21:24 UTC
[Xapian-discuss] Is there a 64 character term size limit? In Ruby bindings?
Xapian has a term length limit, and the exact limit depends on the backend in use, but no backend has one as short as 64 bytes: with flint, the length is 245 bytes, and it was close to that with quartz (I can't remember the exact details now). If a document with a term which is too long is added, the add_document() call will give an exception (Xapian::InvalidArgumentError), so you shouldn't be getting terms silently missing from documents. Are you using the Xapian ruby bindings directly, or some intermediate layer? If you're using Xapian directly, I'm not sure what can be going on, and a test case would be very welcome. -- Richard
Olly Betts
2010-Jun-08 01:32 UTC
[Xapian-discuss] Is there a 64 character term size limit? In Ruby bindings?
On Mon, Jun 07, 2010 at 07:38:08PM +0100, Francis Irving wrote:> I've just found some items in my Xapian database which aren't being > indexed, when the terms are quite long. > > Example term: > Frotherham_doncaster_and_south_humber_mental_health_nhs_foundation_trust > > It represents that the Freedom of Information request was made to a > particular public body. It results in pages like this not correctly > showing results: > > http://www.whatdotheyknow.com/body/rotherham_doncaster_and_south_humber_mental_health_nhs_foundation_trust > > As far as I can tell the terms aren't being indexed when they are > longer than 64 characters. They don't get put in the Xapian database > at all.TermGenerator ignores terms over that size to avoid indexing a lot of junk terms if it gets fed things like base64 data or uuencode. This term looks like a filtering term, in which case it would make more sense to add it with Document:::add_term(). That doesn't have a limit on term size itself, though the backends have a limit of around 245 bytes. Cheers, Olly