Rafael SDM Sierra
2006-Nov-28 02:39 UTC
[Xapian-discuss] Get the FLINT_BTREE_MAX_KEY_LEN variable in Python
How can I get this variable? because it is raising an error when I try replace the document, instead when I add the posting, so, to avoid these kind of error, I can do some "if len(word) > max_length_allowed: continue" and go ahead, I think that this value is variable in some machines, here is 987 bytes....yes, I know that no one can remember a word with this size, so, for now I will limit this manually (200 chars is veery good, I don't know one word with this size) Look: Traceback (most recent call last): File "bot.py", line 47, in ? f.save() File "/home/sdm/workspace/gandalf/index/../../gandalf/__init__.py", line 128, in save self.index.replace_document(unique_key, doc) ValueError: InvalidArgumentError: Key_ too long: length was 987 bytes, maximum length of a key is FLINT_BTREE_MAX_KEY_LEN bytes -- SDM Underlinux http://stiod.wordpress.com Membro da equipe UnderLinux -- PEP-8 There is only 2 kinds of peoples in the world, who know English, and me. oO
Olly Betts
2006-Nov-28 03:20 UTC
[Xapian-discuss] Get the FLINT_BTREE_MAX_KEY_LEN variable in Python
On Sat, Nov 25, 2006 at 02:24:36PM -0200, Rafael SDM Sierra wrote:> How can I get this variable?It's actually a constant, not a variable and its value isn't currently available via the API (the constant itself isn't actually directly useful, but a "maximum term length" would be). This is on my list of things to sort out, but it's complicated by zero bytes in terms being treated specially in this area. My plan is to fix that and then we can have a "max term length" constant or API call. Here's a bit of background: The btree manager which Quartz and Flint both use versions of has a maximum key length of 252 bytes. But because the keys contain more than just term names, the maximum safe length for a term is 240 bytes (or perhaps a few more, but 240 is certainly safe). There's one further wrinkle - any zero bytes in a term require 2 bytes in the the quartz key. Another oddity is that the key for some of the Btree tables has the document id encoded using a variable length coding, so bigger document ids need more bytes. So the absolute maximum term length varies by document by a byte or two! Currently I recommend imposing a sane threshold when tokenising text to produce terms, which is also wise as otherwise things like uuencoded text and ASCII art can generate lots of useless junk terms which just bloat the database! Omega uses a theshold of 64 for this. And also ensure that boolean terms can't be too long - for this 240 bytes is a safe limit. If you need to make a term from something which can be longer like a URL, you might want to look at how Omega handles this by hashing the tail of long URLs. The code is in omindex.cc, function make_url_term.> instead when I add the posting, so, to avoid these kind of error, I > can do some "if len(word) > max_length_allowed: continue" and go > ahead, I think that this value is variable in some machines, here is > 987 bytes.987 is the length of the key generated from a term you tried to add, not the threshold for key length you exceeded (which is 252 in both flint and quartz). This limit isn't variable, though the effective limit on term length is slightly, as noted above. Cheers, Olly