jarrod roberson
2006-Mar-01 18:58 UTC
[Xapian-discuss] Need a suggestion on implementation.
I am still working on my filesystem indexer, thanks for all the previous help, Xapian is awesome! On to my delimna, I want to index arbitarly long logical paths. And I have run up on the ~240 character term limit way more than once so far. So I am trying to decide the best way to index path information. My ideas are as follows: /usr/jarrod/very/long/path/to/a/file.txt use prefixes like P000:usr, P001:jarrod, P002:very P003:long . . . you get the idea the other idea is to use positional information using add_posting( usr, 0 ), add_posting( jarrod, 1 ), add_posting( very, 2 ), add_posting( long, 3 ) which way or which other way would you guys suggestion starting out with? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20060301/3cf0f06a/attachment.htm
On Wed, Mar 01, 2006 at 01:58:50PM -0500, jarrod roberson wrote:> On to my delimna, I want to index arbitarly long logical paths. And I have > run up on the ~240 character term limit way more than once so far. > So I am trying to decide the best way to index path information. > > My ideas are as follows: > > /usr/jarrod/very/long/path/to/a/file.txt > > use prefixes like P000:usr, P001:jarrod, P002:very P003:long . . . you get > the ideaThere's no need for each term to correspond to a directory level - you could make them a fixed number of characters long, which would reduce the number needed, which should make finding a particular existing entry more efficient - if you make the length 240 characters then many files will only need a single term. Also, this'll work even if you have a directory name which is 300 characters long...> the other idea is to use positional information using add_posting( usr, 0 ), > add_posting( jarrod, 1 ), add_posting( very, 2 ), add_posting( long, 3 )That'll be less efficient that encoding the position into the term. You could hash the overlong part of the path like omindex does, but that carries a small chance that two paths may collide and you'll only index one, which you may find unacceptable. Or you could use an external database of some sort to track the pathname -> xapian docid mapping. Cheers, Olly