Emmanuel Eckard
2007-Sep-13 14:08 UTC
[Xapian-discuss] Re: Xapian-discuss Digest, Vol 40, Issue 15
These doubles will often amount to probabilities, though to stay generic, it would be safer to assume that they do not necessarily. I understand that fitting Xapian with a generic interface for research goes somewhat against its optimisations for retrieval speed. I don't know in which measure it could be possible to offer these features as supplementary packages or as configure options.> If the doubles are only used to store probabilities and thus all > the doubles are in the range 0.0 <= x <= 1.0, then you may be > able to re-use integers and interpret them as fixed-point > rather than floating-point. > > Even a 32-bit integer gets you somewhere around 9+ digits of > precision as long as you're in the 0 to 1 range. > > That may be a space/time tradeoff though, unless you could > implement all the algorithms to use integer math rather than > floating point.-- Emmanuel Eckard ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Artificial Intelligence Laboratory, EPFL LIA/IC 1014 Ecublens, Suisse ? ? ? ? ? ? ? ? ? ? +41 21 693 66 97 ? ? ? () ?ascii ribbon campaign - against html mail /\ ? ? ? ? ? ? ? ? ? ? ? ?- against microsoft attachments ? ? ?
Olly Betts
2007-Sep-16 20:03 UTC
[Xapian-discuss] Re: Xapian and research in IR: a few suggestions from experience
On Thu, Sep 13, 2007 at 03:08:03PM +0200, Emmanuel Eckard wrote:> I understand that fitting Xapian with a generic interface for research goes > somewhat against its optimisations for retrieval speed.I think it only starts to get difficult if you want to try to optimise the generic interface a lot. For research work, speed often isn't a key issue (it isn't for TREC-style evaluation, but it might be if you're building a prototype to see how users interact).> I don't know in which measure it could be possible to offer these > features as supplementary packages or as configure options.The problem with compile time options is that adding lots as them rapidly increases the number of combinations and it becomes unfeasible to regularly test all combinations. Plus people building binary packages have to choose a set of options, and if the one you need isn't in that set then you can't use the packages. I've committed a reworked version of Richard's user metadata idea (see http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=143 for the original patches) which essentially allows arbitrary tag data (in the form of a std::string) to be associated with short key strings. This data is committed at the same time as other database changes, so it's easy to ensure it stays in step with the rest of the Xapian database. By using a suitable scheme for generating key names, this could be used to store extra data associated with termnames, docids, etc as you originally suggested - you just need to serialise it to a string. I'd be interested to hear how well this works if anybody tries it. Cheers, Olly