The main thing to be aware of is that there are uses for sort keys (and indeed
values in general) where the value isn't a number to start off with. For
instance, a catalogue of paintings might have stored values of the title or the
artist, both of which could be usefully sorted on.
Similarly, one of the examples in the getting started guide [1] stores a date as
a string in YYYYMMDD format. Although it would be possible to convert this into
a number, that has some complexity in the general case (particularly around
calendar changes). For most Western uses, YYYYMMDD is easy to calculate and to
debug, and acceptable as both a sort key and for range queries.
Just a slight point as well: you talk about "sort_key related fields",
but they aren't fields in the way most people would use the word: from the
database's point of view there are just values, which have some specific use
cases (fields tend to be serialised into document data). Values only become
sort_key related at query time (although you will probably have designed them
for one or more of their intended uses).
When you say "sort_keys will be unserialized when user needs to read its
real float/double values…", that's not really an anticipated way of
working, because for display or further processing you'd usually pull things
out of the document data at this point. (Values are designed to be fast to
access during matching — and aren't necessarily performant in other
situations.)
Hope that helps a little!
[1]
https://getting-started-with-xapian.readthedocs.io/en/latest/howtos/sorting.html
J
> On 22 Jan 2019, at 03:37, Miao LIU <miaoliu95 at acm.org> wrote:
>
> Dear Members of Xapian Project,
>
> Sorry for troubling you this time. It can be witnessed that xapian will
store Document values with serialization approach when given value types meet
float/double.
>
> Such an approach is deployed on sort_key related fields as well, where
the xapian requires KeyMaker::operator() must return an serialized float/dobule
variable. Then heap sort comes and ranks the vector<MSetItem> items
(multimatch.cc MultiMatch::get_mset()) by comparing serialized sort_keys
(std::string) straightforwardly according to <IEEE-754 doubles>.
Subsequently sort_keys will be unserialized when user needs to read its real
float/double values during iterations of result MSet.
>
> Obviously, serialization and unserialization are time-consuming
operations. Compared with defining and using sort_key as float/double type
directly, it is complicated to understand benefits of such serialization above
in both performance and coding aspects.
>
> It will be very kind of you if you could give a short illustration.
Looking forward to your early reply.
>
> Best Regards,
> Miao LIU
>
--
James Aylett
devfort.com — spacelog.org — tartarus.org/james/