Shane Spencer
2012-Mar-09 23:39 UTC
[Xapian-discuss] 128 bit Document IDs (Please don't hurt me)
I apologize for what may be a sore subject. 4 billion documents is a heck of a lot. 64 bit vs 32 bit would be an incredibly large database with an average document and term size. Why 128 bit? Simply for address space. Mapping a UUID (128 bit) or MongoDB ObjectID (96 bit) directly into the Xapian document space removes the need for referencing one or the other from one or both. I see a common tendency to write a document to the Xapian, return the document ID, and then write to the database backing the document in some way. This is nothing new.. but I really would like to remove that extra write and optionally throw a way the Xapian response by specifying the document ID as the UUID associated to the document. This is starting to become much more important as people are walking away from auto-increment fields and aiming more toward universal identification which, from a sparseness standpoint, is amazingly wasteful but incredibly useful. Thanks for your consideration. I have no idea how complicated it would be to make this change to Xapian, however I'd imagine migrating the document ID into a binary like value rather than an integer value would allow for very large document ID widths. This probably means adding a 16 bit length to every document ID which is pretty wasteful. For now I'm just storing the UUID as a serialized large integer through python-xapian and then writing the xapian document ID to my database documents as they become indexed. Thanks for your consideration, Shane Spencer
James Aylett
2012-Mar-10 15:02 UTC
[Xapian-discuss] 128 bit Document IDs (Please don't hurt me)
On 9 Mar 2012, at 23:39, Shane Spencer wrote:> Mapping a UUID (128 bit) or MongoDB ObjectID (96 bit) directly into > the Xapian document space removes the need for referencing one or the > other from one or both. I see a common tendency to write a document > to the Xapian, return the document ID, and then write to the database > backing the document in some way.Shane, have you considered adding your internal object id as a unique term? We have an FAQ on this <http://trac.xapian.org/wiki/FAQ/UniqueIds>. J -- James Aylett talktorex.co.uk - xapian.org - devfort.com - spacelog.org
Kevin Duraj
2012-Mar-11 04:59 UTC
[Xapian-discuss] 128 bit Document IDs (Please don't hurt me)
We need Xapian Document ID in sequence, because then we can browse through index (e.g., 1... 10000) and retrieve every document, using its ID. If we would switch to UUID then we would loose this ability to retrieve each document from Xapian Index. This way we can use Xapian for searching and as a backup. Please do not change anything in Xapian index data structure, thank you. - Kevin Duraj http://MyHealthcare.com Sent from my iPhone On Mar 9, 2012, at 3:39 PM, Shane Spencer <shane at bogomip.com> wrote:> I apologize for what may be a sore subject. 4 billion documents is a > heck of a lot. 64 bit vs 32 bit would be an incredibly large database > with an average document and term size. Why 128 bit? Simply for > address space. > > Mapping a UUID (128 bit) or MongoDB ObjectID (96 bit) directly into > the Xapian document space removes the need for referencing one or the > other from one or both. I see a common tendency to write a document > to the Xapian, return the document ID, and then write to the database > backing the document in some way. > > This is nothing new.. but I really would like to remove that extra > write and optionally throw a way the Xapian response by specifying the > document ID as the UUID associated to the document. This is starting > to become much more important as people are walking away from > auto-increment fields and aiming more toward universal identification > which, from a sparseness standpoint, is amazingly wasteful but > incredibly useful. > > Thanks for your consideration. I have no idea how complicated it > would be to make this change to Xapian, however I'd imagine migrating > the document ID into a binary like value rather than an integer value > would allow for very large document ID widths. This probably means > adding a 16 bit length to every document ID which is pretty wasteful. > > For now I'm just storing the UUID as a serialized large integer > through python-xapian and then writing the xapian document ID to my > database documents as they become indexed. > > Thanks for your consideration, > > Shane Spencer > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss
Olly Betts
2012-Mar-12 13:12 UTC
[Xapian-discuss] 128 bit Document IDs (Please don't hurt me)
On Fri, Mar 09, 2012 at 02:39:48PM -0900, Shane Spencer wrote:> I apologize for what may be a sore subject. 4 billion documents is a > heck of a lot. 64 bit vs 32 bit would be an incredibly large database > with an average document and term size. Why 128 bit? Simply for > address space. > > Mapping a UUID (128 bit) or MongoDB ObjectID (96 bit) directly into > the Xapian document space removes the need for referencing one or the > other from one or both. I see a common tendency to write a document > to the Xapian, return the document ID, and then write to the database > backing the document in some way.As James notes, you can store the ID as a term in Xapian.> This is nothing new.. but I really would like to remove that extra > write and optionally throw a way the Xapian response by specifying the > document ID as the UUID associated to the document. This is starting > to become much more important as people are walking away from > auto-increment fields and aiming more toward universal identification > which, from a sparseness standpoint, is amazingly wasteful but > incredibly useful. > > Thanks for your consideration. I have no idea how complicated it > would be to make this change to Xapian, however I'd imagine migrating > the document ID into a binary like value rather than an integer value > would allow for very large document ID widths. This probably means > adding a 16 bit length to every document ID which is pretty wasteful.You're making incorrect assumptions about how Xapian stores document IDs. They're stored as variable length integers, and the encoding naturally extends to any size of integer. At least conceptually, it's fairly easy to make the change you are suggesting. People have looked at making the change to 64 bit docids: http://trac.xapian.org/ticket/385 It's mostly just a matter of changing the type used to "long long", but assumptions creep in so there are probably a few other fixes needed. Changing to 128-bit docids isn't much harder. Most platforms don't have a 128-bit integer type, but you can make one with a C++ class and operator overloading. Then just plug that in instead of "long long" (and probably fix a few assumptions). The only limitation I can see is that this reduces the maximum term length a bit (since we need to build Btree keys from a term and a docid, so if the docid can be wider, the term can't be quite as long. However, Xapian stores deltas between document ids a lot, and if you create this ultra-sparse space of document ids, these deltas will tend to be billions rather than being small integers. That means everything takes more space to store - probably much more than it would take to just store each document's UUID as a term. Cheers, Olly