Marius Tibeica
2013-Apr-02 13:37 UTC
[Xapian-devel] GSOC - Posting list encoding improvements
> It would be better to have this discussion on the mailing list really.Done! Tried to get as much information in here as possible.> > On Mon, Apr 01, 2013 at 04:22:09PM +0300, Marius Tibeica wrote: > > I've looked through the code of BitReader and BitWriter. > > > > From what I understood, we could keep state and allow repeated calls, but > > we can't get the list in order without extra O(n) memory (how it is now) > > unless we make that an O(log n) operation for each item, meaning O(n*log n) > > total. > > I don't think we want to make the decode time complexity worse. The > current space complexity might be unavoidable (though actually I think > we only need O(log(n)) space - see below), but I think the key win to > be had is to avoid doing a full decode just to get the first few entries > (in terms of time, the first entry is O(1) to decode, the second > O(log(n)), but the third isn't an additive O(log(n)) on top of that). > > > To do this we must: > > - initialize the interpolative decode > > - keep the state information from the BitReader(idx, n_bits, acc) for > > future requests - probably returned as a struct? > > Yes, I think you'd want to wrap up the state in a class. > > > - make a full decode (without keeping the termpos array) to update > > the BitWriter so that other future decodings are properly done (initial > > O(n) when creating the state). > > We're trying to avoid the work of doing the full decode though, so doing > it first and throwing the results away seems a step backwards. We are > unlikely to be able to store this data in a persistent enough way for > this to be useful (for a phrase search, we potentially decode > #terms_in_phrase*#docs_matching_all_those_terms position lists to check > phrase matches, though we try hard to avoid doing that many). > > > - for each requested item of the list, use the stored state. The code > > will be very similar to decode_interpolative, but only calculate the "half" > > where the item is each time (hence the log n). > > > > Does this sound like a good approach? > > My thinking is that the code runs much like it currently does, except > that the state is in an object (rather than on the stack from recursive > calls to decode_interpolative and in the passed in pos vector), and > that it pauses its work when it knows the next position to return. > > I think we may actually only need to store O(log(n)) data to keep the > state - the depth of recursion looks to be O(log(n)), and if you > adjust the decode_interpolative like so then the pos vector doesn't > seem to be needed: > > void > BitReader::decode_interpolative(int j, int k, > Xapian::termpos pos_j, > Xapian::termpos pos_k) > { > while (j + 1 < k) { > const size_t mid = (j + k) / 2; > // Decode one out of (pos_k - pos_j + 1) values > // (less some at either end because we must be able to fit > // all the intervening pos in) > const size_t outof = pos_k - pos_j + j - k + 1; > Xapian::termpos pos_mid = decode(outof) + (pos_j + mid - j); > decode_interpolative(j, mid, pos_j, pos_mid); > j = mid; > } > } > > I believe if you check mid before the recursive call to > decode_interpolative and k after it, then you'll see all the elements in > order (plus out of order repeats before and/or after the useful > occurrence - it looks like if run to completion, you end up considering > about n*2 values to find the n elements, which is OK). >Guess I understood the request wrong. Using a binary search tree in order traversal modified version we get this solution, which is pretty close to what you said: extra items in the BitReader (or a derived class) stack<DIState> di_stack; //will get to a maximum of O(logn) space size indeed DIState di_current; struct DIState { DIState() { set(0, 0, 0, 0); } DIState(int p_j, int p_k, Xapian::termpos pos_j, Xapian::termpos pos_k) { set(j, k, pos_j, pos_k); } void set(int j, int k) { this.j = j; this.k = k;, this.pos_j pos_j; this.pos_k = pos_k; } bool is_nextterm() { return j + 1 < k; }; bool is_uninitialized() { return j == 0 && k == 0 && pos_j == 0 && pos_k == 0; } Xapian::termpos pos_j, pos_k; int j, k; }; // after calling this method, decode_interpolative_next will sequentially return the termpos from j + 1 to k - 1 // on all the decoding methods, including decode_interpolative, we must also check if a decode_interpolative has been started and not finished, and signal an error or do all the reads from the buffer. Xapian::termpos BitReader::decode_interpolative(int j, int k, Xapian::termpos pos_j, Xapian::termpos pos_k) { di_current(j, k, pos_j, pos_k); } Xapian::termpos BitReader::decode_interpolative_next() { if (di_current.is_uninitialized()) { //signal an error } while (!di_stack.empty() || di_current.is_nextterm()) { if (di_current.is_nextterm()) { di_current.push(di_current); int mid = (j + k) / 2; int pos_mid = decode(pos[k] - pos[j] + j - k + 1) + (pos[j] + mid - j); di_current.set(di_current.j, mid, di_current.pos_j, pos_mid); } else { Xapian::termpos pos_ret = di_current.pos_k; if (di_stack.empty() && !di_current.is_nextterm()) { // signal an error. we have called this method too many times. } di_current = di_stack.top(); di_stack.pop(); int mid = (j + k) / 2; int pos_mid = pos_ret; di_current.set(mid, di_current.k, pos_mid, di_current.pos_k); return pos_ret; } } } I may be able to improve this even further by keeping less information in each DIState (if I can get it otherwise). Will focus on this if the approach is ok. Hope the code is self explanatory. Will add more comments if necessary.> Cheers, > Olly> > > One thing comes to mind - we currently have an interpolative coding > > > encoder and decoder, which are used for positional information. It > > > would probably be good to have these as one option for posting lists > > > - for something like a boolean filter term, interpolative coding > > > is very compact, especially for dense cases. > > > > > > However our current decoder can only do a full unpack, so you need > > > to unpack to a temporary vector and then iterate that. It would > > > be better if it could keep state and allow repeated calls to iterate > > > through the list, returning one entry at a time. This would likely > > > benefit phrase searches as well as helping when used for postlists. > > > > > > The code is in common/bitstream.{cc,h} if you want to take a look.
Marius Tibeica
2013-Apr-06 10:22 UTC
[Xapian-devel] GSOC - Posting list encoding improvements
Implemented it on https://github.com/mtibeica/xapian I could also find the size to which the stack will get (the order of the highest bit in k - j) and initialize it to avoit reallocations? I tested by using it in BrassPositionList::read_data. All the unit tests passed. I will focus on finding a way to reduce the size of DIState. On Tue, Apr 2, 2013 at 4:37 PM, Marius Tibeica <mtibeica at gmail.com> wrote:>> It would be better to have this discussion on the mailing list really. > > Done! Tried to get as much information in here as possible. > >> >> On Mon, Apr 01, 2013 at 04:22:09PM +0300, Marius Tibeica wrote: >> > I've looked through the code of BitReader and BitWriter. >> > >> > From what I understood, we could keep state and allow repeated calls, but >> > we can't get the list in order without extra O(n) memory (how it is now) >> > unless we make that an O(log n) operation for each item, meaning O(n*log n) >> > total. >> >> I don't think we want to make the decode time complexity worse. The >> current space complexity might be unavoidable (though actually I think >> we only need O(log(n)) space - see below), but I think the key win to >> be had is to avoid doing a full decode just to get the first few entries >> (in terms of time, the first entry is O(1) to decode, the second >> O(log(n)), but the third isn't an additive O(log(n)) on top of that). >> >> > To do this we must: >> > - initialize the interpolative decode >> > - keep the state information from the BitReader(idx, n_bits, acc) for >> > future requests - probably returned as a struct? >> >> Yes, I think you'd want to wrap up the state in a class. >> >> > - make a full decode (without keeping the termpos array) to update >> > the BitWriter so that other future decodings are properly done (initial >> > O(n) when creating the state). >> >> We're trying to avoid the work of doing the full decode though, so doing >> it first and throwing the results away seems a step backwards. We are >> unlikely to be able to store this data in a persistent enough way for >> this to be useful (for a phrase search, we potentially decode >> #terms_in_phrase*#docs_matching_all_those_terms position lists to check >> phrase matches, though we try hard to avoid doing that many). >> >> > - for each requested item of the list, use the stored state. The code >> > will be very similar to decode_interpolative, but only calculate the "half" >> > where the item is each time (hence the log n). >> > >> > Does this sound like a good approach? >> >> My thinking is that the code runs much like it currently does, except >> that the state is in an object (rather than on the stack from recursive >> calls to decode_interpolative and in the passed in pos vector), and >> that it pauses its work when it knows the next position to return. >> >> I think we may actually only need to store O(log(n)) data to keep the >> state - the depth of recursion looks to be O(log(n)), and if you >> adjust the decode_interpolative like so then the pos vector doesn't >> seem to be needed: >> >> void >> BitReader::decode_interpolative(int j, int k, >> Xapian::termpos pos_j, >> Xapian::termpos pos_k) >> { >> while (j + 1 < k) { >> const size_t mid = (j + k) / 2; >> // Decode one out of (pos_k - pos_j + 1) values >> // (less some at either end because we must be able to fit >> // all the intervening pos in) >> const size_t outof = pos_k - pos_j + j - k + 1; >> Xapian::termpos pos_mid = decode(outof) + (pos_j + mid - j); >> decode_interpolative(j, mid, pos_j, pos_mid); >> j = mid; >> } >> } >> >> I believe if you check mid before the recursive call to >> decode_interpolative and k after it, then you'll see all the elements in >> order (plus out of order repeats before and/or after the useful >> occurrence - it looks like if run to completion, you end up considering >> about n*2 values to find the n elements, which is OK). >> > > Guess I understood the request wrong. > Using a binary search tree in order traversal modified version we get > this solution, which is pretty close to what you said: > > extra items in the BitReader (or a derived class) > stack<DIState> di_stack; //will get to a maximum of O(logn) space size indeed > DIState di_current; > > struct DIState { > DIState() { set(0, 0, 0, 0); } > DIState(int p_j, int p_k, Xapian::termpos pos_j, Xapian::termpos > pos_k) { set(j, k, pos_j, pos_k); } > void set(int j, int k) { this.j = j; this.k = k;, this.pos_j > pos_j; this.pos_k = pos_k; } > bool is_nextterm() { return j + 1 < k; }; > bool is_uninitialized() { return j == 0 && k == 0 && pos_j == 0 && > pos_k == 0; } > Xapian::termpos pos_j, pos_k; > int j, k; > }; > > // after calling this method, decode_interpolative_next will > sequentially return the termpos from j + 1 to k - 1 > // on all the decoding methods, including decode_interpolative, we > must also check if a decode_interpolative has been started and not > finished, and signal an error or do all the reads from the buffer. > Xapian::termpos > BitReader::decode_interpolative(int j, int k, Xapian::termpos pos_j, > Xapian::termpos pos_k) > { > di_current(j, k, pos_j, pos_k); > } > > Xapian::termpos > BitReader::decode_interpolative_next() > { > if (di_current.is_uninitialized()) { > //signal an error > } > while (!di_stack.empty() || di_current.is_nextterm()) { > if (di_current.is_nextterm()) { > di_current.push(di_current); > int mid = (j + k) / 2; > int pos_mid = decode(pos[k] - pos[j] + j - k + 1) + > (pos[j] + mid - j); > di_current.set(di_current.j, mid, di_current.pos_j, pos_mid); > } else { > Xapian::termpos pos_ret = di_current.pos_k; > if (di_stack.empty() && !di_current.is_nextterm()) { > // signal an error. we have called this method too many times. > } > di_current = di_stack.top(); > di_stack.pop(); > int mid = (j + k) / 2; > int pos_mid = pos_ret; > di_current.set(mid, di_current.k, pos_mid, di_current.pos_k); > return pos_ret; > } > } > } > > I may be able to improve this even further by keeping less information > in each DIState (if I can get it otherwise). Will focus on this if the > approach is ok. > Hope the code is self explanatory. Will add more comments if necessary. > > >> Cheers, >> Olly > > >> > > One thing comes to mind - we currently have an interpolative coding >> > > encoder and decoder, which are used for positional information. It >> > > would probably be good to have these as one option for posting lists >> > > - for something like a boolean filter term, interpolative coding >> > > is very compact, especially for dense cases. >> > > >> > > However our current decoder can only do a full unpack, so you need >> > > to unpack to a temporary vector and then iterate that. It would >> > > be better if it could keep state and allow repeated calls to iterate >> > > through the list, returning one entry at a time. This would likely >> > > benefit phrase searches as well as helping when used for postlists. >> > > >> > > The code is in common/bitstream.{cc,h} if you want to take a look.