thr3ads.net - Xapian devel - [Xapian-devel] GSOC - Posting list encoding improvements [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Marius Tibeica

2013-Apr-02 13:37 UTC

[Xapian-devel] GSOC - Posting list encoding improvements

> It would be better to have this discussion on the mailing list really.
Done! Tried to get as much information in here as possible.
>
> On Mon, Apr 01, 2013 at 04:22:09PM +0300, Marius Tibeica wrote:
> > I've looked through the code of BitReader and BitWriter.
> >
> > From what I understood, we could keep state and allow repeated calls,
but
> > we can't get the list in order without extra O(n) memory (how it
is now)
> > unless we make that an O(log n) operation for each item, meaning
O(n*log n)
> > total.
>
> I don't think we want to make the decode time complexity worse.  The
> current space complexity might be unavoidable (though actually I think
> we only need O(log(n)) space - see below), but I think the key win to
> be had is to avoid doing a full decode just to get the first few entries
> (in terms of time, the first entry is O(1) to decode, the second
> O(log(n)), but the third isn't an additive O(log(n)) on top of that).
>
> > To do this we must:
> >   - initialize the interpolative decode
> >       - keep the state information from the BitReader(idx, n_bits,
acc) for
> > future requests - probably returned as a struct?
>
> Yes, I think you'd want to wrap up the state in a class.
>
> >       - make a full decode (without keeping the termpos array) to
update
> > the BitWriter so that other future decodings are properly done
(initial
> > O(n) when creating the state).
>
> We're trying to avoid the work of doing the full decode though, so
doing
> it first and throwing the results away seems a step backwards.  We are
> unlikely to be able to store this data in a persistent enough way for
> this to be useful (for a phrase search, we potentially decode
> #terms_in_phrase*#docs_matching_all_those_terms position lists to check
> phrase matches, though we try hard to avoid doing that many).
>
> >   - for each requested item of the list, use the stored state. The
code
> > will be very similar to decode_interpolative, but only calculate the
"half"
> > where the item is each time (hence the log n).
> >
> > Does this sound like a good approach?
>
> My thinking is that the code runs much like it currently does, except
> that the state is in an object (rather than on the stack from recursive
> calls to decode_interpolative and in the passed in pos vector), and
> that it pauses its work when it knows the next position to return.
>
> I think we may actually only need to store O(log(n)) data to keep the
> state - the depth of recursion looks to be O(log(n)), and if you
> adjust the decode_interpolative like so then the pos vector doesn't
> seem to be needed:
>
> void
> BitReader::decode_interpolative(int j, int k,
>                                 Xapian::termpos pos_j,
>                                 Xapian::termpos pos_k)
> {
>     while (j + 1 < k) {
>         const size_t mid = (j + k) / 2;
>         // Decode one out of (pos_k - pos_j + 1) values
>         // (less some at either end because we must be able to fit
>         // all the intervening pos in)
>         const size_t outof = pos_k - pos_j + j - k + 1;
>         Xapian::termpos pos_mid = decode(outof) + (pos_j + mid - j);
>         decode_interpolative(j, mid, pos_j, pos_mid);
>         j = mid;
>     }
> }
>
> I believe if you check mid before the recursive call to
> decode_interpolative and k after it, then you'll see all the elements
in
> order (plus out of order repeats before and/or after the useful
> occurrence - it looks like if run to completion, you end up considering
> about n*2 values to find the n elements, which is OK).
>
Guess I understood the request wrong.
Using a binary search tree in order traversal modified version we get
this solution, which is pretty close to what you said:

extra items in the BitReader (or a derived class)
stack<DIState> di_stack; //will get to a maximum of O(logn) space size
indeed
DIState di_current;

struct DIState {
  DIState() { set(0, 0, 0, 0); }
  DIState(int p_j, int p_k, Xapian::termpos pos_j, Xapian::termpos
pos_k) { set(j, k, pos_j, pos_k); }
  void set(int j, int k) { this.j = j; this.k = k;, this.pos_j pos_j; this.pos_k
= pos_k; }
  bool is_nextterm() { return j + 1 < k; };
  bool is_uninitialized() { return j == 0 && k == 0 && pos_j ==
0 &&
pos_k == 0; }
  Xapian::termpos pos_j, pos_k;
  int j, k;
};

// after calling this method, decode_interpolative_next will
sequentially return the termpos from j + 1 to k - 1
// on all the decoding methods, including decode_interpolative, we
must also check if a decode_interpolative has been started and not
finished, and signal an error or do all the reads from the buffer.
Xapian::termpos
BitReader::decode_interpolative(int j, int k, Xapian::termpos pos_j,
Xapian::termpos pos_k)
{
    di_current(j, k, pos_j, pos_k);
}

Xapian::termpos
BitReader::decode_interpolative_next()
{
    if (di_current.is_uninitialized()) {
        //signal an error
    }
    while (!di_stack.empty() || di_current.is_nextterm()) {
        if (di_current.is_nextterm()) {
            di_current.push(di_current);
            int mid = (j + k) / 2;
            int pos_mid = decode(pos[k] - pos[j] + j - k + 1) +
(pos[j] + mid - j);
            di_current.set(di_current.j, mid, di_current.pos_j, pos_mid);
        } else {
            Xapian::termpos pos_ret = di_current.pos_k;
            if (di_stack.empty() && !di_current.is_nextterm()) {
                // signal an error. we have called this method too many times.
            }
            di_current = di_stack.top();
            di_stack.pop();
            int mid = (j + k) / 2;
            int pos_mid = pos_ret;
            di_current.set(mid, di_current.k, pos_mid, di_current.pos_k);
            return pos_ret;
        }
    }
}

I may be able to improve this even further by keeping less information
in each DIState (if I can get it otherwise). Will focus on this if the
approach is ok.
Hope the code is self explanatory. Will add more comments if necessary.

> Cheers,
>     Olly
> > > One thing comes to mind - we currently have an interpolative
coding
> > > encoder and decoder, which are used for positional information. 
It
> > > would probably be good to have these as one option for posting
lists
> > > - for something like a boolean filter term, interpolative coding
> > > is very compact, especially for dense cases.
> > >
> > > However our current decoder can only do a full unpack, so you
need
> > > to unpack to a temporary vector and then iterate that.  It would
> > > be better if it could keep state and allow repeated calls to
iterate
> > > through the list, returning one entry at a time.  This would
likely
> > > benefit phrase searches as well as helping when used for
postlists.
> > >
> > > The code is in common/bitstream.{cc,h} if you want to take a
look.

Marius Tibeica

2013-Apr-06 10:22 UTC

head link

[Xapian-devel] GSOC - Posting list encoding improvements

Implemented it on https://github.com/mtibeica/xapian

I could also find the size to which the stack will get (the order of
the highest bit in k - j) and initialize it to avoit reallocations?
I tested by using it in BrassPositionList::read_data. All the unit tests passed.

I will focus on finding a way to reduce the size of DIState.


On Tue, Apr 2, 2013 at 4:37 PM, Marius Tibeica <mtibeica at gmail.com>
wrote:>> It would be better to have this discussion on the mailing list really.
>
> Done! Tried to get as much information in here as possible.
>
>>
>> On Mon, Apr 01, 2013 at 04:22:09PM +0300, Marius Tibeica wrote:
>> > I've looked through the code of BitReader and BitWriter.
>> >
>> > From what I understood, we could keep state and allow repeated
calls, but
>> > we can't get the list in order without extra O(n) memory (how
it is now)
>> > unless we make that an O(log n) operation for each item, meaning
O(n*log n)
>> > total.
>>
>> I don't think we want to make the decode time complexity worse. 
The
>> current space complexity might be unavoidable (though actually I think
>> we only need O(log(n)) space - see below), but I think the key win to
>> be had is to avoid doing a full decode just to get the first few
entries
>> (in terms of time, the first entry is O(1) to decode, the second
>> O(log(n)), but the third isn't an additive O(log(n)) on top of
that).
>>
>> > To do this we must:
>> >   - initialize the interpolative decode
>> >       - keep the state information from the BitReader(idx, n_bits,
acc) for
>> > future requests - probably returned as a struct?
>>
>> Yes, I think you'd want to wrap up the state in a class.
>>
>> >       - make a full decode (without keeping the termpos array) to
update
>> > the BitWriter so that other future decodings are properly done
(initial
>> > O(n) when creating the state).
>>
>> We're trying to avoid the work of doing the full decode though, so
doing
>> it first and throwing the results away seems a step backwards.  We are
>> unlikely to be able to store this data in a persistent enough way for
>> this to be useful (for a phrase search, we potentially decode
>> #terms_in_phrase*#docs_matching_all_those_terms position lists to check
>> phrase matches, though we try hard to avoid doing that many).
>>
>> >   - for each requested item of the list, use the stored state. The
code
>> > will be very similar to decode_interpolative, but only calculate
the "half"
>> > where the item is each time (hence the log n).
>> >
>> > Does this sound like a good approach?
>>
>> My thinking is that the code runs much like it currently does, except
>> that the state is in an object (rather than on the stack from recursive
>> calls to decode_interpolative and in the passed in pos vector), and
>> that it pauses its work when it knows the next position to return.
>>
>> I think we may actually only need to store O(log(n)) data to keep the
>> state - the depth of recursion looks to be O(log(n)), and if you
>> adjust the decode_interpolative like so then the pos vector doesn't
>> seem to be needed:
>>
>> void
>> BitReader::decode_interpolative(int j, int k,
>>                                 Xapian::termpos pos_j,
>>                                 Xapian::termpos pos_k)
>> {
>>     while (j + 1 < k) {
>>         const size_t mid = (j + k) / 2;
>>         // Decode one out of (pos_k - pos_j + 1) values
>>         // (less some at either end because we must be able to fit
>>         // all the intervening pos in)
>>         const size_t outof = pos_k - pos_j + j - k + 1;
>>         Xapian::termpos pos_mid = decode(outof) + (pos_j + mid - j);
>>         decode_interpolative(j, mid, pos_j, pos_mid);
>>         j = mid;
>>     }
>> }
>>
>> I believe if you check mid before the recursive call to
>> decode_interpolative and k after it, then you'll see all the
elements in
>> order (plus out of order repeats before and/or after the useful
>> occurrence - it looks like if run to completion, you end up considering
>> about n*2 values to find the n elements, which is OK).
>>
>
> Guess I understood the request wrong.
> Using a binary search tree in order traversal modified version we get
> this solution, which is pretty close to what you said:
>
> extra items in the BitReader (or a derived class)
> stack<DIState> di_stack; //will get to a maximum of O(logn) space
size indeed
> DIState di_current;
>
> struct DIState {
>   DIState() { set(0, 0, 0, 0); }
>   DIState(int p_j, int p_k, Xapian::termpos pos_j, Xapian::termpos
> pos_k) { set(j, k, pos_j, pos_k); }
>   void set(int j, int k) { this.j = j; this.k = k;, this.pos_j > pos_j;
this.pos_k = pos_k; }
>   bool is_nextterm() { return j + 1 < k; };
>   bool is_uninitialized() { return j == 0 && k == 0 &&
pos_j == 0 &&
> pos_k == 0; }
>   Xapian::termpos pos_j, pos_k;
>   int j, k;
> };
>
> // after calling this method, decode_interpolative_next will
> sequentially return the termpos from j + 1 to k - 1
> // on all the decoding methods, including decode_interpolative, we
> must also check if a decode_interpolative has been started and not
> finished, and signal an error or do all the reads from the buffer.
> Xapian::termpos
> BitReader::decode_interpolative(int j, int k, Xapian::termpos pos_j,
> Xapian::termpos pos_k)
> {
>     di_current(j, k, pos_j, pos_k);
> }
>
> Xapian::termpos
> BitReader::decode_interpolative_next()
> {
>     if (di_current.is_uninitialized()) {
>         //signal an error
>     }
>     while (!di_stack.empty() || di_current.is_nextterm()) {
>         if (di_current.is_nextterm()) {
>             di_current.push(di_current);
>             int mid = (j + k) / 2;
>             int pos_mid = decode(pos[k] - pos[j] + j - k + 1) +
> (pos[j] + mid - j);
>             di_current.set(di_current.j, mid, di_current.pos_j, pos_mid);
>         } else {
>             Xapian::termpos pos_ret = di_current.pos_k;
>             if (di_stack.empty() && !di_current.is_nextterm()) {
>                 // signal an error. we have called this method too many
times.
>             }
>             di_current = di_stack.top();
>             di_stack.pop();
>             int mid = (j + k) / 2;
>             int pos_mid = pos_ret;
>             di_current.set(mid, di_current.k, pos_mid, di_current.pos_k);
>             return pos_ret;
>         }
>     }
> }
>
> I may be able to improve this even further by keeping less information
> in each DIState (if I can get it otherwise). Will focus on this if the
> approach is ok.
> Hope the code is self explanatory. Will add more comments if necessary.
>
>
>> Cheers,
>>     Olly
>
>
>> > > One thing comes to mind - we currently have an interpolative
coding
>> > > encoder and decoder, which are used for positional
information.  It
>> > > would probably be good to have these as one option for
posting lists
>> > > - for something like a boolean filter term, interpolative
coding
>> > > is very compact, especially for dense cases.
>> > >
>> > > However our current decoder can only do a full unpack, so you
need
>> > > to unpack to a temporary vector and then iterate that.  It
would
>> > > be better if it could keep state and allow repeated calls to
iterate
>> > > through the list, returning one entry at a time.  This would
likely
>> > > benefit phrase searches as well as helping when used for
postlists.
>> > >
>> > > The code is in common/bitstream.{cc,h} if you want to take a
look.

Xapian devel - Apr 2013 - GSOC - Posting list encoding improvements

[Xapian-devel] GSOC - Posting list encoding improvements

[Xapian-devel] GSOC - Posting list encoding improvements