thr3ads.net - Xapian discuss - [Xapian-discuss] getting involved [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Előd Biszak

2014-Jan-22 14:30 UTC

[Xapian-discuss] getting involved

Hi,

I am a young developer from Hungary. I have been using Xapian for quite a
while, I am quite familiar with the basic concepts. I have tested various
search engines in the past few years and for now Xapian seems the be the
best choice for my purposes. I am willing to get deeply involved in the
project.

For getting started my main concerns are:

1. Where to get the current developer repository
2. What version control should I use to be able to pull it from your server
3. How to share my code?

I am interested in the following tasks:

1. I want to make the support of non-term sub-queries for positional
queries. For positional sub-queries I think it would make sense. ( e.g:
"Thomas Jefferson" NEAR "King George" )
2. Possible improvement in positional queries. In the docs it sais that
"Queries which use positional information can be significantly slower to
process [...] This will be improved in the future". Is there any thoughts
on how to improve them?
3. Adding the support of efficient ranking based on positional information.

What do you guys think, are these possible improvements? I have the time
and the motivation.

best regards,
Biszak El?d

James Aylett

2014-Jan-22 14:56 UTC

head link

[Xapian-discuss] getting involved

On 22 Jan 2014, at 14:30, El?d Biszak <biszakelod at gmail.com> wrote:
> I am a young developer from Hungary. I have been using Xapian for quite a
> while, I am quite familiar with the basic concepts. I have tested various
> search engines in the past few years and for now Xapian seems the be the
> best choice for my purposes. I am willing to get deeply involved in the
> project.
Hi! Great to have more people involved. For the repo, check out
<http://xapian.org/bleeding> which has all the information you should
need. (git is recommended, as we're moving away from subversion; from GSoCs
past we have experience working with github pull requests, but if you'd
rather work off git.xapian.org and use patchsets I'm sure we can manage that
too.)
> I am interested in the following tasks[snip]

The only thing I'd say about the tasks you're interested in is that
they're all quite "deep" in terms of needing to understand
something hefty inside Xapian: matching, the query parser, weighting ? and
usually more than one of those will have to be touched. I of course won't
dissuade you from working on those, but I'd recommend getting to grips with
the codebase before diving into one of these larger projects. There are some
"bite size" pieces on the wiki
<http://trac.xapian.org/wiki/ProjectIdeas> which will enable you to get to
grips with how the pieces fit together "under the hood", or you could
see if any existing bugs or feature requests
<http://trac.xapian.org/report/1> take your fancy. Spending a bit of time
understanding how Xapian fits together at the code level should make working
with it subsequently a lot easier.

There's also some internals documentation that you'll want to at least
skim <http://xapian.org/docs/internals.html>, and you'll want to look
at our HACKING file
<https://github.com/xapian/xapian/blob/master/xapian-core/HACKING> so you
know what support the build system provides for development, how to use more
sophisticated features of the test framework, some coding standards and other
useful info (it's a bit long, but worth reading right the way through).

Obviously, any questions either to the mailing list (xapian-devel is more
appropriate when you're working on the Xapian code itself) or on IRC. It may
take a few days for someone to get back to you, particularly if it's about
part of the code which fewer people have worked on, but we're here to help.

Best,
James

-- 
 James Aylett, occasional trouble-maker
 xapian.org

Olly Betts

2014-Jan-23 11:23 UTC

head link

[Xapian-discuss] getting involved

On Wed, Jan 22, 2014 at 03:30:52PM +0100, El?d Biszak
wrote:> I am interested in the following tasks:
> 
> 1. I want to make the support of non-term sub-queries for positional
> queries. For positional sub-queries I think it would make sense. ( e.g:
> "Thomas Jefferson" NEAR "King George" )
Not only positional sub-queries - e.g. these have natural
interpretations too:

    A NEAR (B OR C)  equivalent to  (A NEAR B) OR (A NEAR C)

    (X AND Y) NEAR Z  equivalent to  (X NEAR Z) AND (Y NEAR Z)

In 1.2.x, we expand simple cases like these by simply expanding them as
shown, but that no longer happens on trunk - the internals of Query
objects were reimplemented, and making this work again is the remaining
thing to do for that.  And at least for the OR case it seems better to
handle it with an OrPositionList class.
> 2. Possible improvement in positional queries. In the docs it sais that
> "Queries which use positional information can be significantly slower
to
> process [...] This will be improved in the future". Is there any
thoughts
> on how to improve them?
There have been some substantial improvements recently.  1.2.14 added
an optimisation to check weight before positional conditions, which
helps a lot.  There are more major changes on trunk - the decoding of
positional data is now done lazily, and the position table key order has
changed to improve locality of access at search time.

There's likely still scope for further improvement though.

There's a patch in ticket #394 which is promising:

http://trac.xapian.org/ticket/394

This originally made a huge difference to the worst cases, but the
mechanism is hooked in rather crudely, which made us uncomfortable with
merging it.  Since then, the weight optimisation which was added in
1.2.14 has reduced the impact this patch would have - the timings in the
most recent comment show a 10% improvement, but that was before the most
recent changes on trunk.  Also, the current default pond size was an
arbitrary choice and nobody has tried tuning it and seeing what
difference that makes.
> 3. Adding the support of efficient ranking based on positional information.
> 
> What do you guys think, are these possible improvements? I have the time
> and the motivation.
The trick for (3) is to have a model which bounds the contribution which
the positional information can make to the weight.  With that, you can
feed that bound into Xapian's weighting model, and that will help to
eliminate many documents without having to actually look at their
positional data at all.

E.g. say you're looking for 10 results for the query:

  hello world

If you know that the weight bonus for the two words appearing together
is <= 6 (for example), and you have already found 10 results, and the
lowest scoring of these has a total weight of 50, then any document
which matches hello AND world but scores < 44 can't score enough to make
it into the final top 10.

Cheers,
    Olly

Xapian discuss - Jan 2014 - getting involved

[Xapian-discuss] getting involved

[Xapian-discuss] getting involved

[Xapian-discuss] getting involved