Hi, I am a young developer from Hungary. I have been using Xapian for quite a while, I am quite familiar with the basic concepts. I have tested various search engines in the past few years and for now Xapian seems the be the best choice for my purposes. I am willing to get deeply involved in the project. For getting started my main concerns are: 1. Where to get the current developer repository 2. What version control should I use to be able to pull it from your server 3. How to share my code? I am interested in the following tasks: 1. I want to make the support of non-term sub-queries for positional queries. For positional sub-queries I think it would make sense. ( e.g: "Thomas Jefferson" NEAR "King George" ) 2. Possible improvement in positional queries. In the docs it sais that "Queries which use positional information can be significantly slower to process [...] This will be improved in the future". Is there any thoughts on how to improve them? 3. Adding the support of efficient ranking based on positional information. What do you guys think, are these possible improvements? I have the time and the motivation. best regards, Biszak El?d
On 22 Jan 2014, at 14:30, El?d Biszak <biszakelod at gmail.com> wrote:> I am a young developer from Hungary. I have been using Xapian for quite a > while, I am quite familiar with the basic concepts. I have tested various > search engines in the past few years and for now Xapian seems the be the > best choice for my purposes. I am willing to get deeply involved in the > project.Hi! Great to have more people involved. For the repo, check out <http://xapian.org/bleeding> which has all the information you should need. (git is recommended, as we're moving away from subversion; from GSoCs past we have experience working with github pull requests, but if you'd rather work off git.xapian.org and use patchsets I'm sure we can manage that too.)> I am interested in the following tasks[snip] The only thing I'd say about the tasks you're interested in is that they're all quite "deep" in terms of needing to understand something hefty inside Xapian: matching, the query parser, weighting ? and usually more than one of those will have to be touched. I of course won't dissuade you from working on those, but I'd recommend getting to grips with the codebase before diving into one of these larger projects. There are some "bite size" pieces on the wiki <http://trac.xapian.org/wiki/ProjectIdeas> which will enable you to get to grips with how the pieces fit together "under the hood", or you could see if any existing bugs or feature requests <http://trac.xapian.org/report/1> take your fancy. Spending a bit of time understanding how Xapian fits together at the code level should make working with it subsequently a lot easier. There's also some internals documentation that you'll want to at least skim <http://xapian.org/docs/internals.html>, and you'll want to look at our HACKING file <https://github.com/xapian/xapian/blob/master/xapian-core/HACKING> so you know what support the build system provides for development, how to use more sophisticated features of the test framework, some coding standards and other useful info (it's a bit long, but worth reading right the way through). Obviously, any questions either to the mailing list (xapian-devel is more appropriate when you're working on the Xapian code itself) or on IRC. It may take a few days for someone to get back to you, particularly if it's about part of the code which fewer people have worked on, but we're here to help. Best, James -- James Aylett, occasional trouble-maker xapian.org
On Wed, Jan 22, 2014 at 03:30:52PM +0100, El?d Biszak wrote:> I am interested in the following tasks: > > 1. I want to make the support of non-term sub-queries for positional > queries. For positional sub-queries I think it would make sense. ( e.g: > "Thomas Jefferson" NEAR "King George" )Not only positional sub-queries - e.g. these have natural interpretations too: A NEAR (B OR C) equivalent to (A NEAR B) OR (A NEAR C) (X AND Y) NEAR Z equivalent to (X NEAR Z) AND (Y NEAR Z) In 1.2.x, we expand simple cases like these by simply expanding them as shown, but that no longer happens on trunk - the internals of Query objects were reimplemented, and making this work again is the remaining thing to do for that. And at least for the OR case it seems better to handle it with an OrPositionList class.> 2. Possible improvement in positional queries. In the docs it sais that > "Queries which use positional information can be significantly slower to > process [...] This will be improved in the future". Is there any thoughts > on how to improve them?There have been some substantial improvements recently. 1.2.14 added an optimisation to check weight before positional conditions, which helps a lot. There are more major changes on trunk - the decoding of positional data is now done lazily, and the position table key order has changed to improve locality of access at search time. There's likely still scope for further improvement though. There's a patch in ticket #394 which is promising: http://trac.xapian.org/ticket/394 This originally made a huge difference to the worst cases, but the mechanism is hooked in rather crudely, which made us uncomfortable with merging it. Since then, the weight optimisation which was added in 1.2.14 has reduced the impact this patch would have - the timings in the most recent comment show a 10% improvement, but that was before the most recent changes on trunk. Also, the current default pond size was an arbitrary choice and nobody has tried tuning it and seeing what difference that makes.> 3. Adding the support of efficient ranking based on positional information. > > What do you guys think, are these possible improvements? I have the time > and the motivation.The trick for (3) is to have a model which bounds the contribution which the positional information can make to the weight. With that, you can feed that bound into Xapian's weighting model, and that will help to eliminate many documents without having to actually look at their positional data at all. E.g. say you're looking for 10 results for the query: hello world If you know that the weight bonus for the two words appearing together is <= 6 (for example), and you have already found 10 results, and the lowest scoring of these has a total weight of 50, then any document which matches hello AND world but scores < 44 can't score enough to make it into the final top 10. Cheers, Olly