thr3ads.net - Xapian devel - [Xapian-discuss] Participation in GSOC [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Michael Thomas

2011-Mar-29 22:07 UTC

[Xapian-discuss] Participation in GSOC

Hi,

I'm Michael, I would like to participate in this year's Google Summer of
Code, and I picked Xapian as the project to code for.

Before writing a full proposal, I want to get in contact with the
community, as well as introducing myself and discuss my ideas for the
contribution to Xapian.

First of all I'd like to talk about my motivation.
I'm currently working on a webapp for document classification and I came
across Xapian on my research for open source search engine alternatives
to Lucene. It seemed like a fairly good, lightweight search engine
library to me, so I decided to use it, rather to implement one myself,
and push its development further adding nice features I could use for my
own project.

I checked the idea wiki and the source code, and noticed that there is
currently only BM25 and a probabilistic approach implemented as
weighting scheme, so im interested to work on the improvement of the
ranking by implementing more weighting / ranking schemes:

* implementing other statistical schemes like DfR, and tf-idf based term
weighting schemes.
* word-distance weighting: so documents wich contain the query terms
with close distance to each other get higher scores
* location based weighting: terms, that appear in the top of the
document are generally more important
* size based weighting: longer documents tend to be more important, than
shorter ones, as they contain more words
* neural network (mlp) for learning: if user decides (clicks) for a
certain document, the network learns to connect the query string with
this doc. could be interesting to improve quality of ranking.

Another interesting project is the query parser. I could imagine
improving it, by adding some natural language proceccing to avoid the
need of binary keywords, as well as using semantic databases like
WordNet to improve matching by normalizing the sematics of a query term.
Of course, this is not an easy task, and multilanguage support is quite
hard, but anyway, I think its worth working on it, starting with some
special cases. A natural language date parser, like chronic for ruby
could be a usefull feature, when search for date ranges in a chronical
document archive, like blogs. This is just an example.

A few words to myself:
I'm a computer scientist student from Berlin. I've been working 4 years
at a research center, developing software for 2d/3d image proceccing,
mainly in C/C++. I participated in two minor projects for the open
source software OCRopus (http://code.google.com/p/ocropus/) implementing
some classification algorithms for character recognition /
segmentation.
I have fairly good knowledge of C/C++, python, Linux, algorithms, data
structures, mathematics, artificial intelligence concepts ...

I hope to get some feedback of you concerning my ideas soon. I then will
start writing a full proposal to apply in a formal way until 08.04.11.

Best Regards.

Michael

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL:
<http://lists.xapian.org/pipermail/xapian-discuss/attachments/20110330/bfc4419b/attachment.pgp>

Dan Colish

2011-Mar-30 04:59 UTC

head link

[Xapian-devel] Participation in GSOC

On Tuesday, March 29, 2011 at 3:07 PM, Michael Thomas wrote:
Hi, > 
> I'm Michael, I would like to participate in this year's Google
Summer of
> Code, and I picked Xapian as the project to code for. 
> 
> Before writing a full proposal, I want to get in contact with the
> community, as well as introducing myself and discuss my ideas for the
> contribution to Xapian. 
> 
Awesome, its great to hear from you! > First of all I'd like to talk about my motivation. 
> I'm currently working on a webapp for document classification and I
came
> across Xapian on my research for open source search engine alternatives
> to Lucene. It seemed like a fairly good, lightweight search engine
> library to me, so I decided to use it, rather to implement one myself,
> and push its development further adding nice features I could use for my
> own project. 
> 
> I checked the idea wiki and the source code, and noticed that there is
> currently only BM25 and a probabilistic approach implemented as
> weighting scheme, so im interested to work on the improvement of the
> ranking by implementing more weighting / ranking schemes:
> 
> * implementing other statistical schemes like DfR, and tf-idf based term
> weighting schemes. 
> * word-distance weighting: so documents wich contain the query terms
> with close distance to each other get higher scores
> * location based weighting: terms, that appear in the top of the
> document are generally more important
> * size based weighting: longer documents tend to be more important, than
> shorter ones, as they contain more words
> * neural network (mlp) for learning: if user decides (clicks) for a
> certain document, the network learns to connect the query string with
> this doc. could be interesting to improve quality of ranking.
> There's been a lot of discussion on our mailing list about these topics.
I'd really recommend reading though those comments. You can find our
searchable archives here:
http://dir.gmane.org/gmane.comp.search.xapian.devel> Another interesting project is the query parser. I could imagine
> improving it, by adding some natural language proceccing to avoid the
> need of binary keywords, as well as using semantic databases like
> WordNet to improve matching by normalizing the sematics of a query term.
>  Of course, this is not an easy task, and multilanguage support is quite
> hard, but anyway, I think its worth working on it, starting with some
> special cases. A natural language date parser, like chronic for ruby
> could be a usefull feature, when search for date ranges in a chronical
> document archive, like blogs. This is just an example. 
> This has also been discussed a bit on that list. Essentially, we still want to
have a boolean query language similar to the specification here:
http://xapian.org/docs/queryparser.html. WordNet sounds very similar to our
synonym support. There is a project to support additional languages, but that is
a different scope from the queryparser. We also provide range query support
already so there is a possibility of extending those range classes to support
other data range types.

One other note, we're trying to keep all GSoC discussion on the devel list
so I've moved this thread there. Good luck with your project proposal!

--Dan

Olly Betts

2011-Apr-04 02:22 UTC

head link

[Xapian-devel] Participation in GSOC

Dan's given a good general answer, but to pick up on a few details of
your suggestions:

On Wed, Mar 30, 2011 at 12:07:26AM +0200, Michael Thomas
wrote:> * word-distance weighting: so documents wich contain the query terms
> with close distance to each other get higher scores
The tricky part of this is doing it efficiently - if you have to read
all the positional data for every term in the query for every potential
match, this isn't likely to scale to really large databases.  So you
want to be able to cull as many candidates as you can based on other
factors before considering this.  There are similar issues for phrase
searches.
> * location based weighting: terms, that appear in the top of the
> document are generally more important
This is already possible by giving terms at the start a wdf boost -
like in the second example here:

http://trac.xapian.org/wiki/FAQ/ExtraWeight

It's pretty common to apply this technique to the title, and (with a
smaller boost) to any summary or abstract.
> * size based weighting: longer documents tend to be more important, than
> shorter ones, as they contain more words
Document length is already factored in - the b parameter of B25Weight
tunes this.

You don't need to explicitly give extra weight to longer documents,
as they get it already by virtue of being longer - a 12 page document
will naturally have a higher wdf for relevant terms than a 1 page
document.  So in fact you want to counter this effect if anything.
In BM25Weight, b=0 means "no adjustment", while b=1 scales wdf down
proportional to document length.  The default is 0.5.

Cheers,
    Olly

Apparently Analagous Threads

Search for more seemingly similar threads

Xapian devel - Mar 2011 - Participation in GSOC

[Xapian-discuss] Participation in GSOC

[Xapian-devel] Participation in GSOC

[Xapian-devel] Participation in GSOC

Apparently Analagous Threads