Hi, I'm Michael, I would like to participate in this year's Google Summer of Code, and I picked Xapian as the project to code for. Before writing a full proposal, I want to get in contact with the community, as well as introducing myself and discuss my ideas for the contribution to Xapian. First of all I'd like to talk about my motivation. I'm currently working on a webapp for document classification and I came across Xapian on my research for open source search engine alternatives to Lucene. It seemed like a fairly good, lightweight search engine library to me, so I decided to use it, rather to implement one myself, and push its development further adding nice features I could use for my own project. I checked the idea wiki and the source code, and noticed that there is currently only BM25 and a probabilistic approach implemented as weighting scheme, so im interested to work on the improvement of the ranking by implementing more weighting / ranking schemes: * implementing other statistical schemes like DfR, and tf-idf based term weighting schemes. * word-distance weighting: so documents wich contain the query terms with close distance to each other get higher scores * location based weighting: terms, that appear in the top of the document are generally more important * size based weighting: longer documents tend to be more important, than shorter ones, as they contain more words * neural network (mlp) for learning: if user decides (clicks) for a certain document, the network learns to connect the query string with this doc. could be interesting to improve quality of ranking. Another interesting project is the query parser. I could imagine improving it, by adding some natural language proceccing to avoid the need of binary keywords, as well as using semantic databases like WordNet to improve matching by normalizing the sematics of a query term. Of course, this is not an easy task, and multilanguage support is quite hard, but anyway, I think its worth working on it, starting with some special cases. A natural language date parser, like chronic for ruby could be a usefull feature, when search for date ranges in a chronical document archive, like blogs. This is just an example. A few words to myself: I'm a computer scientist student from Berlin. I've been working 4 years at a research center, developing software for 2d/3d image proceccing, mainly in C/C++. I participated in two minor projects for the open source software OCRopus (http://code.google.com/p/ocropus/) implementing some classification algorithms for character recognition / segmentation. I have fairly good knowledge of C/C++, python, Linux, algorithms, data structures, mathematics, artificial intelligence concepts ... I hope to get some feedback of you concerning my ideas soon. I then will start writing a full proposal to apply in a formal way until 08.04.11. Best Regards. Michael -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20110330/bfc4419b/attachment.pgp>
On Tuesday, March 29, 2011 at 3:07 PM, Michael Thomas wrote: Hi,> > I'm Michael, I would like to participate in this year's Google Summer of > Code, and I picked Xapian as the project to code for. > > Before writing a full proposal, I want to get in contact with the > community, as well as introducing myself and discuss my ideas for the > contribution to Xapian. >Awesome, its great to hear from you!> First of all I'd like to talk about my motivation. > I'm currently working on a webapp for document classification and I came > across Xapian on my research for open source search engine alternatives > to Lucene. It seemed like a fairly good, lightweight search engine > library to me, so I decided to use it, rather to implement one myself, > and push its development further adding nice features I could use for my > own project. > > I checked the idea wiki and the source code, and noticed that there is > currently only BM25 and a probabilistic approach implemented as > weighting scheme, so im interested to work on the improvement of the > ranking by implementing more weighting / ranking schemes: > > * implementing other statistical schemes like DfR, and tf-idf based term > weighting schemes. > * word-distance weighting: so documents wich contain the query terms > with close distance to each other get higher scores > * location based weighting: terms, that appear in the top of the > document are generally more important > * size based weighting: longer documents tend to be more important, than > shorter ones, as they contain more words > * neural network (mlp) for learning: if user decides (clicks) for a > certain document, the network learns to connect the query string with > this doc. could be interesting to improve quality of ranking. >There's been a lot of discussion on our mailing list about these topics. I'd really recommend reading though those comments. You can find our searchable archives here: http://dir.gmane.org/gmane.comp.search.xapian.devel> Another interesting project is the query parser. I could imagine > improving it, by adding some natural language proceccing to avoid the > need of binary keywords, as well as using semantic databases like > WordNet to improve matching by normalizing the sematics of a query term. > Of course, this is not an easy task, and multilanguage support is quite > hard, but anyway, I think its worth working on it, starting with some > special cases. A natural language date parser, like chronic for ruby > could be a usefull feature, when search for date ranges in a chronical > document archive, like blogs. This is just an example. >This has also been discussed a bit on that list. Essentially, we still want to have a boolean query language similar to the specification here: http://xapian.org/docs/queryparser.html. WordNet sounds very similar to our synonym support. There is a project to support additional languages, but that is a different scope from the queryparser. We also provide range query support already so there is a possibility of extending those range classes to support other data range types. One other note, we're trying to keep all GSoC discussion on the devel list so I've moved this thread there. Good luck with your project proposal! --Dan
Dan's given a good general answer, but to pick up on a few details of your suggestions: On Wed, Mar 30, 2011 at 12:07:26AM +0200, Michael Thomas wrote:> * word-distance weighting: so documents wich contain the query terms > with close distance to each other get higher scoresThe tricky part of this is doing it efficiently - if you have to read all the positional data for every term in the query for every potential match, this isn't likely to scale to really large databases. So you want to be able to cull as many candidates as you can based on other factors before considering this. There are similar issues for phrase searches.> * location based weighting: terms, that appear in the top of the > document are generally more importantThis is already possible by giving terms at the start a wdf boost - like in the second example here: http://trac.xapian.org/wiki/FAQ/ExtraWeight It's pretty common to apply this technique to the title, and (with a smaller boost) to any summary or abstract.> * size based weighting: longer documents tend to be more important, than > shorter ones, as they contain more wordsDocument length is already factored in - the b parameter of B25Weight tunes this. You don't need to explicitly give extra weight to longer documents, as they get it already by virtue of being longer - a 12 page document will naturally have a higher wdf for relevant terms than a 1 page document. So in fact you want to counter this effect if anything. In BM25Weight, b=0 means "no adjustment", while b=1 scales wdf down proportional to document length. The default is 0.5. Cheers, Olly