Hurricane Tong
2014-Feb-18 14:08 UTC
[Xapian-devel] contribution to "Add more stemming algorithms"
Hi, I am trying to contribute to the "bite-site" project, "Add more stemming algorithms". I implement the Lancaster (Paice/Husk) stemming algorithm by building a class named StemLancaster extending the StemImplementation, with the guide in http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm. I think this class can be added to the default API for the potential users who are interested in this algorithm. There is the source code, https://github.com/HurricaneTong/Xapian, would you like to give me some suggestions about the source code, and can this code be added to the source code of Xapian after necessary modifying ? Besides, I indexed about 5000 documents from wikipedia with Brass and Chert, and execute about 40000 single term search. With the brass database, it costs 5.66s, and with the chert database, it costs 5.57s, ( In virtual machine VBox ). it seems that brass is slower in this condition. ------------------ HurricaneTong,Second Year Undergraduate, School of Computer Science, Fudan University, China. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140218/09c67db7/attachment-0002.html>
James Aylett
2014-Feb-18 14:32 UTC
[Xapian-devel] contribution to "Add more stemming algorithms"
On 18 Feb 2014, at 14:08, "Hurricane Tong" <zhangshangtong.cpp at qq.com> wrote:> I am trying to contribute to the "bite-site" project, "Add more stemming algorithms". > I implement the Lancaster (Paice/Husk) stemming algorithm by building a class named StemLancaster extending > the StemImplementation, with the guide in http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm. > I think this class can be added to the default API for the potential users who are interested in this algorithm.Hi, that sounds like a good approach to getting familiar with Xapian, the build system &c.> There is the source code, https://github.com/HurricaneTong/Xapian, would you like to give me some suggestions about the source code, and can this code be added to the source code of Xapian after necessary modifying ?Either this will want integrating into the Xapian codebase, or will need its own build system and tests. For something this size, I'd think that integrating it is reasonable. For this, you'll want to fork Xapian on github, integrate your code into it, and then issue a pull request (which provides ways for us to comment directly on the code line by line). Before you do that, please read: https://github.com/xapian/xapian/blob/master/xapian-core/HACKING which talks about coding style (there are some changes you'll want to make), licensing statements and other pieces that we like to see for submissions. Crucially, we don't want to merge changes that do not have supporting tests, or that are not documented. It looks like you have some API documentation for you code, but there will need to be something in docs/stemming.rst; tests should be added to tests/api_stem.cc and tests/stemtest.cc ? you want to ensure that constructing a Lancaster stemmer by name, such as Xapian::Stem st("lancaster"), will work, but also that running the stemmer produces the expected results. We do this for existing stemmers using xapian-data/stemming (which is used by tests/stemtest.cc); you'll need a word list and expected output, which the Lancaster stemmer may provide as a reference? Also check out <http://xapian.org/docs/tests.html>, which talks about how to write tests. J -- James Aylett, occasional trouble-maker xapian.org
Olly Betts
2014-Feb-18 22:41 UTC
[Xapian-devel] contribution to "Add more stemming algorithms"
On Tue, Feb 18, 2014 at 10:08:20PM +0800, Hurricane Tong wrote:> I am trying to contribute to the "bite-site" project, "Add more > stemming algorithms". > I implement the Lancaster (Paice/Husk) stemming algorithm by building > a class named StemLancaster extending > the StemImplementation, with the guide in > http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm. > I think this class can be added to the default API for the potential > users who are interested in this algorithm. > There is the source code, https://github.com/HurricaneTong/Xapian, > would you like to give me some suggestions about the source code, and > can this code be added to the source code of Xapian after necessary > modifying ?| This class is implemented based on an ANSI C implementation by Andy | Stark Unfortunately there's no licence provided for that implementation, which sadly means we can't use it in Xapian. I had a quick look and I think your code is pretty clearly a derivative work of Andy Stark's. Last year another student provided a Paice/Husk implementation based on this same code, so I think we need to add a warning to the project idea that we can't use this code unless someone is able to contact Andy Stark and get an explicit licence (which looks hard as there are no contact details for him on the download, and it's a relatively common name).> Besides, I indexed about 5000 documents from wikipedia with Brass and > Chert, and execute about 40000 single term search. > With the brass database, it costs 5.66s, and with the chert database, > it costs 5.57s, ( In virtual machine VBox ). it seems that brass is > slower in this condition.It's expected that brass is currently slower to index, due to the positional data storage changes. I'm hopeful we can regain that loss (and more) by optimising how data is stored in memory while indexing. Cheers, Olly