Hurricane Tong
2014-Feb-11 14:36 UTC
[Xapian-devel] A beginner in "Posting list encoding improvements"
Greeting everyone, I'm an undergraduate Computer Science student from Fudan university, China. These days I'm searching for some projects for the Google Summer of Code 2014, and I discovered Xapian. Among so many projects displayed in http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:Postinglistencodingimprovements, I'm very interested in Posting list encoding improvements. When I studied Data Structure course and Discrete Maths course, the Huffman Coding made me interested in encoding and decoding, but for many reasons, I don't have an opportunity to learn more about encoding. As soon as I found the project, I regarded it as a good chance for me to learn encoding knowledge. According to the guideline for beginners in Xapian, I started to build Xapian in my computer. I used to work in Windows, with MS Visual Studio 2012. But I was faced with many problems when building. Some source code doesn't support Chinese well, such as xapian-core-1.2.8\win32\xapdep\xapdep.c. I need to modify some code to fit Chinese environment. And some code seem not to fit new C++ features in VS2012 well. If there is someone who also uses Xapian in Windows, I think it will be helpful for us to talk about some issues in building in Windows together. Finally, I succeed in running some demo code in Release mode, but I still failed in Debug mode. I have finished reading the paper provided, about VSEncoding. And plan to read some source code concerning about this project. Then I will try to put up some my own proposal. And I will appreciate it much if you can give me some extra advice for beginning with the project "posting list encoding improvements". I'm looking forward to participating in this project. Thanks for your reading. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140211/380ecf39/attachment.html>
Olly Betts
2014-Feb-12 08:55 UTC
[Xapian-devel] A beginner in "Posting list encoding improvements"
On Tue, Feb 11, 2014 at 10:36:07PM +0800, Hurricane Tong wrote:> According to the guideline for beginners in Xapian, I started to build > Xapian in my computer. I used to work in Windows, with MS Visual > Studio 2012. But I was faced with many problems when building. Some > source code doesn't support Chinese well, such as > xapian-core-1.2.8\win32\xapdep\xapdep.c. I need to modify some code to > fit Chinese environment. And some code seem not to fit new C++ > features in VS2012 well. If there is someone who also uses Xapian in > Windows, I think it will be helpful for us to talk about some issues > in building in Windows together.Unfortunately those makefiles haven't been actively maintained for a while. If you can figure out what needs changing, I'm happy to apply patches to improve things. For GSoC projects, I'd recommend developing on Linux, or another Unix-like platform. I think everyone who has so far expressed an interest in mentoring uses Linux or Mac OS X, so we're much better placed to help with development on such platforms. You'll also want to use trunk for GSoC projects - the 1.2 release series only gets bug fixes and non-invasive new features.> I have finished reading the paper provided, about VSEncoding. And plan > to read some source code concerning about this project. Then I will > try to put up some my own proposal. And I will appreciate it much if > you can give me some extra advice for beginning with the project > "posting list encoding improvements". I'm looking forward to > participating in this project.VSEncoding seems a good all-round candidate for encoding posting lists compactly, but it would also be good to have some other encodings available. For example, document lengths (which are essentially stored as a posting list) would benefit from being encoding in a way such that we could skip ahead in a chunk quickly. E.g. a fixed width per chunk would give O(1). http://trac.xapian.org/ticket/326 has some discussion about that. Cheers, Olly
James Aylett
2014-Feb-12 12:18 UTC
[Xapian-devel] A beginner in "Posting list encoding improvements"
On 12 Feb 2014, at 08:55, Olly Betts <olly at survex.com> wrote:> For GSoC projects, I'd recommend developing on Linux, or another > Unix-like platform. I think everyone who has so far expressed an > interest in mentoring uses Linux or Mac OS X, so we're much better > placed to help with development on such platforms.If you're on Windows, the way of doing this is to virtualise (either VirtualBox, which is free, or VMWare, which isn't but you can sometimes find prepackaged VMs which work with their free version). I'd recommend using a recent Ubuntu (13.04 or 13.10) for preference, although 12.04 (which is the Long Term Support version) shouldn't present too many pain points. If you need help, I can probably be of assistance. Keeping track of which commands you type (eg by using script(1)) is a good idea if you're unfamiliar with linux and think you might run into problems. J -- James Aylett, occasional trouble-maker xapian.org