Andrew Betts
2011-Apr-02 11:47 UTC
[Xapian-discuss] Xapian docs (was Re: Xapian-discuss Digest, Vol 83, Issue 2)
> I think this is a shining example of how well Xapian works with large > document collections. I was just discussing this with my colleagues here > and one of the issues that came up is that we'd love Xapian to become > really lot more popular but have found that the documentation's a bit > difficult to get into, as is the API.I agree. There are a few gotchas, as well as branch stuff like matchspy that is phenomenally useful, but largely undocumented and therefore underused. (though by the looks of it matchspy is now in core). I actually find the API docs to be a comprehensive reference, which is a great start - I've recently been trying to use various RabbitMQ wrappers for PHP and its incredibly frustrating not being able to look up the syntax for something even when you know what you want. Xapian isn't like that - if I know what I'm looking for, I can find it easily and the docs are comprehensive on the subject. What's missing is a well organised resource on how to implement Xapian at a more strategic level, and how to achieve various common use cases well in each of the supported languages.> > So I was wondering: do you have any thoughts on improving this and would > you like some help? I use Xapian a fair bit (mostly on > www.reportbuyer.com) together with a new wrapper for our CMS and have a > bit of spare time. I'd be happy to write up examples of how to use some > of the bindings, particularly PHP as that's my area.I'd also be happy to contribute. A cookbook type format could be worth considering, like http://diveintogreasemonkey.org/patterns/index.html (though note that they haven't kept this up to date). To a degree Xapian suffers from the same problem as RabbitMQ on high level docs - there's just a list of independently authored, inconsistently formatted articles many of which cover the same ground. See http://trac.xapian.org/wiki/Articles and http://www.rabbitmq.com/devtools.html. Since Xapian has its own bindings for lots of languages, it should be a relatively straightforward matter to provide consistent, high level documentation that can include examples in multiple languages. Anyway, happy to support this kind of project. Can only be a good thing to get more people introduced to Xapian. Andrew> > >> Message: 1 >> Date: Thu, 31 Mar 2011 11:55:32 -0700 >> From: Kevin Duraj <kevinduraj at gmail.com> >> Subject: [Xapian-discuss] Xapian Index: 607GB = 219 million of unique >> documents >> To: xapian-discuss at lists.xapian.org >> Message-ID: >> <AANLkTiku6tA06=s9hmX7nTcBHWSDfxdDgnHJuLUKhRBN at mail.gmail.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> It took approximately five days, having single process using one core >> CPU and 6GB of memory to build this giant 607GB single Xapian index, >> containing 219 million of unique documents (web sites). So far I did >> not found any other implementation that would enable me to build such >> a single index containing over 200 million documents, while testing >> Lucene, Solr, MySQL, Hadoop and Oracle. Probably that would be the >> real reason why Xapian was not approved last year, for Google's Summer >> of Code. Xapian is the type of open source that they don't want you to >> know about. >> >> Following index can be search from: http://myhealthcare.com/ >> >> total 607G >> -rw-r--r-- 1 kevin kevin 28 2011-03-31 06:09 iamchert >> -rw-r--r-- 1 kevin kevin 14 2011-03-31 01:50 position.baseA >> -rw-r--r-- 1 kevin kevin 622K 2011-03-31 06:09 position.baseB >> -rw-r--r-- 1 kevin kevin 311G 2011-03-31 06:09 position.DB >> -rw-r--r-- 1 kevin kevin 14 2011-03-30 17:19 postlist.baseA >> -rw-r--r-- 1 kevin kevin 139K 2011-03-31 00:49 postlist.baseB >> -rw-r--r-- 1 kevin kevin 70G 2011-03-31 00:49 postlist.DB >> -rw-r--r-- 1 kevin kevin 14 2011-03-31 00:49 record.baseA >> -rw-r--r-- 1 kevin kevin 261K 2011-03-31 01:24 record.baseB >> -rw-r--r-- 1 kevin kevin 131G 2011-03-31 01:24 record.DB >> -rw-r--r-- 1 kevin kevin 14 2011-03-31 01:24 termlist.baseA >> -rw-r--r-- 1 kevin kevin 192K 2011-03-31 01:50 termlist.baseB >> -rw-r--r-- 1 kevin kevin 96G 2011-03-31 01:50 termlist.DB >> >> $ delve . >> number of documents = 219344757 >> average document length = 28255.9 >> document length lower bound = 1 >> document length upper bound = 173153 >> highest document id ever used = 219344757 >> >> Cheers, >> Kevin Duraj >> http://myhealthcare.com >> >> >> >> ------------------------------ >> >> _______________________________________________ >> Xapian-discuss mailing list >> Xapian-discuss at lists.xapian.org >> http://lists.xapian.org/mailman/listinfo/xapian-discuss >> >> >> End of Xapian-discuss Digest, Vol 83, Issue 1 >> ********************************************* > > > > > > ------------------------------ > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss > > > End of Xapian-discuss Digest, Vol 83, Issue 2 > *********************************************
On Sat, Apr 02, 2011 at 12:47:21PM +0100, Andrew Betts wrote:> > I think this is a shining example of how well Xapian works with large > > document collections. I was just discussing this with my colleagues here > > and one of the issues that came up is that we'd love Xapian to become > > really lot more popular but have found that the documentation's a bit > > difficult to get into, as is the API. > > I agree. There are a few gotchas, as well as branch stuff like > matchspy that is phenomenally useful, but largely undocumented and > therefore underused.I don't think it's reasonable to expect complete user level documentation for features under development on a branch. Having access to features being developed on a branch is like a sneak preview. If they were ready for prime time, the branch would probably have been merged.> (though by the looks of it matchspy is now in core).Most of it was merged in 1.1.3. There's still some stuff on the branch: http://trac.xapian.org/ticket/199> I actually find the API docs to be a comprehensive reference, > which is a great start - I've recently been trying to use various > RabbitMQ wrappers for PHP and its incredibly frustrating not being > able to look up the syntax for something even when you know what you > want. Xapian isn't like that - if I know what I'm looking for, I can > find it easily and the docs are comprehensive on the subject. What's > missing is a well organised resource on how to implement Xapian at a > more strategic level, and how to achieve various common use cases well > in each of the supported languages.Yes, there are definitely gaps in the documentation. Partly I think it depends on what you're looking for - for example, some people like tutorial-style documents for getting into something new, others would rather just have some working code to poke at, etc. Ideally we should have documentation and related materials to suit all learning styles. We do have a "topic document" for most of the newer features, but these are missing for most older stuff. And in particular there's not really a good overall overview> > So I was wondering: do you have any thoughts on improving this and would > > you like some help? I use Xapian a fair bit (mostly on > > www.reportbuyer.com) together with a new wrapper for our CMS and have a > > bit of spare time. I'd be happy to write up examples of how to use some > > of the bindings, particularly PHP as that's my area. > > I'd also be happy to contribute. A cookbook type format could be > worth considering, like > http://diveintogreasemonkey.org/patterns/index.html (though note that > they haven't kept this up to date). To a degree Xapian suffers from > the same problem as RabbitMQ on high level docs - there's just a list > of independently authored, inconsistently formatted articles many of > which cover the same ground. See http://trac.xapian.org/wiki/Articles > and http://www.rabbitmq.com/devtools.html.Hmm, well, we already have higher level documents written by the development team and consistently formatted which cover most of the newer features - for example, facets are covered here: http://xapian.org/docs/facets.html These are all linked to in a list on the doc index page: http://xapian.org/docs/ Did you just not find these (in which case we partly have a navigational problem)? Or are you looking for something else? Or just missing these for the older features (which naturally tend to be the more fundamental ones which users will want to get to grips with first)?> Since Xapian has its own bindings for lots of languages, it should be > a relatively straightforward matter to provide consistent, high level > documentation that can include examples in multiple languages.Yes, it would be nice to provide alternative versions of inline code for different languages.> Anyway, happy to support this kind of project. Can only be a good > thing to get more people introduced to Xapian.Contributions are certainly most welcome. Feel free to use the wiki as a tool for collaborating. We are trying to standardise on restructured text as the markup to use for authored documents, but converting trac markup to that isn't hard, and I think trac may be able to handle in-line restructured text anyway. We're already trying to keep track of documentation omissions here, so please use that if it's useful: http://trac.xapian.org/wiki/MissingDocumentation If there's sufficient interest in working on a "Xapian book", I'd be happy to be involved. I worked on the GSoC mentoring manual using the FLOSS Manuals tools and methodology, which seems a good approach to getting a book written in a remarkably short time: http://flossmanuals.net/ If we have a few people interested in a writing sprint, we could give it a go. Cheers, Olly