Hello, I am beginning work on the perf test module. The initial steps that I aim to accomplish are :- -> Download the wikipedia dumps for multiple languages . -> Write python scripts to tokenize the dump (will probably use something like nltk which has powerful inbuilt tokenizers) -> Discuss and finalize the design of the search and query expansion perf tests as I want to complete them before working on the indexing perf test. *Questions* -> If anyone has an experience with dowbloading wikipedia dumps, please can I get some advice on how to go about doing it and which is the best place to get them ? -> For the search and query expansion perf test, I need a query log based on the test documents I'll be using (Inex data set, as per the recent discussion with Olly on IRC.). Please can I get some advice on how to go about using the Inex data sets and the corresponding query logs. Regards Aarsh Regards Aarsh -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140514/8fbc146d/attachment-0002.html>
On 14 May 2014, at 05:38, Aarsh Shah <aarshkshah1992 at gmail.com> wrote:> Questions > -> If anyone has an experience with dowbloading wikipedia dumps, please can I get some advice on how to go about doing it and which is the best place to get them ?https://en.wikipedia.org/wiki/Wikipedia:Database_download Use the XML dumps (and just the pages-articles version); the SQL ones are a nightmare. I've always had good success with bittorrent for wikipedia dumps. Note that you'll probably then need to run each article through a Mediawiki syntax parser to interpret (or possibly just strip) macros and formatting commands before tokenisation. There's a list of libraries at http://www.mediawiki.org/wiki/Alternative_parsers which includes some python ones, although I haven't used them. (I've used a couple of ruby ones, and the quality is highly variable, so you may need to play with different ones to get what you need.) J -- James Aylett, occasional trouble-maker xapian.org
I should have thought about this complexity while writing my proposal :) But thanks for the advice , Ill look into it . I think we need them for the stemming . The inex data set will be much more helpful for all the other tests such as querying and searching . Regards Aarsh On 14/05/2014 2:32 pm, "James Aylett" <james-xapian at tartarus.org> wrote:> On 14 May 2014, at 05:38, Aarsh Shah <aarshkshah1992 at gmail.com> wrote: > > > Questions > > -> If anyone has an experience with dowbloading wikipedia dumps, please > can I get some advice on how to go about doing it and which is the best > place to get them ? > > https://en.wikipedia.org/wiki/Wikipedia:Database_download > > Use the XML dumps (and just the pages-articles version); the SQL ones are > a nightmare. I've always had good success with bittorrent for wikipedia > dumps. > > Note that you'll probably then need to run each article through a > Mediawiki syntax parser to interpret (or possibly just strip) macros and > formatting commands before tokenisation. There's a list of libraries at > http://www.mediawiki.org/wiki/Alternative_parsers which includes some > python ones, although I haven't used them. (I've used a couple of ruby > ones, and the quality is highly variable, so you may need to play with > different ones to get what you need.) > > J > > -- > James Aylett, occasional trouble-maker > xapian.org > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140514/ffdd2ecd/attachment-0002.html>