thr3ads.net - Xapian devel - [Xapian-devel] Starting work on Perf Test Module [May 2014]

If this information is useful, please help other people find it:
Share via:

Aarsh Shah

2014-May-14 04:38 UTC

[Xapian-devel] Starting work on Perf Test Module

Hello,

I am beginning work on the perf test module. The initial steps that I aim
to accomplish are :-

-> Download the wikipedia dumps for multiple languages .
-> Write python scripts to tokenize the dump (will probably use something
like nltk which has powerful inbuilt tokenizers)
-> Discuss and finalize the design of the search and query expansion perf
tests as I want to complete them before working on the indexing perf test.

*Questions*
-> If anyone has an experience with dowbloading wikipedia dumps, please can
I get some advice on how to go about doing it and which is the best place
to get them ?
-> For the search and query expansion perf test, I need a query log based
on the test documents I'll be using (Inex data set, as per the recent
discussion with Olly on IRC.).
Please can I get some advice on how to go about using the Inex data sets
and the corresponding query logs.

Regards
Aarsh


Regards
Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140514/8fbc146d/attachment-0002.html>

James Aylett

2014-May-14 09:02 UTC

head link

[Xapian-devel] Starting work on Perf Test Module

On 14 May 2014, at 05:38, Aarsh Shah <aarshkshah1992 at gmail.com> wrote:
> Questions
> -> If anyone has an experience with dowbloading wikipedia dumps, please
can I get some advice on how to go about doing it and which is the best place to
get them ?
https://en.wikipedia.org/wiki/Wikipedia:Database_download

Use the XML dumps (and just the pages-articles version); the SQL ones are a
nightmare. I've always had good success with bittorrent for wikipedia dumps.

Note that you'll probably then need to run each article through a Mediawiki
syntax parser to interpret (or possibly just strip) macros and formatting
commands before tokenisation. There's a list of libraries at
http://www.mediawiki.org/wiki/Alternative_parsers which includes some python
ones, although I haven't used them. (I've used a couple of ruby ones,
and the quality is highly variable, so you may need to play with different ones
to get what you need.)

J

-- 
 James Aylett, occasional trouble-maker
 xapian.org

Aarsh Shah

2014-May-14 09:06 UTC

head link

[Xapian-devel] Starting work on Perf Test Module

I should have thought about this complexity while writing my proposal :)
But thanks for the advice , Ill look into it . I think we need them for the
stemming . The inex data set will be much more helpful for all the other
tests such as querying and searching .

Regards
Aarsh
On 14/05/2014 2:32 pm, "James Aylett" <james-xapian at
tartarus.org> wrote:
> On 14 May 2014, at 05:38, Aarsh Shah <aarshkshah1992 at gmail.com>
wrote:
>
> > Questions
> > -> If anyone has an experience with dowbloading wikipedia dumps,
please
> can I get some advice on how to go about doing it and which is the best
> place to get them ?
>
> https://en.wikipedia.org/wiki/Wikipedia:Database_download
>
> Use the XML dumps (and just the pages-articles version); the SQL ones are
> a nightmare. I've always had good success with bittorrent for wikipedia
> dumps.
>
> Note that you'll probably then need to run each article through a
> Mediawiki syntax parser to interpret (or possibly just strip) macros and
> formatting commands before tokenisation. There's a list of libraries at
> http://www.mediawiki.org/wiki/Alternative_parsers which includes some
> python ones, although I haven't used them. (I've used a couple of
ruby
> ones, and the quality is highly variable, so you may need to play with
> different ones to get what you need.)
>
> J
>
> --
>  James Aylett, occasional trouble-maker
>  xapian.org
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140514/ffdd2ecd/attachment-0002.html>

Apparently Analagous Threads

Search for more apparently analagous threads

Xapian devel - May 2014 - Starting work on Perf Test Module

[Xapian-devel] Starting work on Perf Test Module

[Xapian-devel] Starting work on Perf Test Module

[Xapian-devel] Starting work on Perf Test Module

Apparently Analagous Threads