thr3ads.net - Xapian discuss - [Xapian-discuss] performance on document.get

If this information is useful, please help other people find it:
Share via:

Tong Liu

2013-Oct-23 05:30 UTC

[Xapian-discuss] performance on document.get_data()

I got some performance issue for document.get_data() and
enquire.get_mset(). It costs 35 seconds for matches enquire.get_mset(0,200), and
3 seconds for iterating all doc in matches to
get_data. Is't normal? My index contains 30millions documents. I use python
binding to operate xapian. Bellow it's my index structure
# value: 0:date, 1:site
# data: json message which contains: author, url, message(30 words)


Do you have any idea to improve the search performance , especially
doc.get_data?

my code snippet

database = xapian.Database("%s/athena" % DATA_PATH)
enquire = xapian.Enquire(database)
enquire.set_weighting_scheme(xapian.BM25Weight())
query = parse(keywords)
enquire.set_query(query)
matches = enquire.get_mset(start, 200)
matches.fetch()
result = [json.loads(match.document.get_data()) for match in matches]

Olly Betts

2013-Oct-30 00:02 UTC

head link

[Xapian-discuss] performance on document.get_data()

On Wed, Oct 23, 2013 at 01:30:51PM +0800, Tong Liu
wrote:> I got some performance issue for document.get_data() and
> enquire.get_mset(). It costs 35 seconds for matches >
enquire.get_mset(0,200), and 3 seconds for iterating all doc in matches to
> get_data. Is't normal? My index contains 30millions documents. I use
python
> binding to operate xapian. Bellow it's my index structure
> # value: 0:date, 1:site
> # data: json message which contains: author, url, message(30 words)
That sounds much slower than I'd expect.  Is that the cold cache time?
If so, does rerunning the same query take much less time?
> Do you have any idea to improve the search performance , especially
> doc.get_data?
> 
> my code snippet
> 
> database = xapian.Database("%s/athena" % DATA_PATH)
> enquire = xapian.Enquire(database)
> enquire.set_weighting_scheme(xapian.BM25Weight())
> query = parse(keywords)
What are you passing in for keywords here?
> enquire.set_query(query)
> matches = enquire.get_mset(start, 200)
Is start 0 here?
> matches.fetch()
With a local database, it probably won't help to call fetch().
> result = [json.loads(match.document.get_data()) for match in matches]
So your time includes parsing the JSON - try changing that to this to
focus on the time actually taken by Xapian and its python bindings:

  result = [match.document.get_data() for match in matches]

Also, what Xapian version are you using?

Cheers,
    Olly

Tong Liu

2013-Oct-30 00:58 UTC

head link

[Xapian-discuss] performance on document.get_data()

1 That sounds much slower than I'd expect.  Is that the cold cache time?
If so, does rerunning the same query take much less time?

Yes, It's cold cache time. The time of next query with same words is faster.

2  > query = parse(keywords)
What are you passing in for keywords here?

such as: BMW, iPhone5s...

3  > matches = enquire.get_mset(start, 200)
Is start 0 here?

Yes, start=0

4  result = [match.document.get_data() for match in matches]
So your time includes parsing the JSON - try changing that to this to
focus on the time actually taken by Xapian and its python bindings

I have tried to run without dcode json, but there is no difference.

Xapian version is 1.2.15.


2013/10/30 Olly Betts <olly at survex.com>
> On Wed, Oct 23, 2013 at 01:30:51PM +0800, Tong Liu wrote:
> > I got some performance issue for document.get_data() and
> > enquire.get_mset(). It costs 35 seconds for matches > >
enquire.get_mset(0,200), and 3 seconds for iterating all doc in matches
> to
> > get_data. Is't normal? My index contains 30millions documents. I
use
> python
> > binding to operate xapian. Bellow it's my index structure
> > # value: 0:date, 1:site
> > # data: json message which contains: author, url, message(30 words)
>
> That sounds much slower than I'd expect.  Is that the cold cache time?
> If so, does rerunning the same query take much less time?
>
> > Do you have any idea to improve the search performance , especially
> > doc.get_data?
> >
> > my code snippet
> >
> > database = xapian.Database("%s/athena" % DATA_PATH)
> > enquire = xapian.Enquire(database)
> > enquire.set_weighting_scheme(xapian.BM25Weight())
> > query = parse(keywords)
>
> What are you passing in for keywords here?
>
> > enquire.set_query(query)
> > matches = enquire.get_mset(start, 200)
>
> Is start 0 here?
>
> > matches.fetch()
>
> With a local database, it probably won't help to call fetch().
>
> > result = [json.loads(match.document.get_data()) for match in matches]
>
> So your time includes parsing the JSON - try changing that to this to
> focus on the time actually taken by Xapian and its python bindings:
>
>   result = [match.document.get_data() for match in matches]
>
> Also, what Xapian version are you using?
>
> Cheers,
>     Olly
>

Possibly Parallel Threads

Search for more seemingly similar threads

Xapian discuss - Oct 2013 - performance on document.get_data()

[Xapian-discuss] performance on document.get_data()

[Xapian-discuss] performance on document.get_data()

[Xapian-discuss] performance on document.get_data()

Possibly Parallel Threads