thr3ads.net - Xapian devel - Is there a large variance in xapian searching? [Jul 2018]

If this information is useful, please help other people find it:
Share via:

morefreeze

2018-Jul-02 10:08 UTC

Is there a large variance in xapian searching?

Dear XAPIAN developers,

I was using xapian to index large than 13 million document about Q &
A(Quora similarly). I will share some performance data about indexing
and searching, and I will seek some help for improving performance of
searching.

My computer has 8 i7 at 3.4G CPU and 16G memory, ubuntu 16.04. Dataset
include about 13M document, each document will be cut into 35
term(Chinese word) on average.

I adopted split-merge algorithm as well. I built each index which
contained 500K data and then merged them into one databases. Building
smaller databases cost 2 min 40 s on avg. Compacting them cost about 2
hr 12 min.


I found every first time(like after booting computer) or
sometime(occasional) to query(use QueryParse) this databases will cost
significant seconds (like 5 seconds), although it cost 0.8 seconds on
average. What is the reason of this? Or how can I debug this, I mean
where can I add some LOGLINE to measure these time?

If I want to shorten this query time what should I do or try? BTW, I
think splitting more databases and query them parallelly is not a good
idea, unless xapian ensure each query is less than a expected
time(Actually this 13M database is 'small', :P).


-- 
One of my most productive days was throwing away 1000 lines of code.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20180702/d54e3add/attachment.html>

Olly Betts

2018-Jul-03 06:21 UTC

head link

Is there a large variance in xapian searching?

On Mon, Jul 02, 2018 at 06:08:40PM +0800, morefreeze
wrote:> I found every first time(like after booting computer) or
> sometime(occasional) to query(use QueryParse) this databases will cost
> significant seconds (like 5 seconds), although it cost 0.8 seconds on
> average. What is the reason of this?
If you've just rebooted, none of the database will be cached, so
everything has to be fetched from disk and that takes more time.

The second query will be faster even if it's for entirely different
terms, because at least the root blocks will be read from cache.
And pretty quickly the cache ends up with all the frequently read
blocks.

This can also happen without a reboot if another process reads a lot
of data which ends up in cache instead of the database blocks.  If
the machine has cronjobs making backups, update the db used by the
"locate" tool, or doing other things which read a lot of files, you
might want to consider carefully when they run, or run them under
something which minimises cache effects such as "nocache".
> If I want to shorten this query time what should I do or try? BTW, I
> think splitting more databases and query them parallelly is not a good
> idea, unless xapian ensure each query is less than a expected
> time(Actually this 13M database is 'small', :P).
I'd think searching more databases would if anything make this "cold
cache" effect worse.

You don't say what version you're using, but make sure it's a recent
Xapian 1.4.x and that you're using the glass backend.  If you're still
using 1.2.x, or 1.4.x with chert databases then switching to 1.4.x+glass
is likely to help.

You can warm the cache usefully just by running a few queries (if
you make them for commonly searched terms that will be more effective).
So if you have a cluster of search machines and want to add a new
member to it, you can automate running a few "warm up" queries after
spinning up the new instance but before actually adding it to the
cluster.

1.4.x will issue prefetch hints if posix_fadvise() is available, which
helps when the cache is cold.  These are done automatically for
postlists, but you can call MSet::fetch() to issue prefetch hints for
fetching document data.  This ticket is about the prefetching changes:

https://trac.xapian.org/ticket/671

If you want to profile what database blocks are being read, then the
strace-analyse script may be useful:

https://trac.xapian.org/browser/git/xapian-maintainer-tools/profiling/strace-analyse

See the comments in the script for how to use it.

Cheers,
    Olly

morefreeze

2018-Jul-03 07:15 UTC

head link

Is there a large variance in xapian searching?

Awesome, thanks!
I use xapian 1.4.5 and congratulate 1.4.6 has been released. I am reading
these link you gave me. I will issue another thread if I get stucked.

On Tue, Jul 3, 2018 at 2:21 PM Olly Betts <olly at survex.com> wrote:
> On Mon, Jul 02, 2018 at 06:08:40PM +0800, morefreeze wrote:
> > I found every first time(like after booting computer) or
> > sometime(occasional) to query(use QueryParse) this databases will cost
> > significant seconds (like 5 seconds), although it cost 0.8 seconds on
> > average. What is the reason of this?
>
> If you've just rebooted, none of the database will be cached, so
> everything has to be fetched from disk and that takes more time.
>
> The second query will be faster even if it's for entirely different
> terms, because at least the root blocks will be read from cache.
> And pretty quickly the cache ends up with all the frequently read
> blocks.
>
> This can also happen without a reboot if another process reads a lot
> of data which ends up in cache instead of the database blocks.  If
> the machine has cronjobs making backups, update the db used by the
> "locate" tool, or doing other things which read a lot of files,
you
> might want to consider carefully when they run, or run them under
> something which minimises cache effects such as "nocache".
>
> > If I want to shorten this query time what should I do or try? BTW, I
> > think splitting more databases and query them parallelly is not a good
> > idea, unless xapian ensure each query is less than a expected
> > time(Actually this 13M database is 'small', :P).
>
> I'd think searching more databases would if anything make this
"cold
> cache" effect worse.
>
> You don't say what version you're using, but make sure it's a
recent
> Xapian 1.4.x and that you're using the glass backend.  If you're
still
> using 1.2.x, or 1.4.x with chert databases then switching to 1.4.x+glass
> is likely to help.
>
> You can warm the cache usefully just by running a few queries (if
> you make them for commonly searched terms that will be more effective).
> So if you have a cluster of search machines and want to add a new
> member to it, you can automate running a few "warm up" queries
after
> spinning up the new instance but before actually adding it to the
> cluster.
>
> 1.4.x will issue prefetch hints if posix_fadvise() is available, which
> helps when the cache is cold.  These are done automatically for
> postlists, but you can call MSet::fetch() to issue prefetch hints for
> fetching document data.  This ticket is about the prefetching changes:
>
> https://trac.xapian.org/ticket/671
>
> If you want to profile what database blocks are being read, then the
> strace-analyse script may be useful:
>
>
>
https://trac.xapian.org/browser/git/xapian-maintainer-tools/profiling/strace-analyse
>
> See the comments in the script for how to use it.
>
> Cheers,
>     Olly
>

-- 
One of my most productive days was throwing away 1000 lines of code.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20180703/b7c9d140/attachment.html>

Seemingly Similar Threads

Search for more reasonably related threads

Xapian devel - Jul 2018 - Is there a large variance in xapian searching?

Is there a large variance in xapian searching?

Is there a large variance in xapian searching?

Is there a large variance in xapian searching?

Seemingly Similar Threads