Dear XAPIAN developers, I was using xapian to index large than 13 million document about Q & A(Quora similarly). I will share some performance data about indexing and searching, and I will seek some help for improving performance of searching. My computer has 8 i7 at 3.4G CPU and 16G memory, ubuntu 16.04. Dataset include about 13M document, each document will be cut into 35 term(Chinese word) on average. I adopted split-merge algorithm as well. I built each index which contained 500K data and then merged them into one databases. Building smaller databases cost 2 min 40 s on avg. Compacting them cost about 2 hr 12 min. I found every first time(like after booting computer) or sometime(occasional) to query(use QueryParse) this databases will cost significant seconds (like 5 seconds), although it cost 0.8 seconds on average. What is the reason of this? Or how can I debug this, I mean where can I add some LOGLINE to measure these time? If I want to shorten this query time what should I do or try? BTW, I think splitting more databases and query them parallelly is not a good idea, unless xapian ensure each query is less than a expected time(Actually this 13M database is 'small', :P). -- One of my most productive days was throwing away 1000 lines of code. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180702/d54e3add/attachment.html>
On Mon, Jul 02, 2018 at 06:08:40PM +0800, morefreeze wrote:> I found every first time(like after booting computer) or > sometime(occasional) to query(use QueryParse) this databases will cost > significant seconds (like 5 seconds), although it cost 0.8 seconds on > average. What is the reason of this?If you've just rebooted, none of the database will be cached, so everything has to be fetched from disk and that takes more time. The second query will be faster even if it's for entirely different terms, because at least the root blocks will be read from cache. And pretty quickly the cache ends up with all the frequently read blocks. This can also happen without a reboot if another process reads a lot of data which ends up in cache instead of the database blocks. If the machine has cronjobs making backups, update the db used by the "locate" tool, or doing other things which read a lot of files, you might want to consider carefully when they run, or run them under something which minimises cache effects such as "nocache".> If I want to shorten this query time what should I do or try? BTW, I > think splitting more databases and query them parallelly is not a good > idea, unless xapian ensure each query is less than a expected > time(Actually this 13M database is 'small', :P).I'd think searching more databases would if anything make this "cold cache" effect worse. You don't say what version you're using, but make sure it's a recent Xapian 1.4.x and that you're using the glass backend. If you're still using 1.2.x, or 1.4.x with chert databases then switching to 1.4.x+glass is likely to help. You can warm the cache usefully just by running a few queries (if you make them for commonly searched terms that will be more effective). So if you have a cluster of search machines and want to add a new member to it, you can automate running a few "warm up" queries after spinning up the new instance but before actually adding it to the cluster. 1.4.x will issue prefetch hints if posix_fadvise() is available, which helps when the cache is cold. These are done automatically for postlists, but you can call MSet::fetch() to issue prefetch hints for fetching document data. This ticket is about the prefetching changes: https://trac.xapian.org/ticket/671 If you want to profile what database blocks are being read, then the strace-analyse script may be useful: https://trac.xapian.org/browser/git/xapian-maintainer-tools/profiling/strace-analyse See the comments in the script for how to use it. Cheers, Olly
Awesome, thanks! I use xapian 1.4.5 and congratulate 1.4.6 has been released. I am reading these link you gave me. I will issue another thread if I get stucked. On Tue, Jul 3, 2018 at 2:21 PM Olly Betts <olly at survex.com> wrote:> On Mon, Jul 02, 2018 at 06:08:40PM +0800, morefreeze wrote: > > I found every first time(like after booting computer) or > > sometime(occasional) to query(use QueryParse) this databases will cost > > significant seconds (like 5 seconds), although it cost 0.8 seconds on > > average. What is the reason of this? > > If you've just rebooted, none of the database will be cached, so > everything has to be fetched from disk and that takes more time. > > The second query will be faster even if it's for entirely different > terms, because at least the root blocks will be read from cache. > And pretty quickly the cache ends up with all the frequently read > blocks. > > This can also happen without a reboot if another process reads a lot > of data which ends up in cache instead of the database blocks. If > the machine has cronjobs making backups, update the db used by the > "locate" tool, or doing other things which read a lot of files, you > might want to consider carefully when they run, or run them under > something which minimises cache effects such as "nocache". > > > If I want to shorten this query time what should I do or try? BTW, I > > think splitting more databases and query them parallelly is not a good > > idea, unless xapian ensure each query is less than a expected > > time(Actually this 13M database is 'small', :P). > > I'd think searching more databases would if anything make this "cold > cache" effect worse. > > You don't say what version you're using, but make sure it's a recent > Xapian 1.4.x and that you're using the glass backend. If you're still > using 1.2.x, or 1.4.x with chert databases then switching to 1.4.x+glass > is likely to help. > > You can warm the cache usefully just by running a few queries (if > you make them for commonly searched terms that will be more effective). > So if you have a cluster of search machines and want to add a new > member to it, you can automate running a few "warm up" queries after > spinning up the new instance but before actually adding it to the > cluster. > > 1.4.x will issue prefetch hints if posix_fadvise() is available, which > helps when the cache is cold. These are done automatically for > postlists, but you can call MSet::fetch() to issue prefetch hints for > fetching document data. This ticket is about the prefetching changes: > > https://trac.xapian.org/ticket/671 > > If you want to profile what database blocks are being read, then the > strace-analyse script may be useful: > > > https://trac.xapian.org/browser/git/xapian-maintainer-tools/profiling/strace-analyse > > See the comments in the script for how to use it. > > Cheers, > Olly >-- One of my most productive days was throwing away 1000 lines of code. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180703/b7c9d140/attachment.html>