YuLun Cai
2017-Apr-20 17:52 UTC
Question about the ticket #743 omindex: delay libmagic checks
Hi, I'm working on the ticket #743 omindex: delay libmagic checks <https://trac.xapian.org/ticket/743>. As the ticket's Description mention, the call to libmagic is expensive than call the stat, so we can check the size by call the stat to get size before call libmagic to get a mime type. But how about the timestamps check? since timestamps check need to iterate the DB to check if the file has been indexed and hasn't changed(in `index_check_existing` function in omega\index_file.cc), so it is expensive too. Should we call the libmagic before or after the timestamps, or do we have another way to check the timestamps? What's more, how should we write tests to prove the omindex works correctly, to generate some practical directories and use omindex to index it then check the things in DB? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170421/c755f0f5/attachment.html>
Olly Betts
2017-Apr-21 05:37 UTC
Question about the ticket #743 omindex: delay libmagic checks
On Fri, Apr 21, 2017 at 01:52:38AM +0800, YuLun Cai wrote:> I'm working on the ticket #743 omindex: delay libmagic checks > <https://trac.xapian.org/ticket/743>. As the ticket's > Description mention, the call to libmagic is expensive than call the stat, > so we can check the size by call the stat to get size before call > libmagic to get a mime type.Yes.> But how about the timestamps check? since timestamps check need to iterate > the DB to check if the file has been indexed and hasn't changed(in > `index_check_existing` function in omega\index_file.cc), so it is expensive > too. Should we call the libmagic before or after the timestamps, or do we > have another way to check the timestamps?We also have an upper bound on the newest timestamp in the database at the start of the run, so we can often avoid this check for new files (at least if they were created since the end of the previous index run). But that just quickly tells us "yes" for such files (at least on the basis of timestamp) so we'd need to check them with libmagic anyway. To get a "no" based on timestamp we need to check against the database. I'd suggest to start with you just look at moving the libmagic check after the filesize checks, so you don't need to get into whether libmagic or the database check is cheaper on average.> What's more, how should we write tests to prove the omindex works > correctly, to generate some practical directories and use omindex to index > it then check the things in DB?We don't (sadly) have any tests of omindex behaviour currently, but having some would be great. You'd need to work out what cases you're aiming to test and then script up suitable changes to the directory between the omindex runs. Cheers, Olly
YuLun Cai
2017-Apr-23 16:22 UTC
Question about the ticket #743 omindex: delay libmagic checks
> > I'd suggest to start with you just look at moving the libmagic check after > the filesize checks, so you don't need to get into whether libmagic or > the database check is cheaper on average.hi, Olly, I have moved the libmagic check after the filesize check directly, https://github.com/caiyulun/xapian/commit/3a97d9ee5397fa900a473aa9b3d8eeb720177a4e can you provide your comments on it and give some advice about the next steps? I think it is hard to say which is cheaper between the libmagic and database check Thanks 2017-04-21 13:37 GMT+08:00 Olly Betts <olly at survex.com>:> On Fri, Apr 21, 2017 at 01:52:38AM +0800, YuLun Cai wrote: > > I'm working on the ticket #743 omindex: delay libmagic checks > > <https://trac.xapian.org/ticket/743>. As the ticket's > > Description mention, the call to libmagic is expensive than call the > stat, > > so we can check the size by call the stat to get size before call > > libmagic to get a mime type. > > Yes. > > > But how about the timestamps check? since timestamps check need to > iterate > > the DB to check if the file has been indexed and hasn't changed(in > > `index_check_existing` function in omega\index_file.cc), so it is > expensive > > too. Should we call the libmagic before or after the timestamps, or do we > > have another way to check the timestamps? > > We also have an upper bound on the newest timestamp in the database at the > start of the run, so we can often avoid this check for new files (at least > if they were created since the end of the previous index run). > > But that just quickly tells us "yes" for such files (at least on the basis > of > timestamp) so we'd need to check them with libmagic anyway. To get a "no" > based on timestamp we need to check against the database. > > I'd suggest to start with you just look at moving the libmagic check after > the filesize checks, so you don't need to get into whether libmagic or > the database check is cheaper on average. > > > What's more, how should we write tests to prove the omindex works > > correctly, to generate some practical directories and use omindex to > index > > it then check the things in DB? > > We don't (sadly) have any tests of omindex behaviour currently, but having > some would be great. > > You'd need to work out what cases you're aiming to test and then script up > suitable changes to the directory between the omindex runs. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170424/a3329853/attachment.html>