thr3ads.net - Xapian discuss - [Xapian-discuss] index only the new files [Apr 2007]

If this information is useful, please help other people find it:
Share via:

iX Gamerz

2007-Apr-24 10:55 UTC

[Xapian-discuss] index only the new files

Hello,

I'm newbie and I try to work with Xapian under Linux ubuntu 6.06 LTS.

1) I use Omindex with success with some options like this :

omindex --db /var/lib/xapian-omega/data/pdftagged/ --url /pdftagged
/var/www/xapian/pdftagged_list/

And I index a lot of pdf files per day.

I run regularly this function to index new copied files, but I have more and
more files to index and that takes more and more time to do that.

Even I receive this message:

Indexing "/RSC 3602.pdf" as application/pdf ... updated.
Indexing "/RSC 3603.pdf" as application/pdf ... updated.
Indexing "/RSC 3605.pdf" as application/pdf ... updated.
Indexing "/RSC 3609.pdf" as application/pdf ... updated.

It takes a lot of time to reindex all the database,

Is that possible to index only the new files recently copied without
reindexing all from the beginning?

2) This files are copied in differents folders where old files was already
indexed.

Is that possible to reindex only a part of the folders?

I can use a mysql database to keep a trace of the new added files. And I can
keep all the recent locations modified. but I don't understand how to use
these informations to index only a little parts of the global database to
keep the index up to date as fast as possible...

Thanks for your help and answer...

Ix

_________________________________________________________________
Ne faites pas souffrir des pauvres volatiles sans d?fense afin de 
communiquer! http://www.communicationevolved.com/fr-ch/

James Aylett

2007-Apr-24 11:48 UTC

head link

[Xapian-discuss] index only the new files

On Tue, Apr 24, 2007 at 09:55:13AM +0000, iX Gamerz wrote:
> 1) I use Omindex with success with some options like this :
> 
> omindex --db /var/lib/xapian-omega/data/pdftagged/ --url /pdftagged
> /var/www/xapian/pdftagged_list/
> 
> Is that possible to index only the new files recently copied without
> reindexing all from the beginning?
--duplicates ignore

should do what you want, providing you never update files. So it'll
ignore anything already in the database. This may not be quite what
you want, however.
> 2) This files are copied in differents folders where old files was already
> indexed.
> 
> Is that possible to reindex only a part of the folders?
> 
> I can use a mysql database to keep a trace of the new added files. And I
can
> keep all the recent locations modified. but I don't understand how to
use
> these informations to index only a little parts of the global database to
> keep the index up to date as fast as possible...
Currently there isn't a way of doing this. What we need is a small
change to omindex so it can take a list of files to
index/reindex. It's actually quite easy; there'd be two steps:

 (1) changes to make it DIRECTORY... not DIRECTORY in the command line
     params

 (2) indirect through index_fs_object() instead of index_directory()
     which can stat each file first (but only at top level, so costing
     us almost nothing in the current usage)

I don't have time to do this myself right now, but (1) is a change to
the test at omindex.cc:793 followed by making omindex.cc:825 into a
loop; (2) is changing the omindex.cc:825 call (which will be a little
later by then) into a call to something like (completely untested, and
there should be some refactoring and I might have got some details
completely wrong :-):

----------------------------------------------------------------------
static void
index_fs_object(size_t depth_limit, const string &path,
                map<string, string>& mime_map)
{
    struct stat st;
    string file = root + indexroot + path;
    if (stat(file.c_str(), &st)) {
        cout << "Could not work with " << path <<
", skipping." << endl;
        return;
    }
    is (S_ISDIR(&st)) {
        index_directory(depth_limit, path, mime_map);
    } else if (S_ISREG(&st)) {
        string ext;
        string::size_type dot = path.find_last_of('.');
        if (dot != string::npos) ext = path.substr(dot + 1);

        map<string,string>::iterator mt = mime_map.find(ext);
        if (mt != mime_map.end()) {
            // It's in our MIME map so we know how to index it.
            const string & mimetype = mt->second;
            try {
                index_file(indexroot + url, mimetype,
		           st.st_mtime,
                           st.st_size);
            } catch (NoSuchFilter) {
                // FIXME: we ought to ignore by mime-type not extension.
                cout << "Filter for \"" << mimetype
<< "\" not installed - ignoring extension \"" <<
ext << "\"" << endl;
                mime_map.erase(mt);
            }
        }
    }
}
----------------------------------------------------------------------

You'd need to use the extended base URI / base directory syntax, but I
think everyone should do that because it stops people thinking that
URIs and files are the same things ;-)

Alternatively you could use pdf2txt yourself directly, and use
scriptindex, but I suspect that's more work than is sensible in your
case.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

Xapian discuss - Apr 2007 - index only the new files

[Xapian-discuss] index only the new files

[Xapian-discuss] index only the new files