James Aylett
2005-Jul-20 18:17 UTC
[james@tartarus.org: Re: [Xapian-discuss] incremental indexing]
Sigh. I can't remember the list address ... J ----- Forwarded message from James Aylett <james@tartarus.org> ----- Date: Wed, 20 Jul 2005 18:05:53 +0100 To: Arshavir Grigorian <ag@m-cam.com>, xapian-discuss@tartarus.org From: James Aylett <james@tartarus.org> Subject: Re: [Xapian-discuss] incremental indexing Mail-Followup-To: Arshavir Grigorian <ag@m-cam.com>, xapian-discuss@tartarus.org On Wed, Jul 20, 2005 at 10:55:30AM -0400, Arshavir Grigorian wrote:> I am very new to Xapian and would not have been able to generate the > dump. So thanks for the script.Note that it's not the preferred way of looping over all documents, but it works well enough in this case.> I installed the bindings and ran the script after the first command, > after the second (on a clean database) as well as after running the > first and second commands in sequence. With a quick diff if looks like > running the second command both on a clean db and after the first > command generated the same results.This would be the case if it's getting rid of any documents created by the first run (well, the document ids will be different).> Attached are the results from the your script as well as the exact > commands I used: > > omindex --db /var/lib/omega/data/default --url '/pdf' /[path]/ 3915 > > omindex --db /var/lib/omega/data/default --url '/pdf' /[path]/ 3916 > (after cleaning the db path - rm /var/lib/omega/data/default/*).Cool, thanks. It wasn't what I thought it might be, but that did push me to check the source code :-). Basically, there's a bug in omindex (effectively) that means that incremental operation currently isn't supported, despite what the overview document says. I've attached an untested patch which should help here. You have to pass a new command line switch, -p, to get the behaviour you want (docs updated in the patch also). J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org Index: omindex.cc ==================================================================--- omindex.cc (revision 6331) +++ omindex.cc (working copy) @@ -2,7 +2,7 @@ * * ----START-LICENCE---- * Copyright 1999,2000,2001 BrightStation PLC - * Copyright 2001 James Aylett + * Copyright 2001,2005 James Aylett * Copyright 2001,2002 Ananova Ltd * Copyright 2002,2003,2004,2005 Olly Betts * @@ -579,6 +579,9 @@ // If overwrite is true, the database will be created anew even if it // already exists. bool overwrite = false; + // If preserve_unupdated is false, delete any documents we don't + // replace (if in replace duplicates mode) + bool preserve_unupdated = false; size_t depth_limit = 0; static const struct option longopts[] = { @@ -586,6 +589,7 @@ { "version", no_argument, NULL, 'v' }, { "overwrite", no_argument, NULL, 'o' }, { "duplicates", required_argument, NULL, 'd' }, + { "preserve-nonduplicates", no_argument, NULL, 'p' }, { "db", required_argument, NULL, 'D' }, { "url", required_argument, NULL, 'U' }, { "mime-type", required_argument, NULL, 'M' }, @@ -631,7 +635,7 @@ mime_map["pm"] = "text/x-perl"; mime_map["pod"] = "text/x-perl"; - while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:l", longopts, NULL))!=EOF) { + while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:lp", longopts, NULL))!=EOF) { switch (getopt_ret) { case 'h': cout << OMINDEX << endl @@ -639,6 +643,9 @@ << "\t--url BASEURL [BASEDIRECTORY] DIRECTORY\n\n" << "Index static website data via the filesystem.\n" << " -d, --duplicates\tset duplicate handling ('ignore' or 'replace')\n" + << " -p, --preserve-nonduplicates\n" + "\t\t\tdon't delete unupdated documents in\n" + "\t\t\tduplicate replace mode\n" << " -D, --db\t\tpath to database to use\n" << " -U, --url\t\tbase url DIRECTORY represents\n" << " -M, --mime-type\tadditional MIME mapping ext:type\n" @@ -654,7 +661,7 @@ case 'v': cout << OMINDEX << " (" << PACKAGE << ") " << VERSION << "\n" << "Copyright (c) 1999,2000,2001 BrightStation PLC.\n" - << "Copyright (c) 2001 James Aylett\n" + << "Copyright (c) 2001,2005 James Aylett\n" << "Copyright (c) 2001,2002 Ananova Ltd\n" << "Copyright (c) 2002,2003,2004,2005 Olly Betts\n\n" << "This is free software, and may be redistributed under\n" @@ -670,6 +677,9 @@ break; } break; + case 'p': // don't delete unupdated documents + preserve_unupdated = true; + break; case 'l': { // Set recursion limit int arg = atoi(optarg); if (arg < 0) arg = 0; @@ -757,7 +767,7 @@ db = Xapian::WritableDatabase(dbpath, Xapian::DB_CREATE_OR_OVERWRITE); } index_directory(depth_limit, "/", mime_map); - if (!skip_duplicates) { + if (!skip_duplicates && !preserve_unupdated) { for (Xapian::docid did = 1; did < updated.size(); ++did) { if (!updated[did]) { try { Index: docs/overview.txt ==================================================================--- docs/overview.txt (revision 6331) +++ docs/overview.txt (working copy) @@ -117,6 +117,9 @@ Index static website data via the filesystem. -d, --duplicates set duplicate handling ('ignore' or 'replace') + -p, --preserve-nonduplicates + don't delete documents not updated during + replace duplicate operation -D, --db path to database to use -U, --url base url DIRECTORY represents -M, --mime-type additional MIME mapping ext:type @@ -154,17 +157,17 @@ passes, one for the '/press' site and one for the '/product' site. You might use the following commands: -$ omindex --db /www/omega --url '/press' /www/example/press -$ omindex --db /www/omega --url '/product' /www/example/product +$ omindex -p --db /www/omega --url '/press' /www/example/press +$ omindex -p --db /www/omega --url '/product' /www/example/product If you add a new large product, but don't want to reindex the whole of the product section, you could do: -$ omindex --db /www/omega --url '/product' /www/example/product large +$ omindex -p --db /www/omega --url '/product' /www/example/product large and just the large products will be reindexed. You need to do it like that, and not as: -$ omindex --db /www/omega --url '/product/large' /www/example/product/large +$ omindex -p --db /www/omega --url '/product/large' /www/example/product/large because that would make the large products part of a new site, '/product/large', which is unlikely to be what you want, as large @@ -228,6 +231,17 @@ completely static documents (eg: archive sites), while 'replace' is the most generally useful. +With 'replace', omindex will remove any document it finds in the +database that it did not update - in other words, it will clear out +everything that doesn't exist any more. However if you are building up +an omega database with several runs of omindex, this is not +appropriate (as each run would delete the data from the previous run), +so you should use the --preserve-nonduplicates. Note that if you +choose to work like this, it is impossible to prune old documents from +the database using omindex. If this is a problem for you, an +alternative is to index each subsite into a different database, and +merge all the databases together when searching. + --depth-limit allows you to prevent omindex from descending more than a certain number of directories. If you wish to replicate the old --no-recurse option, use ----depth-limit=1. ----- End forwarded message ----- -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org -------------- next part -------------- Index: omindex.cc ==================================================================--- omindex.cc (revision 6331) +++ omindex.cc (working copy) @@ -2,7 +2,7 @@ * * ----START-LICENCE---- * Copyright 1999,2000,2001 BrightStation PLC - * Copyright 2001 James Aylett + * Copyright 2001,2005 James Aylett * Copyright 2001,2002 Ananova Ltd * Copyright 2002,2003,2004,2005 Olly Betts * @@ -579,6 +579,9 @@ // If overwrite is true, the database will be created anew even if it // already exists. bool overwrite = false; + // If preserve_unupdated is false, delete any documents we don't + // replace (if in replace duplicates mode) + bool preserve_unupdated = false; size_t depth_limit = 0; static const struct option longopts[] = { @@ -586,6 +589,7 @@ { "version", no_argument, NULL, 'v' }, { "overwrite", no_argument, NULL, 'o' }, { "duplicates", required_argument, NULL, 'd' }, + { "preserve-nonduplicates", no_argument, NULL, 'p' }, { "db", required_argument, NULL, 'D' }, { "url", required_argument, NULL, 'U' }, { "mime-type", required_argument, NULL, 'M' }, @@ -631,7 +635,7 @@ mime_map["pm"] = "text/x-perl"; mime_map["pod"] = "text/x-perl"; - while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:l", longopts, NULL))!=EOF) { + while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:lp", longopts, NULL))!=EOF) { switch (getopt_ret) { case 'h': cout << OMINDEX << endl @@ -639,6 +643,9 @@ << "\t--url BASEURL [BASEDIRECTORY] DIRECTORY\n\n" << "Index static website data via the filesystem.\n" << " -d, --duplicates\tset duplicate handling ('ignore' or 'replace')\n" + << " -p, --preserve-nonduplicates\n" + "\t\t\tdon't delete unupdated documents in\n" + "\t\t\tduplicate replace mode\n" << " -D, --db\t\tpath to database to use\n" << " -U, --url\t\tbase url DIRECTORY represents\n" << " -M, --mime-type\tadditional MIME mapping ext:type\n" @@ -654,7 +661,7 @@ case 'v': cout << OMINDEX << " (" << PACKAGE << ") " << VERSION << "\n" << "Copyright (c) 1999,2000,2001 BrightStation PLC.\n" - << "Copyright (c) 2001 James Aylett\n" + << "Copyright (c) 2001,2005 James Aylett\n" << "Copyright (c) 2001,2002 Ananova Ltd\n" << "Copyright (c) 2002,2003,2004,2005 Olly Betts\n\n" << "This is free software, and may be redistributed under\n" @@ -670,6 +677,9 @@ break; } break; + case 'p': // don't delete unupdated documents + preserve_unupdated = true; + break; case 'l': { // Set recursion limit int arg = atoi(optarg); if (arg < 0) arg = 0; @@ -757,7 +767,7 @@ db = Xapian::WritableDatabase(dbpath, Xapian::DB_CREATE_OR_OVERWRITE); } index_directory(depth_limit, "/", mime_map); - if (!skip_duplicates) { + if (!skip_duplicates && !preserve_unupdated) { for (Xapian::docid did = 1; did < updated.size(); ++did) { if (!updated[did]) { try { Index: docs/overview.txt ==================================================================--- docs/overview.txt (revision 6331) +++ docs/overview.txt (working copy) @@ -117,6 +117,9 @@ Index static website data via the filesystem. -d, --duplicates set duplicate handling ('ignore' or 'replace') + -p, --preserve-nonduplicates + don't delete documents not updated during + replace duplicate operation -D, --db path to database to use -U, --url base url DIRECTORY represents -M, --mime-type additional MIME mapping ext:type @@ -154,17 +157,17 @@ passes, one for the '/press' site and one for the '/product' site. You might use the following commands: -$ omindex --db /www/omega --url '/press' /www/example/press -$ omindex --db /www/omega --url '/product' /www/example/product +$ omindex -p --db /www/omega --url '/press' /www/example/press +$ omindex -p --db /www/omega --url '/product' /www/example/product If you add a new large product, but don't want to reindex the whole of the product section, you could do: -$ omindex --db /www/omega --url '/product' /www/example/product large +$ omindex -p --db /www/omega --url '/product' /www/example/product large and just the large products will be reindexed. You need to do it like that, and not as: -$ omindex --db /www/omega --url '/product/large' /www/example/product/large +$ omindex -p --db /www/omega --url '/product/large' /www/example/product/large because that would make the large products part of a new site, '/product/large', which is unlikely to be what you want, as large @@ -228,6 +231,17 @@ completely static documents (eg: archive sites), while 'replace' is the most generally useful. +With 'replace', omindex will remove any document it finds in the +database that it did not update - in other words, it will clear out +everything that doesn't exist any more. However if you are building up +an omega database with several runs of omindex, this is not +appropriate (as each run would delete the data from the previous run), +so you should use the --preserve-nonduplicates. Note that if you +choose to work like this, it is impossible to prune old documents from +the database using omindex. If this is a problem for you, an +alternative is to index each subsite into a different database, and +merge all the databases together when searching. + --depth-limit allows you to prevent omindex from descending more than a certain number of directories. If you wish to replicate the old --no-recurse option, use ----depth-limit=1.