Hi all, I am trying to implement an Incremental indexing scheme. The problem is that usually the modified documents are large but the modifications are limited. Ideally, I would like to reindex only the modified parts of these documents. If I am not mistaken, xapian cannot do that. Are there any other approaches? It would be nice if xapian supported something like the SQL "group by". If it did, then it would be possible to break large documents into several pieces which could be indexed independently. When querying, these pieces would be then combined again using some aggregate function similar to the SQL function sum. Thanks
Marios Titas writes: > Hi all, > > I am trying to implement an Incremental indexing scheme. The problem > is that usually the modified documents are large but the modifications > are limited. Ideally, I would like to reindex only the modified parts > of these documents. If I am not mistaken, xapian cannot do that. Are > there any other approaches? > > It would be nice if xapian supported something like the SQL "group > by". If it did, then it would be possible to break large documents > into several pieces which could be indexed independently. When > querying, these pieces would be then combined again using some > aggregate function similar to the SQL function sum. Hi, The Recoll Xapian-based desktop indexer implements the "break into pieces" part for big text files. This is done so that the appropriate section of the document can be loaded for previewing (useful for, ie, big log files). It doesn't implement independant incremental re-indexing though because it has no way to know which parts may have changed. The document parts are linked by a common parent identifier which can be used to get to the whole document. There is both an entry in the document data record, used to get to the parent of a result document, and a "parent" unique term for each part, used to find all the parts of a given parent document (useful for deleting for example). In Recoll, this is just a use case of the general mechanism describing document embedding, and a bit complicated. I imagine that this could be implemented in different ways. Cheers, J.F. Dockes
On Tue, Mar 20, 2012 at 6:01 AM, Marios Titas <redneb8888 at gmail.com> wrote:> Hi all, > > I am trying to implement an Incremental indexing scheme. The problem > is that usually the modified documents are large but the modifications > are limited. Ideally, I would like to reindex only the modified parts > of these documents. If I am not mistaken, xapian cannot do that. Are > there any other approaches?I don't know for sure, but I expect xapian is quite good at only updating the updated parts of a document (in terms of what it writes to disk).> It would be nice if xapian supported something like the SQL "group > by". If it did, then it would be possible to break large documents > into several pieces which could be indexed independently. When > querying, these pieces would be then combined again using some > aggregate function similar to the SQL function sum.Xapian::Enquire has set_collapse_key: http://xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html#117ee547f5908e952e2e72d5a986d3bb -- E: sym.roe at talusdesign.co.uk M: 07742079314 @symroe