Hi all, I am trying to implement an Incremental indexing scheme. The problem is that usually the modified documents are large but the modifications are limited. Ideally, I would like to reindex only the modified parts of these documents. If I am not mistaken, xapian cannot do that. Are there any other approaches? It would be nice if xapian supported something like the SQL "group by". If it did, then it would be possible to break large documents into several pieces which could be indexed independently. When querying, these pieces would be then combined again using some aggregate function similar to the SQL function sum. Thanks
If you have a way to break up the documents, you could use the "collapse key" functionality in Xapian to do what you want. /cco On Mar 17, 2013, at 10:52 PM, ?? <chenzhen_java at 126.com> wrote:> Hi all, > > I am trying to implement an Incremental indexing scheme. The problem > is that usually the modified documents are large but the modifications > are limited. Ideally, I would like to reindex only the modified parts > of these documents. If I am not mistaken, xapian cannot do that. Are > there any other approaches? > > It would be nice if xapian supported something like the SQL "group > by". If it did, then it would be possible to break large documents > into several pieces which could be indexed independently. When > querying, these pieces would be then combined again using some > aggregate function similar to the SQL function sum. > > Thanks > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss
On Mon, Mar 18, 2013 at 01:52:01PM +0800, ???? wrote:> I am trying to implement an Incremental indexing scheme. The problem > is that usually the modified documents are large but the modifications > are limited. Ideally, I would like to reindex only the modified parts > of these documents. If I am not mistaken, xapian cannot do that.Xapian does try to be lazy here - in particular, you can get a document from the database, make some changes (e.g. add or remove some terms), and call replace_document() to update it in the database, then only the posting lists for those terms will be updated, plus the termlist for the document itself, and (if the document length changes) the document length pseudo-posting list.> It would be nice if xapian supported something like the SQL "group > by". If it did, then it would be possible to break large documents > into several pieces which could be indexed independently. When > querying, these pieces would be then combined again using some > aggregate function similar to the SQL function sum.As Chris points out, collapsing would allow you to achieve something like this, though such an approach inherently restricts the queries you can perform. For example, if you split the title and body, a search for title:foo AND body:bar is hard to do (but title:foo OR body:bar is easy).> Are there any other approaches?Depending exactly what you're trying to do, using a PostingSource to feed in the more frequently changing information might be suitable. Cheers, Olly