Henry
2009-Jan-12 08:26 UTC
[Xapian-discuss] Returning "fresh" results only from multiple DBs
Greetings, Let's say you have the following scenario: DB1: large corpus with rarely changing data (typically split across a cluster). DB2: small corpus with frequently changing data (to update pages in DB1). DBn: ditto. Since DB1 is so large, and heavily accessed, we want to keep things simple and foolproof, so it's contents are rarely changed, with newer, fresher, pages for the same DB1 pages going into DB2..n. Each duplicate page (but fresher, so preferred) has a numeric field which increments for each refresh (1,2,3...), which identifies the the most up-to-date page across all DBs. How can I perform an enquiry, collapsing on a key (as currently done) to remove duplicate pages, but yielding the freshest of those duplicate pages? Similar to SQL: SELECT MAX(freshness_num),* FROM table... I know we can perform updates on DB1, but I don't want to go down that path because of the volumes/sizes involved. Any ideas? Thanks Henry ---- This message was sent via a PHP demo version of @Mail - http://atmail.com/
Henry
2009-Jan-14 08:02 UTC
[Xapian-discuss] Returning "fresh" results only from multiple DBs
Crikey, my new webmail (atmail) which I've been testing doesn't word-wrap at 80... apologies for that. Here's a repost with nice, fresh, newlines: Let's say you have the following scenario: DB1: large corpus with rarely changing data (typically split across a cluster). DB2: small corpus with frequently changing data (to update pages in DB1). DBn: ditto. Since DB1 is so large, and heavily accessed, we want to keep things simple and foolproof, so it's contents are rarely changed, with newer, fresher, pages for the same DB1 pages going into DB2..n. Each duplicate page (but fresher, so preferred) has a numeric field which increments for each refresh (1,2,3...), which identifies the the most up-to-date page across all DBs. How can I perform an enquiry, collapsing on a key (as currently done) to remove duplicate pages, but yielding the freshest of those duplicate pages? Similar to SQL: SELECT MAX(freshness_num),* FROM table... I know we can perform updates on DB1, but I don't want to go down that path because of the volumes/sizes involved. Any ideas? Thanks Henry
Henry
2009-Jan-16 06:26 UTC
[Xapian-discuss] Returning "fresh" results only from multiple DBs
How about extending set_collapse_key() to accept two or more arguments (a-la MultiValueSorter)? or, more cleanly I suppose, code a new method to create a "collapse_key" object which is a composite list of keys to collapse on, which is used as an arg to set_collapse_key()? Which one would require the least amount of coding (and API disruption)? Thoughts? Cheers Henry Msg sent via ZenMail - http://zen.co.za/
Henry
2009-Jan-18 13:17 UTC
[Xapian-discuss] Returning "fresh" results only from multiple DBs
On Sun 18/01/09 1:06 PM , Olly Betts <olly at survex.com> wrote:> We probably should support something like this.We'll talk about sponsoring something along these lines later.> > Which one would require the least amount of coding (and API > disruption)? > Xapian::Sorter could just be used as-is to build a key for > collapsing > too. It's a shame that we didn't think about this possible reuse > before > adding it to the API as the name seems rather less good now. > But this still won't change that you couldn't implement the "fresh > results" thing using collapsing. Or is your question not related to > that?Not following you here, so I'll study Xapian::Sorter.