similar to: prioritizing aggregated DBs

Displaying 20 results from an estimated 1000 matches similar to: "prioritizing aggregated DBs"

2020 Feb 08
2
prioritizing aggregated DBs
Olly Betts <olly at survex.com> wrote: > On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote: > > Hey all, I've been using ->add_database for a few years > > to tie sharded DBs together and it works great. > > > > Now, I want to be able to search across several DBs > > which aren't sharded, say: linux-DB, glibc-DB, freebsd-DB. > >
2020 Feb 07
0
prioritizing aggregated DBs
On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote: > Hey all, I've been using ->add_database for a few years > to tie sharded DBs together and it works great. > > Now, I want to be able to search across several DBs > which aren't sharded, say: linux-DB, glibc-DB, freebsd-DB. > > I want to search for something across all of them, but > prioritize results
2020 Feb 19
2
prioritizing aggregated DBs
Olly Betts <olly at survex.com> wrote: > On Sat, Feb 08, 2020 at 06:04:42PM +0000, Eric Wong wrote: > > Olly Betts <olly at survex.com> wrote: > > > On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote: > > > > Or would I fiddle with wdf_inc for all ->index_text and ->add_term > > > > calls on a per-DB basis? > > > >
2020 Feb 09
0
prioritizing aggregated DBs
On Sat, Feb 08, 2020 at 06:04:42PM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote: > > > Or would I fiddle with wdf_inc for all ->index_text and ->add_term > > > calls on a per-DB basis? > > > > That would probably work if you don't want to be able to vary the
2020 Feb 19
0
prioritizing aggregated DBs
On Wed, Feb 19, 2020 at 10:23:09AM +0000, Eric Wong wrote: > Btw, is there a way to quickly figure out which sub-DB a retrieved > document or mset item belongs to? Yes: https://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID 1.4.12 added a Database::size() method which reports the number of shards - for older versions you have to keep track of that yourself (which needs a little care as
2020 Feb 21
1
prioritizing aggregated DBs
Olly Betts <olly at survex.com> wrote: > On Wed, Feb 19, 2020 at 10:23:09AM +0000, Eric Wong wrote: > > Btw, is there a way to quickly figure out which sub-DB a retrieved > > document or mset item belongs to? > > Yes: https://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID > > 1.4.12 added a Database::size() method which reports the number of > shards - for
2011 May 23
1
More relevance for recent documents
Good afternoon I would like to ask if is possible somehow give more relevance to the recent documents in search results. I dont want to sort results according to the date, I still prefer relevance, but I would like to see recent documents with better scoring. I was trying to add search query using AND_MAYBE, which should use relevance from both subqueries, but it didnt add any benefit to the
2010 Jun 09
1
TermGenerator incorrectly tokenizes German text which contains special characters
Dear Xapian users, I try to index some German text with Xapian using the xapian_php bindings. I run Apache 2.2 on Windows using PHP 5.2.13 with the pre build xapian bindings from Flax: Xapian Support enabled Xapian Compiled Version @PACKAGE_VERSION@ Xapian Linked Version 1.2.0 The problem is that after indexing text which contains special characters like ?, ?, ? and ?, using
2016 May 03
2
Weighting recent results
On 5/2/2016 9:03 PM, Olly Betts wrote: > On Fri, Apr 22, 2016 at 12:23:15PM -0400, Alex Aminoff wrote: >> I did some digging and found a thread from 2011 talking about how to >> subclass Xapian::PostingSource in order to incorporate the date or >> recency of a document in its weighting: >> >> http://thread.gmane.org/gmane.comp.search.xapian.general/8849/focus=8856
2008 Jan 15
7
PHP indexing, what's the PHP method for indexscript
Currently I have the following indexscript: pid : unique=Q boolean=Q field=pid postdate : field=startdate author_name: unhtml boolean=XAUTHORNAME field=author author_id: boolean=XAUTHORID field=authorid url : field=url sample : weight=1 index field=sample How can I create the same indexing using PHP? With this, I can get an searchable index, but I have no idea how to set the fields, so that I
2016 May 16
2
Weighting recent results
I was thinking about this some more: Is there a reason I can't just weight by some function of recency at indexing time? $weight = get_weight_based_on_recency(...); $tg->index_text($txt,$weight); If I wanted to allow the user the option of searching either in recency-weighted mode or not, I could index each document into 2 different databases, one with and one without. This avoids
2011 Sep 10
1
DBS to R
Hello, I have a bunch of data files all with "dbs" file extensions. They are generated via a SQL query from another program and source. Does anyone know (or have ideas) how to get the data from a dbs file type into R (or into some other format that can imported to R)? I've searched online for 4 hours now... Thanks! Ben [[alternative HTML version deleted]]
2020 Aug 21
2
MultiDatabase shard count limitations
Going back to the "prioritizing aggregated DBs" thread from February 2020, I've got 390 Xapian shards for 130 public inboxes I want to search against(*). There's more on the horizon (we're expecting tens of thousands of public inboxes). After bumping RLIMIT_NOFILE and running ->add_database a bunch, the actual queries seem to be taking ~30s (not good :x). Now I'm
2018 Jan 22
2
How to get the serialise score returned in Xapian::KeyMaker->operator().
>A possible workaround (and perhaps a better approach) would be to >set BoolWeight as the weighting scheme, then feed in your score as >a weight using a PostingSource. Then it's available via get_weight() >on the MSetIterator object: > >https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/postingsource.html > >You may find that's faster because
2017 Dec 15
5
How to get the serialise score returned in Xapian::KeyMaker->operator().
HI, all, I am a user of Xapian, and now I have a problem in using it. After using boolean terms to get some candidates of documents (still too much), we want sorted them by self-defined function which is used in Xapian::KeyMaker->operator(). But how can I get the serialise score in Xapian::MSetIterator object. c++ code likes this: class SortKeyMaker : public Xapian::KeyMaker { std::string
2016 Apr 22
2
Weighting recent results
I did some digging and found a thread from 2011 talking about how to subclass Xapian::PostingSource in order to incorporate the date or recency of a document in its weighting: http://thread.gmane.org/gmane.comp.search.xapian.general/8849/focus=8856 As in that thread, I want to be clear that I don't want to sort by date, but rather incorporate date information into the score by which I
2008 Jul 12
1
add_term
i used to use document.add_term("term"); to associate document with a term that did not appear in html, but add_term function might have changed, as i no longer get results for associated terms. what would be the new way to do it ? Thank You
2014 Mar 17
2
[GSOC 2014] Indexing INEX dataset
Hi Olly, Wouldn't setting the weight of terms in title back to normal (e.g. 5 to 1) by below line, automatically adjust the wdfs and field lengths? indexer.index_text(title, 5, "S"); -> indexer.index_text(title, 1, "S"); if it does not then we should include that part in the patch too. I like to create a patch for xapian-letor for resolving common code of xapian.
2014 Mar 11
2
[GSOC 2014] Indexing INEX dataset
On Tue, Mar 11, 2014 at 03:20:31PM +0100, Parth Gupta wrote: > > > > On current trunk, we index the title with prefix "S" by default in > > omindex, though with a wdf inc of 5 rather than 1: > > > > indexer.index_text(title, 5, "S"); > > > > So I don't think you need that change to omindex now. > > Yes, but please
2013 Oct 30
2
Lucene 3.6.2 backend for xapian (#25)
[Replying to xapian-devel, as I think a wider audience would be useful] On Mon, Oct 21, 2013 at 11:24:51PM +0800, jiangwen jiang wrote: > yes, it's less efficient. Lucene database has multiple segments, each > segment can treat as a independent database. The same term may exists in >= > 1 segments. Sorry for taking a while to respond - I've been both busy and mulling this