Displaying 20 results from an estimated 1000 matches similar to: "prioritizing aggregated DBs"
2020 Feb 08
2
prioritizing aggregated DBs
Olly Betts <olly at survex.com> wrote:
> On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote:
> > Hey all, I've been using ->add_database for a few years
> > to tie sharded DBs together and it works great.
> >
> > Now, I want to be able to search across several DBs
> > which aren't sharded, say: linux-DB, glibc-DB, freebsd-DB.
> >
2020 Feb 07
0
prioritizing aggregated DBs
On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote:
> Hey all, I've been using ->add_database for a few years
> to tie sharded DBs together and it works great.
>
> Now, I want to be able to search across several DBs
> which aren't sharded, say: linux-DB, glibc-DB, freebsd-DB.
>
> I want to search for something across all of them, but
> prioritize results
2020 Feb 19
2
prioritizing aggregated DBs
Olly Betts <olly at survex.com> wrote:
> On Sat, Feb 08, 2020 at 06:04:42PM +0000, Eric Wong wrote:
> > Olly Betts <olly at survex.com> wrote:
> > > On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote:
> > > > Or would I fiddle with wdf_inc for all ->index_text and ->add_term
> > > > calls on a per-DB basis?
> > >
>
2020 Feb 09
0
prioritizing aggregated DBs
On Sat, Feb 08, 2020 at 06:04:42PM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > On Fri, Feb 07, 2020 at 09:33:08PM +0000, Eric Wong wrote:
> > > Or would I fiddle with wdf_inc for all ->index_text and ->add_term
> > > calls on a per-DB basis?
> >
> > That would probably work if you don't want to be able to vary the
2020 Feb 19
0
prioritizing aggregated DBs
On Wed, Feb 19, 2020 at 10:23:09AM +0000, Eric Wong wrote:
> Btw, is there a way to quickly figure out which sub-DB a retrieved
> document or mset item belongs to?
Yes: https://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID
1.4.12 added a Database::size() method which reports the number of
shards - for older versions you have to keep track of that yourself
(which needs a little care as
2020 Feb 21
1
prioritizing aggregated DBs
Olly Betts <olly at survex.com> wrote:
> On Wed, Feb 19, 2020 at 10:23:09AM +0000, Eric Wong wrote:
> > Btw, is there a way to quickly figure out which sub-DB a retrieved
> > document or mset item belongs to?
>
> Yes: https://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID
>
> 1.4.12 added a Database::size() method which reports the number of
> shards - for
2011 May 23
1
More relevance for recent documents
Good afternoon
I would like to ask if is possible somehow give more relevance to the
recent documents in search results.
I dont want to sort results according to the date, I still prefer
relevance, but I would like to see recent documents with better scoring.
I was trying to add search query using AND_MAYBE, which should use
relevance from both subqueries, but it didnt add any benefit to the
2010 Jun 09
1
TermGenerator incorrectly tokenizes German text which contains special characters
Dear Xapian users,
I try to index some German text with Xapian using the xapian_php bindings. I
run Apache 2.2 on Windows using PHP 5.2.13 with the pre build xapian
bindings from Flax:
Xapian Support enabled Xapian
Compiled Version @PACKAGE_VERSION@
Xapian Linked Version 1.2.0
The problem is that after indexing text which contains special characters
like ?, ?, ? and ?, using
2016 May 03
2
Weighting recent results
On 5/2/2016 9:03 PM, Olly Betts wrote:
> On Fri, Apr 22, 2016 at 12:23:15PM -0400, Alex Aminoff wrote:
>> I did some digging and found a thread from 2011 talking about how to
>> subclass Xapian::PostingSource in order to incorporate the date or
>> recency of a document in its weighting:
>>
>> http://thread.gmane.org/gmane.comp.search.xapian.general/8849/focus=8856
2008 Jan 15
7
PHP indexing, what's the PHP method for indexscript
Currently I have the following indexscript:
pid : unique=Q boolean=Q field=pid
postdate : field=startdate
author_name: unhtml boolean=XAUTHORNAME field=author
author_id: boolean=XAUTHORID field=authorid
url : field=url
sample : weight=1 index field=sample
How can I create the same indexing using PHP?
With this, I can get an searchable index, but I have no idea how to set the fields, so that I
2016 May 16
2
Weighting recent results
I was thinking about this some more: Is there a reason I can't just
weight by some function of recency at indexing time?
$weight = get_weight_based_on_recency(...);
$tg->index_text($txt,$weight);
If I wanted to allow the user the option of searching either in
recency-weighted mode or not, I could index each document into 2
different databases, one with and one without.
This avoids
2011 Sep 10
1
DBS to R
Hello,
I have a bunch of data files all with "dbs" file extensions. They are
generated via a SQL query from another program and source. Does anyone know
(or have ideas) how to get the data from a dbs file type into R (or into
some other format that can imported to R)? I've searched online for 4 hours
now...
Thanks!
Ben
[[alternative HTML version deleted]]
2020 Aug 21
2
MultiDatabase shard count limitations
Going back to the "prioritizing aggregated DBs" thread from
February 2020, I've got 390 Xapian shards for 130 public inboxes
I want to search against(*). There's more on the horizon (we're
expecting tens of thousands of public inboxes).
After bumping RLIMIT_NOFILE and running ->add_database a bunch,
the actual queries seem to be taking ~30s (not good :x).
Now I'm
2018 Jan 22
2
How to get the serialise score returned in Xapian::KeyMaker->operator().
>A possible workaround (and perhaps a better approach) would be to
>set BoolWeight as the weighting scheme, then feed in your score as
>a weight using a PostingSource. Then it's available via get_weight()
>on the MSetIterator object:
>
>https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/postingsource.html
>
>You may find that's faster because
2017 Dec 15
5
How to get the serialise score returned in Xapian::KeyMaker->operator().
HI, all,
I am a user of Xapian, and now I have a problem in using it.
After using boolean terms to get some candidates of documents (still too much), we want sorted them by self-defined function which is used in Xapian::KeyMaker->operator(). But how can I get the serialise score in Xapian::MSetIterator object.
c++ code likes this:
class SortKeyMaker : public Xapian::KeyMaker {
std::string
2016 Apr 22
2
Weighting recent results
I did some digging and found a thread from 2011 talking about how to
subclass Xapian::PostingSource in order to incorporate the date or
recency of a document in its weighting:
http://thread.gmane.org/gmane.comp.search.xapian.general/8849/focus=8856
As in that thread, I want to be clear that I don't want to sort by date,
but rather incorporate date information into the score by which I
2008 Jul 12
1
add_term
i used to use document.add_term("term"); to associate document with a
term that did not appear in html, but add_term function might have changed,
as i no longer get results for associated terms.
what would be the new way to do it ?
Thank You
2014 Mar 17
2
[GSOC 2014] Indexing INEX dataset
Hi Olly,
Wouldn't setting the weight of terms in title back to normal (e.g. 5 to 1)
by below line, automatically adjust the wdfs and field lengths?
indexer.index_text(title, 5, "S"); -> indexer.index_text(title, 1, "S");
if it does not then we should include that part in the patch too. I like to
create a patch for xapian-letor for resolving common code of xapian.
2014 Mar 11
2
[GSOC 2014] Indexing INEX dataset
On Tue, Mar 11, 2014 at 03:20:31PM +0100, Parth Gupta wrote:
> >
> > On current trunk, we index the title with prefix "S" by default in
> > omindex, though with a wdf inc of 5 rather than 1:
> >
> > indexer.index_text(title, 5, "S");
> >
> > So I don't think you need that change to omindex now.
>
> Yes, but please
2013 Oct 30
2
Lucene 3.6.2 backend for xapian (#25)
[Replying to xapian-devel, as I think a wider audience would be useful]
On Mon, Oct 21, 2013 at 11:24:51PM +0800, jiangwen jiang wrote:
> yes, it's less efficient. Lucene database has multiple segments, each
> segment can treat as a independent database. The same term may exists in >=
> 1 segments.
Sorry for taking a while to respond - I've been both busy and mulling
this