similar to: Project: Posting list encoding improvements

Displaying 20 results from an estimated 1000 matches similar to: "Project: Posting list encoding improvements"

2013 Mar 08
2
Gsoc-2013
Hi, I am Chinmay Naik, an undergraduate in Computer Science at Bangalore Institute of Technology, Bangalore. I am an experienced programmer and good with C,C++,Python,Java,OpenGL and would love to participate in Gsoc-13. >From the ideas listed, i am interested to work on the project "posting list encoding improvements". I am a newbie to Xapian but would like to get involved and get a
2012 Apr 20
2
Posting list encoding improvements - pfd encoding & var len encoding comparison program
Hi all, I wrote a program that implement the variable length encoding and fixed length encoding, and compares their index size and speed of search doc length. You can see the comparison result from the attachment snapshot. 1. The posting list is in all memory; 2. The search strategy of fixed length encoding is skipping with exponential step (1, 2, 4, 8, ...). Once exceeds the desired doc id, back
2004 Dec 21
1
Search::Xapian add_database'd search results are odd?
Sorry if this is the wrong forum to discuss Search::Xapian issues -- this just seems like the best place.. Anyways, I've been testing out using $db->add_database() when searching, and it seems like the docids I'm getting out of it are incorrect, almost as though they're "double" what they should be (numerically)... the docids that exist should be around 950,000 and
2010 Oct 22
1
overlapping docids when searching on multiple databases?
Just a quick question - it seems to me that it's entirely possible to get overlapping docids when searching on multiple databases? For instance: open database1 add database2 to database1 search db1+db2 if docid 10 exists in both databases, is there any way of telling which which database to retrieve the document from? /Per Jessen, Z?rich
2023 May 03
1
manual flushing thresholds for deletes?
On Wed, May 03, 2023 at 12:38:15PM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > This will also effectively ignore boolean terms, assuming you're giving > > them wdf of 0 (because $3 here is the collection frequency, which is > > sum(wdf(term)) over all documents). > > Should boolean terms be ignored when estimating flushing >
2016 Jun 27
2
xapian-letor: FeatureVector discussion
Hello James, Parth, Following our discussion on IRC and on code review, the way FeatureVector class works needs some discussion. Presently, the FeatureVector class is defined as follows, with a fixed number of feature count (19): class FeatureVector::Internal : public Xapian::Internal::intrusive_base{ friend class FeatureVector; double label; double score;
2012 Apr 01
2
Project: QueryParser Reimplementation, to Olly Betts and Dan Colish
*Hi all,* * * *The following is my general idea for the project. For a complete query parser I still need to consider more details. Please give me feedback because the description of this project is lack of detailed information, and I can submit my proposal without giant deviation.* * * design principle of query parsing: 1) better understanding user input. All search engine do is understanding
2014 May 10
2
some trouble when devising skiplist
Hi, I was confronted with some trouble, I describe the trouble in my journal http://trac.xapian.org/wiki/GSoC2014/Posting%20list%20encoding%20improvements/Journal#May10 And corresponding code is in my git. Would you like to give me some help? ------------------ Shangtong Zhang,Second Year Undergraduate, School of Computer Science, Fudan University, China. -------------- next part
2023 May 03
1
manual flushing thresholds for deletes?
Olly Betts <olly at survex.com> wrote: > On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote: > > Olly Betts <olly at survex.com> wrote: > > > 10 seems too long. You want the mean word length weighted by frequency > > > of occurrence. For English that's typically around 5 characters, which > > > is 5 bytes. If we go for +1 that's:
2017 Dec 18
2
How to get the serialise score returned in Xapian::KeyMaker->operator().
On Sat, Dec 16, 2017 at 10:11:40PM +0000, Olly Betts wrote: > Unfortunately the sort key isn't currently exposed via the public API. > It's available internally and it seems like it ought to be accessible > but there's no accessor method for it - I can add one but that won't > help for existing releases. I've added MSetIterator::get_sort_key() to master in
2013 Jun 19
2
Compact databases and removing stale records at the same time
On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote: > On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote: > > The advantage of compact - it runs approximately 8 times as fast (we > > are CPU limited in each case - writing to tmpfs first, then rsyncing > > to the destination) and it takes approximately 75% of the space of a > > fresh database with maximum
2012 Apr 14
1
[xapian] a bug fixed in brass_database.cc
Hi all, I fixed a bug in brass_database.cc. The bug is: *FIXME: this should be done by checking memory usage, not the number of* *changes. We could also look at the amount of data the inverter object* *currently holds.* I also modified the simpleindex.cc so that it now supports batch files indexing. -- Weixian Zhou Department of Computer Science and Engineering University at Buffalo, SUNY
2017 Jun 05
2
Logging the click data
Hi James, > ID: some identifier for each query > QUERY: text of the query (when the query is run) > URLs: every URL displayed (or alternatively, the Xapian docid — this > might be easier) > OFFSET: otherwise you'll have difficulty coping with result pages other > than the first page (when this happens, the query ID should probably > remain the same, and when you aggregate
2016 Mar 10
2
Introduction and Doubts
Tf-idf is most used used weighting scheme is easy to understand and has been used in other frameworks like lucene and many other places. okapi bm25(implemented in xapian) is theoretically better/improved measure than tf-idf and i am looking into various other weighting scheme which are there in xapian or can be implemented like TF-ICF(term frequecy inverse corpus frequency),TF-RF(term
2017 Apr 08
2
Omega: Missing support for newer weighting schemes
On Sat, Apr 08, 2017 at 09:11:22PM +0100, James Aylett wrote: > On 8 Apr 2017, at 19:15, Vivek Pal <vivekpal.dtu at gmail.com> wrote: > > >> and the details of which weighting schemes were available in which version > >> isn't a key part of the $set command itself. > > > > Do you suggest dropping that piece of information out? Since the reason behind
2018 Jan 03
2
Storing the documents text: data record or value ?
Hi, Following the Recoll snippets generation performance problem caused by the new positions list storage scheme in Xapian 1.4, I am experimenting with generating snippets from the complete document text stored in the index. This increases the index size much less than I would have expected (around 10-15% apparently with my home directory data), which is good news obviously. I have tried
2017 Apr 09
3
Omega: Missing support for newer weighting schemes
On Sun, Apr 09, 2017 at 11:34:07PM +0530, Vivek Pal wrote: > > Each scheme already has a human-readable name, and Xapian::Registry > > can map that to an "examplar" object of the right type, so we > > could take a string like "bm25 1 0.8", see the first word is "bm25" > > and get a BM25Weight object, then call parse_params("1 0.8") on
2017 Sep 28
1
Weighting the author of a doc when that term can also appear as a frequent term in other docs
We have a corpus of academic papers. Sometimes it happens that there is an academic controversy and one paper is a response or rebuttal to another paper. The name of the author of the first paper may appear many times in the second paper. So in light of this, how should we set our weight on the author field? Here is an example: http://www.nber.org/papers/w11215  in which the term
2011 Aug 16
3
Bayesian Relative Survival Analysis in R?
Hi all, May i know does R has packages or code to run "Bayesian Relative Survival Analysis"? I have look through Bayesian Survival Analysis(2001) by Joseph George Ibrahim<http://www.google.com/search?tbo=p&tbm=bks&q=inauthor:%22Joseph+George+Ibrahim%22>, Ming-Hui Chen<http://www.google.com/search?tbo=p&tbm=bks&q=inauthor:%22Ming-Hui+Chen%22>, Debajyoti
2014 Nov 23
2
GSoc Project Idea Weighting Schemes (Ranking)
Hi, I am Abhishek Currently Xapian::Weight follows BM25 scheme, many models such as the Divergence from Randomness (DfR) family of models, Unigram Language Model and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet not merged to the master. The new weighing schemes or improvement in implementing the previous models to change the default scheme of BM25 from SMART with