thr3ads.net - similar to: "Project: Posting list encoding improvements"

Displaying 20 results from an estimated 1000 matches similar to: "Project: Posting list encoding improvements"

Gsoc-2013

2013 Mar 08

Gsoc-2013

Hi, I am Chinmay Naik, an undergraduate in Computer Science at Bangalore Institute of Technology, Bangalore. I am an experienced programmer and good with C,C++,Python,Java,OpenGL and would love to participate in Gsoc-13. >From the ideas listed, i am interested to work on the project "posting list encoding improvements". I am a newbie to Xapian but would like to get involved and get a

Posting list encoding improvements - pfd encoding & var len encoding comparison program

2012 Apr 20

Posting list encoding improvements - pfd encoding & var len encoding comparison program

Hi all, I wrote a program that implement the variable length encoding and fixed length encoding, and compares their index size and speed of search doc length. You can see the comparison result from the attachment snapshot. 1. The posting list is in all memory; 2. The search strategy of fixed length encoding is skipping with exponential step (1, 2, 4, 8, ...). Once exceeds the desired doc id, back

Search::Xapian add_database'd search results are odd?

2004 Dec 21

Search::Xapian add_database'd search results are odd?

Sorry if this is the wrong forum to discuss Search::Xapian issues -- this just seems like the best place.. Anyways, I've been testing out using $db->add_database() when searching, and it seems like the docids I'm getting out of it are incorrect, almost as though they're "double" what they should be (numerically)... the docids that exist should be around 950,000 and

overlapping docids when searching on multiple databases?

2010 Oct 22

overlapping docids when searching on multiple databases?

Just a quick question - it seems to me that it's entirely possible to get overlapping docids when searching on multiple databases? For instance: open database1 add database2 to database1 search db1+db2 if docid 10 exists in both databases, is there any way of telling which which database to retrieve the document from? /Per Jessen, Z?rich

manual flushing thresholds for deletes?

2023 May 03

manual flushing thresholds for deletes?

On Wed, May 03, 2023 at 12:38:15PM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > This will also effectively ignore boolean terms, assuming you're giving > > them wdf of 0 (because $3 here is the collection frequency, which is > > sum(wdf(term)) over all documents). > > Should boolean terms be ignored when estimating flushing >

xapian-letor: FeatureVector discussion

2016 Jun 27

xapian-letor: FeatureVector discussion

Hello James, Parth, Following our discussion on IRC and on code review, the way FeatureVector class works needs some discussion. Presently, the FeatureVector class is defined as follows, with a fixed number of feature count (19): class FeatureVector::Internal : public Xapian::Internal::intrusive_base{ friend class FeatureVector; double label; double score;

Project: QueryParser Reimplementation, to Olly Betts and Dan Colish

2012 Apr 01

Project: QueryParser Reimplementation, to Olly Betts and Dan Colish

*Hi all,* * * *The following is my general idea for the project. For a complete query parser I still need to consider more details. Please give me feedback because the description of this project is lack of detailed information, and I can submit my proposal without giant deviation.* * * design principle of query parsing: 1) better understanding user input. All search engine do is understanding

some trouble when devising skiplist

2014 May 10

some trouble when devising skiplist

Hi, I was confronted with some trouble, I describe the trouble in my journal http://trac.xapian.org/wiki/GSoC2014/Posting%20list%20encoding%20improvements/Journal#May10 And corresponding code is in my git. Would you like to give me some help? ------------------ Shangtong Zhang,Second Year Undergraduate, School of Computer Science, Fudan University, China. -------------- next part

manual flushing thresholds for deletes?

2023 May 03

manual flushing thresholds for deletes?

Olly Betts <olly at survex.com> wrote: > On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote: > > Olly Betts <olly at survex.com> wrote: > > > 10 seems too long. You want the mean word length weighted by frequency > > > of occurrence. For English that's typically around 5 characters, which > > > is 5 bytes. If we go for +1 that's:

How to get the serialise score returned in Xapian::KeyMaker->operator().

2017 Dec 18

How to get the serialise score returned in Xapian::KeyMaker->operator().

On Sat, Dec 16, 2017 at 10:11:40PM +0000, Olly Betts wrote: > Unfortunately the sort key isn't currently exposed via the public API. > It's available internally and it seems like it ought to be accessible > but there's no accessor method for it - I can add one but that won't > help for existing releases. I've added MSetIterator::get_sort_key() to master in

Compact databases and removing stale records at the same time

2013 Jun 19

Compact databases and removing stale records at the same time

On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote: > On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote: > > The advantage of compact - it runs approximately 8 times as fast (we > > are CPU limited in each case - writing to tmpfs first, then rsyncing > > to the destination) and it takes approximately 75% of the space of a > > fresh database with maximum

[xapian] a bug fixed in brass_database.cc

2012 Apr 14

[xapian] a bug fixed in brass_database.cc

Hi all, I fixed a bug in brass_database.cc. The bug is: *FIXME: this should be done by checking memory usage, not the number of* *changes. We could also look at the amount of data the inverter object* *currently holds.* I also modified the simpleindex.cc so that it now supports batch files indexing. -- Weixian Zhou Department of Computer Science and Engineering University at Buffalo, SUNY

Logging the click data

2017 Jun 05

Logging the click data

Hi James, > ID: some identifier for each query > QUERY: text of the query (when the query is run) > URLs: every URL displayed (or alternatively, the Xapian docid — this > might be easier) > OFFSET: otherwise you'll have difficulty coping with result pages other > than the first page (when this happens, the query ID should probably > remain the same, and when you aggregate

Introduction and Doubts

2016 Mar 10

Introduction and Doubts

Tf-idf is most used used weighting scheme is easy to understand and has been used in other frameworks like lucene and many other places. okapi bm25(implemented in xapian) is theoretically better/improved measure than tf-idf and i am looking into various other weighting scheme which are there in xapian or can be implemented like TF-ICF(term frequecy inverse corpus frequency),TF-RF(term

Omega: Missing support for newer weighting schemes

2017 Apr 08

Omega: Missing support for newer weighting schemes

On Sat, Apr 08, 2017 at 09:11:22PM +0100, James Aylett wrote: > On 8 Apr 2017, at 19:15, Vivek Pal <vivekpal.dtu at gmail.com> wrote: > > >> and the details of which weighting schemes were available in which version > >> isn't a key part of the $set command itself. > > > > Do you suggest dropping that piece of information out? Since the reason behind

Storing the documents text: data record or value ?

2018 Jan 03

Storing the documents text: data record or value ?

Hi, Following the Recoll snippets generation performance problem caused by the new positions list storage scheme in Xapian 1.4, I am experimenting with generating snippets from the complete document text stored in the index. This increases the index size much less than I would have expected (around 10-15% apparently with my home directory data), which is good news obviously. I have tried

Omega: Missing support for newer weighting schemes

2017 Apr 09

Omega: Missing support for newer weighting schemes

On Sun, Apr 09, 2017 at 11:34:07PM +0530, Vivek Pal wrote: > > Each scheme already has a human-readable name, and Xapian::Registry > > can map that to an "examplar" object of the right type, so we > > could take a string like "bm25 1 0.8", see the first word is "bm25" > > and get a BM25Weight object, then call parse_params("1 0.8") on

Weighting the author of a doc when that term can also appear as a frequent term in other docs

2017 Sep 28

Weighting the author of a doc when that term can also appear as a frequent term in other docs

We have a corpus of academic papers. Sometimes it happens that there is an academic controversy and one paper is a response or rebuttal to another paper. The name of the author of the first paper may appear many times in the second paper. So in light of this, how should we set our weight on the author field? Here is an example: http://www.nber.org/papers/w11215 in which the term

Bayesian Relative Survival Analysis in R?

2011 Aug 16

Bayesian Relative Survival Analysis in R?

Hi all, May i know does R has packages or code to run "Bayesian Relative Survival Analysis"? I have look through Bayesian Survival Analysis(2001) by Joseph George Ibrahim<http://www.google.com/search?tbo=p&tbm=bks&q=inauthor:%22Joseph+George+Ibrahim%22>, Ming-Hui Chen<http://www.google.com/search?tbo=p&tbm=bks&q=inauthor:%22Ming-Hui+Chen%22>, Debajyoti

GSoc Project Idea Weighting Schemes (Ranking)

2014 Nov 23

GSoc Project Idea Weighting Schemes (Ranking)

Hi, I am Abhishek Currently Xapian::Weight follows BM25 scheme, many models such as the Divergence from Randomness (DfR) family of models, Unigram Language Model and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet not merged to the master. The new weighing schemes or improvement in implementing the previous models to change the default scheme of BM25 from SMART with

similar to: Project: Posting list encoding improvements