search for: bm25

Displaying 20 results from an estimated 65 matches for "bm25".

2011 Feb 18
1
Is it possible to reset the parameters in BM25 each time a new query enters?
Hi guys, I'm trying to improve the search results of our collection by tuning the parameters in the BM25 weighting schema. Since our collection includes several databases, such as for pictures, websites, etc., I would like to use different values of the same schema to calculate the weights. Yet, rebuilding each time after the change was done to the head file seems not an optimal approach and costs too...
2010 Nov 01
1
floating-point issues with set_sort_by_relevance_then_value? (1.2.3, BM25 k1=0)
I am using BM25 with k1=0 and min_normlen=1 to get weights unaffected by document length and term frequency in the document (min_normlen=1 isn't necessary I guess) and am expecting single-term weights to be identical for all matches. I have added a document value to steer such general search queries and it...
2016 Mar 10
2
Introduction and Doubts
Tf-idf is most used used weighting scheme is easy to understand and has been used in other frameworks like lucene and many other places. okapi bm25(implemented in xapian) is theoretically better/improved measure than tf-idf and i am looking into various other weighting scheme which are there in xapian or can be implemented like TF-ICF(term frequecy inverse corpus frequency),TF-RF(term frequency-relevance frequency) for evaluating the speed...
2017 Apr 08
2
Omega: Missing support for newer weighting schemes
...ves in modules we've built on the C++ API that seems a strong hint that this functionality might belong in the API instead. Each scheme already has a human-readable name, and Xapian::Registry can map that to an "examplar" object of the right type, so we could take a string like "bm25 1 0.8", see the first word is "bm25" and get a BM25Weight object, then call parse_params("1 0.8") on it to create the correct Weight object (broadly similar to how unserialise() is handled). Then we can document the available schemes and the parameters they take in one pla...
2017 Sep 28
1
Weighting the author of a doc when that term can also appear as a frequent term in other docs
...t on the author field to like 300, that would cause a search for "Moore's Law" to be dominated by results written by authors named Moore. One suggestion someone had was what if the 300th mention of Hoxby was not as important as the first. I tried to read  https://xapian.org/docs/bm25.html and I think I conclude that as long as f is small relative to L or K, the value of the expression will increase linearly with f. To make it less than linear, we might invoke > BM25 originally introduced another constant, as a power to which f and > K are raised. However, Stephen rem...
2013 May 15
0
Better parsing of BM25 parameters in Omega
Hello guys, as discussed on IRC, I have written some code for better parsing of BM25 parameters in Omega. If no parameters are specified ,it defaults all of them. However, if there some are specified and some are not or if the invalid values are given for any of them,it throws an error. https://github.com/aarshkshah1992/xapian/commit/ac0a11f5d8ff975fad1e96e63764eab9b04dfcfb -Rega...
2017 Apr 09
3
Omega: Missing support for newer weighting schemes
On Sun, Apr 09, 2017 at 11:34:07PM +0530, Vivek Pal wrote: > > Each scheme already has a human-readable name, and Xapian::Registry > > can map that to an "examplar" object of the right type, so we > > could take a string like "bm25 1 0.8", see the first word is "bm25" > > and get a BM25Weight object, then call parse_params("1 0.8") on it to > > create the correct Weight object (broadly similar to how unserialise() > > is handled). > > If I followed correctly, since the set_wei...
2014 Mar 22
2
[GSOC 2014] Indexing INEX dataset
For unsupervised approaches like BM25 this approach works well but letor does not need special weighting for title in this form as it itself assigns weights to title features separately. But I see your concern it would be a problem when BM25 is used on the index with this setup. Hence its preferable to take a note of this uplift in ti...
2014 Nov 23
2
GSoc Project Idea Weighting Schemes (Ranking)
Hi, I am Abhishek Currently Xapian::Weight follows BM25 scheme, many models such as the Divergence from Randomness (DfR) family of models, Unigram Language Model and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet not merged to the master. The new weighing schemes or improvement in implementing the previous models to change the de...
2012 Jul 17
1
Can not use custom weight scheme with python binding
Hi, I'm trying to use custom weight with python binding. My test code is like this. class TinkerWeight(xapian.Weight): def __init__(self): pass def name(self): return "Tinker" def serialize(self): return "" def get_sumpart(*args): return 1 def get_maxpart(*args): return 1 def get_sumextra(*args):
2016 Jul 24
2
Weighting Schemes: Evaluation results
...uation/tree/evaluation/run and a README for getting started with xapian-evaluation module. Hopefully, it might be of help to those who are new to evaluating weighting schemes in Xapian :) Comparing the MAP to access the retrieval effectiveness, some interesting results have emerged as follows: 1. BM25+ : 0.100415 and BM25: 0.101771 BM25 does a slightly better job here. My guess is that BM25+ is falling short because may be we lack very long documents in the data-set collection. Also, I'm thinking of revisiting the PR of BM25+ patch and cross-check it with original BM25+ formula to spot any...
2013 Aug 26
2
Backend for Lucene format indexes-How to get doclength
...document length is > > probably the simplest answer. > > There's tf-idf weighting scheme on svn master, is it suitable for lucene > backend? Yes - TfIdfWeight doesn't ever use the document length (at least with the normalisations currently implemented). You could also use BM25 with parameter b=0. Cheers, Olly
2013 Feb 07
0
Ideas for allowing specification of weighing scheme for Eset
...with the parameters , if he does not not want to use the default values) to build the Eset (rather than using the hard coded TradWeight scheme with default k=1 ) as Olly had suggested that we can probably get better terms (a more relevant Eset) for query expansion if we use say something like BM25 (or allow the user to use a self coded scheme) for ranking the terms . I read up the code for the proxy,internal and iterator classes of Eset and Mset to get a feel of how those sets work.I then traced the working of Enquire::get_eset( ) (understood it well other than how a Termlist tree is buil...
2014 Mar 04
2
Test Dataset for performance and accuracy analysis
Hi Parth, I implemented DFR algorithms in Xapian as a part of GSOC last year under the mentorship of Olly. This year, I want to work on analyzing and optimizing the performance of the DFR algorithms and comparing them with BM25.I also want to work on profiling the query expansion schemes and test the relevance(precision and recall) / speed(time taken) of the algorithms . However, for this, I need a well defined data set containing a considerable amount of textual data, query logs containin...
2012 Mar 31
1
Project: Posting list encoding improvements
...ller: use gamma codes to encode the gap between docids instead of docids. Last question towards the project of weighting schemes: Do we need only to implement existing weighting scheme instead of coming up with new ideas? And our mission is to find a weighting scheme that could replace the default BM25 in Xapian? -- Weixian Zhou Department of Computer Science and Engineering University at Buffalo, SUNY -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120331/58c7dedd/attachment.htm>
2018 Jan 22
2
How to get the serialise score returned in Xapian::KeyMaker->operator().
...s more slower than KeyMaker. I think the reason maybe: We only use one Xapian::Query of PostingSource and the upper bound of our get_weight() can not work on a single PostingSource. So some optimizing don't work, but waste time oppositely. How do you think about this? Also, We found the BM25 algorithm is fast in xapian, so we think if we can modify our get_weight() function to adjust the BM25 algorithm. If so, the type of termfreq of document should be double. I am wondering if it works just re-typedef Xapian::termcount to double? Does it has a negative impact on other place of xapian...
2019 Mar 19
3
Project Proposal in GSoC 2019
Hi All, I am interested in applying for the two projects listed in the Xapian Gsoc 2019 project idealist: "Learning to Rank Stabilisation" and "Weighting Schemes". I have downloaded the codebase and going through some of the commits related to Letor API, BM25, and DFR weighting schemes. Can anyone tell me how to write about the formal proposal for the above-mentioned projects? Thanks and Regards, -Sourav. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/201903...
2013 Sep 02
2
Backend for Lucene format indexes-How to get doclength
On Mon, Sep 02, 2013 at 09:21:48AM +0800, jiangwen jiang wrote: > TfIdfWeight and BM25(b=0) also need wdf_upper_bound, it is not exists in > Lucene backends. If you don't provide an implementation of wdf_upper_bound(), the default is to use the collection frequency of the term, so provided that information is available in the lucene files, the lack of wdf_upper_bound informat...
2016 Jun 10
2
Weighting Schemes -- Project Progress
Hello everyone, I have been working on adding support for BM25+ weighting function from the last couple of weeks. Initially, I considered modifying bm25weight.cc to add support for BM25+ function without disturbing functionalities of BM25. But that didn't work out very well. A day or two was spent trying to refactor and debug the same code. Later, I took...
2011 Apr 01
2
New Idea on Ranking in IR
.... I today only got a mail from one of the professor from Europe whom i am going to join for Ph.D about GSoC and more precisely Xapian. Generally the ranking is unsupervised, where the rank list is produced based on the score provided by the ranking function. Ranking functions are unsupervised like BM25, TF-IDF and so on. So we give the rank list in the dercreasing order of the score. Well learning to rank involves supervised learning. If we can extract features for a query and intial retrieval of documents pairs then we can say which document should come above which. Basically search engine requ...