Olly Betts
2013-Aug-26 02:17 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
On Mon, Aug 26, 2013 at 09:41:07AM +0800, jiangwen jiang wrote:> > For now, using weighting schemes which don't use document length is > > probably the simplest answer. > > There's tf-idf weighting scheme on svn master, is it suitable for lucene > backend?Yes - TfIdfWeight doesn't ever use the document length (at least with the normalisations currently implemented). You could also use BM25 with parameter b=0. Cheers, Olly
jiangwen jiang
2013-Sep-02 01:21 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
TfIdfWeight and BM25(b=0) also need wdf_upper_bound, it is not exists in Lucene backends. I think this data will be caculated when doing copydatabase, I will update the code later Regards 2013/8/26 Olly Betts <olly at survex.com>> On Mon, Aug 26, 2013 at 09:41:07AM +0800, jiangwen jiang wrote: > > > For now, using weighting schemes which don't use document length is > > > probably the simplest answer. > > > > There's tf-idf weighting scheme on svn master, is it suitable for lucene > > backend? > > Yes - TfIdfWeight doesn't ever use the document length (at least with > the normalisations currently implemented). > > You could also use BM25 with parameter b=0. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130902/0643f62f/attachment.html>
Olly Betts
2013-Sep-02 06:56 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
On Mon, Sep 02, 2013 at 09:21:48AM +0800, jiangwen jiang wrote:> TfIdfWeight and BM25(b=0) also need wdf_upper_bound, it is not exists in > Lucene backends.If you don't provide an implementation of wdf_upper_bound(), the default is to use the collection frequency of the term, so provided that information is available in the lucene files, the lack of wdf_upper_bound information isn't a show stopper.> I think this data will be caculated when doing copydatabase, I will update > the code laterThat's probably a good plan though. Cheers, Olly
Possibly Parallel Threads
- Backend for Lucene format indexes-How to get doclength
- Backend for Lucene format indexes-How to get doclength
- Backend for Lucene format indexes-How to get doclength
- Backend for Lucene format indexes-How to get doclength
- Weighting Schemes: Implementing Piv+ Normalization