Olly Betts
2013-Sep-02 06:56 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
On Mon, Sep 02, 2013 at 09:21:48AM +0800, jiangwen jiang wrote:> TfIdfWeight and BM25(b=0) also need wdf_upper_bound, it is not exists in > Lucene backends.If you don't provide an implementation of wdf_upper_bound(), the default is to use the collection frequency of the term, so provided that information is available in the lucene files, the lack of wdf_upper_bound information isn't a show stopper.> I think this data will be caculated when doing copydatabase, I will update > the code laterThat's probably a good plan though. Cheers, Olly
jiangwen jiang
2013-Sep-03 06:38 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
Collection frequency means how many times a particular term appears in all docs, this data is not exists in Lucene backends(I will check it in lucene mailing list later). Termfreq(how many docs contains a particular term) is the most similar data to collection freq, but I don't think collection freq can be instead of termfreq. Now I am trying to caculate this data in copydatabase. Thanks Regards 2013/9/2 Olly Betts <olly at survex.com>> On Mon, Sep 02, 2013 at 09:21:48AM +0800, jiangwen jiang wrote: > > TfIdfWeight and BM25(b=0) also need wdf_upper_bound, it is not exists in > > Lucene backends. > > If you don't provide an implementation of wdf_upper_bound(), the default > is to use the collection frequency of the term, so provided that > information is available in the lucene files, the lack of > wdf_upper_bound information isn't a show stopper. > > > I think this data will be caculated when doing copydatabase, I will > update > > the code later > > That's probably a good plan though. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130903/ec83f1d9/attachment.html>
jiangwen jiang
2013-Sep-15 04:06 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
code is updated now. please see the latest code. also, copy-lucenedatabase.cc is added, to caculate wdf_upper_bound, which is stored in a new file stat.xapian. TfidfWeight is used. Regards 2013/9/3 jiangwen jiang <jiangwen127 at gmail.com>> Collection frequency means how many times a particular term appears in all > docs, this data is not exists in Lucene backends(I will check it in lucene > mailing list later). > Termfreq(how many docs contains a particular term) is the most similar > data to collection freq, but I don't think collection freq can be > instead of termfreq. > Now I am trying to caculate this data in copydatabase. > > Thanks > Regards > > > > 2013/9/2 Olly Betts <olly at survex.com> > >> On Mon, Sep 02, 2013 at 09:21:48AM +0800, jiangwen jiang wrote: >> > TfIdfWeight and BM25(b=0) also need wdf_upper_bound, it is not exists in >> > Lucene backends. >> >> If you don't provide an implementation of wdf_upper_bound(), the default >> is to use the collection frequency of the term, so provided that >> information is available in the lucene files, the lack of >> wdf_upper_bound information isn't a show stopper. >> >> > I think this data will be caculated when doing copydatabase, I will >> update >> > the code later >> >> That's probably a good plan though. >> >> Cheers, >> Olly >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130915/23dec86f/attachment-0002.html>