search for: wdf

Displaying 20 results from an estimated 69 matches for "wdf".

Did you mean: pdf
2013 Feb 19
2
Implementing tf-idf weighting scheme in Xapian
...rms which occur in a few documents) should be able to give a higher weight to the documents they index compared to terms which occur in many documents .Also,the higher the within document frequency in the document ,more is the weight given by the term to the document. The basic formula is W(t,d)=wdf* log(N/termfreq) . However,various normalizations can be applied to both wdf and idf. The extra per document component will be 0 here and so get_maxextra( ) will return 0 . Moreover,an upper bound on W(t,d) for get_maxpart( ) can be found out easily for a particular normalization (if I have al...
2007 Jan 24
1
how to properly extend s3 data.frames with s4 classes?
...y ------- (a) When extending a S3 data.frame with a S4 class adding a slot, it seems to be impossible to initialize objects of these "ExtendedDataframes" (XDF) with S3 data.frames. (b) Extending data.frames with an S4 class without a slot, i.e. creating a "WrappedDataframe" (WDF), seems to allow initialization with a data.frame, but the behaviour appears to be somewhat inconsistent. (c) Trying to be "smart" by extending the WrappedDataframe from (b) by adding a slot, yields a similar behaviour than (a), i.e. initialization with a WDF object fails although WDF...
2012 Jun 11
2
Define a variable on a non-standard year interval (Water Years)
Hello, I am trying to define a different interval for a "year". In hydrology, a "water year" is defined as the period between October 1st and September 30 of the following year. I was wondering how I might do this in R. Say I have a data.frame like the following and I want to extract a variable with the water year specs as defined above:
2007 Mar 21
1
scoring question
Hi All I have just realized that if I set a query like 'green jelly bean' xapian will turn that query into 'green OR jelly OR bean' This causes documents containing just one of the words to be considered a 100% hit. The behavior I would like to see is that each word gives a 33.3% hit, so that a document containing all 3 words gets placed above a document with only 1 or 2
2023 May 03
1
manual flushing thresholds for deletes?
...rom the document text. Ah, OK. > You can take frequency into account with something like this: > > xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END {print t/n}' > > This will also effectively ignore boolean terms, assuming you're giving > them wdf of 0 (because $3 here is the collection frequency, which is > sum(wdf(term)) over all documents). Should boolean terms be ignored when estimating flushing thresholds? They do have a wdf of 0 in my case. I'm indexing git commit SHA-1 hex (and soon SHA-256), so that's a lot of 40-64 cha...
2013 Mar 08
2
Gsoc-2013
Hi, I am Chinmay Naik, an undergraduate in Computer Science at Bangalore Institute of Technology, Bangalore. I am an experienced programmer and good with C,C++,Python,Java,OpenGL and would love to participate in Gsoc-13. >From the ideas listed, i am interested to work on the project "posting list encoding improvements". I am a newbie to Xapian but would like to get involved and get a
2013 Mar 11
1
Implementation of the PL2 weighting scheme of the DFR Framework
...sampling or the risk gain (L) and within document frequency normalization H2(2) (as proposed by Amati in his PHD thesis). The formula for w(t,d) in this scheme is given by::- w(t,d) = wqf * L * P where wqf = within query frequency L = Laplace law of after effect sampling =1 / (wdfn + 1) P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 * log (2 * pi * wdfn) wdfn = wdf * (1+c * log(average length of document in database / length of document d )) (H2 Normalization ) lamda = mean of the Poisson distrubution = Collection frequency of the term...
2017 May 22
2
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
...K items=1886756 lastblock=417475 revision=6207 levels=3 root=83720 B-tree checked okay termlist table structure checked OK postlist: baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238 B-tree checked okay termfreq 197211 != # of entries 197210 collfreq 10861536 != sum wdf 10861533 termfreq 14189 != # of entries 14188 collfreq 98354 != sum wdf 98344 termfreq 9866 != # of entries 9865 collfreq 56453 != sum wdf 56443 termfreq 195141 != # of entries 195137 collfreq 8126093 != sum wdf 8126079 postlist table errors found: 8 position: baseB blocksize=8K items=180902610 la...
2005 May 25
1
[Fwd: Re: [Fwd: failure delivery]]
.... So I created some dummy variables by a factor called "events" and (really ugly!!) have TG, TG+1, TG+2, etc. Now I also have DEC1, and the calendar and data are such that in the period I'm forecasting I have TG+3 but this is NOT in the estimation data. There are also weekday factors (wdf) and some cross factors (Saturday + some special days is highly significant). The model is Sales ~ daynumber + wdf*events + wdf*specialevents where daynumber is the day sequence in the year and specialevents is a set of factors to tell when the business has promotional activities. The entire mo...
2023 Mar 27
1
manual flushing thresholds for deletes?
...e boolean terms which didn't come from the document text. You can take frequency into account with something like this: xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END {print t/n}' This will also effectively ignore boolean terms, assuming you're giving them wdf of 0 (because $3 here is the collection frequency, which is sum(wdf(term)) over all documents). > (that awk bit should be overflow-free) I don't see how to do the above as a rolling mean, so to be accurate for a large database it seems with awk you'll need to make two passes over the d...
2010 Jan 18
3
postlist: Tag containing meta information is corrupt.
Greetings, Using latest svn. I've noticed the following error when performing index merging: postlist: baseB blocksize=8K items=33962 lastblock=534 revision=1 levels=2 root=459 B-tree checked okay Tag containing meta information is corrupt. postlist table errors found: 1 I can still search on this index (I've only checked very small indexes), but merging is now a problem since I check
2008 May 14
4
GPL PV drivers for Windows - WDM version
I''m been busily converting the xenpci and xenvbd drivers from WDF to WDM to resolve a few issues including potential licensing problems with the Microsoft WDF and to (hopefully) allow them to function as boot drivers when doing install and system recovery. It was a fairly major rewrite of xenpci, and xenvbd, which are now working (booting and running without cra...
2008 May 14
4
GPL PV drivers for Windows - WDM version
I''m been busily converting the xenpci and xenvbd drivers from WDF to WDM to resolve a few issues including potential licensing problems with the Microsoft WDF and to (hopefully) allow them to function as boot drivers when doing install and system recovery. It was a fairly major rewrite of xenpci, and xenvbd, which are now working (booting and running without cra...
2023 May 03
1
manual flushing thresholds for deletes?
On Wed, May 03, 2023 at 12:38:15PM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > This will also effectively ignore boolean terms, assuming you're giving > > them wdf of 0 (because $3 here is the collection frequency, which is > > sum(wdf(term)) over all documents). > > Should boolean terms be ignored when estimating flushing > thresholds? They do have a wdf of 0 in my case. I'm indexing > git commit SHA-1 hex (and soon SHA-256), so that...
2017 May 24
0
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
...issues. > > The output of xapian-check follows. > xapian-check ~/.recoll/xapiandb [...] > postlist: > baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238 > B-tree checked okay > termfreq 197211 != # of entries 197210 > collfreq 10861536 != sum wdf 10861533 > termfreq 14189 != # of entries 14188 > collfreq 98354 != sum wdf 98344 > termfreq 9866 != # of entries 9865 > collfreq 56453 != sum wdf 56443 > termfreq 195141 != # of entries 195137 > collfreq 8126093 != sum wdf 8126079 > postlist table errors found: 8 [...] > To...
2008 May 18
11
Release 0.9.0 of GPL PV Drivers for Windows
I''ve just put up the latest release of the GPLPV drivers for Windows. This release involved a fairly big rewrite of the stuff that talks to Windows as I changed from WDF to WDM. WDF is a newer framework from Microsoft which makes it easier to write drivers as a lot of the state management stuff is done for you. It also means shipping a great big dll around with the drivers (note the difference in size between this and the previous version), and makes it really hard...
2008 May 18
11
Release 0.9.0 of GPL PV Drivers for Windows
I''ve just put up the latest release of the GPLPV drivers for Windows. This release involved a fairly big rewrite of the stuff that talks to Windows as I changed from WDF to WDM. WDF is a newer framework from Microsoft which makes it easier to write drivers as a lot of the state management stuff is done for you. It also means shipping a great big dll around with the drivers (note the difference in size between this and the previous version), and makes it really hard...
2006 Aug 24
0
[Rd] reshape scaling with large numbers of times/rows
...> > DF <- data.frame(X = gl(m, n), Y = 1:n, Z = letters[1:25]) > > > > system.time({Zn <- as.numeric(DF$Z) > > > + w2 <- xtabs(Zn ~ Y + X, DF) > > > + w2[w2 > 0] <- levels(DF$Z)[w2] > > > + w2[w2 == 0] <- NA > > > + WDF <- data.frame(Y=dimnames(w2)$Y) > > > + for (col in dimnames(w2)$X) { WDF[col]=w2[,col] } > > > + }) > > > [1] 131.888 1.240 135.945 0.000 0.000 > > > > dim(WDF) > > > [1] 70 4501 > > > > > > I'll have to look; mayb...
2014 Mar 11
2
[GSOC 2014] Indexing INEX dataset
...tletor.cc you need to run it once for each db and > in this case all you need to make sure is below line in omindex.cc while > indexing. > > indexer.index_text(title, 1,"S"); On current trunk, we index the title with prefix "S" by default in omindex, though with a wdf inc of 5 rather than 1: indexer.index_text(title, 5, "S"); So I don't think you need that change to omindex now. Cheers, Olly
2016 May 18
0
Weighting recent results
...:35:53PM -0400, Alex Aminoff wrote: > I was thinking about this some more: Is there a reason I can't just > weight by some function of recency at indexing time? > > $weight = get_weight_based_on_recency(...); > $tg->index_text($txt,$weight); The second parameter there is a WDF multiplier, which isn't really "weight". It depends on the weighting formula you're using (and the parameters set for it), but simply scaling up the WDF values for a whole document is likely to be counteracted by the corresponding increase in the document length (since that is SU...