Displaying 20 results from an estimated 69 matches for "wdf".
Did you mean:
pdf
2013 Feb 19
2
Implementing tf-idf weighting scheme in Xapian
...rms which occur in a few
documents) should be able to give a higher weight to the documents they
index compared to terms which occur in many documents .Also,the higher the
within document frequency in the document ,more is the weight given by the
term to the document.
The basic formula is W(t,d)=wdf* log(N/termfreq) .
However,various normalizations can be applied to both wdf and idf.
The extra per document component will be 0 here and so get_maxextra( ) will
return 0 .
Moreover,an upper bound on W(t,d) for get_maxpart( ) can be found out
easily for a particular normalization (if I have al...
2007 Jan 24
1
how to properly extend s3 data.frames with s4 classes?
...y
-------
(a) When extending a S3 data.frame with a S4 class adding a slot, it
seems to be impossible to initialize objects of these
"ExtendedDataframes" (XDF) with S3 data.frames.
(b) Extending data.frames with an S4 class without a slot, i.e. creating
a "WrappedDataframe" (WDF), seems to allow initialization with a
data.frame, but the behaviour appears to be somewhat inconsistent.
(c) Trying to be "smart" by extending the WrappedDataframe from (b) by
adding a slot, yields a similar behaviour than (a), i.e. initialization
with a WDF object fails although WDF...
2012 Jun 11
2
Define a variable on a non-standard year interval (Water Years)
Hello,
I am trying to define a different interval for a "year". In hydrology,
a "water year" is defined as the period between October 1st and
September 30 of the following year. I was wondering how I might do
this in R. Say I have a data.frame like the following and I want to
extract a variable with the water year specs as defined above:
2007 Mar 21
1
scoring question
Hi All
I have just realized that if I set a query like
'green jelly bean'
xapian will turn that query into
'green OR jelly OR bean'
This causes documents containing just one of the words to be considered
a 100% hit.
The behavior I would like to see is that each word gives a 33.3% hit, so
that a document containing all 3 words gets placed above a document with
only 1 or 2
2023 May 03
1
manual flushing thresholds for deletes?
...rom the document text.
Ah, OK.
> You can take frequency into account with something like this:
>
> xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END {print t/n}'
>
> This will also effectively ignore boolean terms, assuming you're giving
> them wdf of 0 (because $3 here is the collection frequency, which is
> sum(wdf(term)) over all documents).
Should boolean terms be ignored when estimating flushing
thresholds? They do have a wdf of 0 in my case. I'm indexing
git commit SHA-1 hex (and soon SHA-256), so that's a lot of
40-64 cha...
2013 Mar 08
2
Gsoc-2013
Hi,
I am Chinmay Naik, an undergraduate in Computer Science at Bangalore
Institute of Technology, Bangalore.
I am an experienced programmer and good with C,C++,Python,Java,OpenGL and
would love to participate in Gsoc-13.
>From the ideas listed, i am interested to work on the project "posting list
encoding improvements".
I am a newbie to Xapian but would like to get involved and get a
2013 Mar 11
1
Implementation of the PL2 weighting scheme of the DFR Framework
...sampling or the risk gain (L) and within document frequency normalization
H2(2) (as proposed by Amati in his PHD thesis).
The formula for w(t,d) in this scheme is given by::-
w(t,d) = wqf * L * P
where
wqf = within query frequency
L = Laplace law of after effect sampling =1 / (wdfn + 1)
P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 * log
(2 * pi * wdfn)
wdfn = wdf * (1+c * log(average length of document in database /
length of document d )) (H2 Normalization )
lamda = mean of the Poisson distrubution = Collection frequency of
the term...
2017 May 22
2
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
...K items=1886756 lastblock=417475 revision=6207 levels=3 root=83720
B-tree checked okay
termlist table structure checked OK
postlist:
baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238
B-tree checked okay
termfreq 197211 != # of entries 197210
collfreq 10861536 != sum wdf 10861533
termfreq 14189 != # of entries 14188
collfreq 98354 != sum wdf 98344
termfreq 9866 != # of entries 9865
collfreq 56453 != sum wdf 56443
termfreq 195141 != # of entries 195137
collfreq 8126093 != sum wdf 8126079
postlist table errors found: 8
position:
baseB blocksize=8K items=180902610 la...
2005 May 25
1
[Fwd: Re: [Fwd: failure delivery]]
.... So I created some dummy
variables by a factor called "events" and (really ugly!!) have TG, TG+1,
TG+2, etc. Now I also have DEC1, and the calendar and data are such
that in the period I'm forecasting I have TG+3 but this is
NOT in the estimation data. There are also weekday factors (wdf) and some
cross factors (Saturday + some special days is highly significant).
The model is Sales ~ daynumber + wdf*events + wdf*specialevents
where daynumber is the day sequence in the year and specialevents is a
set of factors to tell when the business has promotional activities.
The entire mo...
2023 Mar 27
1
manual flushing thresholds for deletes?
...e
boolean terms which didn't come from the document text.
You can take frequency into account with something like this:
xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END {print t/n}'
This will also effectively ignore boolean terms, assuming you're giving
them wdf of 0 (because $3 here is the collection frequency, which is
sum(wdf(term)) over all documents).
> (that awk bit should be overflow-free)
I don't see how to do the above as a rolling mean, so to be accurate for
a large database it seems with awk you'll need to make two passes over
the d...
2010 Jan 18
3
postlist: Tag containing meta information is corrupt.
Greetings,
Using latest svn.
I've noticed the following error when performing index merging:
postlist:
baseB blocksize=8K items=33962 lastblock=534 revision=1 levels=2 root=459
B-tree checked okay
Tag containing meta information is corrupt.
postlist table errors found: 1
I can still search on this index (I've only checked very small indexes),
but merging is now a problem since I check
2008 May 14
4
GPL PV drivers for Windows - WDM version
I''m been busily converting the xenpci and xenvbd drivers from WDF to WDM
to resolve a few issues including potential licensing problems with the
Microsoft WDF and to (hopefully) allow them to function as boot drivers
when doing install and system recovery.
It was a fairly major rewrite of xenpci, and xenvbd, which are now
working (booting and running without cra...
2008 May 14
4
GPL PV drivers for Windows - WDM version
I''m been busily converting the xenpci and xenvbd drivers from WDF to WDM
to resolve a few issues including potential licensing problems with the
Microsoft WDF and to (hopefully) allow them to function as boot drivers
when doing install and system recovery.
It was a fairly major rewrite of xenpci, and xenvbd, which are now
working (booting and running without cra...
2023 May 03
1
manual flushing thresholds for deletes?
On Wed, May 03, 2023 at 12:38:15PM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > This will also effectively ignore boolean terms, assuming you're giving
> > them wdf of 0 (because $3 here is the collection frequency, which is
> > sum(wdf(term)) over all documents).
>
> Should boolean terms be ignored when estimating flushing
> thresholds? They do have a wdf of 0 in my case. I'm indexing
> git commit SHA-1 hex (and soon SHA-256), so that...
2017 May 24
0
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
...issues.
>
> The output of xapian-check follows.
> xapian-check ~/.recoll/xapiandb
[...]
> postlist:
> baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238
> B-tree checked okay
> termfreq 197211 != # of entries 197210
> collfreq 10861536 != sum wdf 10861533
> termfreq 14189 != # of entries 14188
> collfreq 98354 != sum wdf 98344
> termfreq 9866 != # of entries 9865
> collfreq 56453 != sum wdf 56443
> termfreq 195141 != # of entries 195137
> collfreq 8126093 != sum wdf 8126079
> postlist table errors found: 8
[...]
> To...
2008 May 18
11
Release 0.9.0 of GPL PV Drivers for Windows
I''ve just put up the latest release of the GPLPV drivers for Windows.
This release involved a fairly big rewrite of the stuff that talks to
Windows as I changed from WDF to WDM. WDF is a newer framework from
Microsoft which makes it easier to write drivers as a lot of the state
management stuff is done for you. It also means shipping a great big dll
around with the drivers (note the difference in size between this and
the previous version), and makes it really hard...
2008 May 18
11
Release 0.9.0 of GPL PV Drivers for Windows
I''ve just put up the latest release of the GPLPV drivers for Windows.
This release involved a fairly big rewrite of the stuff that talks to
Windows as I changed from WDF to WDM. WDF is a newer framework from
Microsoft which makes it easier to write drivers as a lot of the state
management stuff is done for you. It also means shipping a great big dll
around with the drivers (note the difference in size between this and
the previous version), and makes it really hard...
2006 Aug 24
0
[Rd] reshape scaling with large numbers of times/rows
...> > DF <- data.frame(X = gl(m, n), Y = 1:n, Z = letters[1:25])
> > > > system.time({Zn <- as.numeric(DF$Z)
> > > + w2 <- xtabs(Zn ~ Y + X, DF)
> > > + w2[w2 > 0] <- levels(DF$Z)[w2]
> > > + w2[w2 == 0] <- NA
> > > + WDF <- data.frame(Y=dimnames(w2)$Y)
> > > + for (col in dimnames(w2)$X) { WDF[col]=w2[,col] }
> > > + })
> > > [1] 131.888 1.240 135.945 0.000 0.000
> > > > dim(WDF)
> > > [1] 70 4501
> > >
> > > I'll have to look; mayb...
2014 Mar 11
2
[GSOC 2014] Indexing INEX dataset
...tletor.cc you need to run it once for each db and
> in this case all you need to make sure is below line in omindex.cc while
> indexing.
>
> indexer.index_text(title, 1,"S");
On current trunk, we index the title with prefix "S" by default in
omindex, though with a wdf inc of 5 rather than 1:
indexer.index_text(title, 5, "S");
So I don't think you need that change to omindex now.
Cheers,
Olly
2016 May 18
0
Weighting recent results
...:35:53PM -0400, Alex Aminoff wrote:
> I was thinking about this some more: Is there a reason I can't just
> weight by some function of recency at indexing time?
>
> $weight = get_weight_based_on_recency(...);
> $tg->index_text($txt,$weight);
The second parameter there is a WDF multiplier, which isn't really
"weight". It depends on the weighting formula you're using (and the
parameters set for it), but simply scaling up the WDF values for a whole
document is likely to be counteracted by the corresponding increase in
the document length (since that is SU...