On Wed, Jun 27, 2007 at 04:48:34PM +1000, David Morris-Oliveros
wrote:> Now I want to use the WDF to give more weight to terms that have
> appeared on that page throughout the life of the document, as opposed to
> terms that only appeared briefly. I thought of adding all the seconds
> that the term has appeared on that page, and that could be its WDF.
>
> However, this would give me WDF's well into the millions.
You don't seem to actually have asked a question, but I guess you want
to know if this is reasonable, or if it is likely to cause problems.
If you have other terms with more normally set wdfs, the relative
weightings between the terms will probably not be sensible.
But if all the terms have similarly inflated wdfs, I can't really see
any major problems. The worst I can see is that the document length is
the sum of the wdfs over all, and that is calculated in a 32 bit
unsigned integer.
However, you could presumably just calculate your "term exposed time"
and then scale it down by a constant factor (1000, say), to give smaller
wdfs and document lengths, which would avoid risk of overflow and result
in a smaller database (smaller values are stored in fewer bytes).
> Plan B: normalize the time from 1..N where N is the number of terms that
> have ever appeared on the page and then just assign each term its order
> in that range.
I think you'd have to try both approaches to find which (if either!)
gives good results.
Trying to factor in the lifetime of terms is certainly an interesting
idea - I'd love to hear how you get on.
Cheers,
Olly