Displaying 5 results from an estimated 5 matches for "unigrams".
Did you mean:
unigram
2012 Apr 15
1
Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.
Hi,
I have implemented initial prototype of the Xapian::Weight subclass for
Unigram Language Modelling to support UnigramLM weighing in xapian.Other
changes include adding collection_frequency to TermFreqs struct to store
collection frequency of terms and some changes to support it xapian
Framework,Changing simplesearch.cc to search using UnigramLMWeight class.
Following issues have not being
2016 Sep 05
1
Defragmentation of memory
Dear all developers,
I'm working with a lot of textual data in R and need to handle this batch
by batch. The problem is that I read in batches of 10 000 documents and do
some calculations that results in objects that consume quite some memory
(calculate unigrams, 2-grams and 3-grams). In every iteration a new objects
(~ 500 mB) is created (and I can't control the size, so a new object needs
to be created each iteration). The speed of this computations is decreasing
every iteration (first iteration 7 sec, after 30 iterations 20-30 minutes
per iteration)...
2014 Nov 23
2
GSoc Project Idea Weighting Schemes (Ranking)
Hi,
I am Abhishek
Currently Xapian::Weight follows BM25 scheme, many models such as the
Divergence from Randomness (DfR) family of models, Unigram Language Model
and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet
not merged to the master.
The new weighing schemes or improvement in implementing the previous models
to change the default scheme of BM25 from SMART with
2012 Apr 27
1
Handling Negative value due to logarithm of probabilities.
Hi,
In continuation of the discussion of melange comments,about negative value
returned in matcher due to logarithm of probabilities.
*I**f we make K suitably large, we could clamp each log(K.Pi) to be >= 0,
and this change will only affect really low probability terms (those with
Pi < 1/K, so you can adjust K to suit):*
*W' = sum(i=1,...,n, max(log(K.Pi), 0))*
Did you mean for low
2007 Jun 05
7
Chinese, Japanese, Korean Tokenizer.
Hi,
I am looking for Chinese Japanese and Korean tokenizer that could can
be use to tokenize terms for CJK languages. I am not very familiar
with these languages however I think that these languages contains one
or more words in one symbol which it make more difficult to tokenize
into searchable terms.
Lucene has CJK Tokenizer ... and I am looking around if there is some
open source that we