Displaying 5 results from an estimated 5 matches for "unigram".
2012 Apr 15
1
Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.
Hi,
I have implemented initial prototype of the Xapian::Weight subclass for
Unigram Language Modelling to support UnigramLM weighing in xapian.Other
changes include adding collection_frequency to TermFreqs struct to store
collection frequency of terms and some changes to support it xapian
Framework,Changing simplesearch.cc to search using UnigramLMWeight class.
Following issues h...
2016 Sep 05
1
Defragmentation of memory
Dear all developers,
I'm working with a lot of textual data in R and need to handle this batch
by batch. The problem is that I read in batches of 10 000 documents and do
some calculations that results in objects that consume quite some memory
(calculate unigrams, 2-grams and 3-grams). In every iteration a new objects
(~ 500 mB) is created (and I can't control the size, so a new object needs
to be created each iteration). The speed of this computations is decreasing
every iteration (first iteration 7 sec, after 30 iterations 20-30 minutes
per iteration...
2014 Nov 23
2
GSoc Project Idea Weighting Schemes (Ranking)
Hi,
I am Abhishek
Currently Xapian::Weight follows BM25 scheme, many models such as the
Divergence from Randomness (DfR) family of models, Unigram Language Model
and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet
not merged to the master.
The new weighing schemes or improvement in implementing the previous models
to change the default scheme of BM25 from SMART with reference to this
paper www.aclweb.org/anthology/P10-1...
2012 Apr 27
1
Handling Negative value due to logarithm of probabilities.
...*else*
***log(K.Pi)*
*)*
In case both doesnt work return 0 would be only option .
Moreover selecting a large enough K would be a tricky task as as no K would
be large enough since log(x) -> -inf as x -> 0
Should we approach selecting value of K by statistically, i will mean to
run the unigram Weighting scheme on large collection and observing lowest
probability which could be found and hence approximating the value of K or
any other method.
I asked same Question on Stack overflow about this.
http://goo.gl/ykwN4
They suggested:
*"Could you simple take the negative of the logar...
2007 Jun 05
7
Chinese, Japanese, Korean Tokenizer.
Hi,
I am looking for Chinese Japanese and Korean tokenizer that could can
be use to tokenize terms for CJK languages. I am not very familiar
with these languages however I think that these languages contains one
or more words in one symbol which it make more difficult to tokenize
into searchable terms.
Lucene has CJK Tokenizer ... and I am looking around if there is some
open source that we