thr3ads.net - search: "unigrams"

Displaying 5 results from an estimated 5 matches for "unigrams".

Did you mean: unigram

Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.

2012 Apr 15

Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.

Hi, I have implemented initial prototype of the Xapian::Weight subclass for Unigram Language Modelling to support UnigramLM weighing in xapian.Other changes include adding collection_frequency to TermFreqs struct to store collection frequency of terms and some changes to support it xapian Framework,Changing simplesearch.cc to search using UnigramLMWeight class. Following issues have not being

Defragmentation of memory

2016 Sep 05

Defragmentation of memory

Dear all developers, I'm working with a lot of textual data in R and need to handle this batch by batch. The problem is that I read in batches of 10 000 documents and do some calculations that results in objects that consume quite some memory (calculate unigrams, 2-grams and 3-grams). In every iteration a new objects (~ 500 mB) is created (and I can't control the size, so a new object needs to be created each iteration). The speed of this computations is decreasing every iteration (first iteration 7 sec, after 30 iterations 20-30 minutes per iteration)...

GSoc Project Idea Weighting Schemes (Ranking)

2014 Nov 23

GSoc Project Idea Weighting Schemes (Ranking)

Hi, I am Abhishek Currently Xapian::Weight follows BM25 scheme, many models such as the Divergence from Randomness (DfR) family of models, Unigram Language Model and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet not merged to the master. The new weighing schemes or improvement in implementing the previous models to change the default scheme of BM25 from SMART with

Handling Negative value due to logarithm of probabilities.

2012 Apr 27

Handling Negative value due to logarithm of probabilities.

Hi, In continuation of the discussion of melange comments,about negative value returned in matcher due to logarithm of probabilities. *I**f we make K suitably large, we could clamp each log(K.Pi) to be >= 0, and this change will only affect really low probability terms (those with Pi < 1/K, so you can adjust K to suit):* *W' = sum(i=1,...,n, max(log(K.Pi), 0))* Did you mean for low

Chinese, Japanese, Korean Tokenizer.

2007 Jun 05

Chinese, Japanese, Korean Tokenizer.

Hi, I am looking for Chinese Japanese and Korean tokenizer that could can be use to tokenize terms for CJK languages. I am not very familiar with these languages however I think that these languages contains one or more words in one symbol which it make more difficult to tokenize into searchable terms. Lucene has CJK Tokenizer ... and I am looking around if there is some open source that we

search for: unigrams