2012 Apr 15
Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.
Hi, I have implemented initial prototype of the Xapian::Weight subclass for Unigram Language Modelling to support UnigramLM weighing in xapian.Other changes include adding collection_frequency to TermFreqs struct to store collection frequency of terms and some changes to support it xapian Framework,Changing to search using UnigramLMWeight class. Following issues h...
2016 Sep 05
Defragmentation of memory
Dear all developers, I'm working with a lot of textual data in R and need to handle this batch by batch. The problem is that I read in batches of 10 000 documents and do some calculations that results in objects that consume quite some memory (calculate unigrams, 2-grams and 3-grams). In every iteration a new objects (~ 500 mB) is created (and I can't control the size, so a new object needs to be created each iteration). The speed of this computations is decreasing every iteration (first iteration 7 sec, after 30 iterations 20-30 minutes per iteration...
2014 Nov 23
GSoc Project Idea Weighting Schemes (Ranking)
Hi, I am Abhishek Currently Xapian::Weight follows BM25 scheme, many models such as the Divergence from Randomness (DfR) family of models, Unigram Language Model and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet not merged to the master. The new weighing schemes or improvement in implementing the previous models to change the default scheme of BM25 from SMART with reference to this paper
2012 Apr 27
Handling Negative value due to logarithm of probabilities.
...*else* ***log(K.Pi)* *)* In case both doesnt work return 0 would be only option . Moreover selecting a large enough K would be a tricky task as as no K would be large enough since log(x) -> -inf as x -> 0 Should we approach selecting value of K by statistically, i will mean to run the unigram Weighting scheme on large collection and observing lowest probability which could be found and hence approximating the value of K or any other method. I asked same Question on Stack overflow about this. They suggested: *"Could you simple take the negative of the logar...
2007 Jun 05
Chinese, Japanese, Korean Tokenizer.
Hi, I am looking for Chinese Japanese and Korean tokenizer that could can be use to tokenize terms for CJK languages. I am not very familiar with these languages however I think that these languages contains one or more words in one symbol which it make more difficult to tokenize into searchable terms. Lucene has CJK Tokenizer ... and I am looking around if there is some open source that we