search for: ngrams

Displaying 7 results from an estimated 7 matches for "ngrams".

Did you mean: grams
2019 Mar 07
3
Ask for advice on exact requirements to fix #699 mixed CJK numbers
I am working on "#699 Better tokenisation of mixed CJK numbers", and have implemented a partial patch of Chinese for this ticket. Current code works well with special test cases and all tests in xapian-core could still pass. But I'm confused with exact requirements of the question, for how much we could pay with performance on enabling more cases, and if there are better methods to
2018 Feb 10
1
How to let Xapian support Chinese searching
I installed Eprints, but it can not search Chinese. EPRINTS use Xapian to index data, how to let Xapian support CHINESE searching? Thanks a lot!
2018 Oct 04
2
Indexing Chinese?
My second (and hopefully last) question: is there any more news on indexing Chinese characters and words? Searching online mostly returns results from a decade ago or more, with nothing very conclusive. How close is this to possible? For the time being I'm doing some pre-processing on long strings of Chinese, breaking on punctuation in order to avoid errors. But I have some large corpora of
2012 Apr 27
1
Handling Negative value due to logarithm of probabilities.
Hi, In continuation of the discussion of melange comments,about negative value returned in matcher due to logarithm of probabilities. *I**f we make K suitably large, we could clamp each log(K.Pi) to be >= 0, and this change will only affect really low probability terms (those with Pi < 1/K, so you can adjust K to suit):* *W' = sum(i=1,...,n, max(log(K.Pi), 0))* Did you mean for low
2018 Feb 13
2
How to set environment variable XAPIAN_CJK_NGRAM?
...nt-Type: text/plain; charset=us-ascii > >On Sat, Feb 10, 2018 at 08:26:52PM +0800, Peter Zhao wrote: >> I installed Eprints, but it can not search Chinese. EPRINTS use >> Xapian to index data, how to let Xapian support CHINESE searching? > >Current releases support indexing ngrams for CJK text - to enable this >you need to pass FLAG_CJK_NGRAM to TermGenerator when indexing and to >QueryParser when searching. > >You can also activate this flag without code changes by setting >environment variable XAPIAN_CJK_NGRAM to a non-empty value (don't forget >to ex...
2018 Oct 04
0
Indexing Chinese?
We are a using a fork of Xapian for this at the Cyrus IMAP project [1], using the Unicode library word segmentation for Chinese, Japanese and Korean [2]. We are using it at FastMail in production since about 2 years and are generally happy with it, the search results improved over using ngrams. There's a pull request open to merge the patch upstream [3], but it's to be decided how to best add this to Xapian. Currently, the upstream patch doesn't build cleanly on the master branch, but I'll look into making it compile cleanly next week. Cheers, Robert [1] https://github....
2017 Sep 07
0
Revolutions blog: August 2017 roundup
....revolutionanalytics.com) and every month I post a summary of articles from the previous month of particular interest to readers of r-help. In case you missed them, here are some articles related to R from the month of August: Using the featurizeText function in the MicrosoftML package to extract ngrams from unstructured text: http://blog.revolutionanalytics.com/2017/08/text-featurization-microsoftml.html A joyplot visualizes the probabilities associated with prases like "highly likely" and "little chance" by a sample of 46 Redditors: http://blog.revolutionanalytics.com/2017/0...