search for: ngram

Displaying 7 results from an estimated 7 matches for "ngram".

Did you mean: dgram
2019 Mar 07
3
Ask for advice on exact requirements to fix #699 mixed CJK numbers
...me, I would fix the code in next commit.) If it's better to create a pull request, please tell me. (Below is my explanation to the code, in case my code is not clear to read) current code only supports the cases that mixed Chinese numbers are embedded into the CJK characters which sent to CJKNgramIterator. And it would extract the whole number as one token instead of 1-gram. The code was added in the operator++ of CJKNgramIterator in cjk-tokenizer.cc, for considerations of minimizing the modification to existing code and harm to modularity. current implementation would pass the test cases...
2018 Feb 10
1
How to let Xapian support Chinese searching
I installed Eprints, but it can not search Chinese. EPRINTS use Xapian to index data, how to let Xapian support CHINESE searching? Thanks a lot!
2018 Oct 04
2
Indexing Chinese?
My second (and hopefully last) question: is there any more news on indexing Chinese characters and words? Searching online mostly returns results from a decade ago or more, with nothing very conclusive. How close is this to possible? For the time being I'm doing some pre-processing on long strings of Chinese, breaking on punctuation in order to avoid errors. But I have some large corpora of
2012 Apr 27
1
Handling Negative value due to logarithm of probabilities.
Hi, In continuation of the discussion of melange comments,about negative value returned in matcher due to logarithm of probabilities. *I**f we make K suitably large, we could clamp each log(K.Pi) to be >= 0, and this change will only affect really low probability terms (those with Pi < 1/K, so you can adjust K to suit):* *W' = sum(i=1,...,n, max(log(K.Pi), 0))* Did you mean for low
2018 Feb 13
2
How to set environment variable XAPIAN_CJK_NGRAM?
Olly, Thanks a lot! I installed Xapian 1.2.25 on Ubuntu 14.04. How to set environment variable XAPIAN_CJK_NGRAM? I'm a newbie to Xapian. Best wishes, Peter At 2018-02-12 20:00:02, xapian-discuss-request at lists.xapian.org wrote: >Send Xapian-discuss mailing list submissions to > xapian-discuss at lists.xapian.org > >To subscribe or unsubscribe via the World Wide Web, visit > htt...
2018 Oct 04
0
Indexing Chinese?
We are a using a fork of Xapian for this at the Cyrus IMAP project [1], using the Unicode library word segmentation for Chinese, Japanese and Korean [2]. We are using it at FastMail in production since about 2 years and are generally happy with it, the search results improved over using ngrams. There's a pull request open to merge the patch upstream [3], but it's to be decided how to best add this to Xapian. Currently, the upstream patch doesn't build cleanly on the master branch, but I'll look into making it compile cleanly next week. Cheers, Robert [1] https://github...
2017 Sep 07
0
Revolutions blog: August 2017 roundup
....revolutionanalytics.com) and every month I post a summary of articles from the previous month of particular interest to readers of r-help. In case you missed them, here are some articles related to R from the month of August: Using the featurizeText function in the MicrosoftML package to extract ngrams from unstructured text: http://blog.revolutionanalytics.com/2017/08/text-featurization-microsoftml.html A joyplot visualizes the probabilities associated with prases like "highly likely" and "little chance" by a sample of 46 Redditors: http://blog.revolutionanalytics.com/2017/...