thr3ads.net - search: "trigrams"

Displaying 5 results from an estimated 5 matches for "trigrams".

Did you mean: trigram

2009 Jan 22

text vector clustering

Hi, I am a new user of R using R 2.8.1 in windows 2003. I have a csv file with single column which contain the 30,000 students names. There were typo errors while entering this student names. The actual list of names is < 1000. However we dont have that list for keyword search. I am interested in grouping/cluster these names as those which are similar letter to letter. Are there any

Project: Weighting Schemes

2014 Mar 03

Project: Weighting Schemes

Hello Sir, I am Reetesh Ranjan, a 3rd year undergraduate student at the *INDIAN INSTITUTE OF TECHNOLOGY BHU, Varanasi-*one of the premier engineering colleges of India. I have gone through your webpage thoroughly and I am very interested in the work that you are undertaking on *Project: Weighting Schemes.*. I earnestly wish to work under your guidance, learn and progress through this experience.

Moving indextext.cc into core.

2007 Mar 28

Moving indextext.cc into core.

One of the items on the ToDo list for version 1.0 at http://wiki.xapian.org/TodoFor1_2e0#preview is: "Rework Omega's indextext.cc as a xapian-core "TextSplitter" class." I've been wondering about this for a while now. Currently, we have the Query Parser in Xapian core, but no text processing. Clearly, it makes sense to have a "text splitter" class in

*wildcard* support?

2005 Oct 08

*wildcard* support?

Hello, First I wanted to say thanks for a great piece of software, thanks Olly and others who've contributed! I know that Xapian supports right-truncating, if that's the proper name for wildcard support, as in a search for "xapia*". I don't believe Xapian supports wildcards on both sides of a term, correct? Is this something that is technically unfeasable, unpalatable

Chinese, Japanese, Korean Tokenizer.

2007 Jun 05

Chinese, Japanese, Korean Tokenizer.

Hi, I am looking for Chinese Japanese and Korean tokenizer that could can be use to tokenize terms for CJK languages. I am not very familiar with these languages however I think that these languages contains one or more words in one symbol which it make more difficult to tokenize into searchable terms. Lucene has CJK Tokenizer ... and I am looking around if there is some open source that we

search for: trigrams