thr3ads.net - search: "bigrams"

Displaying 12 results from an estimated 12 matches for "bigrams".

Did you mean: bigram

Adding Bi-gram in the QueryParser and Object.

2012 Jun 29

Adding Bi-gram in the QueryParser and Object.

...Query((failed at 1 NEAR 11 assertion at 2) OR failed assertion at 3) *Implementation:* Since the all terms detected as near is added to class *Terms* so when we ask for Queries from the Class *Terms *using as_near_query , as_adj_query,as_opwindow_query then while parsing terms we can just add the bigrams while iterating list of term. *Adj: *exactly similar to *NEAR(Bigram can be added)* *phrase : *Terms giving in a Quotes.Since they are terms user want to have together.Bigram can be added*(Bigram can be added)* Implementation is similar to Near,adj. * * *Phrased: *Single term which is actually t...

SVD for reducing dimensions

2002 Nov 17

SVD for reducing dimensions

...n La.svd(x) : argument to La.svd must be numeric or complex > xs <- svd(x) > ncol(xs$v) [1] 500 > nrow(xs$v) [1] 500 > nrow(xs$u) [1] 500 > ncol(xs$u) [1] 500 Also, how should I locate the million or so less common words into the space generated by this? Running svd on the full bigrams sounds infeasable, it would be a 200GB matrix, for a start. Really I just want to 'predict' their location rather than build the classifier with a larger set. Thank you for your time Corrin Lakeland -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (GNU/Linux) iD8DBQE92AJOi5A0ZsG8x8c...

[LLVMdev] Spell Correction Efficiency

2011 Jan 15

[LLVMdev] Spell Correction Efficiency

...ch was used in CLang, notably for auto-completion, based on the >> Levenstein distance. >> >> It turns out I just came upon this link today: >> http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html >> >> The idea is to use bigrams (2-letters parts of a word) to build an index >> of the form (bigram > all words containing this bigram), then use this index >> to retrieve all the words with enough bigrams in common with the word you >> are currently trying to approximate. >> >> This drastically...

GSOC 2016 project on Ranking

2016 Mar 04

GSOC 2016 project on Ranking

Hello Sir, I am a third-year student at the Department of mathematics at IIT Kharagpur. I have good experience in Information Retrieval and Machine Learning. I have read many chapters of the book Introduction to Information Retrieval. Recently I am doing a project on tagging a question on a Q&A Forum using ranking the tags and probabilistic inference. I also have software development

Proposal for Integration of Bi-gram in Xapian Architecture

2012 Jun 03

Proposal for Integration of Bi-gram in Xapian Architecture

Hi, I have made a proposal for changes to integrate bi-grams in Xapian Architecture on Wiki page. Bigram Integration Proposal: http://trac.xapian.org/wiki/GSoC2012/Bi-gram%20Language%20Modeling/Bi-gram%20Integration%20Proposal Since Bi-gram integration will make some difference in how data is accessed from the back-end so its better to get review from whole comunity.Moreover i also have some

Xapian 1.3.5 snapshot performance and index size

2016 Apr 12

Xapian 1.3.5 snapshot performance and index size

...te: > This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of > the" gets from 12 S to 0.9 S. Which is of course brilliant ! > > I think that I can dump my plan of indexing compound terms for runs of > common words :) We had been experimenting with bigrams to accelerate phrases, and not having to go that route was one motivation for the key order change. The bigram terms would add significantly to DB size, and to cache pressure. > > I'm not sure there's an easy solution to the position table not coming > > out compact in this c...

Index indexed words

2010 Jan 18

Index indexed words

Hello, We would like to create Google or Firefox like "search hints". If someone types "abc", the search system should name some possible hints. I think, Firefox does it by indexing 3-characters of the domain name. If you enter parts, you get some hints. Thank you very much Marcus

GSoc 2017 Introduction(Weighting Schemes)

2017 Mar 05

GSoc 2017 Introduction(Weighting Schemes)

Hello Everyone, I am a second year graduate student at IIIT-Bangalore and my interest is in the field of Information Retrieval. I have successfully compiled Xapian from source and have implemented some examples. While going through the project list Weighting Schemes project is the one I was looking to contribute to. So i went through the xapian-core/weight where most of the schemes are already

*wildcard* support?

2005 Oct 08

*wildcard* support?

Hello, First I wanted to say thanks for a great piece of software, thanks Olly and others who've contributed! I know that Xapian supports right-truncating, if that's the proper name for wildcard support, as in a search for "xapia*". I don't believe Xapian supports wildcards on both sides of a term, correct? Is this something that is technically unfeasable, unpalatable

Chinese, Japanese, Korean Tokenizer.

2007 Jun 05

Chinese, Japanese, Korean Tokenizer.

Hi, I am looking for Chinese Japanese and Korean tokenizer that could can be use to tokenize terms for CJK languages. I am not very familiar with these languages however I think that these languages contains one or more words in one symbol which it make more difficult to tokenize into searchable terms. Lucene has CJK Tokenizer ... and I am looking around if there is some open source that we

Xapian 1.3.5 snapshot performance and index size

2016 Apr 11

Xapian 1.3.5 snapshot performance and index size

Olly Betts writes: > On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote: > > Some might notice the 50% index size increase. Excessive index size is > > already one relatively rare, but recurring complaint. Except if I did > > something wrong: I'm actually quite surprised by it. > > Did you try compacting the resulting databases? > >

using package tm to find phrases

2009 Aug 13

using package tm to find phrases

I am using the package "tm" for text-mining of abstracts and would like to use it to find instances of gene names that may contain white space. For instance "gene regulatory protein 1". The default behavior of tm is to parse this into 4 separate words, but I would like to use the class constructor "dictionary" to define phrases such as just mentioned. Is this

search for: bigrams