thr3ads.net - search: "bigram"

Displaying 12 results from an estimated 12 matches for "bigram".

Did you mean: bigrat

Adding Bi-gram in the QueryParser and Object.

2012 Jun 29

Adding Bi-gram in the QueryParser and Object.

...y Object: *Near - *2 or more terms with near in between.It is a type of query these two term are in window of 10 words.Since we are seeking these two words in vicinity of 10 Words window.It wont hurt to have them as bi-grams as we are seeking them in 10 words window so having them next is better.*(Bigram can be added)* *Example:* * * Failed NEAR Assertion *Currently parser output.* Query((failed at 1 NEAR 11 assertion at 2)) *Output With Bigram:* * * Query((failed at 1 NEAR 11 assertion at 2) OR failed assertion at 3) *Implementation:* Since the all terms detected as near is added to class *T...

SVD for reducing dimensions

2002 Nov 17

SVD for reducing dimensions

...uld be ok), which I will call a feature vector. The the distance between two words represents the similarity of the contexts of the words, so big and little have very similar contexts and should get a similar representation. Basically to build something similar to a thesaurus. I have computed bigram counts between the n most common words, for varying values of n between 500 and 5000. These are saved to a file which I can load with read.table. This matrix is symmetric and far from sparse, although I can adjust the sparseness by changing the bigram window. First question, should I scale th...

[LLVMdev] Spell Correction Efficiency

2011 Jan 15

[LLVMdev] Spell Correction Efficiency

...ch was used in CLang, notably for auto-completion, based on the >> Levenstein distance. >> >> It turns out I just came upon this link today: >> http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html >> >> The idea is to use bigrams (2-letters parts of a word) to build an index >> of the form (bigram > all words containing this bigram), then use this index >> to retrieve all the words with enough bigrams in common with the word you >> are currently trying to approximate. >> >> This drastically...

GSOC 2016 project on Ranking

2016 Mar 04

GSOC 2016 project on Ranking

Hello Sir, I am a third-year student at the Department of mathematics at IIT Kharagpur. I have good experience in Information Retrieval and Machine Learning. I have read many chapters of the book Introduction to Information Retrieval. Recently I am doing a project on tagging a question on a Q&A Forum using ranking the tags and probabilistic inference. I also have software development

Proposal for Integration of Bi-gram in Xapian Architecture

2012 Jun 03

Proposal for Integration of Bi-gram in Xapian Architecture

Hi, I have made a proposal for changes to integrate bi-grams in Xapian Architecture on Wiki page. Bigram Integration Proposal: http://trac.xapian.org/wiki/GSoC2012/Bi-gram%20Language%20Modeling/Bi-gram%20Integration%20Proposal Since Bi-gram integration will make some difference in how data is accessed from the back-end so its better to get review from whole comunity.Moreover i also have some doubts w...

Xapian 1.3.5 snapshot performance and index size

2016 Apr 12

Xapian 1.3.5 snapshot performance and index size

...te: > This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of > the" gets from 12 S to 0.9 S. Which is of course brilliant ! > > I think that I can dump my plan of indexing compound terms for runs of > common words :) We had been experimenting with bigrams to accelerate phrases, and not having to go that route was one motivation for the key order change. The bigram terms would add significantly to DB size, and to cache pressure. > > I'm not sure there's an easy solution to the position table not coming > > out compact in this...

Index indexed words

2010 Jan 18

Index indexed words

Hello, We would like to create Google or Firefox like "search hints". If someone types "abc", the search system should name some possible hints. I think, Firefox does it by indexing 3-characters of the domain name. If you enter parts, you get some hints. Thank you very much Marcus

GSoc 2017 Introduction(Weighting Schemes)

2017 Mar 05

GSoc 2017 Introduction(Weighting Schemes)

...successfully compiled Xapian from source and have implemented some examples. While going through the project list Weighting Schemes project is the one I was looking to contribute to. So i went through the xapian-core/weight where most of the schemes are already present and I also went through the Bigram-model which was outside the tree and not merged yet. So can Anyone of please give a pointer to which weighting schemes are not implemented yet so that I can start looking at it. Regards, Prachi Prakash Final year Graduate Student LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/ gith...

*wildcard* support?

2005 Oct 08

*wildcard* support?

Hello, First I wanted to say thanks for a great piece of software, thanks Olly and others who've contributed! I know that Xapian supports right-truncating, if that's the proper name for wildcard support, as in a search for "xapia*". I don't believe Xapian supports wildcards on both sides of a term, correct? Is this something that is technically unfeasable, unpalatable

Chinese, Japanese, Korean Tokenizer.

2007 Jun 05

Chinese, Japanese, Korean Tokenizer.

Hi, I am looking for Chinese Japanese and Korean tokenizer that could can be use to tokenize terms for CJK languages. I am not very familiar with these languages however I think that these languages contains one or more words in one symbol which it make more difficult to tokenize into searchable terms. Lucene has CJK Tokenizer ... and I am looking around if there is some open source that we

Xapian 1.3.5 snapshot performance and index size

2016 Apr 11

Xapian 1.3.5 snapshot performance and index size

Olly Betts writes: > On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote: > > Some might notice the 50% index size increase. Excessive index size is > > already one relatively rare, but recurring complaint. Except if I did > > something wrong: I'm actually quite surprised by it. > > Did you try compacting the resulting databases? > >

using package tm to find phrases

2009 Aug 13

using package tm to find phrases

I am using the package "tm" for text-mining of abstracts and would like to use it to find instances of gene names that may contain white space. For instance "gene regulatory protein 1". The default behavior of tm is to parse this into 4 separate words, but I would like to use the class constructor "dictionary" to define phrases such as just mentioned. Is this

search for: bigram