Gaurav Arora
2012-Jun-29 00:35 UTC
[Xapian-devel] Adding Bi-gram in the QueryParser and Object.
Hi all, I have jotted down a plan for how to handle or add Bi-gram in Query Object though QueryParser. PFA as a sequence diagram which depicts what i got to know about how parser works and query is build from tokens provided by the lexer.I have highlighted some area in blue where i think there is possibility of having bi-grams.While Integrating bi-gram in the Parser ,Query our aim is to generate and add bi-grams for all the consecutive terms to the query. Following are categories sent to Parser from Lexer to Form Query Object: *Near - *2 or more terms with near in between.It is a type of query these two term are in window of 10 words.Since we are seeking these two words in vicinity of 10 Words window.It wont hurt to have them as bi-grams as we are seeking them in 10 words window so having them next is better.*(Bigram can be added)* *Example:* * * Failed NEAR Assertion *Currently parser output.* Query((failed at 1 NEAR 11 assertion at 2)) *Output With Bigram:* * * Query((failed at 1 NEAR 11 assertion at 2) OR failed assertion at 3) *Implementation:* Since the all terms detected as near is added to class *Terms* so when we ask for Queries from the Class *Terms *using as_near_query , as_adj_query,as_opwindow_query then while parsing terms we can just add the bigrams while iterating list of term. *Adj: *exactly similar to *NEAR(Bigram can be added)* *phrase : *Terms giving in a Quotes.Since they are terms user want to have together.Bigram can be added*(Bigram can be added)* Implementation is similar to Near,adj. * * *Phrased: *Single term which is actually two or more term linked with punctuation.These terms can be treated as bi-grams as they are terms which must exist together.*(Bigram can be added)* Implementation is similar to Near,adj. *Group: *A group of term separated only by white-spaces.*(Bigram can be added)* * * *Implementation:* Since the all terms detected as group are added to class *TermGroup* so when we ask for Query from the Class *TermGroup *using as_group_query then while parsing terms we can just add the bigrams while iterating list of terms. *Wild:* *Partial:* *Synonym:* **This is expanding which follow the pattern,synonym of term.It will pull out lot of similar terms and form a query with all those words.So considering this for bi-gram doesn't seem important.Please suggest if you feel it should be included. * * *BRA-KET: *These are bracketed expression.Currently the grammar have rule *BRA expr(E) KET* .so if there will be any scope of bi-grams in query inside BRA-KET it would have been consider while working on internal expression. * * *ValueRange: *No relation with Bi-grams. * * *Love: * *Hate: * Since we are trying to avoid query or have a single term.We can restrain adding them as bi-grams. * * *bool_operator:* https://github.com/sehaj-sk/xapian/blob/mybranch/xapian-core/docs/queryparser_new.rst#boolean-query * * Boolean operator are done by following type of grammer rules. bool_arg(E) bool_operator bool_arg(P). since *bool_arg* are *expr *i.e they are Query Object hence getting Bigrams would be difficult.Please suggest something. Example. assertion OR failed *Current parsed Query:* * * Query((Zfail at 1 OR Zassert at 2)) * * *Since Terms are converted to Query object how to make bigrams for these simple OR operator of terms.* * * Major work of handling the bi-grams will be taken care of by adding bi-grams to the terms while iterating terms in *TermGroup *and *Terms * class. Please guide me and provide feedback about how to adding bi-grams in Query Object. Thanks, Gaurav Arora -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120629/7087a087/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Queryparser_view.jpg Type: image/jpeg Size: 96877 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120629/7087a087/attachment-0002.jpg>