Kevin Duraj
2007-Oct-01 22:02 UTC
[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.
Xapians! If tomorrow Xapian search engine would achieved the same performance and result in searches as Google we would not be able to beat Google, because we would create only a copy of the searches that already exists from Google search engine. However there is a way to beat anyone, and there is a way to beat Google successfully as well just do not give up. Some see it as implementing Ajax, or some cool interface, marketing or some other nonsense. However as I see it, the one way how to beat Google, is to implement Natural Language Processing to enable user to ask a question in natural human sentence and received different results, based on the way the question in natural human sentence was asked. What is interesting is that the simplest thing to do for human is the most difficult to do for computer, to recognize the meaning of a sentence. You have easier time to recognize my misspellings and mal form sentences than computer could recognize the meaning of a perfect sentence. Natural Language Processing is not a new thing and there has been lot of work done that yield inconsistent results. What I am trying to point out is that we need start to think about using natural language processing when placing infrastructure for Xapian. So far we have the following OP_AND, OP_AND_MAYBE, OP_AND_NOT, OP_ELITE_SET, OP_FILTER, OP_NEAR, OP_OR, OP_PHRASE, OP_VALUE_RANGE, OP_XOR search operators and we could add one more OP_NLP. What we can do now is to implement OP_NLP to tagged nouns, adjectives, adverbs, punctuations, foreign words etc. Calculate relation between them and assign boost value to the most occurred terms in query for example noun. Search query example: What is Kevin Duraj doing? OP_NLP would analyze sentence as follow: [what = pronoun, question|is werb|kevin=noun|duraj=noun|doing=verb|?=punctuation] We have nouns dominating the question. Therefore in Xapian search engine we look first for dominating nouns in this case my name Kevin Duraj and then within the result we search for next dominant verb and pronoun. PS: Can you see the future? -- Cheers Kevin Duraj http://MyHealthcare.com Los Angeles, California
James Aylett
2007-Oct-02 12:11 UTC
[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.
On Mon, Oct 01, 2007 at 02:01:50PM -0700, Kevin Duraj wrote:> Search query example: What is Kevin Duraj doing? > OP_NLP would analyze sentence as follow: > [what = pronoun, question|is > werb|kevin=noun|duraj=noun|doing=verb|?=punctuation]'What' isn't a pronoun, but never mind. You're suggesting a fairly primitive level of NLP - is there any evidence to suggest this will give good results? For instance, there's no way you could use that strategy to deal with referrents. Also, how are you planning on coping with ambiguity in part-of-speech? ('dove' is both a noun and a verb.) Couldn't you do this separately to Xapian by judicious fiddling with the generated query? Get the raw unstemmed terms (aka 'words' in this context) and figure out how you want to treat them, and construct a new query which reflects the weighting you want to apply. (Bear in mind that BM25 takes into account with within-query-frequency of a term as well as the within-document-frequency, and the defaults include this.) I have no idea how this applies to other languages. (Well, I do, but only for Latin, Romance languages and to an extent Germanics. That's not all that useful on the web.)> PS: Can you see the future?I think it's orange ;-) J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
Kevin Duraj
2007-Oct-09 19:02 UTC
[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.
On 10/1/07, David P. Novakovic <davidnovakovic@gmail.com> wrote:> disclaimer: I work in NLP research, so I'm a believer. I'm also not a > xapian dev, so I could be wrong :) > > While i do think that NLP will play a big role in the future of > search, what makes you think that Google doesn't have the resources to > do it better? :P > > Anyway, you have mentioned two techniques from NLP there, which are > part of speech tagging and question asking. It becomes very unwieldy, > very quickly to include stop words which tend to overshadow other more > meaningful relationships in the text, and manually tag every term in > every context in the system. This would lead to large overheads in the > core engine. Question asking is an area of research that is still > getting a lot of attention, and just like most other areas of NLP it > is accepted that there is no single way of doing things. It depends > highly on the data you are indexing/querying. > > I believe one of the wonderful things about xapian is that it's fast, > simple and does the job better than a simple keyword search all of the > time, just as many other search engines do. > > All natural language search companies (except powerset) have > acknowledged they stand little chance against Google, and instead > address a particular niche.David, In the history majority of people were told many times that they stand little chance against knowledge, freedom and progress. Do you know what each time happen to those who were telling to the majorities that they can stand little chance? They were humiliated or are no longer here. That is what we majority learn from the history. But I am puzzled about something. What make you think that any corporation can compete with open source and not fail in time? Or what make you think that any corporation can hire more programmers then open source community? -- Cheers Kevin Duraj http://MyHealthcare.com Los Angeles, California> While I hate to be a buzz kill, this is a very very large area of > research, not something we can dive head first into and just implement > straight away. > > my 2c. > > David > > On 10/2/07, Kevin Duraj <kevin.softdev@gmail.com> wrote: > > Xapians! > > > > If tomorrow Xapian search engine would achieved the same performance > > and result in searches as Google we would not be able to beat Google, > > because we would create only a copy of the searches that already > > exists from Google search engine. However there is a way to beat > > anyone, and there is a way to beat Google successfully as well just do > > not give up. Some see it as implementing Ajax, or some cool interface, > > marketing or some other nonsense. However as I see it, the one way > > how to beat Google, is to implement Natural Language Processing to > > enable user to ask a question in natural human sentence and received > > different results, based on the way the question in natural human > > sentence was asked. > > > > What is interesting is that the simplest thing to do for human is the > > most difficult to do for computer, to recognize the meaning of a > > sentence. You have easier time to recognize my misspellings and mal > > form sentences than computer could recognize the meaning of a perfect > > sentence. Natural Language Processing is not a new thing and there has > > been lot of work done that yield inconsistent results. > > > > What I am trying to point out is that we need start to think about > > using natural language processing when placing infrastructure for > > Xapian. So far we have the following OP_AND, OP_AND_MAYBE, > > OP_AND_NOT, OP_ELITE_SET, OP_FILTER, OP_NEAR, OP_OR, OP_PHRASE, > > OP_VALUE_RANGE, OP_XOR search operators and we could add one more > > OP_NLP. > > > > What we can do now is to implement OP_NLP to tagged nouns, > > adjectives, adverbs, punctuations, foreign words etc. Calculate > > relation between them and assign boost value to the most occurred > > terms in query for example noun. > > > > Search query example: What is Kevin Duraj doing? > > OP_NLP would analyze sentence as follow: > > [what = pronoun, question|is > > werb|kevin=noun|duraj=noun|doing=verb|?=punctuation] > > > > We have nouns dominating the question. Therefore in Xapian search > > engine we look first for dominating nouns in this case my name Kevin > > Duraj and then within the result we search for next dominant verb and > > pronoun. > > > > PS: Can you see the future? > > > > -- > > Cheers > > Kevin Duraj > > http://MyHealthcare.com > > Los Angeles, California > > > > _______________________________________________ > > Xapian-discuss mailing list > > Xapian-discuss@lists.xapian.org > > http://lists.xapian.org/mailman/listinfo/xapian-discuss > > >
Ron Kass
2007-Dec-07 23:48 UTC
[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.
Two companies to keep in mind when considering NLP search are, PowerSet and OpinMind. Actually, allowing effective NLP based search requires (from what I know) both a little more complex query parsing (understanding the context and semantic relations between parts of the document) and a much more complex indexing (storing relations between terms). For example: Searching for: who shot dick chaney? and Searching for: who dick chaney shot? Both contain the same words.. so a simple parser as you suggested would result in the same compiled query. It shouldn't.. these are two different questions. Also.. in the documents, if a document contains "John smith was fired upon by dick chaney", the relation between the individuals and actions are very important. How else will you know if its a document describing Dick Chaney shooting John smith or the other way around? I do agree with you though that NLP is a very important aspect of intelligent searching and are a way to "beat Google". I think Google is aware of it too though ;) -- View this message in context: http://www.nabble.com/How-to-beat-Google-aka-Xapian---Natural-Language-Processing.-tf4551151.html#a12990884 Sent from the Xapian - Discuss mailing list archive at Nabble.com.