thr3ads.net - Xapian discuss - [Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing. [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Kevin Duraj

2007-Oct-01 22:02 UTC

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

Xapians!

If tomorrow Xapian search engine would achieved the same performance
and result in searches as Google we would not be able to beat Google,
because we would create only a copy of the searches that already
exists from Google search engine. However there is a way to beat
anyone, and there is a way to beat Google successfully as well just do
not give up. Some see it as implementing Ajax, or some cool interface,
marketing or some other nonsense. However  as I see it, the one way
how to beat Google, is to implement Natural Language Processing to
enable user to ask a question in natural human sentence and received
different results, based on the way the question in natural human
sentence was asked.

What is interesting is that the simplest thing to do for human is the
most difficult to do for computer, to recognize the meaning of a
sentence. You have easier time to recognize my misspellings and mal
form sentences than computer could recognize the meaning of a perfect
sentence. Natural Language Processing is not a new thing and there has
been lot of work done that yield inconsistent results.

What I am trying to point out is that we need start to think about
using natural language processing when placing infrastructure for
Xapian.  So far we have the following OP_AND, OP_AND_MAYBE,
OP_AND_NOT, OP_ELITE_SET, OP_FILTER, OP_NEAR, OP_OR, OP_PHRASE,
OP_VALUE_RANGE, OP_XOR search operators and we could add one more
OP_NLP.

What we can do now is to implement OP_NLP  to tagged nouns,
adjectives, adverbs, punctuations, foreign words etc. Calculate
relation between them and assign boost value to the most occurred
terms in query for example noun.

Search query example: What is Kevin Duraj doing?
OP_NLP  would analyze sentence as follow:
[what =  pronoun, question|is
werb|kevin=noun|duraj=noun|doing=verb|?=punctuation]

We have nouns dominating  the question.  Therefore in Xapian search
engine we look first for dominating nouns in this case my name Kevin
Duraj and then within the result we search for next dominant verb and
pronoun.

PS: Can you see the future?

-- 
Cheers
  Kevin Duraj
  http://MyHealthcare.com
  Los Angeles, California

James Aylett

2007-Oct-02 12:11 UTC

head link

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

On Mon, Oct 01, 2007 at 02:01:50PM -0700, Kevin Duraj wrote:
> Search query example: What is Kevin Duraj doing?
> OP_NLP  would analyze sentence as follow:
> [what =  pronoun, question|is >
werb|kevin=noun|duraj=noun|doing=verb|?=punctuation]
'What' isn't a pronoun, but never mind. You're suggesting a
fairly
primitive level of NLP - is there any evidence to suggest this will
give good results? For instance, there's no way you could use that
strategy to deal with referrents. Also, how are you planning on coping
with ambiguity in part-of-speech? ('dove' is both a noun and a verb.)

Couldn't you do this separately to Xapian by judicious fiddling with
the generated query? Get the raw unstemmed terms (aka 'words' in this
context) and figure out how you want to treat them, and construct a
new query which reflects the weighting you want to apply. (Bear in
mind that BM25 takes into account with within-query-frequency of a
term as well as the within-document-frequency, and the defaults
include this.)

I have no idea how this applies to other languages. (Well, I do, but
only for Latin, Romance languages and to an extent Germanics. That's
not all that useful on the web.)
> PS: Can you see the future?
I think it's orange ;-)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

Kevin Duraj

2007-Oct-09 19:02 UTC

head link

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

On 10/1/07, David P. Novakovic <davidnovakovic@gmail.com>
wrote:> disclaimer: I work in NLP research, so I'm a believer. I'm also not
a
> xapian dev, so I could be wrong :)
>
> While i do think that NLP will play a big role in the future of
> search, what makes you think that Google doesn't have the resources to
> do it better? :P
>
> Anyway, you have mentioned two techniques from NLP there, which are
> part of speech tagging and question asking. It becomes very unwieldy,
> very quickly to include stop words which tend to overshadow other more
> meaningful relationships in the text, and manually tag every term in
> every context in the system. This would lead to large overheads in the
> core engine. Question asking is an area of research that is still
> getting a lot of attention, and just like most other areas of NLP it
> is accepted that there is no single way of doing things. It depends
> highly on the data you are indexing/querying.
>
> I believe one of the wonderful things about xapian is that it's fast,
> simple and does the job better than a simple keyword search all of the
> time, just as many other search engines do.
>
> All natural language search companies (except powerset) have
> acknowledged they stand little chance against Google, and instead
> address a particular niche.
David,

In the history majority of people were told many times that they stand
little chance against knowledge, freedom and progress. Do you know
what each time happen to those who were telling to the majorities that
they can stand little chance? They were humiliated or are no longer
here. That is what we majority learn from the history.

But I am puzzled about something. What make you think that any
corporation can compete with open source and not fail in time? Or what
make you think that any corporation can hire more programmers then
open source community?

-- 
Cheers
  Kevin Duraj
  http://MyHealthcare.com
  Los Angeles, California


> While I hate to be a buzz kill, this is a very very large area of
> research, not something we can dive head first into and just implement
> straight away.
>
> my 2c.
>
> David
>
> On 10/2/07, Kevin Duraj <kevin.softdev@gmail.com> wrote:
> > Xapians!
> >
> > If tomorrow Xapian search engine would achieved the same performance
> > and result in searches as Google we would not be able to beat Google,
> > because we would create only a copy of the searches that already
> > exists from Google search engine. However there is a way to beat
> > anyone, and there is a way to beat Google successfully as well just do
> > not give up. Some see it as implementing Ajax, or some cool interface,
> > marketing or some other nonsense. However  as I see it, the one way
> > how to beat Google, is to implement Natural Language Processing to
> > enable user to ask a question in natural human sentence and received
> > different results, based on the way the question in natural human
> > sentence was asked.
> >
> > What is interesting is that the simplest thing to do for human is the
> > most difficult to do for computer, to recognize the meaning of a
> > sentence. You have easier time to recognize my misspellings and mal
> > form sentences than computer could recognize the meaning of a perfect
> > sentence. Natural Language Processing is not a new thing and there has
> > been lot of work done that yield inconsistent results.
> >
> > What I am trying to point out is that we need start to think about
> > using natural language processing when placing infrastructure for
> > Xapian.  So far we have the following OP_AND, OP_AND_MAYBE,
> > OP_AND_NOT, OP_ELITE_SET, OP_FILTER, OP_NEAR, OP_OR, OP_PHRASE,
> > OP_VALUE_RANGE, OP_XOR search operators and we could add one more
> > OP_NLP.
> >
> > What we can do now is to implement OP_NLP  to tagged nouns,
> > adjectives, adverbs, punctuations, foreign words etc. Calculate
> > relation between them and assign boost value to the most occurred
> > terms in query for example noun.
> >
> > Search query example: What is Kevin Duraj doing?
> > OP_NLP  would analyze sentence as follow:
> > [what =  pronoun, question|is > >
werb|kevin=noun|duraj=noun|doing=verb|?=punctuation]
> >
> > We have nouns dominating  the question.  Therefore in Xapian search
> > engine we look first for dominating nouns in this case my name Kevin
> > Duraj and then within the result we search for next dominant verb and
> > pronoun.
> >
> > PS: Can you see the future?
> >
> > --
> > Cheers
> >   Kevin Duraj
> >   http://MyHealthcare.com
> >   Los Angeles, California
> >
> > _______________________________________________
> > Xapian-discuss mailing list
> > Xapian-discuss@lists.xapian.org
> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
> >
>

Ron Kass

2007-Dec-07 23:48 UTC

head link

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

Two companies to keep in mind when considering NLP search are, PowerSet and
OpinMind.
Actually, allowing effective NLP based search requires (from what I know)
both a little more complex query parsing (understanding the context and
semantic relations between parts of the document) and a much more complex
indexing (storing relations between terms).

For example:
Searching for: who shot dick chaney?
and 
Searching for: who dick chaney shot?

Both contain the same words.. so a simple parser as you suggested would
result in the same compiled query. It shouldn't.. these are two different
questions.

Also.. in the documents, if a document contains "John smith was fired upon
by dick chaney", the relation between the individuals and actions are very
important. How else will you know if its a document describing Dick Chaney
shooting John smith or the other way around?


I do agree with you though that NLP is a very important aspect of
intelligent searching and are a way to "beat Google". I think Google
is
aware of it too though ;)
-- 
View this message in context:
http://www.nabble.com/How-to-beat-Google-aka-Xapian---Natural-Language-Processing.-tf4551151.html#a12990884
Sent from the Xapian - Discuss mailing list archive at Nabble.com.

Maybe Matching Threads

Search for more possibly parallel threads

Xapian discuss - Oct 2007 - How to beat Google aka Xapian & Natural Language Processing.

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

Maybe Matching Threads