thr3ads.net - Xapian devel - [Xapian-devel] GSOC 2012 : QueryParser Reimplementation [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Sehaj Singh Kalra

2012-Mar-20 13:49 UTC

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

Hello, I am Sehaj Singh Kalra, an Indian student. I am an undergraduation
student in Indian Institute of Technology-Delhi (IIT-Delhi) pursuing
Computer Science and Engineering. I want to work on the idea "QueryParser
Reimplementation ".
With the background I have in this field, I am fully comfortable with this
project.

I have went through the specification and through Query Parser
documentation (which I believe is not complete), and I am currently going
through the source code of current parser implementation.
I have some doubts :
1. How is Multiple Language Support handled in Xapian? While going through
the source code I found that the parser invokes the term generator class to
convert query to terms. Accordingly it would depend on what stage other
processes like stemming are being done.
2. The main motivation for parser re-implementation and not using Flex &
bison or lemon generator according to me is to make error state recovery
fast since in natural languages, mistakes are bound to happen as well as
NLP(Natural Language Processing) is different from processing of computer
language. If there is any other aspect related with it, please guide me.
3. To what extent does the xapian queryparser at present take part in
optimising the search?

Based on the understanding till now, I would also like to extent the
project by proposing some more things :
1. Pre-Analysing the query and making efficient changes at parser level
using some algorithms so as to make the search more efficient.
2. Aid in Relevancy Ranking
3. To maintain a log of queries searched and processing and ranking them
using algorithms . Using of these logs will make the parser more efficient.
The things proposed above will lead to some pre-search filtering which is
best done at Parser level. Moreover since the parser would be hand-written
rather than generated, integrating these things will make the parser more
efficient.

I will be happy to participate in this project during Google Summer of Code
2012 to implement these ideas.

Cheers,
Sehaj
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20120320/bd0d5d8b/attachment-0001.html>

Olly Betts

2012-Mar-22 04:54 UTC

head link

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

On Tue, Mar 20, 2012 at 07:19:34PM +0530, Sehaj Singh Kalra
wrote:> I have went through the specification and through Query Parser
> documentation (which I believe is not complete), and I am currently going
> through the source code of current parser implementation.
> I have some doubts :
> 1. How is Multiple Language Support handled in Xapian? While going through
> the source code I found that the parser invokes the term generator class to
> convert query to terms.  Accordingly it would depend on what stage other
> processes like stemming are being done.
There's no explicit support, but you can do it with the available
tools.

One approach is to decide that each query is in a particular language
(either detected, which is hard to do reliably on a short string, or
specified in the UI) and filter it to only search documents in the same
language (either specified in metadata, or detected, which works much
better for longer text).

Another is to parse the query once for each stemming language you are
interested in, combine each result with a corresponding language filter,
and then OR them all together.
> 2. The main motivation for parser re-implementation and not using Flex
&
> bison or lemon generator according to me is to make error state recovery
> fast since in natural languages, mistakes are bound to happen as well as
> NLP(Natural Language Processing) is different from processing of computer
> language.  If there is any other aspect related with it, please guide me.
Error recovery is part of it, though the speed isn't really the issue
(parsing the query is insignificant in time compared to executing it)
but rather parsing queries which a formal grammar thinks are a syntax
error.  The user query is a mix of formal grammar (we want to support
operators with precedence and brackets to control that) and more free
form text.  Generated parsers are great for the first part, and much
less good for the rest.  Also, we currently we try to work around some
of these issues in the lexer, which makes things more complicated.
> 3. To what extent does the xapian queryparser at present take part in
> optimising the search?
Hardly at all, it just tries to build sane Query object trees which
reflect the intended meaning.

The query optimisation is done as it is converted into a PostList tree,
which means all Query objects benefit from these optimisations, not only
those built from parsing user strings by a QueryParser object.
> Based on the understanding till now, I would also like to extent the
> project by proposing some more things :
> 1. Pre-Analysing the query and making efficient changes at parser level
> using some algorithms so as to make the search more efficient.
I'm not at all clear what you have in mind, but I suspect this is the
wrong place to optimise the query.
> 2. Aid in Relevancy Ranking
You'll need to elaborate on this one too...
> 3. To maintain a log of queries searched and processing and ranking them
> using algorithms . Using of these logs will make the parser more efficient.
How will it make the parser more efficient?
> The things proposed above will lead to some pre-search filtering which is
> best done at Parser level.
What will get filtered here?

Cheers,
    Olly

Sehaj Singh Kalra

2012-Mar-22 07:50 UTC

head link

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

I have sent a reply with some attached photographs but xapian mailing list
is saying that message is awaiting approval because it's size is big.
Kindly approve it.
Cheers,
Sehaj

On Thu, Mar 22, 2012 at 1:17 PM, Sehaj Singh Kalra <sehaj.sk at
gmail.com>wrote:
> As you mentioned, the user query is a mix of formal grammar (we want to
> support operators with precedence and brackets to control that) and more
> free form text.
> I was suggesting some ways to improve the later.
> I have attached some pics with this mail, kindly go through it and you
> will have some idea as to what I was trying to say.
> Maintaining logs will improve parser as the present query can be matched
> against the recent queries. This way, suppose for example, if we find the
> exact query, the time taken by search engine
> can be reduced. Also even if the exact query can't be found,  this will
> help parser in making sane and better Query object trees by matching
> against some logs and using algorithms like longest common sub-sequence
> etc. This way query can be modified a  bit to make more sense from the free
> form text.
>
> These were the plans suggested to improve parser functioning.
> Please guide me, about the other ways in which the parser can be modified
> for better outputs.
>
> Note : The pics attached are from a patent document whose URL is
>
http://www.google.co.in/patents?hl=en&lr=&vid=USPAT6766320&id=Q4MSAAAAEBAJ&oi=fnd&dq=query+parser+for+search+engine&printsec=abstract#v=onepage&q=query%20parser%20for%20search%20engine&f=false
.
> This is just used to reflect the idea which are present in many search
> papers.
>
> Cheers,
> Sehaj
>
>
> On Thu, Mar 22, 2012 at 10:24 AM, Olly Betts <olly at survex.com>
wrote:
>
>> On Tue, Mar 20, 2012 at 07:19:34PM +0530, Sehaj Singh Kalra wrote:
>> > I have went through the specification and through Query Parser
>> > documentation (which I believe is not complete), and I am
currently
>> going
>> > through the source code of current parser implementation.
>> > I have some doubts :
>> > 1. How is Multiple Language Support handled in Xapian? While going
>> through
>> > the source code I found that the parser invokes the term generator
>> class to
>> > convert query to terms.  Accordingly it would depend on what stage
other
>> > processes like stemming are being done.
>>
>> There's no explicit support, but you can do it with the available
>> tools.
>>
>> One approach is to decide that each query is in a particular language
>> (either detected, which is hard to do reliably on a short string, or
>> specified in the UI) and filter it to only search documents in the same
>> language (either specified in metadata, or detected, which works much
>> better for longer text).
>>
>> Another is to parse the query once for each stemming language you are
>> interested in, combine each result with a corresponding language
filter,
>> and then OR them all together.
>>
>> > 2. The main motivation for parser re-implementation and not using
Flex &
>> > bison or lemon generator according to me is to make error state
recovery
>> > fast since in natural languages, mistakes are bound to happen as
well as
>> > NLP(Natural Language Processing) is different from processing of
>> computer
>> > language.  If there is any other aspect related with it, please
guide
>> me.
>>
>> Error recovery is part of it, though the speed isn't really the
issue
>> (parsing the query is insignificant in time compared to executing it)
>> but rather parsing queries which a formal grammar thinks are a syntax
>> error.  The user query is a mix of formal grammar (we want to support
>> operators with precedence and brackets to control that) and more free
>> form text.  Generated parsers are great for the first part, and much
>> less good for the rest.  Also, we currently we try to work around some
>> of these issues in the lexer, which makes things more complicated.
>>
>> > 3. To what extent does the xapian queryparser at present take part
in
>> > optimising the search?
>>
>> Hardly at all, it just tries to build sane Query object trees which
>> reflect the intended meaning.
>>
>> The query optimisation is done as it is converted into a PostList tree,
>> which means all Query objects benefit from these optimisations, not
only
>> those built from parsing user strings by a QueryParser object.
>>
>> > Based on the understanding till now, I would also like to extent
the
>> > project by proposing some more things :
>> > 1. Pre-Analysing the query and making efficient changes at parser
level
>> > using some algorithms so as to make the search more efficient.
>>
>> I'm not at all clear what you have in mind, but I suspect this is
the
>> wrong place to optimise the query.
>>
>> > 2. Aid in Relevancy Ranking
>>
>> You'll need to elaborate on this one too...
>>
>> > 3. To maintain a log of queries searched and processing and
ranking them
>> > using algorithms . Using of these logs will make the parser more
>> efficient.
>>
>> How will it make the parser more efficient?
>>
>> > The things proposed above will lead to some pre-search filtering
which
>> is
>> > best done at Parser level.
>>
>> What will get filtered here?
>>
>> Cheers,
>>    Olly
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20120322/923d75b1/attachment.html>

Olly Betts

2012-Mar-22 08:26 UTC

head link

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

On Thu, Mar 22, 2012 at 01:20:06PM +0530, Sehaj Singh Kalra
wrote:> I have sent a reply with some attached photographs but xapian mailing list
> is saying that message is awaiting approval because it's size is big.
I'm not sure how photographs would be relevant, but anyway the size
limit is in place for a reason.  Please put large attachments on your
website and send a link.  That way only those who want them have to
download them.

Cheers,
    Olly

Apparently Analagous Threads

Search for more seemingly similar threads

Xapian devel - Mar 2012 - GSOC 2012 : QueryParser Reimplementation

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

[Xapian-devel] GSOC 2012 : QueryParser Reimplementation

Apparently Analagous Threads