thr3ads.net - Xapian devel - GSoC aspirant - guruprasad hegde [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Guruprasad Hegde

2018-Mar-23 07:05 UTC

GSoC aspirant - guruprasad hegde

Hi,

I plan to propose 'Math Aware search' project.

After the literature review on the topic, I found Tangent or MIaS system
would be a good start. With that, I studied both of the systems well.

I plan to pick Tangent because it performs better. Also, it has a good
literature(thesis report and few papers available) and reference code
available.

I keep the summary of both the system, I welcome any opinion on the choice.

Tangent:
Indexing stage:
Each document contains math formula and text. Text indexing is done in a
usual way.

======preprocessing=================       ===indexing===Math
Formula(PresentationMathML) => Symbol Layout Tree => Generate Symbol
pair tuples => Store in Inverted Index
Searching stage:
Query(PresentationMathML) => symbol layout tree => Generate symbol pair
tuples => Form a query with logical OR operator=> Candidate documents
selection using dice coefficient metric => ReRanking the documents using
MSS metric.

MIaS:
Indexing stage:

 ======preprocessing===============      ============indexing====Math
Formula(PresentationMathML) => Tokenization => Formula(token)
Modification =>  Index each token with proper weight(discussed in paper)
Formula modification = Ordering + Unification of variables + unification of
constant
Searching stage:
Query(PresentationMathML) => Formula modification => Form a query with
logical OR operator => Retrieve using text search engine

 I plan to send the draft proposal by the end of the day.

I also put some thoughts on implementation here.
I believe the major work is in preprocessing and searching stage(new weight
metric implementation). Existing indexing technique can be used for math
part as well.
My plan is to implement only formulae retrieval first(document has only
math) and add keyword support(document = text + math) later.
Later also add support for the query in latex format.

Please let me know if you have any comments or any questions on points I
mentioned.
Sorry for the delay. I would like to mention that I am doing active
preparation by reading Xapian codebase and literature.
Thank you.

Regards,
Guruprasad

Link for Tangent paper:
https://www.cs.rit.edu/~rlaz/files/ntcir2016_tangent.pdf
Link for MIaS paper:
https://link.springer.com/chapter/10.1007/978-3-642-22673-1_16
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20180323/bbb1722a/attachment.html>

Gaurav Arora

2018-Mar-26 10:51 UTC

head link

GSoC aspirant - guruprasad hegde

>
> I plan to pick Tangent because it performs better. Also, it has a good
> literature(thesis report and few papers available) and reference code
> available.
>
    Great! Having reference code would definitely help. Seems a decent
choice.

Tangent:>
Indexing stage:> Each document contains math formula and text. Text indexing is done in a
> usual way.
>
> ======preprocessing=================       ===indexing===> Math
Formula(PresentationMathML) => Symbol Layout Tree => Generate Symbol
> pair tuples => Store in Inverted Index
> Searching stage:
> Query(PresentationMathML) => symbol layout tree => Generate symbol
pair
> tuples => Form a query with logical OR operator=> Candidate documents
> selection using dice coefficient metric => ReRanking the documents using
> MSS metric.
>
> I am thinking since putting PresentationMathML as a query would bedifficult for user. We should as the first step aim to parse latex while
Query processing. It should be latex or any other representation which is
easier for user to put in search Box. Generally i could search on google by
typing X^{2} + Y^{2} . Google Search
<https://www.google.co.in/search?q=x%5E%7B2%7D%2BY%5E%7B2%7D&gws_rd=cr&dcr=0&ei=Zcu4WtW2DMXPvgTOoL2gCA>

Check out this link <http://saskatoon.cs.rit.edu/min/> where, you can draw
an equation and search on Tangent, Google and other search engine.

I am hoping to that Reranking has to be kept the last module as it would
have some complexities like we would need to store the Layout Symbol tree
efficiently with the document to be able to actually calculate MSS metric.

Have you estimated how much time would implementing and what would be the
complexity level of following tasks:
1. Implementation to convert PresentationMathML -> Symbol Layout tree
-> Generate
Symbol pair tuples
2. Implementation to rank formulas using Dice coefficient metric.

> I also put some thoughts on implementation here.
> I believe the major work is in preprocessing and searching stage(new
> weight metric implementation). Existing indexing technique can be used for
> math part as well.
> My plan is to implement only formulae retrieval first(document has only
> math) and add keyword support(document = text + math) later.
>One can use the set of Wikipedia documents
<http://www.cs.rit.edu/~rlaz/Wiki_formulas_v0.1.tar.bz2>provided by MathIR
at NTCIR for initial indexing. They only contain Math formulas. On a
general note adding keyword support won't be very difficult. It's would
be
ok to initial only focus on documents only with maths.

Later also add support for the query in latex format.>
> I am hoping Query would need to support latex from initially, since we canexpect users to enter  PresentationMathML. I expect it to  be something
which user could enter easily as a search query it would be difficult to
write Markup Language in search text box. Tangent seems to support both
mathml and tex. We can support both or start with tex.
>
> Link for Tangent paper:https://www.cs.rit.edu/~
> rlaz/files/ntcir2016_tangent.pdf
>You could also refer there initial paper at
https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf
Reference Implementation https://github.com/DPRL/tangent

- Gaurav Arora
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20180326/4c088dc8/attachment.html>

Guruprasad Hegde

2018-Mar-26 14:05 UTC

head link

GSoC aspirant - guruprasad hegde

Please find the draft proposal with this link:
https://github.com/guruhegde/xapian-gsoc-proposal
It is still work in progress.

Question: If we index math terms(symbol pair tuples) in the same DB along
with the text data, do you think, adding field prefix(making a new one)
implicitly for math terms, help in some way w.r.t performance for cases
like searching only text terms or only math terms?

Regards,
Guruprasad


On Mon, Mar 26, 2018 at 3:27 PM, Gaurav Arora <gauravarora.daiict at
gmail.com>
wrote:
>
> I thought if I start with the MathML as input and build the core, then I
>> can extend the system to support any other query/document type by
looking
>> for third party tools available for c++. At the moment, I don't
have any
>> idea about this.  What do you think?
>>
>> We can look for the option in bonding period too. For now, I can make
>> latex to mathml as first step in proposal and shuffle the steps later
right?
>>
>
> Proposal need to account for doing that. i.e proposal should account that
> before end of GSOC search through latex should be supported and merged. It
> can be done anytime. It's perfectly fine to build the core using MathML
> representation initially.
>
>>
>> Generating symbol layout tree requires implementing parser. I guess it
>> invloves good amount of text processing. Since it's standard
problem, I
>> hope it should not be hard, but requires handling many scenarios. I
plan to
>> read about the parser and try implementing small examples first in
coming
>> days.
>>
> That would be great :)
>
>>
>> I feel generating symbol pair will be easy once I build the tree.
>>
>> Do you think I should come up with some sort of psuedocode in proposal?
>>
> Would definitely help.
>
>>
>>
> With other weight metric implementations available and with existing
>> indexing structure, I feel getting the stats and implementing this
would
>> not be hard I feel.
>>
>> A basic check and estimate would help to estimate time this would take
to
> plan the project timeline accordingly.
>
>
>> I have been working on the draft. I am really sorry about the delay in
>> draft. Hope to make up for that with some good work:)
>>
> Sooner you show us the draft version would increase your chance of getting
> feedback from us and improving your proposal.
>
>
> - Gaurav Arora
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20180326/dc297f9a/attachment.html>

Guruprasad Hegde

2018-Apr-24 09:53 UTC

head link

GSoC aspirant - guruprasad hegde

Hi All,

Thank you for accepting the proposal.

For the next couple of days, I plan to read again the related papers
keeping the implementation in mind and understand Xapian code(backend part).

Please suggest if any other activity I can do.

Regards,
Guruprasad

On Mon, Mar 26, 2018 at 4:05 PM, Guruprasad Hegde <guruhegde1308 at
gmail.com>
wrote:
> Please find the draft proposal with this link: https://github.com/
> guruhegde/xapian-gsoc-proposal
> It is still work in progress.
>
> Question: If we index math terms(symbol pair tuples) in the same DB along
> with the text data, do you think, adding field prefix(making a new one)
> implicitly for math terms, help in some way w.r.t performance for cases
> like searching only text terms or only math terms?
>
> Regards,
> Guruprasad
>
>
> On Mon, Mar 26, 2018 at 3:27 PM, Gaurav Arora <
> gauravarora.daiict at gmail.com> wrote:
>
>>
>> I thought if I start with the MathML as input and build the core, then
I
>>> can extend the system to support any other query/document type by
looking
>>> for third party tools available for c++. At the moment, I don't
have any
>>> idea about this.  What do you think?
>>>
>>> We can look for the option in bonding period too. For now, I can
make
>>> latex to mathml as first step in proposal and shuffle the steps
later right?
>>>
>>
>> Proposal need to account for doing that. i.e proposal should account
that
>> before end of GSOC search through latex should be supported and merged.
It
>> can be done anytime. It's perfectly fine to build the core using
MathML
>> representation initially.
>>
>>>
>>> Generating symbol layout tree requires implementing parser. I guess
it
>>> invloves good amount of text processing. Since it's standard
problem, I
>>> hope it should not be hard, but requires handling many scenarios. I
plan to
>>> read about the parser and try implementing small examples first in
coming
>>> days.
>>>
>> That would be great :)
>>
>>>
>>> I feel generating symbol pair will be easy once I build the tree.
>>>
>>> Do you think I should come up with some sort of psuedocode in
proposal?
>>>
>> Would definitely help.
>>
>>>
>>>
>> With other weight metric implementations available and with existing
>>> indexing structure, I feel getting the stats and implementing this
would
>>> not be hard I feel.
>>>
>>> A basic check and estimate would help to estimate time this would
take
>> to plan the project timeline accordingly.
>>
>>
>>> I have been working on the draft. I am really sorry about the delay
in
>>> draft. Hope to make up for that with some good work:)
>>>
>> Sooner you show us the draft version would increase your chance of
>> getting feedback from us and improving your proposal.
>>
>>
>> - Gaurav Arora
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20180424/81dcd5b8/attachment.html>

Xapian devel - Mar 2018 - GSoC aspirant - guruprasad hegde

GSoC aspirant - guruprasad hegde

GSoC aspirant - guruprasad hegde

GSoC aspirant - guruprasad hegde

GSoC aspirant - guruprasad hegde