Hi, I plan to propose 'Math Aware search' project. After the literature review on the topic, I found Tangent or MIaS system would be a good start. With that, I studied both of the systems well. I plan to pick Tangent because it performs better. Also, it has a good literature(thesis report and few papers available) and reference code available. I keep the summary of both the system, I welcome any opinion on the choice. Tangent: Indexing stage: Each document contains math formula and text. Text indexing is done in a usual way. ======preprocessing================= ===indexing===Math Formula(PresentationMathML) => Symbol Layout Tree => Generate Symbol pair tuples => Store in Inverted Index Searching stage: Query(PresentationMathML) => symbol layout tree => Generate symbol pair tuples => Form a query with logical OR operator=> Candidate documents selection using dice coefficient metric => ReRanking the documents using MSS metric. MIaS: Indexing stage: ======preprocessing=============== ============indexing====Math Formula(PresentationMathML) => Tokenization => Formula(token) Modification => Index each token with proper weight(discussed in paper) Formula modification = Ordering + Unification of variables + unification of constant Searching stage: Query(PresentationMathML) => Formula modification => Form a query with logical OR operator => Retrieve using text search engine I plan to send the draft proposal by the end of the day. I also put some thoughts on implementation here. I believe the major work is in preprocessing and searching stage(new weight metric implementation). Existing indexing technique can be used for math part as well. My plan is to implement only formulae retrieval first(document has only math) and add keyword support(document = text + math) later. Later also add support for the query in latex format. Please let me know if you have any comments or any questions on points I mentioned. Sorry for the delay. I would like to mention that I am doing active preparation by reading Xapian codebase and literature. Thank you. Regards, Guruprasad Link for Tangent paper: https://www.cs.rit.edu/~rlaz/files/ntcir2016_tangent.pdf Link for MIaS paper: https://link.springer.com/chapter/10.1007/978-3-642-22673-1_16 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180323/bbb1722a/attachment.html>
> > I plan to pick Tangent because it performs better. Also, it has a good > literature(thesis report and few papers available) and reference code > available. >Great! Having reference code would definitely help. Seems a decent choice. Tangent:>Indexing stage:> Each document contains math formula and text. Text indexing is done in a > usual way. > > ======preprocessing================= ===indexing===> Math Formula(PresentationMathML) => Symbol Layout Tree => Generate Symbol > pair tuples => Store in Inverted Index > Searching stage: > Query(PresentationMathML) => symbol layout tree => Generate symbol pair > tuples => Form a query with logical OR operator=> Candidate documents > selection using dice coefficient metric => ReRanking the documents using > MSS metric. > > I am thinking since putting PresentationMathML as a query would bedifficult for user. We should as the first step aim to parse latex while Query processing. It should be latex or any other representation which is easier for user to put in search Box. Generally i could search on google by typing X^{2} + Y^{2} . Google Search <https://www.google.co.in/search?q=x%5E%7B2%7D%2BY%5E%7B2%7D&gws_rd=cr&dcr=0&ei=Zcu4WtW2DMXPvgTOoL2gCA> Check out this link <http://saskatoon.cs.rit.edu/min/> where, you can draw an equation and search on Tangent, Google and other search engine. I am hoping to that Reranking has to be kept the last module as it would have some complexities like we would need to store the Layout Symbol tree efficiently with the document to be able to actually calculate MSS metric. Have you estimated how much time would implementing and what would be the complexity level of following tasks: 1. Implementation to convert PresentationMathML -> Symbol Layout tree -> Generate Symbol pair tuples 2. Implementation to rank formulas using Dice coefficient metric.> I also put some thoughts on implementation here. > I believe the major work is in preprocessing and searching stage(new > weight metric implementation). Existing indexing technique can be used for > math part as well. > My plan is to implement only formulae retrieval first(document has only > math) and add keyword support(document = text + math) later. >One can use the set of Wikipedia documents <http://www.cs.rit.edu/~rlaz/Wiki_formulas_v0.1.tar.bz2>provided by MathIR at NTCIR for initial indexing. They only contain Math formulas. On a general note adding keyword support won't be very difficult. It's would be ok to initial only focus on documents only with maths. Later also add support for the query in latex format.> > I am hoping Query would need to support latex from initially, since we canexpect users to enter PresentationMathML. I expect it to be something which user could enter easily as a search query it would be difficult to write Markup Language in search text box. Tangent seems to support both mathml and tex. We can support both or start with tex.> > Link for Tangent paper:https://www.cs.rit.edu/~ > rlaz/files/ntcir2016_tangent.pdf >You could also refer there initial paper at https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf Reference Implementation https://github.com/DPRL/tangent - Gaurav Arora -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180326/4c088dc8/attachment.html>
Please find the draft proposal with this link: https://github.com/guruhegde/xapian-gsoc-proposal It is still work in progress. Question: If we index math terms(symbol pair tuples) in the same DB along with the text data, do you think, adding field prefix(making a new one) implicitly for math terms, help in some way w.r.t performance for cases like searching only text terms or only math terms? Regards, Guruprasad On Mon, Mar 26, 2018 at 3:27 PM, Gaurav Arora <gauravarora.daiict at gmail.com> wrote:> > I thought if I start with the MathML as input and build the core, then I >> can extend the system to support any other query/document type by looking >> for third party tools available for c++. At the moment, I don't have any >> idea about this. What do you think? >> >> We can look for the option in bonding period too. For now, I can make >> latex to mathml as first step in proposal and shuffle the steps later right? >> > > Proposal need to account for doing that. i.e proposal should account that > before end of GSOC search through latex should be supported and merged. It > can be done anytime. It's perfectly fine to build the core using MathML > representation initially. > >> >> Generating symbol layout tree requires implementing parser. I guess it >> invloves good amount of text processing. Since it's standard problem, I >> hope it should not be hard, but requires handling many scenarios. I plan to >> read about the parser and try implementing small examples first in coming >> days. >> > That would be great :) > >> >> I feel generating symbol pair will be easy once I build the tree. >> >> Do you think I should come up with some sort of psuedocode in proposal? >> > Would definitely help. > >> >> > With other weight metric implementations available and with existing >> indexing structure, I feel getting the stats and implementing this would >> not be hard I feel. >> >> A basic check and estimate would help to estimate time this would take to > plan the project timeline accordingly. > > >> I have been working on the draft. I am really sorry about the delay in >> draft. Hope to make up for that with some good work:) >> > Sooner you show us the draft version would increase your chance of getting > feedback from us and improving your proposal. > > > - Gaurav Arora >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180326/dc297f9a/attachment.html>
Hi All, Thank you for accepting the proposal. For the next couple of days, I plan to read again the related papers keeping the implementation in mind and understand Xapian code(backend part). Please suggest if any other activity I can do. Regards, Guruprasad On Mon, Mar 26, 2018 at 4:05 PM, Guruprasad Hegde <guruhegde1308 at gmail.com> wrote:> Please find the draft proposal with this link: https://github.com/ > guruhegde/xapian-gsoc-proposal > It is still work in progress. > > Question: If we index math terms(symbol pair tuples) in the same DB along > with the text data, do you think, adding field prefix(making a new one) > implicitly for math terms, help in some way w.r.t performance for cases > like searching only text terms or only math terms? > > Regards, > Guruprasad > > > On Mon, Mar 26, 2018 at 3:27 PM, Gaurav Arora < > gauravarora.daiict at gmail.com> wrote: > >> >> I thought if I start with the MathML as input and build the core, then I >>> can extend the system to support any other query/document type by looking >>> for third party tools available for c++. At the moment, I don't have any >>> idea about this. What do you think? >>> >>> We can look for the option in bonding period too. For now, I can make >>> latex to mathml as first step in proposal and shuffle the steps later right? >>> >> >> Proposal need to account for doing that. i.e proposal should account that >> before end of GSOC search through latex should be supported and merged. It >> can be done anytime. It's perfectly fine to build the core using MathML >> representation initially. >> >>> >>> Generating symbol layout tree requires implementing parser. I guess it >>> invloves good amount of text processing. Since it's standard problem, I >>> hope it should not be hard, but requires handling many scenarios. I plan to >>> read about the parser and try implementing small examples first in coming >>> days. >>> >> That would be great :) >> >>> >>> I feel generating symbol pair will be easy once I build the tree. >>> >>> Do you think I should come up with some sort of psuedocode in proposal? >>> >> Would definitely help. >> >>> >>> >> With other weight metric implementations available and with existing >>> indexing structure, I feel getting the stats and implementing this would >>> not be hard I feel. >>> >>> A basic check and estimate would help to estimate time this would take >> to plan the project timeline accordingly. >> >> >>> I have been working on the draft. I am really sorry about the delay in >>> draft. Hope to make up for that with some good work:) >>> >> Sooner you show us the draft version would increase your chance of >> getting feedback from us and improving your proposal. >> >> >> - Gaurav Arora >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180424/81dcd5b8/attachment.html>