Jiarong Wei
2014-Mar-11 02:37 UTC
[Xapian-devel] [GSOC 2013] Question about indexing INEX dataset
Hi, I?m trying to use Omega to index INEX dataset for Letor. But omindex told me these xml files are unknown. Olly told me I could tell omindex to handle them as HTML. (Thanks Olly :) ) Is it appropriate? Parth, could you give me some suggestions? Thank you! Jiarong Wei
Parth Gupta
2014-Mar-11 10:40 UTC
[Xapian-devel] [GSOC 2013] Question about indexing INEX dataset
Yes, treating them as HTML is fine. We did not face any problems with it. Parth. On Tue, Mar 11, 2014 at 3:37 AM, Jiarong Wei <vcamx3 at gmail.com> wrote:> Hi, > > I'm trying to use Omega to index INEX dataset for Letor. But omindex told > me these xml files are unknown. Olly told me I could tell omindex to handle > them as HTML. (Thanks Olly :) ) Is it appropriate? Parth, could you give me > some suggestions? > > Thank you! > > Jiarong Wei > _______________________________________________ > Xapian-devel mailing list > Xapian-devel at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140311/167c76bb/attachment-0002.html>
Olly Betts
2014-Mar-11 23:50 UTC
[Xapian-devel] [GSOC 2013] Question about indexing INEX dataset
On Tue, Mar 11, 2014 at 11:40:42AM +0100, Parth Gupta wrote:> Yes, treating them as HTML is fine. We did not face any problems with it.It's not a bad way to get things working - the XML format uses <title> so that will be picked up by the HTML parser, and the document's body text is in a <bdy> tag which the parser will not understand and it just gathers up all the text in that case. But before the <bdy> tag, there is some other metadata inside other tags which the HTML parser won't know, so it will treat these as more body text, but we really want to handle this metadata specially. It wouldn't be hard to add a special parser for this format which handles this better - opendocparse.cc and opendocparse.h are probably a good similar example to look at. Cheers, Olly