Sehaj Singh Kalra
2012-Jul-13 12:39 UTC
[Xapian-devel] Need Suggestions for Sentence Breaking Implementation
Hi, I have been working on developing Link Grammar interface, so as to use POS tagging while indexing the documents. The interface header as well as implementation file have been completed and you can view them at < https://github.com/sehaj-sk/xapian/commit/052d634e1986bcf5607e43f52ac3e07646920196> and < https://github.com/sehaj-sk/xapian/commit/3015223662986d7a180d77101d6f4664f6552144> respectively. After that I have tried to use that for indexing the documents. Here's the code that does that implements the Link Grammar interface for POS tagged indexing in termgenerator. < https://github.com/sehaj-sk/xapian/commit/4ed8e505b44581fcc038598ec0b7cd011e42f8da> I have added a simple example in the xapian-core/examples directory, that shows the outcome and results of this feature. The example is present at < https://github.com/sehaj-sk/xapian/commit/75c2e4749e9084fca5f390b88d565cb117e90d38> At present it is capable of indexing only single sentences. So to index a large text, I need to break it into sentences. So I need suggestions for doing the Sentence Boundary Disambiguation. Please suggest any paper/algorithm that could be coded or any existing library that can be used. The focus at present is on English language only. I have done some searching and here's what I have found - 1. Here's an article on Wikipedia that mentions about it and the solutions available. < http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation > 2. There are not many available solutions in C/C++. Almost all of them are either in Python or Java. 3. There's a sentence boundary detection algorithm defined by Unicode Standard. It's present at < http://www.unicode.org/reports/tr29/#Sentence%5FBoundaries > 4. An existing C++ API that does this is BreakIterator class present here - < http://icu-project.org/apiref/icu4c/classBreakIterator.html > . Here's a line from it's doc: "The text boundary positions are found according to the rules described in Unicode Standard Annex #29, Text Boundaries, and Unicode Standard Annex #14, Line Breaking Properties. These are available at < http://www.unicode.org/reports/tr14/ > and < http://www.unicode.org/reports/tr29/ > ." 5. Somone suggested me to use PCRE (Perl Compatible Regular Expressions) Library < http://www.pcre.org/ > (though I don't know much about Perl) , to use Perl Based Regular Expressions and code them in C++ using PCRE. The wikipedia page above mentions some Perl Compatible Regular Expressions for sentence breaking. Other that that, some one also made a suggestion to use this Perl module < http://search.cpan.org/~achimru/Lingua-Sentence-1.00/lib/Lingua/Sentence.pm> and code it up it C++ using PCRE. 6. There are also a lot of research papers written for this problem. Looking forward for quick guidance and suggestions. Thanks, Sehaj -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120713/d9ab623a/attachment-0001.html>
Olly Betts
2012-Jul-16 03:56 UTC
[Xapian-devel] Need Suggestions for Sentence Breaking Implementation
On Fri, Jul 13, 2012 at 06:09:38PM +0530, Sehaj Singh Kalra wrote:> 4. An existing C++ API that does this is BreakIterator class present here - > < http://icu-project.org/apiref/icu4c/classBreakIterator.html > . > Here's a line from it's doc: "The text boundary positions are found > according to the rules described in Unicode Standard Annex #29, Text > Boundaries, and Unicode Standard Annex #14, Line Breaking Properties. These > are available at < http://www.unicode.org/reports/tr14/ > and < > http://www.unicode.org/reports/tr29/ > ."ICU is rather a big dependency, but this is probably a good choice for initial development as it means you can get on with the indexing part, rather than spending time coding up the same Unicode algorithms from scratch.> 5. Somone suggested me to use PCRE (Perl Compatible Regular Expressions) > Library < http://www.pcre.org/ > (though I don't know much about Perl) , to > use Perl Based Regular Expressions and code them in C++ using PCRE. The > wikipedia page above mentions some Perl Compatible Regular Expressions for > sentence breaking. Other that that, some one also made a suggestion to use > this Perl module < > http://search.cpan.org/~achimru/Lingua-Sentence-1.00/lib/Lingua/Sentence.pm> > and code it up it C++ using PCRE.I wouldn't take the regular expression route. While they can probably do a reasonable job at finding sentence boundaries, ultimately regular expressions can't express everything you can with code, so at some point you'll probably find you have to rewrite not to use regular expressions anyway. Cheers, Olly