This is a list with some libraries that I have been looking at. The idea is to discuss the advantages and disadvantages of adding some of these libraries to Xapian. If anyone knows another library that could be add to the list it would be great! Libfreexl: * For Excel (.xls) * Last release: 2018-02 * Info: gaia-gis.it/fossil/freexl/index * License: MPL tri-license ============================= Libzip: * For zip archives(C library) * Last release 2018-04 * Info: libzip.org * License: 3-clause BSD Libzipios++: * For zip archives * Last release 2019-04 * Info: zipios.sourceforge.net * License: GNU Lesser General Public License (LGPL) I have been thinking about unzip. It is widely use in omindex an it might be an option to replace unzip with one of this libraries. I know that it is not the best solution, but it could be something to consider for some formats. ============================= Djvulibre: * For DjVu files * Last release: 2015-02 * Info: djvu.sourceforge.net * License: GNU General Public License version 2 ============================= Libe-book: * For ebooks formats * Last release 2018-01 * It shows little activity * Status: Beta * Info: sourceforge.net/projects/libebook/ * License: GNU Lesser GPL 2.1+ and MPL 2.0+ I have been reading the code of this library, but it seems a bit complex. It could be a good option, but it will take a while to figure it out how it works. ============================= Libetonyek-dev: * For Apple iWork documents * Status: Beta * Info: wiki.documentfoundation.org/DLP/Libraries/libetonyek * License: MPL 2.0+ ============================= Libabw: * For AbiWord documents * Last release 2017-12 * Info: wiki.documentfoundation.org/DLP/Libraries/libabw * License: MPL 2.0 ============================= Other Options: * libreoffice-dev(SDK) * libmarkdown2-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190614/d70b8090/attachment.html>
On Fri, Jun 14, 2019 at 08:52:51AM -0300, Bruno Baruffaldi wrote:> This is a list with some libraries that I have been looking at. > > The idea is to discuss the advantages and disadvantages of adding some of > these libraries to Xapian.I think we should prioritise formats which are widely used (among current and potential users of Omega particularly), and also formats which we don't already support (or which we could support better by using a library).> > If anyone knows another library that could be add to the list it would be > great! > > Libfreexl: > * For Excel (.xls) > * Last release: 2018-02 > * Info: gaia-gis.it/fossil/freexl/index > * License: MPL tri-licenseI've not come across this before. It looks like it is currently only used in GIS software which is probably more interested in numbers than text, so before we commit a lot of effort to supporting it I'd suggest we try it out and compare how it does with the command line tool we currently use (xls2csv).> Libzip: > * For zip archives(C library) > * Last release 2018-04 > * Info: libzip.org > * License: 3-clause BSD > > Libzipios++: > * For zip archives > * Last release 2019-04 > * Info: zipios.sourceforge.net > * License: GNU Lesser General Public License (LGPL) > > I have been thinking about unzip. It is widely use in omindex an it might > be an option to replace unzip with one of this libraries. I know that it is > not the best solution, but it could be something to consider for some > formats.I'd suggest libarchive for zip files - it's widely used, and supports reading other archive formats rather than just zip files (I actually wrote a prototype patch for omindex a while back to support indexing files in archive files which used libarchive, though it hasn't been merged yet). I think this is probably one to prioritise since we use unzip for a number of common formats.> Djvulibre: > * For DjVu files > * Last release: 2015-02 > * Info: djvu.sourceforge.net > * License: GNU General Public License version 2While DjVu is an interesting format, it doesn't seem to be widely used and we can already index these files using the command line djvutxt tool.> Libe-book: > * For ebooks formats > * Last release 2018-01 > * It shows little activity > * Status: Beta > * Info: sourceforge.net/projects/libebook/ > * License: GNU Lesser GPL 2.1+ and MPL 2.0+ > > I have been reading the code of this library, but it seems a bit complex. > It could be a good option, but it will take a while to figure it out how it > works.This is used by libreoffice. There's a command line tool in the libe-book source to extract text (though for some reason this tool isn't packaged for Debian it seems). You can see the source here, which shows how to use the API to extract text: https://sources.debian.org/src/libe-book/0.1.3-1/src/conv/text/ebook2text.cpp/ This would add support for several popular formats we don't currently support at all, so seems another one to prioritise.> Libetonyek-dev: > * For Apple iWork documents > * Status: Beta > * Info: wiki.documentfoundation.org/DLP/Libraries/libetonyek > * License: MPL 2.0+We use this via a command line tool currently. I'd guess it's popular on Macs so this is probably a good candidate.> Libabw: > * For AbiWord documents > * Last release 2017-12 > * Info: wiki.documentfoundation.org/DLP/Libraries/libabw > * License: MPL 2.0This is an XML-based format which we have a built-in parser for, so there's probably not a lot to gain from using an external library. It's also not a very widely used format in my experience.> Other Options: > * libreoffice-dev(SDK)I guess this is "libreofficekit"? I actually maintain a command line tool which is a thin wrapper around that: https://gitlab.com/ojwb/lloconv It works pretty well, but it's rather slow even reusing the lok::Office() object (lloconv has a feature where it can fork a daemon process to allow such reuse). Much of the import code libreoffice uses has now beep split out into libraries (like libabw, libe-book and libetonyek from your list) and I think we'd do better to use such libraries directly. You can find a list of these libraries here: https://www.documentliberation.org/projects/#import-libs Cheers, Olly
Hello, I have been looking libarchive and it seems a great candidate! I think we can also add libstaroffice <https://github.com/fosnola/libstaroffice> and libmarkdown2-dev. I wasn't sure about adding libmarkdown2-dev to the list because I couldn't find much information about it. El sáb., 15 de jun. de 2019 a la(s) 00:49, Olly Betts (olly at survex.com) escribió:> On Fri, Jun 14, 2019 at 08:52:51AM -0300, Bruno Baruffaldi wrote: > > This is a list with some libraries that I have been looking at. > > > > The idea is to discuss the advantages and disadvantages of adding some of > > these libraries to Xapian. > > I think we should prioritise formats which are widely used (among > current and potential users of Omega particularly), and also formats > which we don't already support (or which we could support better by > using a library). > > > > > If anyone knows another library that could be add to the list it would be > > great! > > > > Libfreexl: > > * For Excel (.xls) > > * Last release: 2018-02 > > * Info: gaia-gis.it/fossil/freexl/index > > * License: MPL tri-license > > I've not come across this before. It looks like it is currently only > used in GIS software which is probably more interested in numbers than > text, so before we commit a lot of effort to supporting it I'd suggest > we try it out and compare how it does with the command line tool we > currently use (xls2csv). > > > Libzip: > > * For zip archives(C library) > > * Last release 2018-04 > > * Info: libzip.org > > * License: 3-clause BSD > > > > Libzipios++: > > * For zip archives > > * Last release 2019-04 > > * Info: zipios.sourceforge.net > > * License: GNU Lesser General Public License (LGPL) > > > > I have been thinking about unzip. It is widely use in omindex an it might > > be an option to replace unzip with one of this libraries. I know that it > is > > not the best solution, but it could be something to consider for some > > formats. > > I'd suggest libarchive for zip files - it's widely used, and supports > reading other archive formats rather than just zip files (I actually > wrote a prototype patch for omindex a while back to support indexing > files in archive files which used libarchive, though it hasn't been > merged yet). > > I think this is probably one to prioritise since we use unzip for a > number of common formats. > > > Djvulibre: > > * For DjVu files > > * Last release: 2015-02 > > * Info: djvu.sourceforge.net > > * License: GNU General Public License version 2 > > While DjVu is an interesting format, it doesn't seem to be widely used > and we can already index these files using the command line djvutxt > tool. > > > Libe-book: > > * For ebooks formats > > * Last release 2018-01 > > * It shows little activity > > * Status: Beta > > * Info: sourceforge.net/projects/libebook/ > > * License: GNU Lesser GPL 2.1+ and MPL 2.0+ > > > > I have been reading the code of this library, but it seems a bit complex. > > It could be a good option, but it will take a while to figure it out how > it > > works. > > This is used by libreoffice. > > There's a command line tool in the libe-book source to extract text > (though for some reason this tool isn't packaged for Debian it seems). > You can see the source here, which shows how to use the API to extract > text: > > > https://sources.debian.org/src/libe-book/0.1.3-1/src/conv/text/ebook2text.cpp/ > > This would add support for several popular formats we don't currently > support at all, so seems another one to prioritise. > > > Libetonyek-dev: > > * For Apple iWork documents > > * Status: Beta > > * Info: wiki.documentfoundation.org/DLP/Libraries/libetonyek > > * License: MPL 2.0+ > > We use this via a command line tool currently. I'd guess it's popular > on Macs so this is probably a good candidate. > > > Libabw: > > * For AbiWord documents > > * Last release 2017-12 > > * Info: wiki.documentfoundation.org/DLP/Libraries/libabw > > * License: MPL 2.0 > > This is an XML-based format which we have a built-in parser for, so > there's probably not a lot to gain from using an external library. > It's also not a very widely used format in my experience. > > > Other Options: > > * libreoffice-dev(SDK) > > I guess this is "libreofficekit"? > > I actually maintain a command line tool which is a thin wrapper > around that: > > https://gitlab.com/ojwb/lloconv > > It works pretty well, but it's rather slow even reusing the > lok::Office() object (lloconv has a feature where it can fork a > daemon process to allow such reuse). > > Much of the import code libreoffice uses has now beep split out into > libraries (like libabw, libe-book and libetonyek from your list) and > I think we'd do better to use such libraries directly. > > You can find a list of these libraries here: > > https://www.documentliberation.org/projects/#import-libs > > Cheers, > Olly >-- Atte. Bruno Baruffaldi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190615/d169b3f0/attachment.html>