Bruno Baruffaldi
2019-Mar-21 12:31 UTC
[GSoC] Questions about project Text-Extraction Libraries
Hello! I have a few question related to the project Text-Extraction Libraries. Firstly, I think that trying to isolate library bugs in subprocesses could get to work, but I am not sure about how to handle deadlocks or infinite loops. I feel that using a timer is the only way to deal with it but I would like to know what you think about it. Secondly, I have been reading the source code of ominex, but I cannot figure out if it is possible to group all file formats under the same interface. When indexing files, are all file formats treated in a similar way, or are there special formats that require a different work (beyond the use of external filters)? To sum up, I want to know if ominex use multithreading for indexing files or if you consider that it could be implemented to speed it up. Cheers, Bruno Baruffaldi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190321/0d4bbd0c/attachment.html>
On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi wrote:> Firstly, I think that trying to isolate library bugs in subprocesses could > get to work, but I am not sure about how to handle deadlocks or infinite > loops. I feel that using a timer is the only way to deal with it but I > would like to know what you think about it.There's already code to set a CPU time limit for filter subprocesses (using setrlimit()) and to implement an inactivity timeout (by using select() to wait for the connection file descriptor to become readable or a timeout to be reached) - see runfilter.cc. I think both mechanisms should be usable for this project (the CPU time limit would need to allow for CPU time used by the child process processing previous files).> Secondly, I have been reading the source code of ominex, but I cannot > figure out if it is possible to group all file formats under the same > interface. When indexing files, are all file formats treated in a similar > way, or are there special formats that require a different work (beyond the > use of external filters)?A few do - e.g. for PDF files we currently need to run pdfinfo and pdftotext on the file, PostScript files are first converted to a temporary PDF (because there doesn't seem to be a Unicode-aware filter which converts PostScript to text), etc. It may be possible to come up with a common interface still though.> To sum up, I want to know if ominex use multithreading for indexing files > or if you consider that it could be implemented to speed it up.Currently there isn't really any parallelism in omindex. It would help when indexing formats which are CPU intensive to extract text from (an extreme case is if you're running OCR to index image files). When dealing with external filters, the extra isolation that subprocesses gives us makes that a better approach than launching threads - if a library used by a thread crashes the process then the indexer dies, while if that happens in a subprocess the parent indexer process can recover easily. Potentially we could have concurrent child processes working on different documents. I'd suggest that it's better to focus on getting the subprocesses to work individually first before trying to get them to run in parallel, but to keep in mind that we're likely to want to instantiate multiple concurrent instances while implementing them.
Bruno Baruffaldi
2019-Mar-23 18:42 UTC
[GSoC] Questions about project Text-Extraction Libraries
Thanks! That was really useful! I wanted to share my approach to this project with the hope that you can give me some feedback. I am think that applying a design that foresees the incorporation of new file formats is the most suitable way to solve the problem. In the attached sketch we can see: * Bug_Box: It is responsible for encapsulating and handling errors. * File_extrator: It presents an interface for the different formats. * File_X: Encapsulates a particular library for the X file format. * File_Hadle: It is responsible for directing the extraction. More specifically, it determines the file format and which extractor to use. * Ominex: It represents the rest of the project. The idea of organizing the code in this way focuses on two fundamental items: * The possibility of changing a particular library for another that fulfills the same purpose without affecting the project. * The possibility of extending Xapian's support in terms of file format. One of the major advantages is that if a particular programmer wishes to add support for a new file format or improve an existing one, they should only modify the objects that are in red. In this way, with a proper documentation this kind of tasks should not be a complex task. I know it is an ambitious approach, but I think that with a good documentation it would give the project great flexibility and the programmers would have the option of adapting Xapian to their needs. **The image only presents a simple scheme to explain the idea, I do not consider it as a design for the project. I believe that we should discuss about different design patterns to choose the most suitable one. Cheers, Bruno Baruffaldi El sáb., 23 de mar. de 2019 a la(s) 12:10, Olly Betts (olly at survex.com) escribió:> On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi wrote: > > Firstly, I think that trying to isolate library bugs in subprocesses > could > > get to work, but I am not sure about how to handle deadlocks or infinite > > loops. I feel that using a timer is the only way to deal with it but I > > would like to know what you think about it. > > There's already code to set a CPU time limit for filter subprocesses > (using setrlimit()) and to implement an inactivity timeout (by using > select() to wait for the connection file descriptor to become readable > or a timeout to be reached) - see runfilter.cc. I think both mechanisms > should be usable for this project (the CPU time limit would need to > allow for CPU time used by the child process processing previous > files). > > > Secondly, I have been reading the source code of ominex, but I cannot > > figure out if it is possible to group all file formats under the same > > interface. When indexing files, are all file formats treated in a similar > > way, or are there special formats that require a different work (beyond > the > > use of external filters)? > > A few do - e.g. for PDF files we currently need to run pdfinfo and > pdftotext on the file, PostScript files are first converted to a > temporary PDF (because there doesn't seem to be a Unicode-aware > filter which converts PostScript to text), etc. > > It may be possible to come up with a common interface still though. > > > To sum up, I want to know if ominex use multithreading for indexing files > > or if you consider that it could be implemented to speed it up. > > Currently there isn't really any parallelism in omindex. It would help > when indexing formats which are CPU intensive to extract text from > (an extreme case is if you're running OCR to index image files). > > When dealing with external filters, the extra isolation that > subprocesses gives us makes that a better approach than launching > threads - if a library used by a thread crashes the process then the > indexer dies, while if that happens in a subprocess the parent indexer > process can recover easily. > > Potentially we could have concurrent child processes working on > different documents. I'd suggest that it's better to focus on getting > the subprocesses to work individually first before trying to get them to > run in parallel, but to keep in mind that we're likely to want to > instantiate multiple concurrent instances while implementing them. > >-- Atte. Bruno Baruffaldi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: sketch.png Type: image/png Size: 11183 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.png>