Bruno Baruffaldi
2019-Mar-23 18:42 UTC
[GSoC] Questions about project Text-Extraction Libraries
Thanks! That was really useful! I wanted to share my approach to this project with the hope that you can give me some feedback. I am think that applying a design that foresees the incorporation of new file formats is the most suitable way to solve the problem. In the attached sketch we can see: * Bug_Box: It is responsible for encapsulating and handling errors. * File_extrator: It presents an interface for the different formats. * File_X: Encapsulates a particular library for the X file format. * File_Hadle: It is responsible for directing the extraction. More specifically, it determines the file format and which extractor to use. * Ominex: It represents the rest of the project. The idea of organizing the code in this way focuses on two fundamental items: * The possibility of changing a particular library for another that fulfills the same purpose without affecting the project. * The possibility of extending Xapian's support in terms of file format. One of the major advantages is that if a particular programmer wishes to add support for a new file format or improve an existing one, they should only modify the objects that are in red. In this way, with a proper documentation this kind of tasks should not be a complex task. I know it is an ambitious approach, but I think that with a good documentation it would give the project great flexibility and the programmers would have the option of adapting Xapian to their needs. **The image only presents a simple scheme to explain the idea, I do not consider it as a design for the project. I believe that we should discuss about different design patterns to choose the most suitable one. Cheers, Bruno Baruffaldi El sáb., 23 de mar. de 2019 a la(s) 12:10, Olly Betts (olly at survex.com) escribió:> On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi wrote: > > Firstly, I think that trying to isolate library bugs in subprocesses > could > > get to work, but I am not sure about how to handle deadlocks or infinite > > loops. I feel that using a timer is the only way to deal with it but I > > would like to know what you think about it. > > There's already code to set a CPU time limit for filter subprocesses > (using setrlimit()) and to implement an inactivity timeout (by using > select() to wait for the connection file descriptor to become readable > or a timeout to be reached) - see runfilter.cc. I think both mechanisms > should be usable for this project (the CPU time limit would need to > allow for CPU time used by the child process processing previous > files). > > > Secondly, I have been reading the source code of ominex, but I cannot > > figure out if it is possible to group all file formats under the same > > interface. When indexing files, are all file formats treated in a similar > > way, or are there special formats that require a different work (beyond > the > > use of external filters)? > > A few do - e.g. for PDF files we currently need to run pdfinfo and > pdftotext on the file, PostScript files are first converted to a > temporary PDF (because there doesn't seem to be a Unicode-aware > filter which converts PostScript to text), etc. > > It may be possible to come up with a common interface still though. > > > To sum up, I want to know if ominex use multithreading for indexing files > > or if you consider that it could be implemented to speed it up. > > Currently there isn't really any parallelism in omindex. It would help > when indexing formats which are CPU intensive to extract text from > (an extreme case is if you're running OCR to index image files). > > When dealing with external filters, the extra isolation that > subprocesses gives us makes that a better approach than launching > threads - if a library used by a thread crashes the process then the > indexer dies, while if that happens in a subprocess the parent indexer > process can recover easily. > > Potentially we could have concurrent child processes working on > different documents. I'd suggest that it's better to focus on getting > the subprocesses to work individually first before trying to get them to > run in parallel, but to keep in mind that we're likely to want to > instantiate multiple concurrent instances while implementing them. > >-- Atte. Bruno Baruffaldi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: sketch.png Type: image/png Size: 11183 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.png>
On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi wrote:> Thanks! > That was really useful! > > I wanted to share my approach to this project with the hope that you can > give me some feedback. > > I am think that applying a design that foresees the incorporation of new > file formats is the most suitable way to solve the problem. > > In the attached sketch we can see: > * Bug_Box: It is responsible for encapsulating and handling errors. > * File_extrator: It presents an interface for the different formats. > * File_X: Encapsulates a particular library for the X file format. > * File_Hadle: It is responsible for directing the extraction. More > specifically, it determines the file format and which extractor to use. > * Ominex: It represents the rest of the project.I'm not entirely sure what these boxes are meant to actually be (classes? programs? something else?), but in general I'd tend to steer GSoC projects towards an evolutionary approach rather than trying to rewrite everything in sight, or even refactor everything into some entirely new structure. With an evolutionary approach you can get to something that basically works much sooner, and then fill in the missing pieces, fix bugs, etc. It lends itself much better to incremental cycles of implement, test, document, review, merge, which is easier to work through for both mentors and students, and if the work doesn't get fully completed, at least there's something to show for it. With a revolutionary approach, there's nothing you can show working for much longer, and you'll need to do a lot of extra testing for all the existing functionality to make sure your reimplementation works (unfortunately there's currently no testsuite for omindex you can lean on here). Review is painful because it involves wading through thousands of lines of code, so you're likely to need to wait longer for a review because it's harder for mentors to find enough time in one go for that. And if the work doesn't get fully completed, there's a big pile of non-functioning code, which it's unlikely anyone is going to have the time or enthusiasm to do anything further with. More specifically to this case we already have code which encapsulates extraction in a subprocess for an external filter program, and code which determines the file format and which extractor to use. If you are proposing to replace those, you're going to need to convince us what you think is deficient about the existing code, how you can do better, and why that's a good use of the limited GSoC coding time (if you spend time doing X, then you can't do Y).> The idea of organizing the code in this way focuses on two fundamental > items: > * The possibility of changing a particular library for another that > fulfills the same purpose without affecting the project.That's achievable without a major restructure (e.g. wrap each library in a helper program).> * The possibility of extending Xapian's support in terms of file format.People have been adding new file formats for years within the current structure.> One of the major advantages is that if a particular programmer wishes to > add support for a new file format or improve an existing one, they should > only modify the objects that are in red. In this way, with a proper > documentation this kind of tasks should not be a complex task.We've had prospective GSoC students who were new to the codebase add support for new formats successful, which suggests it isn't all that complex currently. To show what's typically involved, here's the patch to add support for iWork documents (which is the most recent format added): https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee (The gen-mimemap tweak is only because this happened to mean the generated mimetype lookup table now needs 2 byte offsets). Cheers, Olly
Bruno Baruffaldi
2019-Mar-27 15:52 UTC
[GSoC] Questions about project Text-Extraction Libraries
I think you are right and I will try with another approach. One last query, I was thinking if it would be worth trying to use an external filter (when it is available) in case a particular library fails on run time. Have you considered it? El mar., 26 de mar. de 2019 a la(s) 19:39, Olly Betts (olly at survex.com) escribió:> On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi wrote: > > Thanks! > > That was really useful! > > > > I wanted to share my approach to this project with the hope that you can > > give me some feedback. > > > > I am think that applying a design that foresees the incorporation of new > > file formats is the most suitable way to solve the problem. > > > > In the attached sketch we can see: > > * Bug_Box: It is responsible for encapsulating and handling errors. > > * File_extrator: It presents an interface for the different formats. > > * File_X: Encapsulates a particular library for the X file format. > > * File_Hadle: It is responsible for directing the extraction. More > > specifically, it determines the file format and which extractor to use. > > * Ominex: It represents the rest of the project. > > I'm not entirely sure what these boxes are meant to actually be > (classes? programs? something else?), but in general I'd tend to steer > GSoC projects towards an evolutionary approach rather than trying to > rewrite everything in sight, or even refactor everything into some > entirely new structure. > > With an evolutionary approach you can get to something that basically > works much sooner, and then fill in the missing pieces, fix bugs, etc. > It lends itself much better to incremental cycles of implement, test, > document, review, merge, which is easier to work through for both > mentors and students, and if the work doesn't get fully completed, at > least there's something to show for it. > > With a revolutionary approach, there's nothing you can show working > for much longer, and you'll need to do a lot of extra testing for > all the existing functionality to make sure your reimplementation > works (unfortunately there's currently no testsuite for omindex you can > lean on here). > > Review is painful because it involves wading through thousands of > lines of code, so you're likely to need to wait longer for a review > because it's harder for mentors to find enough time in one go for > that. > > And if the work doesn't get fully completed, there's a big pile of > non-functioning code, which it's unlikely anyone is going to have the > time or enthusiasm to do anything further with. > > More specifically to this case we already have code which encapsulates > extraction in a subprocess for an external filter program, and code > which determines the file format and which extractor to use. If you > are proposing to replace those, you're going to need to convince us > what you think is deficient about the existing code, how you can > do better, and why that's a good use of the limited GSoC coding time > (if you spend time doing X, then you can't do Y). > > > The idea of organizing the code in this way focuses on two fundamental > > items: > > * The possibility of changing a particular library for another that > > fulfills the same purpose without affecting the project. > > That's achievable without a major restructure (e.g. wrap each library > in a helper program). > > > * The possibility of extending Xapian's support in terms of file format. > > People have been adding new file formats for years within the current > structure. > > > One of the major advantages is that if a particular programmer wishes to > > add support for a new file format or improve an existing one, they should > > only modify the objects that are in red. In this way, with a proper > > documentation this kind of tasks should not be a complex task. > > We've had prospective GSoC students who were new to the codebase add > support for new formats successful, which suggests it isn't all that > complex currently. > > To show what's typically involved, here's the patch to add support for > iWork documents (which is the most recent format added): > > > https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee > > (The gen-mimemap tweak is only because this happened to mean the > generated mimetype lookup table now needs 2 byte offsets). > > Cheers, > Olly >-- Atte. Bruno Baruffaldi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190327/36952f7a/attachment.html>