thr3ads.net - Xapian devel - [GSoC] Questions about project Text-Extraction Libraries [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Bruno Baruffaldi

2019-Mar-21 12:31 UTC

[GSoC] Questions about project Text-Extraction Libraries

Hello!

I have a few question related to the project Text-Extraction Libraries.

Firstly, I think that trying to isolate library bugs in subprocesses could
get to work, but I am not sure about how to handle deadlocks or infinite
loops. I feel that using a timer is the only way to deal with it but I
would like to know what you think about it.

Secondly, I have been reading the source code of ominex, but I cannot
figure out if it is possible to group all file formats under the same
interface. When indexing files, are all file formats treated in a similar
way, or are there special formats that require a different work (beyond the
use of external filters)?

To sum up, I want to know if ominex use multithreading for indexing files
or if you consider that it could be implemented to speed it up.

Cheers,
   Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190321/0d4bbd0c/attachment.html>

Olly Betts

2019-Mar-23 15:09 UTC

head link

[GSoC] Questions about project Text-Extraction Libraries

On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi
wrote:> Firstly, I think that trying to isolate library bugs in subprocesses could
> get to work, but I am not sure about how to handle deadlocks or infinite
> loops. I feel that using a timer is the only way to deal with it but I
> would like to know what you think about it.
There's already code to set a CPU time limit for filter subprocesses
(using setrlimit()) and to implement an inactivity timeout (by using
select() to wait for the connection file descriptor to become readable
or a timeout to be reached) - see runfilter.cc.  I think both mechanisms
should be usable for this project (the CPU time limit would need to
allow for CPU time used by the child process processing previous
files).
> Secondly, I have been reading the source code of ominex, but I cannot
> figure out if it is possible to group all file formats under the same
> interface. When indexing files, are all file formats treated in a similar
> way, or are there special formats that require a different work (beyond the
> use of external filters)?
A few do - e.g. for PDF files we currently need to run pdfinfo and
pdftotext on the file, PostScript files are first converted to a
temporary PDF (because there doesn't seem to be a Unicode-aware
filter which converts PostScript to text), etc.

It may be possible to come up with a common interface still though.
> To sum up, I want to know if ominex use multithreading for indexing files
> or if you consider that it could be implemented to speed it up.
Currently there isn't really any parallelism in omindex.  It would help
when indexing formats which are CPU intensive to extract text from
(an extreme case is if you're running OCR to index image files).

When dealing with external filters, the extra isolation that
subprocesses gives us makes that a better approach than launching
threads - if a library used by a thread crashes the process then the
indexer dies, while if that happens in a subprocess the parent indexer
process can recover easily.

Potentially we could have concurrent child processes working on
different documents.  I'd suggest that it's better to focus on getting
the subprocesses to work individually first before trying to get them to
run in parallel, but to keep in mind that we're likely to want to
instantiate multiple concurrent instances while implementing them.

Bruno Baruffaldi

2019-Mar-23 18:42 UTC

head link

[GSoC] Questions about project Text-Extraction Libraries

Thanks!
That was really useful!

I wanted to share my approach to this project with the hope that you can
give me some feedback.

I am think that applying a design that foresees the incorporation of new
file formats is the most suitable way to solve the problem.

In the attached sketch we can see:
* Bug_Box: It is responsible for encapsulating and handling errors.
* File_extrator: It presents an interface for the different formats.
* File_X: Encapsulates a particular library for the X file format.
* File_Hadle: It is responsible for directing the extraction. More
specifically, it determines the file format and which extractor to use.
* Ominex: It represents the rest of the project.

The idea of organizing the code in this way focuses on two fundamental
items:
* The possibility of changing a particular library for another that
fulfills the same purpose without affecting the project.
* The possibility of extending Xapian's support in terms of file format.

One of the major advantages is that if a particular programmer wishes to
add support for a new file format or improve an existing one, they should
only modify the objects that are in red. In this way, with a proper
documentation this kind of tasks should not be a complex task.

I know it is an ambitious approach, but I think that with a good
documentation it would give the project great flexibility and the
programmers would have the option of adapting Xapian to their needs.

**The image only presents a simple scheme to explain the idea, I do not
consider it as a design for the project. I believe that we should discuss
about different design patterns to choose the most suitable one.

Cheers,
   Bruno Baruffaldi

El sáb., 23 de mar. de 2019 a la(s) 12:10, Olly Betts (olly at survex.com)
escribió:
> On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi wrote:
> > Firstly, I think that trying to isolate library bugs in subprocesses
> could
> > get to work, but I am not sure about how to handle deadlocks or
infinite
> > loops. I feel that using a timer is the only way to deal with it but I
> > would like to know what you think about it.
>
> There's already code to set a CPU time limit for filter subprocesses
> (using setrlimit()) and to implement an inactivity timeout (by using
> select() to wait for the connection file descriptor to become readable
> or a timeout to be reached) - see runfilter.cc.  I think both mechanisms
> should be usable for this project (the CPU time limit would need to
> allow for CPU time used by the child process processing previous
> files).
>
> > Secondly, I have been reading the source code of ominex, but I cannot
> > figure out if it is possible to group all file formats under the same
> > interface. When indexing files, are all file formats treated in a
similar
> > way, or are there special formats that require a different work
(beyond
> the
> > use of external filters)?
>
> A few do - e.g. for PDF files we currently need to run pdfinfo and
> pdftotext on the file, PostScript files are first converted to a
> temporary PDF (because there doesn't seem to be a Unicode-aware
> filter which converts PostScript to text), etc.
>
> It may be possible to come up with a common interface still though.
>
> > To sum up, I want to know if ominex use multithreading for indexing
files
> > or if you consider that it could be implemented to speed it up.
>
> Currently there isn't really any parallelism in omindex.  It would help
> when indexing formats which are CPU intensive to extract text from
> (an extreme case is if you're running OCR to index image files).
>
> When dealing with external filters, the extra isolation that
> subprocesses gives us makes that a better approach than launching
> threads - if a library used by a thread crashes the process then the
> indexer dies, while if that happens in a subprocess the parent indexer
> process can recover easily.
>
> Potentially we could have concurrent child processes working on
> different documents.  I'd suggest that it's better to focus on
getting
> the subprocesses to work individually first before trying to get them to
> run in parallel, but to keep in mind that we're likely to want to
> instantiate multiple concurrent instances while implementing them.
>
>
-- 
Atte. Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sketch.png
Type: image/png
Size: 11183 bytes
Desc: not available
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.png>

Possibly Parallel Threads

Search for more maybe matching threads

Xapian devel - Mar 2019 - [GSoC] Questions about project Text-Extraction Libraries

[GSoC] Questions about project Text-Extraction Libraries

[GSoC] Questions about project Text-Extraction Libraries

[GSoC] Questions about project Text-Extraction Libraries

Possibly Parallel Threads