thr3ads.net - Xapian devel - [GSoC] Questions about project Text-Extraction Libraries [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Bruno Baruffaldi

2019-Mar-23 18:42 UTC

[GSoC] Questions about project Text-Extraction Libraries

Thanks!
That was really useful!

I wanted to share my approach to this project with the hope that you can
give me some feedback.

I am think that applying a design that foresees the incorporation of new
file formats is the most suitable way to solve the problem.

In the attached sketch we can see:
* Bug_Box: It is responsible for encapsulating and handling errors.
* File_extrator: It presents an interface for the different formats.
* File_X: Encapsulates a particular library for the X file format.
* File_Hadle: It is responsible for directing the extraction. More
specifically, it determines the file format and which extractor to use.
* Ominex: It represents the rest of the project.

The idea of organizing the code in this way focuses on two fundamental
items:
* The possibility of changing a particular library for another that
fulfills the same purpose without affecting the project.
* The possibility of extending Xapian's support in terms of file format.

One of the major advantages is that if a particular programmer wishes to
add support for a new file format or improve an existing one, they should
only modify the objects that are in red. In this way, with a proper
documentation this kind of tasks should not be a complex task.

I know it is an ambitious approach, but I think that with a good
documentation it would give the project great flexibility and the
programmers would have the option of adapting Xapian to their needs.

**The image only presents a simple scheme to explain the idea, I do not
consider it as a design for the project. I believe that we should discuss
about different design patterns to choose the most suitable one.

Cheers,
   Bruno Baruffaldi

El sáb., 23 de mar. de 2019 a la(s) 12:10, Olly Betts (olly at survex.com)
escribió:
> On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi wrote:
> > Firstly, I think that trying to isolate library bugs in subprocesses
> could
> > get to work, but I am not sure about how to handle deadlocks or
infinite
> > loops. I feel that using a timer is the only way to deal with it but I
> > would like to know what you think about it.
>
> There's already code to set a CPU time limit for filter subprocesses
> (using setrlimit()) and to implement an inactivity timeout (by using
> select() to wait for the connection file descriptor to become readable
> or a timeout to be reached) - see runfilter.cc.  I think both mechanisms
> should be usable for this project (the CPU time limit would need to
> allow for CPU time used by the child process processing previous
> files).
>
> > Secondly, I have been reading the source code of ominex, but I cannot
> > figure out if it is possible to group all file formats under the same
> > interface. When indexing files, are all file formats treated in a
similar
> > way, or are there special formats that require a different work
(beyond
> the
> > use of external filters)?
>
> A few do - e.g. for PDF files we currently need to run pdfinfo and
> pdftotext on the file, PostScript files are first converted to a
> temporary PDF (because there doesn't seem to be a Unicode-aware
> filter which converts PostScript to text), etc.
>
> It may be possible to come up with a common interface still though.
>
> > To sum up, I want to know if ominex use multithreading for indexing
files
> > or if you consider that it could be implemented to speed it up.
>
> Currently there isn't really any parallelism in omindex.  It would help
> when indexing formats which are CPU intensive to extract text from
> (an extreme case is if you're running OCR to index image files).
>
> When dealing with external filters, the extra isolation that
> subprocesses gives us makes that a better approach than launching
> threads - if a library used by a thread crashes the process then the
> indexer dies, while if that happens in a subprocess the parent indexer
> process can recover easily.
>
> Potentially we could have concurrent child processes working on
> different documents.  I'd suggest that it's better to focus on
getting
> the subprocesses to work individually first before trying to get them to
> run in parallel, but to keep in mind that we're likely to want to
> instantiate multiple concurrent instances while implementing them.
>
>
-- 
Atte. Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sketch.png
Type: image/png
Size: 11183 bytes
Desc: not available
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.png>

Olly Betts

2019-Mar-26 22:38 UTC

head link

[GSoC] Questions about project Text-Extraction Libraries

On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi
wrote:> Thanks!
> That was really useful!
> 
> I wanted to share my approach to this project with the hope that you can
> give me some feedback.
> 
> I am think that applying a design that foresees the incorporation of new
> file formats is the most suitable way to solve the problem.
> 
> In the attached sketch we can see:
> * Bug_Box: It is responsible for encapsulating and handling errors.
> * File_extrator: It presents an interface for the different formats.
> * File_X: Encapsulates a particular library for the X file format.
> * File_Hadle: It is responsible for directing the extraction. More
> specifically, it determines the file format and which extractor to use.
> * Ominex: It represents the rest of the project.
I'm not entirely sure what these boxes are meant to actually be
(classes? programs? something else?), but in general I'd tend to steer
GSoC projects towards an evolutionary approach rather than trying to
rewrite everything in sight, or even refactor everything into some
entirely new structure.

With an evolutionary approach you can get to something that basically
works much sooner, and then fill in the missing pieces, fix bugs, etc.
It lends itself much better to incremental cycles of implement, test,
document, review, merge, which is easier to work through for both
mentors and students, and if the work doesn't get fully completed, at
least there's something to show for it.

With a revolutionary approach, there's nothing you can show working
for much longer, and you'll need to do a lot of extra testing for
all the existing functionality to make sure your reimplementation
works (unfortunately there's currently no testsuite for omindex you can
lean on here).

Review is painful because it involves wading through thousands of
lines of code, so you're likely to need to wait longer for a review
because it's harder for mentors to find enough time in one go for
that.

And if the work doesn't get fully completed, there's a big pile of
non-functioning code, which it's unlikely anyone is going to have the
time or enthusiasm to do anything further with.

More specifically to this case we already have code which encapsulates
extraction in a subprocess for an external filter program, and code
which determines the file format and which extractor to use.  If you
are proposing to replace those, you're going to need to convince us
what you think is deficient about the existing code, how you can
do better, and why that's a good use of the limited GSoC coding time
(if you spend time doing X, then you can't do Y).
> The idea of organizing the code in this way focuses on two fundamental
> items:
> * The possibility of changing a particular library for another that
> fulfills the same purpose without affecting the project.
That's achievable without a major restructure (e.g. wrap each library
in a helper program).
> * The possibility of extending Xapian's support in terms of file
format.
People have been adding new file formats for years within the current
structure.
> One of the major advantages is that if a particular programmer wishes to
> add support for a new file format or improve an existing one, they should
> only modify the objects that are in red. In this way, with a proper
> documentation this kind of tasks should not be a complex task.
We've had prospective GSoC students who were new to the codebase add
support for new formats successful, which suggests it isn't all that
complex currently.

To show what's typically involved, here's the patch to add support for
iWork documents (which is the most recent format added):

https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee

(The gen-mimemap tweak is only because this happened to mean the
generated mimetype lookup table now needs 2 byte offsets).

Cheers,
    Olly

Bruno Baruffaldi

2019-Mar-27 15:52 UTC

head link

[GSoC] Questions about project Text-Extraction Libraries

I think you are right and I will try with another approach.

One last query, I was thinking if it would be worth trying to use an
external filter (when it is available) in case a particular library fails
on run time.

Have you considered it?

El mar., 26 de mar. de 2019 a la(s) 19:39, Olly Betts (olly at survex.com)
escribió:
> On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi wrote:
> > Thanks!
> > That was really useful!
> >
> > I wanted to share my approach to this project with the hope that you
can
> > give me some feedback.
> >
> > I am think that applying a design that foresees the incorporation of
new
> > file formats is the most suitable way to solve the problem.
> >
> > In the attached sketch we can see:
> > * Bug_Box: It is responsible for encapsulating and handling errors.
> > * File_extrator: It presents an interface for the different formats.
> > * File_X: Encapsulates a particular library for the X file format.
> > * File_Hadle: It is responsible for directing the extraction. More
> > specifically, it determines the file format and which extractor to
use.
> > * Ominex: It represents the rest of the project.
>
> I'm not entirely sure what these boxes are meant to actually be
> (classes? programs? something else?), but in general I'd tend to steer
> GSoC projects towards an evolutionary approach rather than trying to
> rewrite everything in sight, or even refactor everything into some
> entirely new structure.
>
> With an evolutionary approach you can get to something that basically
> works much sooner, and then fill in the missing pieces, fix bugs, etc.
> It lends itself much better to incremental cycles of implement, test,
> document, review, merge, which is easier to work through for both
> mentors and students, and if the work doesn't get fully completed, at
> least there's something to show for it.
>
> With a revolutionary approach, there's nothing you can show working
> for much longer, and you'll need to do a lot of extra testing for
> all the existing functionality to make sure your reimplementation
> works (unfortunately there's currently no testsuite for omindex you can
> lean on here).
>
> Review is painful because it involves wading through thousands of
> lines of code, so you're likely to need to wait longer for a review
> because it's harder for mentors to find enough time in one go for
> that.
>
> And if the work doesn't get fully completed, there's a big pile of
> non-functioning code, which it's unlikely anyone is going to have the
> time or enthusiasm to do anything further with.
>
> More specifically to this case we already have code which encapsulates
> extraction in a subprocess for an external filter program, and code
> which determines the file format and which extractor to use.  If you
> are proposing to replace those, you're going to need to convince us
> what you think is deficient about the existing code, how you can
> do better, and why that's a good use of the limited GSoC coding time
> (if you spend time doing X, then you can't do Y).
>
> > The idea of organizing the code in this way focuses on two fundamental
> > items:
> > * The possibility of changing a particular library for another that
> > fulfills the same purpose without affecting the project.
>
> That's achievable without a major restructure (e.g. wrap each library
> in a helper program).
>
> > * The possibility of extending Xapian's support in terms of file
format.
>
> People have been adding new file formats for years within the current
> structure.
>
> > One of the major advantages is that if a particular programmer wishes
to
> > add support for a new file format or improve an existing one, they
should
> > only modify the objects that are in red. In this way, with a proper
> > documentation this kind of tasks should not be a complex task.
>
> We've had prospective GSoC students who were new to the codebase add
> support for new formats successful, which suggests it isn't all that
> complex currently.
>
> To show what's typically involved, here's the patch to add support
for
> iWork documents (which is the most recent format added):
>
>
>
https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee
>
> (The gen-mimemap tweak is only because this happened to mean the
> generated mimetype lookup table now needs 2 byte offsets).
>
> Cheers,
>     Olly
>

-- 
Atte. Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20190327/36952f7a/attachment.html>

Xapian devel - Mar 2019 - [GSoC] Questions about project Text-Extraction Libraries

[GSoC] Questions about project Text-Extraction Libraries

[GSoC] Questions about project Text-Extraction Libraries

[GSoC] Questions about project Text-Extraction Libraries