thr3ads.net - dovecot - [Dovecot] FTS Plugin design [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Rui Carneiro

2009-Apr-13 10:18 UTC

[Dovecot] FTS Plugin design

Hi all,

Currently I am developing some changes on the solr plugin. I want this
plugin indexing also the attachment's content. I have already started to
look on plugin's source but I am having some problems understanding how it
works.

I didn't understood yet what is the plugin's design and how the plugins
are
called from the core system and I was wondering if anyone could help me with
that.

Sorry if this doubts sound stupid but I am newcomer on Dovecot.

Regards,
Rui Carneiro

Timo Sirainen

2009-Apr-15 22:23 UTC

head link

[Dovecot] FTS Plugin design

On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:> I didn't understood yet what is the plugin's design and how the
plugins are
> called from the core system and I was wondering if anyone could help me
with
> that.
fts-storage.c hooks into all the functions in mail-storage API that it
needs to. Currently indexing isn't done while messages are being saved,
but instead just before searching. The searching functions are:

 - fts_mailbox_search_init() tries to figure out if FTS can optimize the
search. If it does, it tries to figure out if FTS index is up-to-date
and if not, starts the search.

 - fts_mailbox_search_next_nonblock() continues the indexing (or
searching after indexing) for a while. The idea is that IMAP connection
is able to process other commands while doing a long-running search. So
fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It
would be nice if that value was dynamically calculated and also based on
bytes instead of messages, but that's maybe too much trouble.

 - fts_mailbox_search_next_update_seq() uses the fts search results and
updates mail-storage's search stuff so that it doesn't go through
messages that don't match.

 - fts_build_mail() indexes a single mail. It parses the messages and
returns the data in small blocks. For text/* and message/rfc822 parts
those blocks are currently sent to FTS backend. This is where I think
you should look into hooking your attachment parsing. Change
fts_build_want_index_part() to look for more content-types that you're
interested in and then before feeding the blocks to FTS backend put them
through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx,
const struct message_block *input, struct message_block *output);


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL:
<dovecot.org/pipermail/dovecot/attachments/20090415/89c0543a/attachment-0002.bin>

Rui Carneiro

2009-Apr-17 09:03 UTC

head link

[Dovecot] FTS Plugin design

Thank you for all tips. The design look more clear to me now.

I have one more question. I looked into fts_build_want_index_part() and I
saw that I need to add some flags to message_part_flags, what values should
I choose? My first approach was to follow your schema and set
MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?

I already had changed parse_content_type() to set ctx->part->flags
correctly
but if i choose my custom flag dovecot assume that all attachment lines are
headers. I already tried to set those ctx->part->flags as TEXT and the
fts_backend was feeded correctly with all attachment lines.

I don't know if this is related with the value of
MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting
block.hdr = NULL or some more code to handle new flags).

Thank you,
Rui Carneiro

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss at iki.fi> wrote:
> On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:
> > I didn't understood yet what is the plugin's design and how
the plugins
> are
> > called from the core system and I was wondering if anyone could help
me
> with
> > that.
>
> fts-storage.c hooks into all the functions in mail-storage API that it
> needs to. Currently indexing isn't done while messages are being saved,
> but instead just before searching. The searching functions are:
>
>  - fts_mailbox_search_init() tries to figure out if FTS can optimize the
> search. If it does, it tries to figure out if FTS index is up-to-date
> and if not, starts the search.
>
>  - fts_mailbox_search_next_nonblock() continues the indexing (or
> searching after indexing) for a while. The idea is that IMAP connection
> is able to process other commands while doing a long-running search. So
> fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It
> would be nice if that value was dynamically calculated and also based on
> bytes instead of messages, but that's maybe too much trouble.
>
>  - fts_mailbox_search_next_update_seq() uses the fts search results and
> updates mail-storage's search stuff so that it doesn't go through
> messages that don't match.
>
>  - fts_build_mail() indexes a single mail. It parses the messages and
> returns the data in small blocks. For text/* and message/rfc822 parts
> those blocks are currently sent to FTS backend. This is where I think
> you should look into hooking your attachment parsing. Change
> fts_build_want_index_part() to look for more content-types that you're
> interested in and then before feeding the blocks to FTS backend put them
> through your own converter function, something like:
>
> int attachment_extract_text(struct attachment_extract_context *ctx,
> const struct message_block *input, struct message_block *output);
>
>
>

-- 
mobile: +351 963446125
mail: rui.arc at gmail.com
mail: ei04073 at fe.up.pt
website:
paginas.fe.up.pt/~ei04073<paginas.fe.up.pt/~ei04073>

Steffen Kaiser

2009-Apr-23 10:42 UTC

head link

[Dovecot] FTS Plugin design

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 22 Apr 2009, Rui Carneiro wrote:
>
> I will talk with the developers of those applications about the possibility
> of supporting stdin input (if not supported yet).
>
> I think the API that fts plugin uses to do the conversion should be
>> generic enough that both approaches would work. Then it would be easier
>> to implement one or another or both eventually.
>
> I think I will try the external applications approach. My developing time
> available is not to much.
Actually, if I consider what the xls-to-HTML converter did lately to our 
webmail frontend, I suggest to index "alien" formats asynchroneously, 
maybe in low-priority process, not only to prevent potential long 
conversation time and resource requirement, but also to prevent MUAs 
re-initate the search and force the IMAP server to index the same file 
simultaneously.

Bye,

- -- 
Steffen Kaiser
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSfBGBXWSIuGy1ktrAQKrRwgAll5KRqG0tMwPYgt21cKR5F4r8mrnA9nJ
5zvdQgFGXJoT4NegpzJ15+V8l7a28Uaxx79hzrubRpJSTNI5gU08TkzdNkJwWLTu
IA8gK/ZwQnnMqpQByF/pf7ERzMroZv3ZpYpkbEbI64MMSYOrI2hT92t3KSSnJ39f
TUSdRN9sUhdA69uWlKCFMofhAEfaoP+U8N3pg1b/kc14+HzmTqrx/SWNHZkzU5qm
clUmfa/uGMuv+gq+bKSEtos79Q1QOTqH9qRSRbNsxOVISM75C7dTpqIlcqz53iIg
RsRHDxCtyIv/UJrfE9fniOYE6l/xs8iLgG69fOGUCzwmLjVx2j9dKA==7O9D
-----END PGP SIGNATURE-----

rui.carneiro at portugalmail.net

2009-Apr-23 11:27 UTC

head link

[Dovecot] FTS Plugin design

On Thu, Apr 23, 2009 at 5:47 AM, <tomas at tuxteam.de> wrote:

    Note that some formats might require to seek to some point in the file [1]
    (typically the end), so reading from stdin is awkward (it would require
    stdin to be seekable, so either the app or the caller would have to put
    the whole file somewhere anyway).

    [1] Notably PDF has some index tables at EOF - 1k if I remember
    correctly.

I hadn't thought on that before but I think you are right. The only question
here is writing data to memory or hd.

Thank you all,
Rui Carneiro

--
Portugalmail, Comunica??es S.A.
portugalmail.net

Rui Carneiro

2009-May-05 11:08 UTC

head link

[Dovecot] FTS Plugin design

Hi again,

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss at iki.fi> wrote:
>  - fts_build_mail() indexes a single mail. It parses the messages and
> returns the data in small blocks. For text/* and message/rfc822 parts
> those blocks are currently sent to FTS backend. This is where I think
> you should look into hooking your attachment parsing. Change
> fts_build_want_index_part() to look for more content-types that you're
> interested in and then before feeding the blocks to FTS backend put them
> through your own converter function, something like:
>
> int attachment_extract_text(struct attachment_extract_context *ctx,
> const struct message_block *input, struct message_block *output);

Let's take the example of an application-pdf content-type. Before I
converter all pdf data to text I need to gather all data before. The actual
process is feeding FTS backend with small parts of data and appending them
on "build_more" functions (e.g. fts_backend_solr_build_more()).

So where should I call attachment_extract_text()? In
fts_backend_solr_build_more() and not making append to cmd until data is
extracted? Or gather all information before (e.g. fts_build_mail()) and send
all in once to FTS backend?

I hope I've made myself clear.

Regards,
Rui Carneiro
-- 
Portugalmail, Comunica??es S.A.
portugalmail.net

Rui Carneiro

2009-May-22 17:24 UTC

head link

[Dovecot] FTS Plugin design

Hi Timo,

I almost finish the changes on fts plugin. By now, it seems to work fine with
attachments (extracting and sending them to Solr). I only have a problem with
the max size of the command (cmd) that we can send to Solr:

#define SOLR_CMDBUF_SIZE (1024*64)

By now, if we send some message bigger than this value the fts-plugin crash.

There is anything in your TODO-List that solves this problem?

Regards,
Rui Carneiro

PS: asap I will send you my code for your approval :)

-- 
Portugalmail, Comunica??es S.A.
portugalmail.net

Seemingly Similar Threads

Search for more reasonably related threads

dovecot - Apr 2009 - FTS Plugin design

[Dovecot] FTS Plugin design

[Dovecot] FTS Plugin design

[Dovecot] FTS Plugin design

[Dovecot] FTS Plugin design

[Dovecot] FTS Plugin design

[Dovecot] FTS Plugin design

[Dovecot] FTS Plugin design

Seemingly Similar Threads