Hi all, Currently I am developing some changes on the solr plugin. I want this plugin indexing also the attachment's content. I have already started to look on plugin's source but I am having some problems understanding how it works. I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that. Sorry if this doubts sound stupid but I am newcomer on Dovecot. Regards, Rui Carneiro
On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:> I didn't understood yet what is the plugin's design and how the plugins are > called from the core system and I was wondering if anyone could help me with > that.fts-storage.c hooks into all the functions in mail-storage API that it needs to. Currently indexing isn't done while messages are being saved, but instead just before searching. The searching functions are: - fts_mailbox_search_init() tries to figure out if FTS can optimize the search. If it does, it tries to figure out if FTS index is up-to-date and if not, starts the search. - fts_mailbox_search_next_nonblock() continues the indexing (or searching after indexing) for a while. The idea is that IMAP connection is able to process other commands while doing a long-running search. So fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It would be nice if that value was dynamically calculated and also based on bytes instead of messages, but that's maybe too much trouble. - fts_mailbox_search_next_update_seq() uses the fts search results and updates mail-storage's search stuff so that it doesn't go through messages that don't match. - fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into hooking your attachment parsing. Change fts_build_want_index_part() to look for more content-types that you're interested in and then before feeding the blocks to FTS backend put them through your own converter function, something like: int attachment_extract_text(struct attachment_extract_context *ctx, const struct message_block *input, struct message_block *output); -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20090415/89c0543a/attachment-0002.bin>
Thank you for all tips. The design look more clear to me now. I have one more question. I looked into fts_build_want_index_part() and I saw that I need to add some flags to message_part_flags, what values should I choose? My first approach was to follow your schema and set MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this? I already had changed parse_content_type() to set ctx->part->flags correctly but if i choose my custom flag dovecot assume that all attachment lines are headers. I already tried to set those ctx->part->flags as TEXT and the fts_backend was feeded correctly with all attachment lines. I don't know if this is related with the value of MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting block.hdr = NULL or some more code to handle new flags). Thank you, Rui Carneiro On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss at iki.fi> wrote:> On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote: > > I didn't understood yet what is the plugin's design and how the plugins > are > > called from the core system and I was wondering if anyone could help me > with > > that. > > fts-storage.c hooks into all the functions in mail-storage API that it > needs to. Currently indexing isn't done while messages are being saved, > but instead just before searching. The searching functions are: > > - fts_mailbox_search_init() tries to figure out if FTS can optimize the > search. If it does, it tries to figure out if FTS index is up-to-date > and if not, starts the search. > > - fts_mailbox_search_next_nonblock() continues the indexing (or > searching after indexing) for a while. The idea is that IMAP connection > is able to process other commands while doing a long-running search. So > fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It > would be nice if that value was dynamically calculated and also based on > bytes instead of messages, but that's maybe too much trouble. > > - fts_mailbox_search_next_update_seq() uses the fts search results and > updates mail-storage's search stuff so that it doesn't go through > messages that don't match. > > - fts_build_mail() indexes a single mail. It parses the messages and > returns the data in small blocks. For text/* and message/rfc822 parts > those blocks are currently sent to FTS backend. This is where I think > you should look into hooking your attachment parsing. Change > fts_build_want_index_part() to look for more content-types that you're > interested in and then before feeding the blocks to FTS backend put them > through your own converter function, something like: > > int attachment_extract_text(struct attachment_extract_context *ctx, > const struct message_block *input, struct message_block *output); > > >-- mobile: +351 963446125 mail: rui.arc at gmail.com mail: ei04073 at fe.up.pt website: http://paginas.fe.up.pt/~ei04073<http://paginas.fe.up.pt/%7Eei04073>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 22 Apr 2009, Rui Carneiro wrote:> > I will talk with the developers of those applications about the possibility > of supporting stdin input (if not supported yet). > > I think the API that fts plugin uses to do the conversion should be >> generic enough that both approaches would work. Then it would be easier >> to implement one or another or both eventually. > > I think I will try the external applications approach. My developing time > available is not to much.Actually, if I consider what the xls-to-HTML converter did lately to our webmail frontend, I suggest to index "alien" formats asynchroneously, maybe in low-priority process, not only to prevent potential long conversation time and resource requirement, but also to prevent MUAs re-initate the search and force the IMAP server to index the same file simultaneously. Bye, - -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iQEVAwUBSfBGBXWSIuGy1ktrAQKrRwgAll5KRqG0tMwPYgt21cKR5F4r8mrnA9nJ 5zvdQgFGXJoT4NegpzJ15+V8l7a28Uaxx79hzrubRpJSTNI5gU08TkzdNkJwWLTu IA8gK/ZwQnnMqpQByF/pf7ERzMroZv3ZpYpkbEbI64MMSYOrI2hT92t3KSSnJ39f TUSdRN9sUhdA69uWlKCFMofhAEfaoP+U8N3pg1b/kc14+HzmTqrx/SWNHZkzU5qm clUmfa/uGMuv+gq+bKSEtos79Q1QOTqH9qRSRbNsxOVISM75C7dTpqIlcqz53iIg RsRHDxCtyIv/UJrfE9fniOYE6l/xs8iLgG69fOGUCzwmLjVx2j9dKA==7O9D -----END PGP SIGNATURE-----
On Thu, Apr 23, 2009 at 5:47 AM, <tomas at tuxteam.de> wrote: Note that some formats might require to seek to some point in the file [1] (typically the end), so reading from stdin is awkward (it would require stdin to be seekable, so either the app or the caller would have to put the whole file somewhere anyway). [1] Notably PDF has some index tables at EOF - 1k if I remember correctly. I hadn't thought on that before but I think you are right. The only question here is writing data to memory or hd. Thank you all, Rui Carneiro -- Portugalmail, Comunica??es S.A. www.portugalmail.net
Hi again, On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen <tss at iki.fi> wrote:> - fts_build_mail() indexes a single mail. It parses the messages and > returns the data in small blocks. For text/* and message/rfc822 parts > those blocks are currently sent to FTS backend. This is where I think > you should look into hooking your attachment parsing. Change > fts_build_want_index_part() to look for more content-types that you're > interested in and then before feeding the blocks to FTS backend put them > through your own converter function, something like: > > int attachment_extract_text(struct attachment_extract_context *ctx, > const struct message_block *input, struct message_block *output);Let's take the example of an application-pdf content-type. Before I converter all pdf data to text I need to gather all data before. The actual process is feeding FTS backend with small parts of data and appending them on "build_more" functions (e.g. fts_backend_solr_build_more()). So where should I call attachment_extract_text()? In fts_backend_solr_build_more() and not making append to cmd until data is extracted? Or gather all information before (e.g. fts_build_mail()) and send all in once to FTS backend? I hope I've made myself clear. Regards, Rui Carneiro -- Portugalmail, Comunica??es S.A. www.portugalmail.net
Hi Timo, I almost finish the changes on fts plugin. By now, it seems to work fine with attachments (extracting and sending them to Solr). I only have a problem with the max size of the command (cmd) that we can send to Solr: #define SOLR_CMDBUF_SIZE (1024*64) By now, if we send some message bigger than this value the fts-plugin crash. There is anything in your TODO-List that solves this problem? Regards, Rui Carneiro PS: asap I will send you my code for your approval :) -- Portugalmail, Comunica??es S.A. www.portugalmail.net