On 7 Jan 2019, at 16.05, Joan Moreau via dovecot <dovecot at dovecot.org> wrote:> > Hi > > ANyone to answer specifically ? > > Q1 : get_last_uid -> Is this the last UID indexed (which may be not the greatest value), or the gratest value (which may not be the latest) (the code of existing plugins is unclear about this, Solr looks for the greatest for insance)All the mails are always supposed to be indexed from the beginning to the last indexed mail. If there's a gap, indexer first indexes all the missing mails. So the latest UID is supposed to be the greatest UID. (Supporting out-of-order indexing would be rather difficult to keep track of.)> Q2 : WHen Indexing an email, the data is not passed by "build_key". Why so ? What is the link with "build_more" ?The idea is that it calls something like: - build_key(type=hdr, hdr_name=From) - build_more("tss at iki.fi") - build_key(type=hdr, hdr_name=Subject) - build_more("Re: Solr -> Xapian ?") - build_key(type=body_part) - build_more("message body piece") - build_more("message body piece2") ...> Q3 : Searching/Lookup : THe fheader in which to llok for (must be a least among "cc, to, from, subject, body") is not appearing in the 'struct' data. WHere to find it ?lookup() gets struct mail_search_arg *args, which contains the entire IMAP SEARCH query. This could be used for more or less complex query builders. In case of a single header search, you should have args->args->hdr_field_name contain the header name and args->args->value.str contain the content you're searching for.> Q4 : Refresh : this is very unclear. How come there would not be the "latest" view on index. What is the real meaning of this function ?In case of Xapian it might not matter if it automatically refreshes its indexes between each query. But with some other indexes this could happen: - IMAP session is opened - IMAP SEARCH is run, which opens and searches the index - a new mail is delivered to the mailbox and indexed - IMAP SEARCH is run. Without refresh() it doesn't see the newly indexed mail and doesn't include it in the search results.> Q5 : Rescan : is it just a bout remonving all indexes for a specific mailbox ?It's run when "doveadm fts rescan" is run manually. Usually that's only run manually to fix up some brokenness. So it's intended to verify that the current mailbox contents match the FTS indexes: - If there are any mails in FTS index that no longer exist in the actual mailbox, delete those mails from FTS - If FTS is missing any mails in the middle of the mailbox, make sure that the next mailbox indexing will index those missing mails. I think currently this basically means reindexing all the mails since the first missing mail, even the mails that are already in the index. fts-lucene implements this, but other FTS backends are lazy and simply rebuild all mails. Actually fts-solr is bad because it doesn't even delete the extra mails.> Q6 : lokkup_multi : isn't the function the same for all plugnins (see below) ? >> and finally , for fts_backend_xxxx_lookup_multi, why is that backend dependent ?This function is called only when searching in virtual folders. So for example the virtual "All mails" folder, which would contain all mails in all folders. In that case the boxes[] would contain a list of user's all folders, except Trash and Spam. If lookup_multi() isn't implemented (left to NULL), the search is run separately via lookup() for each folder. With lookup_multi() there can be just one lookup, and the backend can filter only the wanted folders and return them directly. So it's an optimization for FTS indexes that support user-global searches rather than only per-folder searches.>> static int fts_backend_xapian_lookup_multi(struct fts_backend *_backend, struct mailbox *const boxes[], struct mail_search_arg *args, enum fts_lookup_flags flags, struct fts_multi_result *result) >> { >> struct xapian_fts_backend_update_context *ctx >> (struct xapian_fts_backend_update_context *)_ctx; >> >> int i=0; >> >> while(boxes[i]!=NULL) >> { >> if(fts_backend_xapian_lookup(backend,box[i],args,flags,result->box_results[i])<0) return -1; >> i++; >> } >> return 0; >> }See fts_backend_lookup_multi() - if you leave lookup_multi=NULL it basically does this.>> For "rescan " and "optimize", wouldn't it be the dovecot core who indicate which are to be dismissed (expunged), or re-ask for indexing a particular (or all) uid ? WHy would the backend be aware of the transactions on the mailbox ???rescan() is about fixing up a more or less broken index, or simply to verify that it's all ok. So core doesn't know what messages exist in the FTS index and can't request specific reindexing or expunging. I guess an alternative API could have been to have functions that iterate through all mails in the index, and use that to implement rescan in core. Now thinking about it, that sounds like a simpler and better way. optimize() is currently done only when explicitly running "doveadm fts optimize", which requests running a slower index optimization. Depends on the FTS backend whether this is useful or not.>> There is alredy "fts_backend_xxx_update_expunge", so I beleive the management of the expunged messages is *NOT* in the backend, right ?Normally when mails are expunged, update_expunge() is called to notify FTS backend that it should delete the mail also from FTS index.>> .flags = FTS_BACKEND_FLAG_NORMALIZE_INPUT,*-> what other flags ?*You probably want to use FTS_BACKEND_FLAG_FUZZY_SEARCH only like Solr. See enum fts_backend_flags in fts-api-private.h
Ok. Additional question : - for rescan : who is responsible for passing again the new email ? Is the Dovecot core sending again all the emails to index ? or the fts shall somehow access the mailbox and read all emails ? Wouldn't just be saying "delete all index and get_last_uid is now 0" the easy way ? or the fts must process all emails (and block the current thread as a mailbx maybe quite large) - for get_last_uid : this uncertainity is very unclear. "If there is a gap, then indexer first indexes all the missing" -> this mean at a certain point, indexer maybe rebuilding a previous email, so *last* uid is something different than max. And how indexer does know whther there is a gap wihtout callong the fts backend (whch it does not as there are no function for that) ? On 2019-01-08 04:24, Timo Sirainen wrote:> On 7 Jan 2019, at 16.05, Joan Moreau via dovecot <dovecot at dovecot.org> wrote: > >> Hi >> >> ANyone to answer specifically ? >> >> Q1 : get_last_uid -> Is this the last UID indexed (which may be not the greatest value), or the gratest value (which may not be the latest) (the code of existing plugins is unclear about this, Solr looks for the greatest for insance) > > All the mails are always supposed to be indexed from the beginning to the last indexed mail. If there's a gap, indexer first indexes all the missing mails. So the latest UID is supposed to be the greatest UID. (Supporting out-of-order indexing would be rather difficult to keep track of.) > >> Q2 : WHen Indexing an email, the data is not passed by "build_key". Why so ? What is the link with "build_more" ? > > The idea is that it calls something like: > > - build_key(type=hdr, hdr_name=From) > - build_more("tss at iki.fi") > - build_key(type=hdr, hdr_name=Subject) > - build_more("Re: Solr -> Xapian ?") > - build_key(type=body_part) > - build_more("message body piece") > - build_more("message body piece2") > ... > >> Q3 : Searching/Lookup : THe fheader in which to llok for (must be a least among "cc, to, from, subject, body") is not appearing in the 'struct' data. WHere to find it ? > > lookup() gets struct mail_search_arg *args, which contains the entire IMAP SEARCH query. This could be used for more or less complex query builders. > > In case of a single header search, you should have args->args->hdr_field_name contain the header name and args->args->value.str contain the content you're searching for. > >> Q4 : Refresh : this is very unclear. How come there would not be the "latest" view on index. What is the real meaning of this function ? > > In case of Xapian it might not matter if it automatically refreshes its indexes between each query. But with some other indexes this could happen: > > - IMAP session is opened > - IMAP SEARCH is run, which opens and searches the index > - a new mail is delivered to the mailbox and indexed > - IMAP SEARCH is run. Without refresh() it doesn't see the newly indexed mail and doesn't include it in the search results. > >> Q5 : Rescan : is it just a bout remonving all indexes for a specific mailbox ? > > It's run when "doveadm fts rescan" is run manually. Usually that's only run manually to fix up some brokenness. So it's intended to verify that the current mailbox contents match the FTS indexes: > - If there are any mails in FTS index that no longer exist in the actual mailbox, delete those mails from FTS > - If FTS is missing any mails in the middle of the mailbox, make sure that the next mailbox indexing will index those missing mails. I think currently this basically means reindexing all the mails since the first missing mail, even the mails that are already in the index. > > fts-lucene implements this, but other FTS backends are lazy and simply rebuild all mails. Actually fts-solr is bad because it doesn't even delete the extra mails. > > Q6 : lokkup_multi : isn't the function the same for all plugnins (see below) ? and finally , for fts_backend_xxxx_lookup_multi, why is that backend dependent ?This function is called only when searching in virtual folders. So for example the virtual "All mails" folder, which would contain all mails in all folders. In that case the boxes[] would contain a list of user's all folders, except Trash and Spam. If lookup_multi() isn't implemented (left to NULL), the search is run separately via lookup() for each folder. With lookup_multi() there can be just one lookup, and the backend can filter only the wanted folders and return them directly. So it's an optimization for FTS indexes that support user-global searches rather than only per-folder searches.>> static int fts_backend_xapian_lookup_multi(struct fts_backend *_backend, struct mailbox *const boxes[], struct mail_search_arg *args, enum fts_lookup_flags flags, struct fts_multi_result *result) >> { >> struct xapian_fts_backend_update_context *ctx >> (struct xapian_fts_backend_update_context *)_ctx; >> >> int i=0; >> >> while(boxes[i]!=NULL) >> { >> if(fts_backend_xapian_lookup(backend,box[i],args,flags,result->box_results[i])<0) return -1; >> i++; >> } >> return 0; >> }See fts_backend_lookup_multi() - if you leave lookup_multi=NULL it basically does this.>> For "rescan " and "optimize", wouldn't it be the dovecot core who indicate which are to be dismissed (expunged), or re-ask for indexing a particular (or all) uid ? WHy would the backend be aware of the transactions on the mailbox ???rescan() is about fixing up a more or less broken index, or simply to verify that it's all ok. So core doesn't know what messages exist in the FTS index and can't request specific reindexing or expunging. I guess an alternative API could have been to have functions that iterate through all mails in the index, and use that to implement rescan in core. Now thinking about it, that sounds like a simpler and better way. optimize() is currently done only when explicitly running "doveadm fts optimize", which requests running a slower index optimization. Depends on the FTS backend whether this is useful or not.>> There is alredy "fts_backend_xxx_update_expunge", so I beleive the management of the expunged messages is *NOT* in the backend, right ?Normally when mails are expunged, update_expunge() is called to notify FTS backend that it should delete the mail also from FTS index.>> .flags = FTS_BACKEND_FLAG_NORMALIZE_INPUT,*-> what other flags ?*You probably want to use FTS_BACKEND_FLAG_FUZZY_SEARCH only like Solr. See enum fts_backend_flags in fts-api-private.h -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20190109/b292c3ca/attachment-0001.html>
Also, 1 - WHat does represent "subargs" in mail_search_args 2 - I made my first code, and the error I get compiling within the dovecot architecture is "In file included from fts-xapian-plugin.c:4: fts-xapian-plugin.h:6:1: error: unknown type name 'using'; did you mean 'uint'? using namespace std;" if I remove this, the Xapian library is also complaining about "namespace" keyword In file included from /usr/include/xapian.h:47, from fts-backend-xapian.c:11: /usr/include/xapian/types.h:31:1: error: unknown type name 'namespace'; did you mean 'i_isspace'? namespace Xapian { Someone can bring me some light ? Thanks On 2019-01-09 09:58, Joan Moreau via dovecot wrote:> Ok. > > Additional question : > > - for rescan : who is responsible for passing again the new email ? Is the Dovecot core sending again all the emails to index ? or the fts shall somehow access the mailbox and read all emails ? Wouldn't just be saying "delete all index and get_last_uid is now 0" the easy way ? or the fts must process all emails (and block the current thread as a mailbx maybe quite large) > > - for get_last_uid : this uncertainity is very unclear. "If there is a gap, then indexer first indexes all the missing" -> this mean at a certain point, indexer maybe rebuilding a previous email, so *last* uid is something different than max. And how indexer does know whther there is a gap wihtout callong the fts backend (whch it does not as there are no function for that) ? > > On 2019-01-08 04:24, Timo Sirainen wrote: > On 7 Jan 2019, at 16.05, Joan Moreau via dovecot <dovecot at dovecot.org> wrote: > Hi > > ANyone to answer specifically ? > > Q1 : get_last_uid -> Is this the last UID indexed (which may be not the greatest value), or the gratest value (which may not be the latest) (the code of existing plugins is unclear about this, Solr looks for the greatest for insance) > All the mails are always supposed to be indexed from the beginning to the last indexed mail. If there's a gap, indexer first indexes all the missing mails. So the latest UID is supposed to be the greatest UID. (Supporting out-of-order indexing would be rather difficult to keep track of.) > > Q2 : WHen Indexing an email, the data is not passed by "build_key". Why so ? What is the link with "build_more" ? > The idea is that it calls something like: > > - build_key(type=hdr, hdr_name=From) > - build_more("tss at iki.fi") > - build_key(type=hdr, hdr_name=Subject) > - build_more("Re: Solr -> Xapian ?") > - build_key(type=body_part) > - build_more("message body piece") > - build_more("message body piece2") > ... > > Q3 : Searching/Lookup : THe fheader in which to llok for (must be a least among "cc, to, from, subject, body") is not appearing in the 'struct' data. WHere to find it ? > lookup() gets struct mail_search_arg *args, which contains the entire IMAP SEARCH query. This could be used for more or less complex query builders. > > In case of a single header search, you should have args->args->hdr_field_name contain the header name and args->args->value.str contain the content you're searching for. > > Q4 : Refresh : this is very unclear. How come there would not be the "latest" view on index. What is the real meaning of this function ? > In case of Xapian it might not matter if it automatically refreshes its indexes between each query. But with some other indexes this could happen: > > - IMAP session is opened > - IMAP SEARCH is run, which opens and searches the index > - a new mail is delivered to the mailbox and indexed > - IMAP SEARCH is run. Without refresh() it doesn't see the newly indexed mail and doesn't include it in the search results. > > Q5 : Rescan : is it just a bout remonving all indexes for a specific mailbox ? > It's run when "doveadm fts rescan" is run manually. Usually that's only run manually to fix up some brokenness. So it's intended to verify that the current mailbox contents match the FTS indexes: > - If there are any mails in FTS index that no longer exist in the actual mailbox, delete those mails from FTS > - If FTS is missing any mails in the middle of the mailbox, make sure that the next mailbox indexing will index those missing mails. I think currently this basically means reindexing all the mails since the first missing mail, even the mails that are already in the index. > > fts-lucene implements this, but other FTS backends are lazy and simply rebuild all mails. Actually fts-solr is bad because it doesn't even delete the extra mails. > > Q6 : lokkup_multi : isn't the function the same for all plugnins (see below) ? and finally , for fts_backend_xxxx_lookup_multi, why is that backend dependent ?This function is called only when searching in virtual folders. So for example the virtual "All mails" folder, which would contain all mails in all folders. In that case the boxes[] would contain a list of user's all folders, except Trash and Spam. If lookup_multi() isn't implemented (left to NULL), the search is run separately via lookup() for each folder. With lookup_multi() there can be just one lookup, and the backend can filter only the wanted folders and return them directly. So it's an optimization for FTS indexes that support user-global searches rather than only per-folder searches.>> static int fts_backend_xapian_lookup_multi(struct fts_backend *_backend, struct mailbox *const boxes[], struct mail_search_arg *args, enum fts_lookup_flags flags, struct fts_multi_result *result) >> { >> struct xapian_fts_backend_update_context *ctx >> (struct xapian_fts_backend_update_context *)_ctx; >> >> int i=0; >> >> while(boxes[i]!=NULL) >> { >> if(fts_backend_xapian_lookup(backend,box[i],args,flags,result->box_results[i])<0) return -1; >> i++; >> } >> return 0; >> }See fts_backend_lookup_multi() - if you leave lookup_multi=NULL it basically does this.>> For "rescan " and "optimize", wouldn't it be the dovecot core who indicate which are to be dismissed (expunged), or re-ask for indexing a particular (or all) uid ? WHy would the backend be aware of the transactions on the mailbox ???rescan() is about fixing up a more or less broken index, or simply to verify that it's all ok. So core doesn't know what messages exist in the FTS index and can't request specific reindexing or expunging. I guess an alternative API could have been to have functions that iterate through all mails in the index, and use that to implement rescan in core. Now thinking about it, that sounds like a simpler and better way. optimize() is currently done only when explicitly running "doveadm fts optimize", which requests running a slower index optimization. Depends on the FTS backend whether this is useful or not.>> There is alredy "fts_backend_xxx_update_expunge", so I beleive the management of the expunged messages is *NOT* in the backend, right ?Normally when mails are expunged, update_expunge() is called to notify FTS backend that it should delete the mail also from FTS index.>> .flags = FTS_BACKEND_FLAG_NORMALIZE_INPUT,*-> what other flags ?*You probably want to use FTS_BACKEND_FLAG_FUZZY_SEARCH only like Solr. See enum fts_backend_flags in fts-api-private.h -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20190111/74ce607f/attachment-0001.html>
To get back on thi "build_more" function: this is what I receive: (see below) 2 poitns : the header name seems to be added at the end of the *data. not always, why so ? where is the body ? Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(mime-version)=1.0MIME-VERSION Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(mime-version)=1.0MIME-VERSION Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: Start indexing 'Sent' (/data/mail/grosjo.net/jom/xapian-indexes/db_49fdf110ec9bc14c375b0000d6a3092d) Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(content-type)=MULTIPART/ALTERNATIVE; BOUNDARY="=_87A48D791CC8B262204294719234352F"CONTENT-TYPE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(content-type)=MULTIPART/ALTERNATIVE; BOUNDARY="=_87A48D791CC8B262204294719234352F"CONTENT-TYPE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(date)=TUE, 22 JAN 2019 09:25:49 +0100DATE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(date)=TUE, 22 JAN 2019 09:25:49 +0100DATE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(from)="JOAN MOREAU" <JOM at GROSJO.NET> Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(from)="JOAN MOREAU" <JOM at GROSJO.NET> Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(to)="JOAN MOREAU" <JOAN.MOREAU at M4X.ORG> Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(to)="JOAN MOREAU" <JOAN.MOREAU at M4X.ORG> Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(subject)=TESTSUBJECT Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(subject)=TESTSUBJECT Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(user-agent)=ROUNDCUBE WEBMAIL/1.4-GITUSER-AGENT Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(user-agent)=ROUNDCUBE WEBMAIL/1.4-GITUSER-AGENT Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(message-id)=<1C18523A5A00849C8BE7970F44276F1B at GROSJO.NET>MESSAGE-ID Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(message-id)=<1C18523A5A00849C8BE7970F44276F1B at GROSJO.NET>MESSAGE-ID Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(x-sender)=JOM at GROSJO.NETX-SENDER Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(x-sender)=JOM at GROSJO.NETX-SENDER Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(content-transfer-encoding)=7BITCONTENT-TRANSFER-ENCODING Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(content-transfer-encoding)=7BITCONTENT-TRANSFER-ENCODING Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(content-type)=TEXT/PLAIN; CHARSET=UTF-8; FORMAT=FLOWEDCONTENT-TYPE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(content-type)=TEXT/PLAIN; CHARSET=UTF-8; FORMAT=FLOWEDCONTENT-TYPE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(content-transfer-encoding)=QUOTED-PRINTABLECONTENT-TRANSFER-ENCODING Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(content-transfer-encoding)=QUOTED-PRINTABLECONTENT-TRANSFER-ENCODING Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA(content-type)=TEXT/HTML; CHARSET=UTF-8CONTENT-TYPE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: DATA2(content-type)=TEXT/HTML; CHARSET=UTF-8CONTENT-TYPE Jan 22 08:25:50 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: Done indexing 'Sent' (1 msgs in 3 ms, rate: 333.3) Jan 22 08:25:50 gjserver dovecot[20984]: imap-login: Login: user=<jom at grosjo.net>, method=PLAIN, rip=127.0.0.1, lip=127.0.0.1, mpid=21699, secured, session=<99GeuQeANlt/AAAB> Jan 22 08:25:51 gjserver dovecot[20984]: imap(jom at grosjo.net)<21699><99GeuQeANlt/AAAB>: Logged out in=20201 out=567147 deleted=0 expunged=0 trashed=0 hdr_count=200 hdr_bytes=62139 body_count=0 body_bytes=0 Jan 22 08:25:51 gjserver dovecot[20984]: indexer-worker(jom at grosjo.net)<20998><jyqauQeAMlt/AAAB:qHyjLY7TRlwGUgAA0thIag>: Indexed 1 messages in Sent (UIDs 60585..60585) On 2019-01-08 04:24, Timo Sirainen wrote:> On 7 Jan 2019, at 16.05, Joan Moreau via dovecot <dovecot at dovecot.org> wrote: > >> Hi >> >> ANyone to answer specifically ? >> >> Q1 : get_last_uid -> Is this the last UID indexed (which may be not the greatest value), or the gratest value (which may not be the latest) (the code of existing plugins is unclear about this, Solr looks for the greatest for insance) > > All the mails are always supposed to be indexed from the beginning to the last indexed mail. If there's a gap, indexer first indexes all the missing mails. So the latest UID is supposed to be the greatest UID. (Supporting out-of-order indexing would be rather difficult to keep track of.) > >> Q2 : WHen Indexing an email, the data is not passed by "build_key". Why so ? What is the link with "build_more" ? > > The idea is that it calls something like: > > - build_key(type=hdr, hdr_name=From) > - build_more("tss at iki.fi") > - build_key(type=hdr, hdr_name=Subject) > - build_more("Re: Solr -> Xapian ?") > - build_key(type=body_part) > - build_more("message body piece") > - build_more("message body piece2") > ... > >> Q3 : Searching/Lookup : THe fheader in which to llok for (must be a least among "cc, to, from, subject, body") is not appearing in the 'struct' data. WHere to find it ? > > lookup() gets struct mail_search_arg *args, which contains the entire IMAP SEARCH query. This could be used for more or less complex query builders. > > In case of a single header search, you should have args->args->hdr_field_name contain the header name and args->args->value.str contain the content you're searching for. > >> Q4 : Refresh : this is very unclear. How come there would not be the "latest" view on index. What is the real meaning of this function ? > > In case of Xapian it might not matter if it automatically refreshes its indexes between each query. But with some other indexes this could happen: > > - IMAP session is opened > - IMAP SEARCH is run, which opens and searches the index > - a new mail is delivered to the mailbox and indexed > - IMAP SEARCH is run. Without refresh() it doesn't see the newly indexed mail and doesn't include it in the search results. > >> Q5 : Rescan : is it just a bout remonving all indexes for a specific mailbox ? > > It's run when "doveadm fts rescan" is run manually. Usually that's only run manually to fix up some brokenness. So it's intended to verify that the current mailbox contents match the FTS indexes: > - If there are any mails in FTS index that no longer exist in the actual mailbox, delete those mails from FTS > - If FTS is missing any mails in the middle of the mailbox, make sure that the next mailbox indexing will index those missing mails. I think currently this basically means reindexing all the mails since the first missing mail, even the mails that are already in the index. > > fts-lucene implements this, but other FTS backends are lazy and simply rebuild all mails. Actually fts-solr is bad because it doesn't even delete the extra mails. > > Q6 : lokkup_multi : isn't the function the same for all plugnins (see below) ? and finally , for fts_backend_xxxx_lookup_multi, why is that backend dependent ?This function is called only when searching in virtual folders. So for example the virtual "All mails" folder, which would contain all mails in all folders. In that case the boxes[] would contain a list of user's all folders, except Trash and Spam. If lookup_multi() isn't implemented (left to NULL), the search is run separately via lookup() for each folder. With lookup_multi() there can be just one lookup, and the backend can filter only the wanted folders and return them directly. So it's an optimization for FTS indexes that support user-global searches rather than only per-folder searches.>> static int fts_backend_xapian_lookup_multi(struct fts_backend *_backend, struct mailbox *const boxes[], struct mail_search_arg *args, enum fts_lookup_flags flags, struct fts_multi_result *result) >> { >> struct xapian_fts_backend_update_context *ctx >> (struct xapian_fts_backend_update_context *)_ctx; >> >> int i=0; >> >> while(boxes[i]!=NULL) >> { >> if(fts_backend_xapian_lookup(backend,box[i],args,flags,result->box_results[i])<0) return -1; >> i++; >> } >> return 0; >> }See fts_backend_lookup_multi() - if you leave lookup_multi=NULL it basically does this.>> For "rescan " and "optimize", wouldn't it be the dovecot core who indicate which are to be dismissed (expunged), or re-ask for indexing a particular (or all) uid ? WHy would the backend be aware of the transactions on the mailbox ???rescan() is about fixing up a more or less broken index, or simply to verify that it's all ok. So core doesn't know what messages exist in the FTS index and can't request specific reindexing or expunging. I guess an alternative API could have been to have functions that iterate through all mails in the index, and use that to implement rescan in core. Now thinking about it, that sounds like a simpler and better way. optimize() is currently done only when explicitly running "doveadm fts optimize", which requests running a slower index optimization. Depends on the FTS backend whether this is useful or not.>> There is alredy "fts_backend_xxx_update_expunge", so I beleive the management of the expunged messages is *NOT* in the backend, right ?Normally when mails are expunged, update_expunge() is called to notify FTS backend that it should delete the mail also from FTS index.>> .flags = FTS_BACKEND_FLAG_NORMALIZE_INPUT,*-> what other flags ?*You probably want to use FTS_BACKEND_FLAG_FUZZY_SEARCH only like Solr. See enum fts_backend_flags in fts-api-private.h -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20190122/269e6ad1/attachment-0001.html>