On 8/4/2021 1:24 AM, Vincent Brillault wrote:> On a local dovecot cluster currently hosting roughly 2.1TB of data, > using Solr as its FTS backend, we now have 256GB of data in Solr, split > in 12 shard (to which replication adds 256GB of data through 12 > additional cores). > > I'm now trying to see if we can optimize that data. Looking at one core > at random (22G), I see that the data is split mostly between > - .pos files: 12G > - .tim files: 4.2G > - .doc files: 3.8G > - .cfs files: 1.8G > > Looking around a bit, I found > https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50PostingsFormat.html > (which is unfortunately a bit outdated I think) that explains each file > content: > - .tim: Term Dictionary > - .tip: Term Index > - .doc: Frequencies and Skip Data > - .pos: Positions > - .pay: Payloads and OffsetsThis is completely off-topic for the dovecot list.? I am involved with the Solr project, so I can discuss it.? My message will also be off topic here. You didn't say what version of Solr you're on.? That document for Lucene 6.2.0 would be relevant for Solr 6.2.0.? There are versions of that document for all Lucene releases, which have been in lock-step with Solr releases since one of the early 3.x versions. (Aside: Solr has been split into its own top-level Apache project, so there is no longer a guarantee moving forward that Solr X.Y.Z will be based on Lucene X.Y.Z) Not all of the lucene file types will be involved on every install of Solr.? It will depend on the configuration. The .cfs file is a file where all of the other file types for a segment are compounded into a single file.? Within that single file, each file type will use the same format as it would if it had its own extension.? I'm not completely clear on when Lucene (under Solr's control) will choose the CFS format .. but I think it happens when the segments are small, not large.> Looking at Solr documentation on search > (https://solr.apache.org/guide/8_6/the-standard-query-parser.html) it > seems that position aware query are written as `"term1 term2"~[0-9]+`. > Looking at the dovecot code > (https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c), > I don't see this kind of query being made, `~` only being used for fuzzy > search.Positions are required for a phrase query -- where the query text is in double quotes.? The number after ~ on a phrase query refers to phrase slop -- think of it as a fuzziness factor for the phrase, not for each term.? Right now you noticed that dovecot's FTS Solr plugin doesn't explicitly use phrase queries, but there's no guarantee that this will always be the case.? Position data will only be accessed if it is needed for a query, so if it is not needed it should not affect query performance.? I cannot speak as to whether the FTS Solr plugin relies on the autoGenereatePhraseQueries functionality, but if it does, then you definitely want position data in the index.? That functionality can do a lot to improve relevancy ranking, so I would expect it to be instrumental in good full-text searching -- disabling positions will probably not help your search results. If you want an in-depth discussion beyond this email, please subscribe to the solr-user mailing list and ask there. Note that general Solr recommendations are to have enough space available that the Solr index can triple in size temporarily -- this is to accommodate all possible scenarios for Lucene segment merging.? Running Solr on systems with limited disk space is not recommended. Solr does have an "optimize" operation which will combine all the segments into one, removing space taken up by deleted documents as it works.? Lucene calls that operation "forceMerge".? Running an optimize can help performance, but it's extremely resource intensive and can take a long time to run -- performance gets worse before it gets better.? Also, the amount of performance gain is not usually significant. Thanks, Shawn
Vincent Brillault
2021-Aug-05 07:00 UTC
Dovecot - FTS Solr: disk usage & position information?
Dear Shawn, Thanks for your very complete answer!> This is completely off-topic for the dovecot list.? I am involved with > the Solr project, so I can discuss it.? My message will also be off > topic here.Sorry, maybe I didn't explain myself properly. I asked on the dovecot mailing list as I'm interested in: - The interaction between Solr & dovecot: what dovecot really needs and uses from Solr. - The reasons for the settings in the schema example in the dovecot repositories. I think these are still interesting to be discussed on the dovecot mailing list, but I'm extremely grateful for your feedback.> You didn't say what version of Solr you're on.? That document for Lucene > 6.2.0 would be relevant for Solr 6.2.0.Indeed, I should have. I'm using Solr 8.6, which is clearly not the same as Solr 6.2.0, but when looking at more recent versions of the documentation, no information about the use of each file appeared. That's why I was mentioning it was slightly outdated.>> I don't see this kind of query being made, `~` only being used for fuzzy >> search. > > Positions are required for a phrase query -- where the query text is in > double quotes.Yes, I discovered that while testing yesterday :D ```"field "body" was indexed without position data; cannot run PhraseQuery```> Right now you noticed that dovecot's FTS Solr plugin doesn't > explicitly use phrase queries, but there's no guarantee that this will > always be the case.? Position data will only be accessed if it is needed > for a query, so if it is not needed it should not affect query > performance.Of course if dovecot's FTS Solr plugin requirements change, then the schema I'm using will to change. This is why I'm asking here. Solr is a powerful engine, but search within IMAP are more restricted. As far as I understand, dovecot does not make use of all the features for Solr, only of a very small subset and thus I believe it makes sense to try to optimize the configuration to deliver what it needs without spending to much compute or storage on features dovecot doesn't need.> I cannot speak as to whether the FTS Solr plugin relies on > the autoGenereatePhraseQueries functionality, but if it does, then you > definitely want position data in the index.? That functionality can do a > lot to improve relevancy ranking, so I would expect it to be > instrumental in good full-text searching -- disabling positions will > probably not help your search results.This is the main question and what I don't really understand. If the query generated by dovecot from the IMAP searches it creates significantly improve with position data, then yes, it's clearly required. If it only marginally improves it, then a cost/benefit analysis should be taken. Yesterday, I've modified my test cluster to use `omitTermFreqAndPositions="true" omitPositions="true"` instead of `autoGeneratePhraseQueries="true"`. This is a painful operation as it requires to drop everything and re-index all the data, but at the end of the day, after re-indexation: - Total disk usage for the test cluster went from 16.0 GB to 9.8 GB, so a 39% reduction is disk usage :) - No .pos file created in the cores Basic tests show no obvious change in the search results (after I removed autoGeneratePhraseQueries, before that it failed in some cases). Did any other Dovecot user try something similar? (I've only found once post on the internet raising the question so far :/).> If you want an in-depth discussion beyond this email, please subscribe > to the solr-user mailing list and ask there.Thanks, I'll take on your offer, for the Solr specific part, as I need to understand that autoGeneratePhraseQueries better :)> Note that general Solr recommendations are to have enough space > available that the Solr index can triple in size temporarily -- this is > to accommodate all possible scenarios for Lucene segment merging.? > Running Solr on systems with limited disk space is not recommended.Well, it depends on what you define as "limited". I'd love to have infinite storage, but unfortunately every resource is always limited one way of another. Ensuring that each core can temporarily triple in size (required e.g. if ones want to split the shards to distribute over more nodes) is one thing (that can have a limited impact if the shards are split in small enough sizes). Requiring double the size overall with no operational benefit is another ;). I'm just trying to understand how much storage we'll really need once the cluster is scaled to final use. Thanks again Shawn for your contribution, it was quite helpful! Cheers, Vincent -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <https://dovecot.org/pipermail/dovecot/attachments/20210805/e756de50/attachment.sig>
Vincent Brillault
2021-Sep-01 09:27 UTC
Dovecot - FTS Solr: disk usage & position information?
Dear all, Just a status update, in case this can help others. We went forward and disabled the position information indexing and the re-indexed of our mail data (over a couple of days to avoid overloading the systems). Before the re-indexing we had 1.33 TiB in our Solr Indexes. After re-indexation, we had only 542 GiB, that's a 60% of our storage requirements for our FTS indexes :) So far, we haven't been reported any issue or measurable differences by our users concerning the quality of the FTS. From further debugging, as discussed on the solr-user mailing list (https://lists.apache.org/thread.html/rcdf8bb97be0839e57928ad5fa34501ec8a73392c11248db91206bc33%40%3Cusers.solr.apache.org%3E), I've come to the conclusion that, with the current integration between Dovecot and Solr (esp the fact that `"` is escaped), it's impossible to trigger phrase queries from user queries as long as autoGeneratePhraseQueries is false. I've attached the schema.xml and solrconfig.xml we are now using with Solr 8.6.0, in case there is any interest from others. Let me know if you prefer a MR to update the xmls present in https://github.com/dovecot/core/tree/master/doc. Cheers, Vincent -------------- next part -------------- A non-text attachment was scrubbed... Name: solrconfig.xml Type: text/xml Size: 2856 bytes Desc: not available URL: <https://dovecot.org/pipermail/dovecot/attachments/20210901/cd1b4a30/attachment.xml> -------------- next part -------------- A non-text attachment was scrubbed... Name: schema.xml Type: text/xml Size: 3478 bytes Desc: not available URL: <https://dovecot.org/pipermail/dovecot/attachments/20210901/cd1b4a30/attachment-0001.xml> -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <https://dovecot.org/pipermail/dovecot/attachments/20210901/cd1b4a30/attachment.sig>