Vincent Brillault
2021-Aug-04  07:24 UTC
Dovecot - FTS Solr: disk usage & position information?
Dear all,
On a local dovecot cluster currently hosting roughly 2.1TB of data,
using Solr as its FTS backend, we now have 256GB of data in Solr, split
in 12 shard (to which replication adds 256GB of data through 12
additional cores).
I'm now trying to see if we can optimize that data. Looking at one core
at random (22G), I see that the data is split mostly between
- .pos files: 12G
- .tim files: 4.2G
- .doc files: 3.8G
- .cfs files: 1.8G
Looking around a bit, I found
https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50PostingsFormat.html
(which is unfortunately a bit outdated I think) that explains each file
content:
- .tim: Term Dictionary
- .tip: Term Index
- .doc: Frequencies and Skip Data
- .pos: Positions
- .pay: Payloads and Offsets
So clearly the file naming convention have changed, but still if .pos is
really position information ("lists of positions that each term occurs
at within documents."), this sounds rather useless for the dovecot
integration.
Looking at Solr documentation on search
(https://solr.apache.org/guide/8_6/the-standard-query-parser.html) it
seems that position aware query are written as `"term1 term2"~[0-9]+`.
Looking at the dovecot code
(https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c),
I don't see this kind of query being made, `~` only being used for fuzzy
search.
Has anyone ever tried to set omitTermFreqAndPositions or omitPositions
to true for the text fields in the Solr Schema? It sounds that this
could improve a lot the disk space used by Solr without losing any
feature. The only thing I'm not too clear about is the
"autoGeneratePhraseQueries" which is enabled in
https://github.com/dovecot/core/blob/master/doc/solr-schema-7.7.0.xml.
Thanks in advance,
Vincent Brillault
PS: I have attached the schema we are using for completeness. It's based
on the one in the dovecot repo, with a bit of simplification for headers
that don't really require as much massaging.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: schema.xml
Type: text/xml
Size: 3068 bytes
Desc: not available
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20210804/6bafbde7/attachment.xml>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20210804/6bafbde7/attachment.sig>
On 8/4/2021 1:24 AM, Vincent Brillault wrote:> On a local dovecot cluster currently hosting roughly 2.1TB of data, > using Solr as its FTS backend, we now have 256GB of data in Solr, split > in 12 shard (to which replication adds 256GB of data through 12 > additional cores). > > I'm now trying to see if we can optimize that data. Looking at one core > at random (22G), I see that the data is split mostly between > - .pos files: 12G > - .tim files: 4.2G > - .doc files: 3.8G > - .cfs files: 1.8G > > Looking around a bit, I found > https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50PostingsFormat.html > (which is unfortunately a bit outdated I think) that explains each file > content: > - .tim: Term Dictionary > - .tip: Term Index > - .doc: Frequencies and Skip Data > - .pos: Positions > - .pay: Payloads and OffsetsThis is completely off-topic for the dovecot list.? I am involved with the Solr project, so I can discuss it.? My message will also be off topic here. You didn't say what version of Solr you're on.? That document for Lucene 6.2.0 would be relevant for Solr 6.2.0.? There are versions of that document for all Lucene releases, which have been in lock-step with Solr releases since one of the early 3.x versions. (Aside: Solr has been split into its own top-level Apache project, so there is no longer a guarantee moving forward that Solr X.Y.Z will be based on Lucene X.Y.Z) Not all of the lucene file types will be involved on every install of Solr.? It will depend on the configuration. The .cfs file is a file where all of the other file types for a segment are compounded into a single file.? Within that single file, each file type will use the same format as it would if it had its own extension.? I'm not completely clear on when Lucene (under Solr's control) will choose the CFS format .. but I think it happens when the segments are small, not large.> Looking at Solr documentation on search > (https://solr.apache.org/guide/8_6/the-standard-query-parser.html) it > seems that position aware query are written as `"term1 term2"~[0-9]+`. > Looking at the dovecot code > (https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c), > I don't see this kind of query being made, `~` only being used for fuzzy > search.Positions are required for a phrase query -- where the query text is in double quotes.? The number after ~ on a phrase query refers to phrase slop -- think of it as a fuzziness factor for the phrase, not for each term.? Right now you noticed that dovecot's FTS Solr plugin doesn't explicitly use phrase queries, but there's no guarantee that this will always be the case.? Position data will only be accessed if it is needed for a query, so if it is not needed it should not affect query performance.? I cannot speak as to whether the FTS Solr plugin relies on the autoGenereatePhraseQueries functionality, but if it does, then you definitely want position data in the index.? That functionality can do a lot to improve relevancy ranking, so I would expect it to be instrumental in good full-text searching -- disabling positions will probably not help your search results. If you want an in-depth discussion beyond this email, please subscribe to the solr-user mailing list and ask there. Note that general Solr recommendations are to have enough space available that the Solr index can triple in size temporarily -- this is to accommodate all possible scenarios for Lucene segment merging.? Running Solr on systems with limited disk space is not recommended. Solr does have an "optimize" operation which will combine all the segments into one, removing space taken up by deleted documents as it works.? Lucene calls that operation "forceMerge".? Running an optimize can help performance, but it's extremely resource intensive and can take a long time to run -- performance gets worse before it gets better.? Also, the amount of performance gain is not usually significant. Thanks, Shawn