Hi all,
Has anyone implemented larger Dovecot+Solr clusters and would be willing to give
some details about how it works for you? My understanding about it so far is:
- SolrCloud isn?t usable with Dovecot. Replication isn?t useful, because nobody
wants to pay for double the disk space for indexes that could be regenerated
anyway. The autosharding isn?t very useful also, because: I think the shard keys
could be created in two possible ways: a) Mails would be entirely randomly
distributed across the cluster. This would make updates efficient, because the
writes would be fully distributed across all servers. But I think it would also
make reads somewhat inefficient, since all the servers would have to be searched
and the results combined. Also if a server is lost, there?s no easy way to
reindex back the missing data, because it would contain a piece of pretty much
all the users? data. b) Shard keys could be created so that the same user would
typically go only to 1-2 servers. It would be possible (at least in theory) to
find a broken server?s list of users and reindex only their data, but I?m not
sure if this method is any easier than the non-SolrCloud setup.
- Without SolrCloud you?d then need to shard the data manually. This would be
easy enough to do by just assigning different users to different shards. But at
some point the index is going to become too large and you need to add more
shards and move some existing users to them. To keep the search performance good
during the move, I guess this could be done with a script that does: 1) reindex
user to new shard, 2) update userdb to point to new shard, 3) delete user from
old shard, 4) doveadm fts rescan the user to remove any mails already deleted
during the reindexing.
- It seems that Solr index shouldn?t grow above 200 GB or the performance will
be getting too bad? I?ve seen this in a few web pages. So each server should
likely be running multiple separate Solr instances (shards).
-
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-500000-volumes-5-million-volumes-and-beyond
recommends NFS (or I guess any kind of a shared filesystem), which does seem to
make sense. Since Dovecot wants to get instantly updated search results after
indexing, I think it?s probably better not to separate the indexing and
searching servers.
- Would be interesting to know what kind of hardware your Solr servers
currently have, how well they?re performing and what are the bottlenecks? From
the above URL it appears that disk I/O is first, but if there?s enough of that
available then CPU usage is second. I?m not quite sure where most of the memory
goes - caching?
- I?m guessing users are doing relatively few searches compared to how many new
emails are being indexed/deleted all the time?