On Wed, Oct 10, 2018 at 09:37:46AM +0300, Aki Tuomi wrote:> > > On 09.10.2018 22:16, William Taylor wrote: > > We have started seeing index corruption ever since we upgraded (we > > believe) our imap servers from SL6 to Centos 7. Mail/Indexes are stored > > on Netapps mounted via NFS. We have 2 lvs servers running surealived in > > dr/wlc, 2 directors and 6 backend imap/pop servers. > > > > Most of the core dumps I've looked at for different users are like > > "Backtrace 2" with some variations on folder path. > > > > This latest crash (Backtrace 1) is different from others I've seen. > > It is also leaving 0byte files in the users .Drafts/tmp folder. > > > > # ls -s /var/spool/mail/15/00/user1/.Drafts/tmp | awk '{print $1}' > > |sort | uniq -c > > 9692 0 > > 1 218600 > > > > I believe the number of cores here is different from the number of tmp > > files because this is when we moved the user to our debug server so we > > could get the core dumps. > > # ls -la /home/u/user1/core.* |wc -l > > 8437 > > > > Any help/insight would be greatly appreciated. > > > > Thanks, > > William > > > > > > OS Info: > > CentOS Linux release 7.5.1804 (Core) > > 3.10.0-862.14.4.el7.x86_64 > > > > NFS: > > # mount -t nfs |grep mail/15 > > 172.16.255.14:/vol/vol1/mail/15 on /var/spool/mail/15 type nfs > > (rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nordirplus,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.255.14,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.16.255.14) > > > > Dovecot Info: > > dovecot -n > > # 2.1.17: /etc/dovecot/dovecot.conf > > > > Hi! > > Thank you for your report, however, 2.1.17 is VERY old version of > dovecot and this problem is very likely fixed in a more recent version. > > Aki >I realize it is an older release. Are you saying that there is a bug in this version that affects RHEL 7.5 but not RHEL 6 or just use the newest version and maybe the problem goes away?
<!doctype html> <html> <head> <meta charset="UTF-8"> </head> <body> <div> <br> </div> <blockquote type="cite"> <div> On 10 October 2018 at 19:12 William Taylor < <a href="mailto:william.taylor@sonic.com">william.taylor@sonic.com</a>> wrote: </div> <div> <br> </div> <div> <br> </div> <div> On Wed, Oct 10, 2018 at 09:37:46AM +0300, Aki Tuomi wrote: </div> <blockquote type="cite"> <div> <br> </div> <div> On 09.10.2018 22:16, William Taylor wrote: </div> <blockquote type="cite"> <div> We have started seeing index corruption ever since we upgraded (we </div> <div> believe) our imap servers from SL6 to Centos 7. Mail/Indexes are stored </div> <div> on Netapps mounted via NFS. We have 2 lvs servers running surealived in </div> <div> dr/wlc, 2 directors and 6 backend imap/pop servers. </div> </blockquote> <blockquote type="cite"> <div> Most of the core dumps I've looked at for different users are like </div> <div> "Backtrace 2" with some variations on folder path. </div> </blockquote> <blockquote type="cite"> <div> This latest crash (Backtrace 1) is different from others I've seen. </div> <div> It is also leaving 0byte files in the users .Drafts/tmp folder. </div> </blockquote> <blockquote type="cite"> <div> # ls -s /var/spool/mail/15/00/user1/.Drafts/tmp | awk '{print $1}' </div> <div> |sort | uniq -c </div> <div> 9692 0 </div> <div> 1 218600 </div> </blockquote> <blockquote type="cite"> <div> I believe the number of cores here is different from the number of tmp </div> <div> files because this is when we moved the user to our debug server so we </div> <div> could get the core dumps. </div> <div> # ls -la /home/u/user1/core.* |wc -l </div> <div> 8437 </div> </blockquote> <blockquote type="cite"> <div> Any help/insight would be greatly appreciated. </div> </blockquote> <blockquote type="cite"> <div> Thanks, </div> <div> William </div> </blockquote> <div> > </div> <blockquote type="cite"> <div> OS Info: </div> <div> CentOS Linux release 7.5.1804 (Core) </div> <div> 3.10.0-862.14.4.el7.x86_64 </div> </blockquote> <blockquote type="cite"> <div> NFS: </div> <div> # mount -t nfs |grep mail/15 </div> <div> 172.16.255.14:/vol/vol1/mail/15 on /var/spool/mail/15 type nfs </div> <div> (rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nordirplus,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.255.14,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.16.255.14) </div> </blockquote> <blockquote type="cite"> <div> Dovecot Info: </div> <div> dovecot -n </div> <div> # 2.1.17: /etc/dovecot/dovecot.conf </div> </blockquote> <div> <br> </div> <div> Hi! </div> <div> <br> </div> <div> Thank you for your report, however, 2.1.17 is VERY old version of </div> <div> dovecot and this problem is very likely fixed in a more recent version. </div> <div> <br> </div> <div> Aki </div> <div> <br> </div> </blockquote> <div> I realize it is an older release. </div> <div> <br> </div> <div> Are you saying that there is a bug in this version that affects RHEL 7.5 </div> <div> but not RHEL 6 or just use the newest version and maybe the problem goes </div> <div> away? </div> </blockquote> <div> <br> </div> <div> We have very limited interest in figuring out problems with (very) old dovecot versions. At minimum you need to show this problem with 2.2.36 or 2.3.2.1. </div> <div> <br> </div> <div> A thing you should make sure is that you are not accessing the user with two different servers concurrently. </div> <div class="io-ox-signature"> --- <br>Aki Tuomi </div> </body> </html>
On 10.10.2018 19:12, William Taylor wrote:> OS Info: >>> CentOS Linux release 7.5.1804 (Core) >>> 3.10.0-862.14.4.el7.x86_64 >>> >>> NFS: >>> # mount -t nfs |grep mail/15 >>> 172.16.255.14:/vol/vol1/mail/15 on /var/spool/mail/15 type nfs >>> (rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nordirplus,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.255.14,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=172.16.255.14) >>> >>> Dovecot Info: >>> dovecot -n >>> # 2.1.17: /etc/dovecot/dovecot.conf >>> >> Hi! >> >> Thank you for your report, however, 2.1.17 is VERY old version of >> dovecot and this problem is very likely fixed in a more recent version. >> >> Aki >> > I realize it is an older release. > > Are you saying that there is a bug in this version that affects RHEL 7.5 > but not RHEL 6 or just use the newest version and maybe the problem goes > away?I can see from my CentOS 7 installation that it comes with 2.2.10-8.el7 package. Did you install 2.1.17 specifically somehow? I'm using dovecot 2.3.3 as packaged by the developers in CentOS 7 myself. Good luck, Reio
On 10/10/18 7:26 AM, Aki Tuomi wrote:>> Are you saying that there is a bug in this version that affects RHEL 7.5 >> but not RHEL 6 or just use the newest version and maybe the problem goes >> away? > > We have very limited interest in figuring out problems with (very) old > dovecot versions. At minimum you need to show this problem with 2.2.36 > or 2.3.2.1. > > A thing you should make sure is that you are not accessing the user with > two different servers concurrently.The directors appear to be working fine so, no, users aren't hitting multiple back end servers. To be clear, we don't suspect Dovecot as much - our deployment had been stable for years - but rather behavior changes between the RHEL6 and RHLE7 environment, particularly with regards to NFSv3. But we've have been at a loss to find a smoking gun. For various reasons achieving stability (again) on the current version is very important while we continue to plan Dovecot and storage backend upgrades. Corruption leading to crashes is very infrequent percentage wise but it's enough to negatively impact performance and impact users -- out of 5+ million sessions/day we're seeing ~5 instances whereas on 6 it would have been one every few months. Has anyone else experienced any NFS/locking issues transitioning from RHEL6 to 7 with Netapp storage? Grasping at straws - perhaps compiler and/or system library issues interacting with Dovecot? -K