Hi, We have a serious issue running on our POP/IMAP servers these days. The load average of a servers spikes up to 400-500 as a uptime command result, for a particular time period , to be specific mostly in noon time and evening, but it last for few minutes only. We have 2 servers running dovecot 1.1.20 , in loadbanlancer, We have used KEEPLIVE (1.1.13) for loadbalacing. Server specification. Operating System : CentOS 5.5 64bit CPU cores : 16 RAM : 8GB Mail and Indexes are mounted on NFS (NetApp). Below is the dovecot -n ... (top results during high spike) ############################################################################################# # 1.1.20: /usr/local/etc/dovecot.conf # OS: Linux 2.6.28 x86_64 CentOS release 5.5 (Final) log_path: /var/log/dovecot-info.log info_log_path: /var/log/dovecot-info.log syslog_facility: local1 protocols: imap imaps pop3 pop3s listen(default): *:143 listen(imap): *:143 listen(pop3): *:110 ssl_listen(default): *:993 ssl_listen(imap): *:993 ssl_listen(pop3): *:995 ssl_cert_file: /usr/local/etc/ssl/certs/dovecot.pem ssl_key_file: /usr/local/etc/ssl/private/dovecot.pem disable_plaintext_auth: no login_dir: /usr/local/var/run/dovecot/login login_executable(default): /usr/local/libexec/dovecot/imap-login login_executable(imap): /usr/local/libexec/dovecot/imap-login login_executable(pop3): /usr/local/libexec/dovecot/pop3-login login_greeting: Welcome to Popserver. login_process_per_connection: no max_mail_processes: 1024 mail_max_userip_connections(default): 100 mail_max_userip_connections(imap): 100 mail_max_userip_connections(pop3): 50 verbose_proctitle: yes first_valid_uid: 99 first_valid_gid: 99 mail_location: maildir:~/Maildir:INDEX=/indexes/%h:CONTROL=/indexes/%h mmap_disable: yes mail_nfs_storage: yes mail_nfs_index: yes lock_method: dotlock mail_executable(default): /usr/local/libexec/dovecot/imap mail_executable(imap): /usr/local/libexec/dovecot/imap mail_executable(pop3): /usr/local/libexec/dovecot/pop3 mail_plugins(default): quota imap_quota mail_plugins(imap): quota imap_quota mail_plugins(pop3): quota mail_plugin_dir(default): /usr/local/lib/dovecot/imap mail_plugin_dir(imap): /usr/local/lib/dovecot/imap mail_plugin_dir(pop3): /usr/local/lib/dovecot/pop3 pop3_no_flag_updates(default): no pop3_no_flag_updates(imap): no pop3_no_flag_updates(pop3): yes pop3_lock_session(default): no pop3_lock_session(imap): no pop3_lock_session(pop3): yes pop3_client_workarounds(default): pop3_client_workarounds(imap): pop3_client_workarounds(pop3): outlook-no-nuls lda: postmaster_address: admin at research.com mail_plugins: cmusieve quota mail_log mail_plugin_dir: /usr/local/lib/dovecot/lda auth_socket_path: /var/run/dovecot/auth-master auth default: worker_max_count: 15 passdb: driver: sql args: /usr/local/etc/dovecot-mysql.conf userdb: driver: sql args: /usr/local/etc/dovecot-mysql.conf userdb: driver: prefetch socket: type: listen client: path: /var/run/dovecot/auth-client mode: 432 user: nobody group: nobody master: path: /var/run/dovecot/auth-master mode: 384 user: nobody group: nobody plugin: quota_warning: storage=95%% /usr/local/bin/quota-warning.sh 95 %u quota_warning2: storage=80%% /usr/local/bin/quota-warning.sh 80 %u quota: maildir:storage=64 ############################################################################################## ############################################################################################## top - 12:08:31 up 206 days, 10:45, 3 users, load average: 189.88, 82.07, 55.97 Tasks: 771 total, 1 running, 767 sleeping, 1 stopped, 2 zombie Cpu(s): 8.3%us, 7.6%sy, 0.0%ni, 8.3%id, 75.0%wa, 0.0%hi, 0.8%si, 0.0%st Mem: 16279824k total, 11913788k used, 4366036k free, 334308k buffers Swap: 4192956k total, 0k used, 4192956k free, 10359492k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 408 mysql 18 0 384m 38m 4412 S 52.8 0.2 42221:44 mysqld 29326 nobody 15 0 22688 10m 1112 D 3.9 0.1 0:00.05 imap 29313 nobody 16 0 14892 4892 1000 S 3.1 0.0 0:00.07 imap 28129 nobody 16 0 23688 4148 1140 S 2.3 0.0 0:00.10 imap 28936 nobody 16 0 25048 13m 1108 S 2.3 0.1 0:00.06 imap 29302 nobody 16 0 15168 3924 1000 S 2.3 0.0 0:00.05 imap 29307 nobody 16 0 12864 2892 1000 S 2.3 0.0 0:00.03 imap 21956 root 15 0 13272 1652 800 R 1.6 0.0 0:06.46 top 29119 nobody 16 0 26000 9324 1096 D 1.6 0.1 0:00.11 imap 29158 nobody 16 0 12684 2728 1028 S 1.6 0.0 0:00.03 imap 29204 nobody 16 0 14400 4232 1000 S 1.6 0.0 0:00.08 imap 29262 nobody 16 0 25436 2776 1120 S 1.6 0.0 0:00.02 imap 3397 root 10 -5 0 0 0 S 0.8 0.0 28:02.90 rpciod/2 3401 root 10 -5 0 0 0 S 0.8 0.0 337:30.77 rpciod/6 4218 root 16 0 8948 1548 588 S 0.8 0.0 8:05.53 dovecot 10214 root 15 0 50868 1968 1468 S 0.8 0.0 0:30.08 dovecot-auth 27454 nobody 16 0 19220 5048 1256 D 0.8 0.0 0:00.17 imap 27820 nobody 16 0 18916 4704 1252 D 0.8 0.0 0:00.08 imap 27925 nobody 16 0 11180 1524 1000 S 0.8 0.0 0:00.07 imap 28094 nobody 16 0 11536 2068 1148 S 0.8 0.0 0:00.13 imap 28150 nobody 16 0 12528 2396 1104 S 0.8 0.0 0:00.08 imap 28196 nobody 16 0 13976 3792 1148 S 0.8 0.0 0:00.21 imap 28292 nobody 16 0 11984 1808 1004 S 0.8 0.0 0:00.04 imap 28343 nobody 16 0 11584 2052 1152 D 0.8 0.0 0:00.02 imap 28352 nobody 15 0 11108 1492 1004 S 0.8 0.0 0:00.04 imap 28413 nobody 16 0 11220 1568 996 S 0.8 0.0 0:00.02 imap 28434 nobody 16 0 12492 2444 1004 S 0.8 0.0 0:00.05 imap 28467 nobody 16 0 11852 2044 1004 D 0.8 0.0 0:00.06 imap 28558 nobody 16 0 11904 2316 1168 S 0.8 0.0 0:00.02 imap 28643 nobody 16 0 11312 1688 1000 S 0.8 0.0 0:00.04 imap 28711 nobody 15 0 12048 1868 1004 S 0.8 0.0 0:00.04 imap 28779 nobody 16 0 23392 10m 1000 D 0.8 0.1 0:00.05 imap ############################################################################################ Regards, Kavish Karkera
Hi, if you try the following command during the server has a high load: # ps -ostat,pid,time,wchan='WCHAN-xxxxxxxxxxxxxxxxxxxx',cmd ax |grep D Do you get back something like this? STAT PID TIME WCHAN-xxxxxxxxxxxxxxxxxxxx CMD D 18713 00:00:00 synchronize_srcu dovecot/imap D 18736 00:00:00 synchronize_srcu dovecot/imap D 18775 00:00:05 synchronize_srcu dovecot/imap D 20330 00:00:00 synchronize_srcu dovecot/imap D 20357 00:00:00 synchronize_srcu dovecot/imap D 20422 00:00:00 synchronize_srcu dovecot/imap D 20687 00:00:00 synchronize_srcu dovecot/imap S+ 20913 00:00:00 pipe_wait grep D If yes, it could be a problem with Inotify in your kernel. You can try to disable inotify in the kernel with: echo 0 > /proc/sys/fs/inotify/max_user_watches echo 0 > /proc/sys/fs/inotify/max_user_instances Full article: http://thread.gmane.org/gmane.linux.kernel/1315430 For me this resolved the problem. Load goes down to < 1.00 Regards Urban Am 21.08.2013 12:37, schrieb Kavish Karkera:> Hi, > > We have a serious issue running on our POP/IMAP servers these days. The load average of a servers > spikes up to 400-500 as a uptime command result, for a particular time period , to be specific > mostly in noon time and evening, but it last for few minutes only. > > We have 2 servers running dovecot 1.1.20 , in loadbanlancer, We have used KEEPLIVE (1.1.13) for > loadbalacing. > > Server specification. > Operating System : CentOS 5.5 64bit > CPU cores : 16 > RAM : 8GB > > Mail and Indexes are mounted on NFS (NetApp). > > Below is the dovecot -n ... (top results during high spike) > > > ############################################################################################# > > # 1.1.20: /usr/local/etc/dovecot.conf > # OS: Linux 2.6.28 x86_64 CentOS release 5.5 (Final) > log_path: /var/log/dovecot-info.log > info_log_path: /var/log/dovecot-info.log > syslog_facility: local1 > protocols: imap imaps pop3 pop3s > listen(default): *:143 > listen(imap): *:143 > listen(pop3): *:110 > ssl_listen(default): *:993 > ssl_listen(imap): *:993 > ssl_listen(pop3): *:995 > ssl_cert_file: /usr/local/etc/ssl/certs/dovecot.pem > ssl_key_file: /usr/local/etc/ssl/private/dovecot.pem > disable_plaintext_auth: no > login_dir: /usr/local/var/run/dovecot/login > login_executable(default): /usr/local/libexec/dovecot/imap-login > login_executable(imap): /usr/local/libexec/dovecot/imap-login > login_executable(pop3): /usr/local/libexec/dovecot/pop3-login > login_greeting: Welcome to Popserver. > login_process_per_connection: no > max_mail_processes: 1024 > mail_max_userip_connections(default): 100 > mail_max_userip_connections(imap): 100 > mail_max_userip_connections(pop3): 50 > verbose_proctitle: yes > first_valid_uid: 99 > first_valid_gid: 99 > mail_location: maildir:~/Maildir:INDEX=/indexes/%h:CONTROL=/indexes/%h > mmap_disable: yes > mail_nfs_storage: yes > mail_nfs_index: yes > lock_method: dotlock > mail_executable(default): /usr/local/libexec/dovecot/imap > mail_executable(imap): /usr/local/libexec/dovecot/imap > mail_executable(pop3): /usr/local/libexec/dovecot/pop3 > mail_plugins(default): quota imap_quota > mail_plugins(imap): quota imap_quota > mail_plugins(pop3): quota > mail_plugin_dir(default): /usr/local/lib/dovecot/imap > mail_plugin_dir(imap): /usr/local/lib/dovecot/imap > mail_plugin_dir(pop3): /usr/local/lib/dovecot/pop3 > pop3_no_flag_updates(default): no > pop3_no_flag_updates(imap): no > pop3_no_flag_updates(pop3): yes > pop3_lock_session(default): no > pop3_lock_session(imap): no > pop3_lock_session(pop3): yes > pop3_client_workarounds(default): > pop3_client_workarounds(imap): > pop3_client_workarounds(pop3): outlook-no-nuls > lda: > postmaster_address: admin at research.com > mail_plugins: cmusieve quota mail_log > mail_plugin_dir: /usr/local/lib/dovecot/lda > auth_socket_path: /var/run/dovecot/auth-master > auth default: > worker_max_count: 15 > passdb: > driver: sql > args: /usr/local/etc/dovecot-mysql.conf > userdb: > driver: sql > args: /usr/local/etc/dovecot-mysql.conf > userdb: > driver: prefetch > socket: > type: listen > client: > path: /var/run/dovecot/auth-client > mode: 432 > user: nobody > group: nobody > master: > path: /var/run/dovecot/auth-master > mode: 384 > user: nobody > group: nobody > plugin: > quota_warning: storage=95%% /usr/local/bin/quota-warning.sh 95 %u > quota_warning2: storage=80%% /usr/local/bin/quota-warning.sh 80 %u > quota: maildir:storage=64 > ############################################################################################## > > ############################################################################################## > > top - 12:08:31 up 206 days, 10:45, 3 users, load average: 189.88, 82.07, 55.97 > Tasks: 771 total, 1 running, 767 sleeping, 1 stopped, 2 zombie > Cpu(s): 8.3%us, 7.6%sy, 0.0%ni, 8.3%id, 75.0%wa, 0.0%hi, 0.8%si, 0.0%st > Mem: 16279824k total, 11913788k used, 4366036k free, 334308k buffers > Swap: 4192956k total, 0k used, 4192956k free, 10359492k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 408 mysql 18 0 384m 38m 4412 S 52.8 0.2 42221:44 mysqld > 29326 nobody 15 0 22688 10m 1112 D 3.9 0.1 0:00.05 imap > 29313 nobody 16 0 14892 4892 1000 S 3.1 0.0 0:00.07 imap > 28129 nobody 16 0 23688 4148 1140 S 2.3 0.0 0:00.10 imap > 28936 nobody 16 0 25048 13m 1108 S 2.3 0.1 0:00.06 imap > 29302 nobody 16 0 15168 3924 1000 S 2.3 0.0 0:00.05 imap > 29307 nobody 16 0 12864 2892 1000 S 2.3 0.0 0:00.03 imap > 21956 root 15 0 13272 1652 800 R 1.6 0.0 0:06.46 top > 29119 nobody 16 0 26000 9324 1096 D 1.6 0.1 0:00.11 imap > 29158 nobody 16 0 12684 2728 1028 S 1.6 0.0 0:00.03 imap > 29204 nobody 16 0 14400 4232 1000 S 1.6 0.0 0:00.08 imap > 29262 nobody 16 0 25436 2776 1120 S 1.6 0.0 0:00.02 imap > 3397 root 10 -5 0 0 0 S 0.8 0.0 28:02.90 rpciod/2 > 3401 root 10 -5 0 0 0 S 0.8 0.0 337:30.77 rpciod/6 > 4218 root 16 0 8948 1548 588 S 0.8 0.0 8:05.53 dovecot > 10214 root 15 0 50868 1968 1468 S 0.8 0.0 0:30.08 dovecot-auth > 27454 nobody 16 0 19220 5048 1256 D 0.8 0.0 0:00.17 imap > 27820 nobody 16 0 18916 4704 1252 D 0.8 0.0 0:00.08 imap > 27925 nobody 16 0 11180 1524 1000 S 0.8 0.0 0:00.07 imap > 28094 nobody 16 0 11536 2068 1148 S 0.8 0.0 0:00.13 imap > 28150 nobody 16 0 12528 2396 1104 S 0.8 0.0 0:00.08 imap > 28196 nobody 16 0 13976 3792 1148 S 0.8 0.0 0:00.21 imap > 28292 nobody 16 0 11984 1808 1004 S 0.8 0.0 0:00.04 imap > 28343 nobody 16 0 11584 2052 1152 D 0.8 0.0 0:00.02 imap > 28352 nobody 15 0 11108 1492 1004 S 0.8 0.0 0:00.04 imap > 28413 nobody 16 0 11220 1568 996 S 0.8 0.0 0:00.02 imap > 28434 nobody 16 0 12492 2444 1004 S 0.8 0.0 0:00.05 imap > 28467 nobody 16 0 11852 2044 1004 D 0.8 0.0 0:00.06 imap > 28558 nobody 16 0 11904 2316 1168 S 0.8 0.0 0:00.02 imap > 28643 nobody 16 0 11312 1688 1000 S 0.8 0.0 0:00.04 imap > 28711 nobody 15 0 12048 1868 1004 S 0.8 0.0 0:00.04 imap > 28779 nobody 16 0 23392 10m 1000 D 0.8 0.1 0:00.05 imap > > ############################################################################################ > > Regards, > Kavish Karkera >
On 8/21/2013 5:37 AM, Kavish Karkera wrote:> We have a serious issue running on our POP/IMAP servers these days. The load average of a servers > spikes up to 400-500 as a uptime command result, for a particular time period , to be specific > mostly in noon time and evening, but it last for few minutes only. > > We have 2 servers running dovecot 1.1.20 , in loadbanlancer, We have used KEEPLIVE (1.1.13) for > loadbalacing. > > Server specification. > Operating System : CentOS 5.5 64bit > CPU cores : 16 > RAM : 8GB > > Mail and Indexes are mounted on NFS (NetApp)....> Cpu(s): 8.3%us, 7.6%sy, 0.0%ni, 8.3%id, 75.0%wa, 0.0%hi, 0.8%si, 0.0%st^^^^^^^> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 408 mysql 18 0 384m 38m 4412 S 52.8 0.2 42221:44 mysqldThis doesn't seem to be a dovecot issue. mysql has apparently 8 (or more) threads on 8 cores all blocking on IO. I see a few possible causes. 1. The NetApp is unable to keep up with the request rate because: a. There are too few spindles in the RAID set backing this NFS volume and/or the file(s) aren't properly striped across all spindles b. An inappropriate RAID level. The mysql job is apparently doing large table updates and you're experiencing massive RMW latency from RAID5/6. This is why one should never put a transactional database, or one that sees large frequent table updates, on a parity RAID volume--unless the disks are SSD. SSDs have no mechanical parts, thus RMW latency is almost nonexistent. 2. Apparently 8 (or more) threads are concurrently accessing the same file or files. Thus the massive iowait could simply be the result of filesystem and/or NFS locking, NFS client caching issues, etc. The cause of the massive iowait could be one or all of the above, or could be something else entirely. These are the typical causes. You seem to have a database job scheduled to run twice daily that triggers the problem. Identify this job, figure out what it does, why it does it, how necessary it is, and if it can be scheduled to run at off peak hours. If it can you may want to simply do so, as it may be expensive, in hardware and/or labor dollars, to fix the IO latency problem. -- Stan