Hi,
We have a serious issue running on our POP/IMAP servers these days. The load
average of a servers
spikes up to 400-500 as a uptime command result, for a particular time period ,
to be specific
mostly in noon time and evening, but it last for few minutes only.
We have 2 servers running dovecot 1.1.20 , in loadbanlancer, We have used
KEEPLIVE (1.1.13) for
loadbalacing.
Server specification.
Operating System : CentOS 5.5 64bit
CPU cores : 16
RAM : 8GB
Mail and Indexes are mounted on NFS (NetApp).
Below is the dovecot -n ... (top results during high spike)
#############################################################################################
# 1.1.20: /usr/local/etc/dovecot.conf
# OS: Linux 2.6.28 x86_64 CentOS release 5.5 (Final)
log_path: /var/log/dovecot-info.log
info_log_path: /var/log/dovecot-info.log
syslog_facility: local1
protocols: imap imaps pop3 pop3s
listen(default): *:143
listen(imap): *:143
listen(pop3): *:110
ssl_listen(default): *:993
ssl_listen(imap): *:993
ssl_listen(pop3): *:995
ssl_cert_file: /usr/local/etc/ssl/certs/dovecot.pem
ssl_key_file: /usr/local/etc/ssl/private/dovecot.pem
disable_plaintext_auth: no
login_dir: /usr/local/var/run/dovecot/login
login_executable(default): /usr/local/libexec/dovecot/imap-login
login_executable(imap): /usr/local/libexec/dovecot/imap-login
login_executable(pop3): /usr/local/libexec/dovecot/pop3-login
login_greeting: Welcome to Popserver.
login_process_per_connection: no
max_mail_processes: 1024
mail_max_userip_connections(default): 100
mail_max_userip_connections(imap): 100
mail_max_userip_connections(pop3): 50
verbose_proctitle: yes
first_valid_uid: 99
first_valid_gid: 99
mail_location: maildir:~/Maildir:INDEX=/indexes/%h:CONTROL=/indexes/%h
mmap_disable: yes
mail_nfs_storage: yes
mail_nfs_index: yes
lock_method: dotlock
mail_executable(default): /usr/local/libexec/dovecot/imap
mail_executable(imap): /usr/local/libexec/dovecot/imap
mail_executable(pop3): /usr/local/libexec/dovecot/pop3
mail_plugins(default): quota imap_quota
mail_plugins(imap): quota imap_quota
mail_plugins(pop3): quota
mail_plugin_dir(default): /usr/local/lib/dovecot/imap
mail_plugin_dir(imap): /usr/local/lib/dovecot/imap
mail_plugin_dir(pop3): /usr/local/lib/dovecot/pop3
pop3_no_flag_updates(default): no
pop3_no_flag_updates(imap): no
pop3_no_flag_updates(pop3): yes
pop3_lock_session(default): no
pop3_lock_session(imap): no
pop3_lock_session(pop3): yes
pop3_client_workarounds(default):
pop3_client_workarounds(imap):
pop3_client_workarounds(pop3): outlook-no-nuls
lda:
postmaster_address: admin at research.com
mail_plugins: cmusieve quota mail_log
mail_plugin_dir: /usr/local/lib/dovecot/lda
auth_socket_path: /var/run/dovecot/auth-master
auth default:
worker_max_count: 15
passdb:
driver: sql
args: /usr/local/etc/dovecot-mysql.conf
userdb:
driver: sql
args: /usr/local/etc/dovecot-mysql.conf
userdb:
driver: prefetch
socket:
type: listen
client:
path: /var/run/dovecot/auth-client
mode: 432
user: nobody
group: nobody
master:
path: /var/run/dovecot/auth-master
mode: 384
user: nobody
group: nobody
plugin:
quota_warning: storage=95%% /usr/local/bin/quota-warning.sh 95 %u
quota_warning2: storage=80%% /usr/local/bin/quota-warning.sh 80 %u
quota: maildir:storage=64
##############################################################################################
##############################################################################################
top - 12:08:31 up 206 days, 10:45, 3 users, load average: 189.88, 82.07, 55.97
Tasks: 771 total, 1 running, 767 sleeping, 1 stopped, 2 zombie
Cpu(s): 8.3%us, 7.6%sy, 0.0%ni, 8.3%id, 75.0%wa, 0.0%hi, 0.8%si, 0.0%st
Mem: 16279824k total, 11913788k used, 4366036k free, 334308k buffers
Swap: 4192956k total, 0k used, 4192956k free, 10359492k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
408 mysql 18 0 384m 38m 4412 S 52.8 0.2 42221:44 mysqld
29326 nobody 15 0 22688 10m 1112 D 3.9 0.1 0:00.05 imap
29313 nobody 16 0 14892 4892 1000 S 3.1 0.0 0:00.07 imap
28129 nobody 16 0 23688 4148 1140 S 2.3 0.0 0:00.10 imap
28936 nobody 16 0 25048 13m 1108 S 2.3 0.1 0:00.06 imap
29302 nobody 16 0 15168 3924 1000 S 2.3 0.0 0:00.05 imap
29307 nobody 16 0 12864 2892 1000 S 2.3 0.0 0:00.03 imap
21956 root 15 0 13272 1652 800 R 1.6 0.0 0:06.46 top
29119 nobody 16 0 26000 9324 1096 D 1.6 0.1 0:00.11 imap
29158 nobody 16 0 12684 2728 1028 S 1.6 0.0 0:00.03 imap
29204 nobody 16 0 14400 4232 1000 S 1.6 0.0 0:00.08 imap
29262 nobody 16 0 25436 2776 1120 S 1.6 0.0 0:00.02 imap
3397 root 10 -5 0 0 0 S 0.8 0.0 28:02.90 rpciod/2
3401 root 10 -5 0 0 0 S 0.8 0.0 337:30.77 rpciod/6
4218 root 16 0 8948 1548 588 S 0.8 0.0 8:05.53 dovecot
10214 root 15 0 50868 1968 1468 S 0.8 0.0 0:30.08 dovecot-auth
27454 nobody 16 0 19220 5048 1256 D 0.8 0.0 0:00.17 imap
27820 nobody 16 0 18916 4704 1252 D 0.8 0.0 0:00.08 imap
27925 nobody 16 0 11180 1524 1000 S 0.8 0.0 0:00.07 imap
28094 nobody 16 0 11536 2068 1148 S 0.8 0.0 0:00.13 imap
28150 nobody 16 0 12528 2396 1104 S 0.8 0.0 0:00.08 imap
28196 nobody 16 0 13976 3792 1148 S 0.8 0.0 0:00.21 imap
28292 nobody 16 0 11984 1808 1004 S 0.8 0.0 0:00.04 imap
28343 nobody 16 0 11584 2052 1152 D 0.8 0.0 0:00.02 imap
28352 nobody 15 0 11108 1492 1004 S 0.8 0.0 0:00.04 imap
28413 nobody 16 0 11220 1568 996 S 0.8 0.0 0:00.02 imap
28434 nobody 16 0 12492 2444 1004 S 0.8 0.0 0:00.05 imap
28467 nobody 16 0 11852 2044 1004 D 0.8 0.0 0:00.06 imap
28558 nobody 16 0 11904 2316 1168 S 0.8 0.0 0:00.02 imap
28643 nobody 16 0 11312 1688 1000 S 0.8 0.0 0:00.04 imap
28711 nobody 15 0 12048 1868 1004 S 0.8 0.0 0:00.04 imap
28779 nobody 16 0 23392 10m 1000 D 0.8 0.1 0:00.05 imap
############################################################################################
Regards,
Kavish Karkera
Hi, if you try the following command during the server has a high load: # ps -ostat,pid,time,wchan='WCHAN-xxxxxxxxxxxxxxxxxxxx',cmd ax |grep D Do you get back something like this? STAT PID TIME WCHAN-xxxxxxxxxxxxxxxxxxxx CMD D 18713 00:00:00 synchronize_srcu dovecot/imap D 18736 00:00:00 synchronize_srcu dovecot/imap D 18775 00:00:05 synchronize_srcu dovecot/imap D 20330 00:00:00 synchronize_srcu dovecot/imap D 20357 00:00:00 synchronize_srcu dovecot/imap D 20422 00:00:00 synchronize_srcu dovecot/imap D 20687 00:00:00 synchronize_srcu dovecot/imap S+ 20913 00:00:00 pipe_wait grep D If yes, it could be a problem with Inotify in your kernel. You can try to disable inotify in the kernel with: echo 0 > /proc/sys/fs/inotify/max_user_watches echo 0 > /proc/sys/fs/inotify/max_user_instances Full article: http://thread.gmane.org/gmane.linux.kernel/1315430 For me this resolved the problem. Load goes down to < 1.00 Regards Urban Am 21.08.2013 12:37, schrieb Kavish Karkera:> Hi, > > We have a serious issue running on our POP/IMAP servers these days. The load average of a servers > spikes up to 400-500 as a uptime command result, for a particular time period , to be specific > mostly in noon time and evening, but it last for few minutes only. > > We have 2 servers running dovecot 1.1.20 , in loadbanlancer, We have used KEEPLIVE (1.1.13) for > loadbalacing. > > Server specification. > Operating System : CentOS 5.5 64bit > CPU cores : 16 > RAM : 8GB > > Mail and Indexes are mounted on NFS (NetApp). > > Below is the dovecot -n ... (top results during high spike) > > > ############################################################################################# > > # 1.1.20: /usr/local/etc/dovecot.conf > # OS: Linux 2.6.28 x86_64 CentOS release 5.5 (Final) > log_path: /var/log/dovecot-info.log > info_log_path: /var/log/dovecot-info.log > syslog_facility: local1 > protocols: imap imaps pop3 pop3s > listen(default): *:143 > listen(imap): *:143 > listen(pop3): *:110 > ssl_listen(default): *:993 > ssl_listen(imap): *:993 > ssl_listen(pop3): *:995 > ssl_cert_file: /usr/local/etc/ssl/certs/dovecot.pem > ssl_key_file: /usr/local/etc/ssl/private/dovecot.pem > disable_plaintext_auth: no > login_dir: /usr/local/var/run/dovecot/login > login_executable(default): /usr/local/libexec/dovecot/imap-login > login_executable(imap): /usr/local/libexec/dovecot/imap-login > login_executable(pop3): /usr/local/libexec/dovecot/pop3-login > login_greeting: Welcome to Popserver. > login_process_per_connection: no > max_mail_processes: 1024 > mail_max_userip_connections(default): 100 > mail_max_userip_connections(imap): 100 > mail_max_userip_connections(pop3): 50 > verbose_proctitle: yes > first_valid_uid: 99 > first_valid_gid: 99 > mail_location: maildir:~/Maildir:INDEX=/indexes/%h:CONTROL=/indexes/%h > mmap_disable: yes > mail_nfs_storage: yes > mail_nfs_index: yes > lock_method: dotlock > mail_executable(default): /usr/local/libexec/dovecot/imap > mail_executable(imap): /usr/local/libexec/dovecot/imap > mail_executable(pop3): /usr/local/libexec/dovecot/pop3 > mail_plugins(default): quota imap_quota > mail_plugins(imap): quota imap_quota > mail_plugins(pop3): quota > mail_plugin_dir(default): /usr/local/lib/dovecot/imap > mail_plugin_dir(imap): /usr/local/lib/dovecot/imap > mail_plugin_dir(pop3): /usr/local/lib/dovecot/pop3 > pop3_no_flag_updates(default): no > pop3_no_flag_updates(imap): no > pop3_no_flag_updates(pop3): yes > pop3_lock_session(default): no > pop3_lock_session(imap): no > pop3_lock_session(pop3): yes > pop3_client_workarounds(default): > pop3_client_workarounds(imap): > pop3_client_workarounds(pop3): outlook-no-nuls > lda: > postmaster_address: admin at research.com > mail_plugins: cmusieve quota mail_log > mail_plugin_dir: /usr/local/lib/dovecot/lda > auth_socket_path: /var/run/dovecot/auth-master > auth default: > worker_max_count: 15 > passdb: > driver: sql > args: /usr/local/etc/dovecot-mysql.conf > userdb: > driver: sql > args: /usr/local/etc/dovecot-mysql.conf > userdb: > driver: prefetch > socket: > type: listen > client: > path: /var/run/dovecot/auth-client > mode: 432 > user: nobody > group: nobody > master: > path: /var/run/dovecot/auth-master > mode: 384 > user: nobody > group: nobody > plugin: > quota_warning: storage=95%% /usr/local/bin/quota-warning.sh 95 %u > quota_warning2: storage=80%% /usr/local/bin/quota-warning.sh 80 %u > quota: maildir:storage=64 > ############################################################################################## > > ############################################################################################## > > top - 12:08:31 up 206 days, 10:45, 3 users, load average: 189.88, 82.07, 55.97 > Tasks: 771 total, 1 running, 767 sleeping, 1 stopped, 2 zombie > Cpu(s): 8.3%us, 7.6%sy, 0.0%ni, 8.3%id, 75.0%wa, 0.0%hi, 0.8%si, 0.0%st > Mem: 16279824k total, 11913788k used, 4366036k free, 334308k buffers > Swap: 4192956k total, 0k used, 4192956k free, 10359492k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 408 mysql 18 0 384m 38m 4412 S 52.8 0.2 42221:44 mysqld > 29326 nobody 15 0 22688 10m 1112 D 3.9 0.1 0:00.05 imap > 29313 nobody 16 0 14892 4892 1000 S 3.1 0.0 0:00.07 imap > 28129 nobody 16 0 23688 4148 1140 S 2.3 0.0 0:00.10 imap > 28936 nobody 16 0 25048 13m 1108 S 2.3 0.1 0:00.06 imap > 29302 nobody 16 0 15168 3924 1000 S 2.3 0.0 0:00.05 imap > 29307 nobody 16 0 12864 2892 1000 S 2.3 0.0 0:00.03 imap > 21956 root 15 0 13272 1652 800 R 1.6 0.0 0:06.46 top > 29119 nobody 16 0 26000 9324 1096 D 1.6 0.1 0:00.11 imap > 29158 nobody 16 0 12684 2728 1028 S 1.6 0.0 0:00.03 imap > 29204 nobody 16 0 14400 4232 1000 S 1.6 0.0 0:00.08 imap > 29262 nobody 16 0 25436 2776 1120 S 1.6 0.0 0:00.02 imap > 3397 root 10 -5 0 0 0 S 0.8 0.0 28:02.90 rpciod/2 > 3401 root 10 -5 0 0 0 S 0.8 0.0 337:30.77 rpciod/6 > 4218 root 16 0 8948 1548 588 S 0.8 0.0 8:05.53 dovecot > 10214 root 15 0 50868 1968 1468 S 0.8 0.0 0:30.08 dovecot-auth > 27454 nobody 16 0 19220 5048 1256 D 0.8 0.0 0:00.17 imap > 27820 nobody 16 0 18916 4704 1252 D 0.8 0.0 0:00.08 imap > 27925 nobody 16 0 11180 1524 1000 S 0.8 0.0 0:00.07 imap > 28094 nobody 16 0 11536 2068 1148 S 0.8 0.0 0:00.13 imap > 28150 nobody 16 0 12528 2396 1104 S 0.8 0.0 0:00.08 imap > 28196 nobody 16 0 13976 3792 1148 S 0.8 0.0 0:00.21 imap > 28292 nobody 16 0 11984 1808 1004 S 0.8 0.0 0:00.04 imap > 28343 nobody 16 0 11584 2052 1152 D 0.8 0.0 0:00.02 imap > 28352 nobody 15 0 11108 1492 1004 S 0.8 0.0 0:00.04 imap > 28413 nobody 16 0 11220 1568 996 S 0.8 0.0 0:00.02 imap > 28434 nobody 16 0 12492 2444 1004 S 0.8 0.0 0:00.05 imap > 28467 nobody 16 0 11852 2044 1004 D 0.8 0.0 0:00.06 imap > 28558 nobody 16 0 11904 2316 1168 S 0.8 0.0 0:00.02 imap > 28643 nobody 16 0 11312 1688 1000 S 0.8 0.0 0:00.04 imap > 28711 nobody 15 0 12048 1868 1004 S 0.8 0.0 0:00.04 imap > 28779 nobody 16 0 23392 10m 1000 D 0.8 0.1 0:00.05 imap > > ############################################################################################ > > Regards, > Kavish Karkera >
On 8/21/2013 5:37 AM, Kavish Karkera wrote:> We have a serious issue running on our POP/IMAP servers these days. The load average of a servers > spikes up to 400-500 as a uptime command result, for a particular time period , to be specific > mostly in noon time and evening, but it last for few minutes only. > > We have 2 servers running dovecot 1.1.20 , in loadbanlancer, We have used KEEPLIVE (1.1.13) for > loadbalacing. > > Server specification. > Operating System : CentOS 5.5 64bit > CPU cores : 16 > RAM : 8GB > > Mail and Indexes are mounted on NFS (NetApp)....> Cpu(s): 8.3%us, 7.6%sy, 0.0%ni, 8.3%id, 75.0%wa, 0.0%hi, 0.8%si, 0.0%st^^^^^^^> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 408 mysql 18 0 384m 38m 4412 S 52.8 0.2 42221:44 mysqldThis doesn't seem to be a dovecot issue. mysql has apparently 8 (or more) threads on 8 cores all blocking on IO. I see a few possible causes. 1. The NetApp is unable to keep up with the request rate because: a. There are too few spindles in the RAID set backing this NFS volume and/or the file(s) aren't properly striped across all spindles b. An inappropriate RAID level. The mysql job is apparently doing large table updates and you're experiencing massive RMW latency from RAID5/6. This is why one should never put a transactional database, or one that sees large frequent table updates, on a parity RAID volume--unless the disks are SSD. SSDs have no mechanical parts, thus RMW latency is almost nonexistent. 2. Apparently 8 (or more) threads are concurrently accessing the same file or files. Thus the massive iowait could simply be the result of filesystem and/or NFS locking, NFS client caching issues, etc. The cause of the massive iowait could be one or all of the above, or could be something else entirely. These are the typical causes. You seem to have a database job scheduled to run twice daily that triggers the problem. Identify this job, figure out what it does, why it does it, how necessary it is, and if it can be scheduled to run at off peak hours. If it can you may want to simply do so, as it may be expensive, in hardware and/or labor dollars, to fix the IO latency problem. -- Stan