Hello,
My dovecot server is currently having random issues when users want to
acces their mailbox.
When the issue is occuring there's a timeout. It's random because it
happens more or less often and is not dependent on a user or the way
their check the mailbox.
The time during the problem occurs is random, sometimes it's less than a
minute, sometimes it's some hours.
When a user can't connect to its mailbox, he generally can receive its
emails through his BlackBerry (it's just to illustrate the fact that it
doesn't seem to be linked to an account).
When this issue occurs they can't receive the headers from any folder
(INBOX, custom, sent, ...) and can't read mails that headers are known
by the mail client.
When there's timeout, users can send emails (same jail and with an error
when the client want to write a copy in the sentbox) and use the
calendar server for example (on the same machine).
Dovecot is running on a FreeBSD 7.0 32bits, 4GB RAM, Intel Xeon QuadCore
@ 1.86 Ghz, and 3*500Gb SATA-2 RAID-5 disks.
The box is hosting jails, and the mail jails ( imap + smtp, clamav +
spamassassin ). The mail jails are new (since August 2008) but worked
great since the beginning of this year.
The server is hosting 122 accounts currently.
My first thought was that there's an I/O issue, the disks are maybe too
busy or there's paging that result in a timeout. I check it through
vmstat an top commands but nothing appears, there's always some memory
free (between 90-300 MB) and a very little paging, generally around 1MB.
The fault are under a hundred and when, few times, it has more than a
hundred (generally under 200) the next snapshot is under 100. I set the
screen/line to refresh every 5 seconds.
I shutdowned all jails not directly related to the mail service but the
problem still occured.
I also separated clamav and spamassassin from imap and smtp to a
different box.
After that I checked the dovecot config to lighten it, and ( only )
disabled fsync.
I upgraded the RAM, added memory is used, but nothing changed.
The human resources are constantly changing here, we were more before
the problem started than now (around 10-15%), to illustrate the fact
that I don't think it is linked to the number of users.
I tried to recreate some accounts having the issue, but the problem
appeared again.
I upgraded Dovecot Friday to 1.1.11 from 1.1.7 (The installation was
before 1.1.7, I did an upgrade some times ago).
I used a command script to log Thunderbird's IMAP activity, everything
is fine but there's no timestamp in the logs. So I'm only sure now this
is not an erroneous packet/info sent problem.
I watched the TB doc (
http://www.mozilla.org/projects/nspr/reference/html/prlog.html#25306 )
but there's no directive to put a timestamp for each line of log.
The network is fine. I tried different configuration to be sure a device
isn't doing something weird.
Now, I don't know what to check to identify the issue.
If anyone has any idea I didn't wrote here, or if I did erroneous
interpretation(s) from the datas, I'll be glad to know.
Regards,
--
Bastien Semene
Administrateur R?seau & Syst?me