Thank you Aaron for the quickly answer. Below, my comments:
Em 27/07/2010 10:57, Aaron Thompson escreveu:> This looks like a disk issue - Contention, or wait time. This could be
> a result of the time needed to write that 80k message to all users
> mailboxes is throttling your disk connection or pushing some limit for
> file size that moves the io into a larger set of blocks than smaller
> messages would use. It looks and sounds like you may be waiting for
> the disk to write those messages - I guess it depends on the size of
> *all*.
Ok. I guess it too, and I intend to increase the block size from 2 KB to
4 KB and split my 2 TB partition in 4-5 partitions of 400 GB to share
the load between the two main controllers from storage device. Do you
think this is a good improvement or more overhead?
One doubt is: is this contention caused by Debian (and its IO/ocfs2
manager) or by Storage device? I made some IO benchmarchs using Debian
with OCFS and reached almost 100 MBps!! I know that the profile of
benchmarch is different from mail environment (with a lot of small
files), but...
>
> Your load is a function of more than CPU - your IO Wait is in there
> somewhere also. I would suggest iostat, it may give you a better view
> of which disk is doing how much work. I believe this is packaged with
> a few other utilities as systat in debian (I've been on RHEL for a
> while so make sure you check)
Today I have the 2 TB partition spread over 20 FC disks in a Raid 5
array. iostat didn't help so much:
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
dm-0 4637,25 6,99 2,07 7
2
dm-0 1491,18 2,91 0,00 2
0
dm-0 1535,51 2,58 0,41 2
0
Any other advice? Thanks again
Jeronimo
>
> Good Luck.
>
> @
>
> Aaron Thompson Applications Administrator / Database Administrator
> http://www.uni.edu/~prefect/ University of Northern Iowa
>
> "All it takes to fly is to hurl yourself at the ground... and
miss."
> -Douglas Adams
>
> On 07/27/10 08:32, Jeronimo Bezerra wrote:
>> Hello all,
>>
>> I need some help to understand one situation about disk/OCFS
>> performance. Let-me introduce my environment:
>>
>> I use OCFS2 in a mail environment with almost 10k users, in a OCFS2
>> partition of 2 TB (~1TB in use). A lot of low files, block size of 2Kb.
>> It's a Debian Etch Linux, in a IBM Ds4500 Storage with QLA2340.
>>
>> Since a few weeks ago, I noted a poor performance when I have a mail to
>> all users (all-l), mainly when this e-mail has more than 80Kb (yes, I
>> know, It shouldn't happen, but here we have friendly fire! ). This
>> situation is new, because this environment has almost 3 years. When
this
>> email 'appears' in my mail postfix queue, after some seconds,
my load
>> average goes to 100 -> 200 -> 300! Yesterday I paused the
delivery of
>> these emails in postfix (postsuper -h ALL) and after a one minute, the
>> load average went to 2,31! One very strange thing is the mpstat output
>> in that moment of high load:
>>
>> 09:28:17 CPU %user %nice %sys %iowait %irq %soft
>> %steal %idle intr/s
>> 09:28:18 all 7,05 0,00 2,59 11,12 0,00 0,22
>> 0,00 79,02 1788,12
>> 09:28:18 0 37,62 0,00 15,84 41,58 0,00 3,96
>> 0,00 0,99 1790,10
>> 09:28:18 1 2,97 0,00 4,95 5,94 0,00 0,00
>> 0,00 92,08 0,00
>> 09:28:18 2 0,00 0,00 1,98 6,93 0,00 0,00
>> 0,00 112,87 0,00
>> 09:28:18 3 0,00 0,00 0,99 4,95 0,00 0,00
>> 0,00 158,42 0,00
>> 09:28:18 4 0,00 0,00 0,00 0,00 0,00 0,00
>> 0,00 100,99 0,00
>> 09:28:18 5 0,99 0,00 1,98 31,68 0,00 0,00
>> 0,00 70,30 0,00
>> 09:28:18 6 0,00 0,00 0,00 0,00 0,00 0,00
>> 0,00 185,15 0,00
>> 09:28:18 7 0,00 0,00 0,00 0,00 0,00 0,00
>> 0,00 52,48 0,00
>> 09:28:18 8 29,70 0,00 7,92 57,43 0,00 0,00
>> 0,00 6,93 0,00
>> 09:28:18 9 2,97 0,00 5,94 43,56 0,00 0,00
>> 0,00 50,50 0,00
>> 09:28:18 10 47,52 0,00 3,96 1,98 0,00 0,00
>> 0,00 54,46 0,00
>> 09:28:18 11 0,00 0,00 0,00 3,96 0,00 0,00
>> 0,00 99,01 0,00
>> 09:28:18 12 0,00 0,00 0,00 0,00 0,00 0,00
>> 0,00 99,01 0,00
>> 09:28:18 13 3,96 0,00 1,98 0,00 0,00 0,00
>> 0,00 99,01 0,00
>> 09:28:18 14 0,00 0,00 0,00 0,00 0,00 0,00
>> 0,00 138,61 0,00
>> 09:28:18 15 0,00 0,00 0,00 1,98 0,00 0,00
>> 0,00 99,01 0,00
>>
>> 09:31:44 CPU %user %nice %sys %iowait %irq %soft
>> %steal %idle intr/s
>> 09:31:45 all 1,10 0,00 2,88 11,22 0,00 0,25
>> 0,00 84,55 1811,76
>> 09:31:45 0 6,86 0,00 13,73 69,61 0,00 3,92
>> 0,00 5,88 1810,78
>> 09:31:45 1 0,98 0,00 2,94 2,94 0,00 0,00
>> 0,00 96,08 0,00
>> 09:31:45 2 0,98 0,00 1,96 9,80 0,00 0,00
>> 0,00 90,20 0,00
>> 09:31:45 3 0,00 0,00 1,96 1,96 0,00 0,00
>> 0,00 94,12 0,00
>> 09:31:45 4 0,98 0,00 0,00 0,00 0,00 0,00
>> 0,00 99,02 0,00
>> 09:31:45 5 0,00 0,00 0,98 0,98 0,00 0,00
>> 0,00 97,06 0,00
>> 09:31:45 6 0,00 0,00 2,94 4,90 0,00 0,00
>> 0,00 95,10 0,00
>> 09:31:45 7 0,00 0,00 1,96 9,80 0,00 0,00
>> 0,00 86,27 0,00
>> 09:31:45 8 1,96 0,00 5,88 50,00 0,00 0,00
>> 0,00 41,18 0,00
>> 09:31:45 9 1,96 0,00 0,98 0,98 0,00 0,00
>> 0,00 92,16 0,00
>> 09:31:45 10 0,98 0,00 2,94 8,82 0,00 0,00
>> 0,00 84,31 0,00
>> 09:31:45 11 2,94 0,00 1,96 1,96 0,00 0,00
>> 0,00 94,12 0,00
>> 09:31:45 12 0,00 0,00 1,96 0,98 0,00 0,00
>> 0,00 97,06 0,00
>> 09:31:45 13 0,00 0,00 1,96 0,98 0,00 0,00
>> 0,00 94,12 0,00
>> 09:31:45 14 0,00 0,00 1,96 7,84 0,00 0,00
>> 0,00 95,10 0,00
>> 09:31:45 15 0,00 0,00 0,98 7,84 0,00 0,00
>> 0,00 93,14 0,00
>>
>> I don't understand why only one CPU (from the 16) is with 100%
>> utilization in the moment of high load average, and why mpstat shows
>> that only CPU 0 has almost all interruptions/s. By htop, just CPU 0 is
>> in high utilization, and that's strange for me. In taht moment, the
>> DS-4500 is normal, shows utilization from my mail host about 7-8 MB/s.
>>
>> So, how could I do to discover why my server have this bottleneck? Any
>> help would be appreciated.
>>
>> Thank you,
>>
>> Jeronimo Bezerra
>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>