thr3ads.net - Lustre discuss - [Lustre-discuss] slow direct_io , slow journal .. in OST log [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Lex

2010-Jan-25 02:42 UTC

[Lustre-discuss] slow direct_io , slow journal .. in OST log

Hi list

I have one OSS with hadware info like this :

CPU Intel(R) xeon E5420 2.5 Ghz
Chipset intel 5000P
8GB RAM

With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB hard
drive with RAID controller adaptec 5805 )

I worked quite smooth before, but, about 2 weeks ago, in /var/log/messages,
i saw many warning ( i thought so)  like this:

*Jan 25 08:41:23 OST6 kernel: Lustre:
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 35s
Jan 25 08:41:34 OST6 kernel: Lustre:
9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 41s
Jan 25 08:41:34 OST6 kernel: Lustre:
9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous
similar messages
Jan 25 08:41:35 OST6 kernel: Lustre:
9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 43s
Jan 25 08:58:10 OST6 kernel: Lustre:
9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 31s
Jan 25 08:59:39 OST6 kernel: Lustre:
9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 30s
Jan 25 09:01:05 OST6 kernel: Lustre:
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 33s
Jan 25 09:03:23 OST6 kernel: Lustre:
9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 32s
Jan 25 09:11:25 OST6 kernel: Lustre:
9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 36s*

I googled around and found that it''s because a problem with
oss_num_threads
and even though brought it down to 64 ( followed by the function i found in
the 1.8 manual: thread_number = RAM * CPU core / 128 MB, its value is 256  )


*options ost oss_num_threads=64*

It still didn''t help.

I thought it was only the harmless warning but maybe wrong, our performance
is goes down quite heavily ( it''s maybe because of other reason, but
for
now, i am only doubting slow direct_io problem )

iostat -m 1 1
Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.02    2.86   25.01    0.00   72.10

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               1.30         0.01         0.00      11386       3469
sdb               1.30         0.01         0.00      11531       3469
sdc             131.50        *12.40*         0.26   11793218     249934
sdd             178.46        *18.00*         0.26   17124065     250334
md2               3.33         0.02         0.00      22915       2634
md1               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
drbd3           480.10        *12.39*         0.26   11789047     249639
drbd6           565.85        *14.89*         0.26   14168452     249211


So, could anyone please tell me whether it''s warning impact our system
performance or not ? and if it does, give me solution or advice to resolve
it, please

Best regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100125/782df497/attachment.html

Aaron Knister

2010-Jan-25 03:35 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

My best guess (and please correct me if I''m wrong) is that those
messages are because the underlying block devices are slow to respond to i/o
requests. It looks like you''re using DRBD. What''s your
interconnect?

On Jan 24, 2010, at 9:42 PM, Lex wrote:
> Hi list 
> 
> I have one OSS with hadware info like this : 
> 
> CPU Intel(R) xeon E5420 2.5 Ghz
> Chipset intel 5000P 
> 8GB RAM 
> 
> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB hard
drive with RAID controller adaptec 5805 )
> 
> I worked quite smooth before, but, about 2 weeks ago, in /var/log/messages,
i saw many warning ( i thought so)  like this:
> 
> Jan 25 08:41:23 OST6 kernel: Lustre:
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 35s
> Jan 25 08:41:34 OST6 kernel: Lustre:
9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 41s
> Jan 25 08:41:34 OST6 kernel: Lustre:
9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous similar
messages
> Jan 25 08:41:35 OST6 kernel: Lustre:
9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 43s
> Jan 25 08:58:10 OST6 kernel: Lustre:
9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 31s
> Jan 25 08:59:39 OST6 kernel: Lustre:
9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 30s
> Jan 25 09:01:05 OST6 kernel: Lustre:
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 33s
> Jan 25 09:03:23 OST6 kernel: Lustre:
9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 32s
> Jan 25 09:11:25 OST6 kernel: Lustre:
9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 36s
> 
> I googled around and found that it''s because a problem with
oss_num_threads and even though brought it down to 64 ( followed by the function
i found in the 1.8 manual: thread_number = RAM * CPU core / 128 MB, its value is
256  )
> 
> options ost oss_num_threads=64
> 
> It still didn''t help. 
> 
> I thought it was only the harmless warning but maybe wrong, our performance
is goes down quite heavily ( it''s maybe because of other reason, but
for now, i am only doubting slow direct_io problem )
> 
> iostat -m 1 1
> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.01    0.02    2.86   25.01    0.00   72.10
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda               1.30         0.01         0.00      11386       3469
> sdb               1.30         0.01         0.00      11531       3469
> sdc             131.50        12.40         0.26   11793218     249934
> sdd             178.46        18.00         0.26   17124065     250334
> md2               3.33         0.02         0.00      22915       2634
> md1               0.00         0.00         0.00          0          0
> md0               0.00         0.00         0.00          0          0
> drbd3           480.10        12.39         0.26   11789047     249639
> drbd6           565.85        14.89         0.26   14168452     249211
> 
> 
> So, could anyone please tell me whether it''s warning impact our
system performance or not ? and if it does, give me solution or advice to
resolve it, please
> 
> Best regards 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100124/f597f80b/attachment.html

Lex

2010-Jan-25 04:17 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

Thank you for your fast reply, Aaron

I''m using Giga Ethernet to synchronize data between to our fail-over
node.
Is there something wrong ? Tell me, please

On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <aaron.knister at
gmail.com>wrote:
> My best guess (and please correct me if I''m wrong) is that those
messages
> are because the underlying block devices are slow to respond to i/o
> requests. It looks like you''re using DRBD. What''s your
interconnect?
>
> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>
> Hi list
>
> I have one OSS with hadware info like this :
>
> CPU Intel(R) xeon E5420 2.5 Ghz
> Chipset intel 5000P
> 8GB RAM
>
> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB hard
> drive with RAID controller adaptec 5805 )
>
> I worked quite smooth before, but, about 2 weeks ago, in /var/log/messages,
> i saw many warning ( i thought so)  like this:
>
> *Jan 25 08:41:23 OST6 kernel: Lustre:
> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 35s
> Jan 25 08:41:34 OST6 kernel: Lustre:
> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 41s
> Jan 25 08:41:34 OST6 kernel: Lustre:
> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous
> similar messages
> Jan 25 08:41:35 OST6 kernel: Lustre:
> 9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 43s
> Jan 25 08:58:10 OST6 kernel: Lustre:
> 9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 31s
> Jan 25 08:59:39 OST6 kernel: Lustre:
> 9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 30s
> Jan 25 09:01:05 OST6 kernel: Lustre:
> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 33s
> Jan 25 09:03:23 OST6 kernel: Lustre:
> 9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 32s
> Jan 25 09:11:25 OST6 kernel: Lustre:
> 9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
> direct_io 36s*
>
> I googled around and found that it''s because a problem with
oss_num_threads
> and even though brought it down to 64 ( followed by the function i found in
> the 1.8 manual: thread_number = RAM * CPU core / 128 MB, its value is 256 
)
>
>
> *options ost oss_num_threads=64*
>
> It still didn''t help.
>
> I thought it was only the harmless warning but maybe wrong, our performance
> is goes down quite heavily ( it''s maybe because of other reason,
but for
> now, i am only doubting slow direct_io problem )
>
> iostat -m 1 1
> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.01    0.02    2.86   25.01    0.00   72.10
>
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda               1.30         0.01         0.00      11386       3469
> sdb               1.30         0.01         0.00      11531       3469
> sdc             131.50        *12.40*         0.26   11793218     249934
> sdd             178.46        *18.00*         0.26   17124065     250334
> md2               3.33         0.02         0.00      22915       2634
> md1               0.00         0.00         0.00          0          0
> md0               0.00         0.00         0.00          0          0
> drbd3           480.10        *12.39*         0.26   11789047     249639
> drbd6           565.85        *14.89*         0.26   14168452     249211
>
>
> So, could anyone please tell me whether it''s warning impact our
system
> performance or not ? and if it does, give me solution or advice to resolve
> it, please
>
> Best regards
>
>
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100125/ea773cce/attachment.html

Aaron Knister

2010-Jan-25 04:43 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

I don''t necessarily think there''s anything wrong with using
drbd or running it over gigabit ethernet. If you stop all I/O to the lustre
filesystem, what does an hdparm -t show on the sdc and drbd devices? Do you have
any performance numbers for the drbd or underlying raid devices?

On Jan 24, 2010, at 11:17 PM, Lex wrote:
> Thank you for your fast reply, Aaron
> 
> I''m using Giga Ethernet to synchronize data between to our
fail-over node. Is there something wrong ? Tell me, please
> 
> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <aaron.knister at
gmail.com> wrote:
> My best guess (and please correct me if I''m wrong) is that those
messages are because the underlying block devices are slow to respond to i/o
requests. It looks like you''re using DRBD. What''s your
interconnect?
> 
> On Jan 24, 2010, at 9:42 PM, Lex wrote:
> 
>> Hi list 
>> 
>> I have one OSS with hadware info like this : 
>> 
>> CPU Intel(R) xeon E5420 2.5 Ghz
>> Chipset intel 5000P 
>> 8GB RAM 
>> 
>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB
hard drive with RAID controller adaptec 5805 )
>> 
>> I worked quite smooth before, but, about 2 weeks ago, in
/var/log/messages, i saw many warning ( i thought so)  like this:
>> 
>> Jan 25 08:41:23 OST6 kernel: Lustre:
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 35s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 41s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous similar
messages
>> Jan 25 08:41:35 OST6 kernel: Lustre:
9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 43s
>> Jan 25 08:58:10 OST6 kernel: Lustre:
9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 31s
>> Jan 25 08:59:39 OST6 kernel: Lustre:
9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 30s
>> Jan 25 09:01:05 OST6 kernel: Lustre:
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 33s
>> Jan 25 09:03:23 OST6 kernel: Lustre:
9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 32s
>> Jan 25 09:11:25 OST6 kernel: Lustre:
9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
direct_io 36s
>> 
>> I googled around and found that it''s because a problem with
oss_num_threads and even though brought it down to 64 ( followed by the function
i found in the 1.8 manual: thread_number = RAM * CPU core / 128 MB, its value is
256  )
>> 
>> options ost oss_num_threads=64
>> 
>> It still didn''t help. 
>> 
>> I thought it was only the harmless warning but maybe wrong, our
performance is goes down quite heavily ( it''s maybe because of other
reason, but for now, i am only doubting slow direct_io problem )
>> 
>> iostat -m 1 1
>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
>> 
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0.01    0.02    2.86   25.01    0.00   72.10
>> 
>> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>> sda               1.30         0.01         0.00      11386       3469
>> sdb               1.30         0.01         0.00      11531       3469
>> sdc             131.50        12.40         0.26   11793218     249934
>> sdd             178.46        18.00         0.26   17124065     250334
>> md2               3.33         0.02         0.00      22915       2634
>> md1               0.00         0.00         0.00          0          0
>> md0               0.00         0.00         0.00          0          0
>> drbd3           480.10        12.39         0.26   11789047     249639
>> drbd6           565.85        14.89         0.26   14168452     249211
>> 
>> 
>> So, could anyone please tell me whether it''s warning impact
our system performance or not ? and if it does, give me solution or advice to
resolve it, please
>> 
>> Best regards 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100124/cfea9034/attachment.html

Erik Froese

2010-Jan-25 15:52 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

Is each OST journals on its own physical disk? I''ve seen those messages
when
there isn''t enough hardware dedicated to the journal device.
Erik

On Sun, Jan 24, 2010 at 11:43 PM, Aaron Knister <aaron.knister at
gmail.com>wrote:
> I don''t necessarily think there''s anything wrong with
using drbd or running
> it over gigabit ethernet. If you stop all I/O to the lustre filesystem,
what
> does an hdparm -t show on the sdc and drbd devices? Do you have any
> performance numbers for the drbd or underlying raid devices?
>
> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>
> Thank you for your fast reply, Aaron
>
> I''m using Giga Ethernet to synchronize data between to our
fail-over node.
> Is there something wrong ? Tell me, please
>
> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <aaron.knister at
gmail.com>wrote:
>
>> My best guess (and please correct me if I''m wrong) is that
those messages
>> are because the underlying block devices are slow to respond to i/o
>> requests. It looks like you''re using DRBD. What''s
your interconnect?
>>
>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>
>> Hi list
>>
>> I have one OSS with hadware info like this :
>>
>> CPU Intel(R) xeon E5420 2.5 Ghz
>> Chipset intel 5000P
>> 8GB RAM
>>
>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB
hard
>> drive with RAID controller adaptec 5805 )
>>
>> I worked quite smooth before, but, about 2 weeks ago, in
>> /var/log/messages, i saw many warning ( i thought so)  like this:
>>
>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 35s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 41s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous
>> similar messages
>> Jan 25 08:41:35 OST6 kernel: Lustre:
>> 9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 43s
>> Jan 25 08:58:10 OST6 kernel: Lustre:
>> 9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 31s
>> Jan 25 08:59:39 OST6 kernel: Lustre:
>> 9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 30s
>> Jan 25 09:01:05 OST6 kernel: Lustre:
>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 33s
>> Jan 25 09:03:23 OST6 kernel: Lustre:
>> 9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 32s
>> Jan 25 09:11:25 OST6 kernel: Lustre:
>> 9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 36s*
>>
>> I googled around and found that it''s because a problem with
>> oss_num_threads and even though brought it down to 64 ( followed by the
>> function i found in the 1.8 manual: thread_number = RAM * CPU core /
128 MB,
>> its value is 256  )
>>
>> *options ost oss_num_threads=64*
>>
>> It still didn''t help.
>>
>> I thought it was only the harmless warning but maybe wrong, our
>> performance is goes down quite heavily ( it''s maybe because of
other reason,
>> but for now, i am only doubting slow direct_io problem )
>>
>> iostat -m 1 1
>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0.01    0.02    2.86   25.01    0.00   72.10
>>
>> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>> sda               1.30         0.01         0.00      11386       3469
>> sdb               1.30         0.01         0.00      11531       3469
>> sdc             131.50        *12.40*         0.26   11793218    
249934
>> sdd             178.46        *18.00*         0.26   17124065    
250334
>> md2               3.33         0.02         0.00      22915       2634
>> md1               0.00         0.00         0.00          0          0
>> md0               0.00         0.00         0.00          0          0
>> drbd3           480.10        *12.39*         0.26   11789047    
249639
>> drbd6           565.85        *14.89*         0.26   14168452    
249211
>>
>>
>> So, could anyone please tell me whether it''s warning impact
our system
>> performance or not ? and if it does, give me solution or advice to
resolve
>> it, please
>>
>> Best regards
>>
>>
>>
>>
>>
>>
>>  _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100125/9d563237/attachment-0001.html

Lex

2010-Jan-25 16:35 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

I can''t stop I/O to the lustre system as you said because it will make
our
service goes down. But instead, i can user hdparm with our backup OST, it
has exactly the same hardware info with the master one. And this is the
result :

* hdparm -t /dev/sdc

/dev/sdc:
 Timing buffered disk reads:  1318 MB in  3.00 seconds = 439.01 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate ioctl
for device*

Is it helpful ?

About the performance numbers of drbd or underlying raid devices, could u
please tell me exactly what info do you want is ?

Many thanks

On Mon, Jan 25, 2010 at 11:43 AM, Aaron Knister <aaron.knister at
gmail.com>wrote:
> I don''t necessarily think there''s anything wrong with
using drbd or running
> it over gigabit ethernet. If you stop all I/O to the lustre filesystem,
what
> does an hdparm -t show on the sdc and drbd devices? Do you have any
> performance numbers for the drbd or underlying raid devices?
>
> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>
> Thank you for your fast reply, Aaron
>
> I''m using Giga Ethernet to synchronize data between to our
fail-over node.
> Is there something wrong ? Tell me, please
>
> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <aaron.knister at
gmail.com>wrote:
>
>> My best guess (and please correct me if I''m wrong) is that
those messages
>> are because the underlying block devices are slow to respond to i/o
>> requests. It looks like you''re using DRBD. What''s
your interconnect?
>>
>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>
>> Hi list
>>
>> I have one OSS with hadware info like this :
>>
>> CPU Intel(R) xeon E5420 2.5 Ghz
>> Chipset intel 5000P
>> 8GB RAM
>>
>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB
hard
>> drive with RAID controller adaptec 5805 )
>>
>> I worked quite smooth before, but, about 2 weeks ago, in
>> /var/log/messages, i saw many warning ( i thought so)  like this:
>>
>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 35s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 41s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous
>> similar messages
>> Jan 25 08:41:35 OST6 kernel: Lustre:
>> 9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 43s
>> Jan 25 08:58:10 OST6 kernel: Lustre:
>> 9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 31s
>> Jan 25 08:59:39 OST6 kernel: Lustre:
>> 9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 30s
>> Jan 25 09:01:05 OST6 kernel: Lustre:
>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 33s
>> Jan 25 09:03:23 OST6 kernel: Lustre:
>> 9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 32s
>> Jan 25 09:11:25 OST6 kernel: Lustre:
>> 9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>> direct_io 36s*
>>
>> I googled around and found that it''s because a problem with
>> oss_num_threads and even though brought it down to 64 ( followed by the
>> function i found in the 1.8 manual: thread_number = RAM * CPU core /
128 MB,
>> its value is 256  )
>>
>> *options ost oss_num_threads=64*
>>
>> It still didn''t help.
>>
>> I thought it was only the harmless warning but maybe wrong, our
>> performance is goes down quite heavily ( it''s maybe because of
other reason,
>> but for now, i am only doubting slow direct_io problem )
>>
>> iostat -m 1 1
>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0.01    0.02    2.86   25.01    0.00   72.10
>>
>> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>> sda               1.30         0.01         0.00      11386       3469
>> sdb               1.30         0.01         0.00      11531       3469
>> sdc             131.50        *12.40*         0.26   11793218    
249934
>> sdd             178.46        *18.00*         0.26   17124065    
250334
>> md2               3.33         0.02         0.00      22915       2634
>> md1               0.00         0.00         0.00          0          0
>> md0               0.00         0.00         0.00          0          0
>> drbd3           480.10        *12.39*         0.26   11789047    
249639
>> drbd6           565.85        *14.89*         0.26   14168452    
249211
>>
>>
>> So, could anyone please tell me whether it''s warning impact
our system
>> performance or not ? and if it does, give me solution or advice to
resolve
>> it, please
>>
>> Best regards
>>
>>
>>
>>
>>
>>
>>  _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100125/895e8e75/attachment.html

Lex

2010-Jan-25 16:40 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

Sorry Erik if i''m rising such a "bad" question, could u tell
me more about
OST journal device ? I even don''t know what it is as well as
haven''t seen it
before, in the lustre manual.

Best regards


On Mon, Jan 25, 2010 at 10:52 PM, Erik Froese <erik.froese at gmail.com>
wrote:
> Is each OST journals on its own physical disk? I''ve seen those
messages
> when there isn''t enough hardware dedicated to the journal device.
> Erik
>
> On Sun, Jan 24, 2010 at 11:43 PM, Aaron Knister <aaron.knister at
gmail.com>wrote:
>
>> I don''t necessarily think there''s anything wrong with
using drbd or
>> running it over gigabit ethernet. If you stop all I/O to the lustre
>> filesystem, what does an hdparm -t show on the sdc and drbd devices? Do
you
>> have any performance numbers for the drbd or underlying raid devices?
>>
>> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>>
>> Thank you for your fast reply, Aaron
>>
>> I''m using Giga Ethernet to synchronize data between to our
fail-over node.
>> Is there something wrong ? Tell me, please
>>
>> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <aaron.knister at
gmail.com>wrote:
>>
>>> My best guess (and please correct me if I''m wrong) is that
those messages
>>> are because the underlying block devices are slow to respond to i/o
>>> requests. It looks like you''re using DRBD. What''s
your interconnect?
>>>
>>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>>
>>> Hi list
>>>
>>> I have one OSS with hadware info like this :
>>>
>>> CPU Intel(R) xeon E5420 2.5 Ghz
>>> Chipset intel 5000P
>>> 8GB RAM
>>>
>>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5
TB
>>> hard drive with RAID controller adaptec 5805 )
>>>
>>> I worked quite smooth before, but, about 2 weeks ago, in
>>> /var/log/messages, i saw many warning ( i thought so)  like this:
>>>
>>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 35s
>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 41s
>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2
previous
>>> similar messages
>>> Jan 25 08:41:35 OST6 kernel: Lustre:
>>> 9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 43s
>>> Jan 25 08:58:10 OST6 kernel: Lustre:
>>> 9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 31s
>>> Jan 25 08:59:39 OST6 kernel: Lustre:
>>> 9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 30s
>>> Jan 25 09:01:05 OST6 kernel: Lustre:
>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 33s
>>> Jan 25 09:03:23 OST6 kernel: Lustre:
>>> 9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 32s
>>> Jan 25 09:11:25 OST6 kernel: Lustre:
>>> 9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006:
slow
>>> direct_io 36s*
>>>
>>> I googled around and found that it''s because a problem
with
>>> oss_num_threads and even though brought it down to 64 ( followed by
the
>>> function i found in the 1.8 manual: thread_number = RAM * CPU core
/ 128 MB,
>>> its value is 256  )
>>>
>>> *options ost oss_num_threads=64*
>>>
>>> It still didn''t help.
>>>
>>> I thought it was only the harmless warning but maybe wrong, our
>>> performance is goes down quite heavily ( it''s maybe
because of other reason,
>>> but for now, i am only doubting slow direct_io problem )
>>>
>>> iostat -m 1 1
>>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>            0.01    0.02    2.86   25.01    0.00   72.10
>>>
>>> Device:            tps    MB_read/s    MB_wrtn/s    MB_read   
MB_wrtn
>>> sda               1.30         0.01         0.00      11386      
3469
>>> sdb               1.30         0.01         0.00      11531      
3469
>>> sdc             131.50        *12.40*         0.26   11793218    
249934
>>> sdd             178.46        *18.00*         0.26   17124065    
250334
>>> md2               3.33         0.02         0.00      22915      
2634
>>> md1               0.00         0.00         0.00          0        
0
>>> md0               0.00         0.00         0.00          0        
0
>>> drbd3           480.10        *12.39*         0.26   11789047    
249639
>>> drbd6           565.85        *14.89*         0.26   14168452    
249211
>>>
>>>
>>> So, could anyone please tell me whether it''s warning
impact our system
>>> performance or not ? and if it does, give me solution or advice to
resolve
>>> it, please
>>>
>>> Best regards
>>>
>>>
>>>
>>>
>>>
>>>
>>>  _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100125/e91a9df1/attachment.html

Lex

2010-Jan-26 02:06 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

---------- Forwarded message ----------
From: Lex <lexluthor87 at gmail.com>
Date: Tue, Jan 26, 2010 at 9:05 AM
Subject: Re: [Lustre-discuss] slow direct_io , slow journal .. in OST log
To: Mark Hahn <hahn at mcmaster.ca>
Cc: Erik Froese <erik.froese at gmail.com>




On Tue, Jan 26, 2010 at 12:07 AM, Mark Hahn <hahn at mcmaster.ca> wrote:
>  Sorry Erik if i''m rising such a "bad" question, could u
tell me more about
>> OST journal device ? I even don''t know what it is as well as
haven''t seen
>> it
>> before, in the lustre manual.
>>
>
> an OST is an ext3-ish filesystem associated with some OST server processes.
> since it''s basically ext3, the properties of ext3 apply, including
the use
> of a journal, and the ability to use an "external" journal
device.
>
I think we don''t have any separate or external journal device like you
said.
So could anyone tell me exactly what stuff make the warning about slow xxx
in our OST logs like i described above is ? Is there any role of lustre in
this situation ? And how? What is the relationship with journal device and
my problem ?

Best regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/8b6e2736/attachment.html

Lex

2010-Jan-26 07:12 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

There was a problem with our raid controller this morning, raid array was
degraded ( I reinstalled the hard drive and its state was *rebuilding* right
after that )  and from messages log, i saw many warnings like this:

*Jan 26 13:59:13 OST6 kernel: Lustre:
9589:0:(filter.c:1363:filter_parent_lock()) lustre-OST0006: slow parent lock
49s
Jan 26 13:59:13 OST6 kernel: Lustre:
9589:0:(filter.c:1363:filter_parent_lock()) Skipped 10 previous similar
messages
Jan 26 13:59:13 OST6 kernel: Lustre:
9638:0:(filter_io.c:398:filter_preprw_read()) lustre-OST0006: slow
preprw_read setup 34s*

I have never seen *slow parent lock *or *preprw-read  *before. Maybe
everything is going down fast and severely, Do all you guys have any ideas
for my situation ?

Just guess, i think it was impacted by performance decreasing issue during
rebuild raid array.

Best regards

On Tue, Jan 26, 2010 at 9:05 AM, Lex <lexluthor87 at gmail.com> wrote:
>
>
> On Tue, Jan 26, 2010 at 12:07 AM, Mark Hahn <hahn at mcmaster.ca>
wrote:
>
>>  Sorry Erik if i''m rising such a "bad" question,
could u tell me more
>>> about
>>> OST journal device ? I even don''t know what it is as well
as haven''t seen
>>> it
>>> before, in the lustre manual.
>>>
>>
>> an OST is an ext3-ish filesystem associated with some OST server
>> processes.
>> since it''s basically ext3, the properties of ext3 apply,
including the use
>> of a journal, and the ability to use an "external" journal
device.
>>
>
> I think we don''t have any separate or external journal device like
you
> said. So could anyone tell me exactly what stuff make the warning about
slow
> xxx in our OST logs like i described above is ? Is there any role of lustre
> in this situation ? And how? What is the relationship with journal device
> and my problem ?
>
> Best regards
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/08e8a654/attachment.html

Lex

2010-Jan-26 07:17 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

From: Lex <lexluthor87 at gmail.com>
Date: Tue, Jan 26, 2010 at 2:12 PM
Subject: Re: [Lustre-discuss] slow direct_io , slow journal .. in OST log
To: Mark Hahn <hahn at mcmaster.ca>
Cc: Erik Froese <erik.froese at gmail.com>


There was a problem with our raid controller this morning, raid array was
degraded ( I reinstalled the hard drive and its state was *rebuilding* right
after that )  and from messages log, i saw many warnings like this:

*Jan 26 13:59:13 OST6 kernel: Lustre:
9589:0:(filter.c:1363:filter_parent_lock()) lustre-OST0006: slow parent lock
49s
Jan 26 13:59:13 OST6 kernel: Lustre:
9589:0:(filter.c:1363:filter_parent_lock()) Skipped 10 previous similar
messages
Jan 26 13:59:13 OST6 kernel: Lustre:
9638:0:(filter_io.c:398:filter_preprw_read()) lustre-OST0006: slow
preprw_read setup 34s*

I have never seen *slow parent lock *or *preprw-read  *before. Maybe
everything is going down fast and severely, Do all you guys have any ideas
for my situation ?

Just guess, i think it was impacted by performance decreasing issue during
rebuild raid array.

Best regards

On Tue, Jan 26, 2010 at 9:05 AM, Lex <lexluthor87 at gmail.com> wrote:
>
>
> On Tue, Jan 26, 2010 at 12:07 AM, Mark Hahn <hahn at mcmaster.ca>
wrote:
>
>>  Sorry Erik if i''m rising such a "bad" question,
could u tell me more
>>> about
>>> OST journal device ? I even don''t know what it is as well
as haven''t seen
>>> it
>>> before, in the lustre manual.
>>>
>>
>> an OST is an ext3-ish filesystem associated with some OST server
>> processes.
>> since it''s basically ext3, the properties of ext3 apply,
including the use
>> of a journal, and the ability to use an "external" journal
device.
>>
>
> I think we don''t have any separate or external journal device like
you
> said. So could anyone tell me exactly what stuff make the warning about
slow
> xxx in our OST logs like i described above is ? Is there any role of lustre
> in this situation ? And how? What is the relationship with journal device
> and my problem ?
>
> Best regards
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/971ec37d/attachment-0001.html

Brian J. Murrell

2010-Jan-26 12:37 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

On Tue, 2010-01-26 at 14:12 +0700, Lex wrote:> There was a problem with our raid controller this morning, raid array
> was degraded ( I reinstalled the hard drive and its state was
> rebuilding right after that )  and from messages log, i saw many
> warnings like this: 
> 
> Jan 26 13:59:13 OST6 kernel: Lustre:
> 9589:0:(filter.c:1363:filter_parent_lock()) lustre-OST0006: slow
> parent lock 49s
> Jan 26 13:59:13 OST6 kernel: Lustre:
> 9589:0:(filter.c:1363:filter_parent_lock()) Skipped 10 previous
> similar messages
> Jan 26 13:59:13 OST6 kernel: Lustre:
> 9638:0:(filter_io.c:398:filter_preprw_read()) lustre-OST0006: slow
> preprw_read setup 34s
This is not surprising during a rebuild.
> I have never seen slow parent lock or preprw-read  before. Maybe
> everything is going down fast and severely, Do all you guys have any
> ideas for my situation ? 
About this and your previous message, I think somebody has already
suggested that you are oversubscribing your disks.  You need to tune the
number of OST threads your OSSes are using so that they don''t
oversubscribe your disks.  Details are in the manual.

b.

P.S.  Please don''t put the mailing list address in the BCC list.  Doing
that makes it less intuitive for people to reply back to the list.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100126/f1d6568c/attachment.bin

Lex

2010-Jan-27 01:59 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

Hi all

I heard somewhere about oversubscribing issue related to ost thread, but i
just wonder why i calculated followed the function that i founded in the
manual ( *thread_number = RAM * CPU core / 128 MB* - do correct me if
there''s something wrong with it, please ) , the oversubscribing warning
is
still appears.

Maybe i have to choose my own value from trial and error, but is there any
explanation for this situation?

@Erik : could u please describe your bottleneck problem with journal device
for me ? as detail as better ?


On Tue, Jan 26, 2010 at 10:00 PM, Erik Froese <erik.froese at gmail.com>
wrote:
> Sorry Lex I misread your email. I saw similar messages about my journal
> devices. The OST is an ext3+extra features filesystem. Each FS has an
> associated journal that CAN be on a separate device. Its supposed to speed
> up small file operations. Mine were oversubscribed and became a bottleneck.
>
> Erik
>
>
> On Mon, Jan 25, 2010 at 11:40 AM, Lex <lexluthor87 at gmail.com>
wrote:
>
>> Sorry Erik if i''m rising such a "bad" question,
could u tell me more about
>> OST journal device ? I even don''t know what it is as well as
haven''t seen it
>> before, in the lustre manual.
>>
>> Best regards
>>
>>
>> On Mon, Jan 25, 2010 at 10:52 PM, Erik Froese <erik.froese at
gmail.com>wrote:
>>
>>> Is each OST journals on its own physical disk? I''ve seen
those messages
>>> when there isn''t enough hardware dedicated to the journal
device.
>>> Erik
>>>
>>> On Sun, Jan 24, 2010 at 11:43 PM, Aaron Knister <aaron.knister
at gmail.com
>>> > wrote:
>>>
>>>> I don''t necessarily think there''s anything
wrong with using drbd or
>>>> running it over gigabit ethernet. If you stop all I/O to the
lustre
>>>> filesystem, what does an hdparm -t show on the sdc and drbd
devices? Do you
>>>> have any performance numbers for the drbd or underlying raid
devices?
>>>>
>>>> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>>>>
>>>> Thank you for your fast reply, Aaron
>>>>
>>>> I''m using Giga Ethernet to synchronize data between to
our fail-over
>>>> node. Is there something wrong ? Tell me, please
>>>>
>>>> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <
>>>> aaron.knister at gmail.com> wrote:
>>>>
>>>>> My best guess (and please correct me if I''m wrong)
is that those
>>>>> messages are because the underlying block devices are slow
to respond to i/o
>>>>> requests. It looks like you''re using DRBD.
What''s your interconnect?
>>>>>
>>>>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>>>>
>>>>> Hi list
>>>>>
>>>>> I have one OSS with hadware info like this :
>>>>>
>>>>> CPU Intel(R) xeon E5420 2.5 Ghz
>>>>> Chipset intel 5000P
>>>>> 8GB RAM
>>>>>
>>>>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has
4 x 1.5 TB
>>>>> hard drive with RAID controller adaptec 5805 )
>>>>>
>>>>> I worked quite smooth before, but, about 2 weeks ago, in
>>>>> /var/log/messages, i saw many warning ( i thought so)  like
this:
>>>>>
>>>>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>>>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 35s
>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 41s
>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped
2 previous
>>>>> similar messages
>>>>> Jan 25 08:41:35 OST6 kernel: Lustre:
>>>>> 9645:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 43s
>>>>> Jan 25 08:58:10 OST6 kernel: Lustre:
>>>>> 9646:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 31s
>>>>> Jan 25 08:59:39 OST6 kernel: Lustre:
>>>>> 9609:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 30s
>>>>> Jan 25 09:01:05 OST6 kernel: Lustre:
>>>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 33s
>>>>> Jan 25 09:03:23 OST6 kernel: Lustre:
>>>>> 9633:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 32s
>>>>> Jan 25 09:11:25 OST6 kernel: Lustre:
>>>>> 9585:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>> direct_io 36s*
>>>>>
>>>>> I googled around and found that it''s because a
problem with
>>>>> oss_num_threads and even though brought it down to 64 (
followed by the
>>>>> function i found in the 1.8 manual: thread_number = RAM *
CPU core / 128 MB,
>>>>> its value is 256  )
>>>>>
>>>>> *options ost oss_num_threads=64*
>>>>>
>>>>> It still didn''t help.
>>>>>
>>>>> I thought it was only the harmless warning but maybe wrong,
our
>>>>> performance is goes down quite heavily ( it''s
maybe because of other reason,
>>>>> but for now, i am only doubting slow direct_io problem )
>>>>>
>>>>> iostat -m 1 1
>>>>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)     
01/25/2010
>>>>>
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>            0.01    0.02    2.86   25.01    0.00   72.10
>>>>>
>>>>> Device:            tps    MB_read/s    MB_wrtn/s    MB_read
MB_wrtn
>>>>> sda               1.30         0.01         0.00      11386
3469
>>>>> sdb               1.30         0.01         0.00      11531
3469
>>>>> sdc             131.50        *12.40*         0.26  
11793218
>>>>> 249934
>>>>> sdd             178.46        *18.00*         0.26  
17124065
>>>>> 250334
>>>>> md2               3.33         0.02         0.00      22915
2634
>>>>> md1               0.00         0.00         0.00          0
0
>>>>> md0               0.00         0.00         0.00          0
0
>>>>> drbd3           480.10        *12.39*         0.26  
11789047
>>>>> 249639
>>>>> drbd6           565.85        *14.89*         0.26  
14168452
>>>>> 249211
>>>>>
>>>>>
>>>>> So, could anyone please tell me whether it''s
warning impact our system
>>>>> performance or not ? and if it does, give me solution or
advice to resolve
>>>>> it, please
>>>>>
>>>>> Best regards
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100127/72ca300a/attachment-0001.html

Lex

2010-Jan-29 07:04 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

Hi guys, do you have any idea about my issue, my question?

On Wed, Jan 27, 2010 at 8:59 AM, Lex <lexluthor87 at gmail.com> wrote:
> Hi all
>
> I heard somewhere about oversubscribing issue related to ost thread, but i
> just wonder why i calculated followed the function that i founded in the
> manual ( *thread_number = RAM * CPU core / 128 MB* - do correct me if
> there''s something wrong with it, please ) , the oversubscribing
warning is
> still appears.
>
> Maybe i have to choose my own value from trial and error, but is there any
> explanation for this situation?
>
> @Erik : could u please describe your bottleneck problem with journal device
> for me ? as detail as better ?
>
>
> On Tue, Jan 26, 2010 at 10:00 PM, Erik Froese <erik.froese at
gmail.com>wrote:
>
>> Sorry Lex I misread your email. I saw similar messages about my journal
>> devices. The OST is an ext3+extra features filesystem. Each FS has an
>> associated journal that CAN be on a separate device. Its supposed to
speed
>> up small file operations. Mine were oversubscribed and became a
bottleneck.
>>
>> Erik
>>
>>
>> On Mon, Jan 25, 2010 at 11:40 AM, Lex <lexluthor87 at gmail.com>
wrote:
>>
>>> Sorry Erik if i''m rising such a "bad" question,
could u tell me more
>>> about OST journal device ? I even don''t know what it is as
well as haven''t
>>> seen it before, in the lustre manual.
>>>
>>> Best regards
>>>
>>>
>>> On Mon, Jan 25, 2010 at 10:52 PM, Erik Froese <erik.froese at
gmail.com>wrote:
>>>
>>>> Is each OST journals on its own physical disk? I''ve
seen those messages
>>>> when there isn''t enough hardware dedicated to the
journal device.
>>>> Erik
>>>>
>>>> On Sun, Jan 24, 2010 at 11:43 PM, Aaron Knister <
>>>> aaron.knister at gmail.com> wrote:
>>>>
>>>>> I don''t necessarily think there''s
anything wrong with using drbd or
>>>>> running it over gigabit ethernet. If you stop all I/O to
the lustre
>>>>> filesystem, what does an hdparm -t show on the sdc and drbd
devices? Do you
>>>>> have any performance numbers for the drbd or underlying
raid devices?
>>>>>
>>>>> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>>>>>
>>>>> Thank you for your fast reply, Aaron
>>>>>
>>>>> I''m using Giga Ethernet to synchronize data
between to our fail-over
>>>>> node. Is there something wrong ? Tell me, please
>>>>>
>>>>> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <
>>>>> aaron.knister at gmail.com> wrote:
>>>>>
>>>>>> My best guess (and please correct me if I''m
wrong) is that those
>>>>>> messages are because the underlying block devices are
slow to respond to i/o
>>>>>> requests. It looks like you''re using DRBD.
What''s your interconnect?
>>>>>>
>>>>>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>>>>>
>>>>>> Hi list
>>>>>>
>>>>>> I have one OSS with hadware info like this :
>>>>>>
>>>>>> CPU Intel(R) xeon E5420 2.5 Ghz
>>>>>> Chipset intel 5000P
>>>>>> 8GB RAM
>>>>>>
>>>>>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each
has 4 x 1.5 TB
>>>>>> hard drive with RAID controller adaptec 5805 )
>>>>>>
>>>>>> I worked quite smooth before, but, about 2 weeks ago,
in
>>>>>> /var/log/messages, i saw many warning ( i thought so) 
like this:
>>>>>>
>>>>>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>>>>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 35s
>>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 41s
>>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write())
Skipped 2 previous
>>>>>> similar messages
>>>>>> Jan 25 08:41:35 OST6 kernel: Lustre:
>>>>>> 9645:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 43s
>>>>>> Jan 25 08:58:10 OST6 kernel: Lustre:
>>>>>> 9646:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 31s
>>>>>> Jan 25 08:59:39 OST6 kernel: Lustre:
>>>>>> 9609:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 30s
>>>>>> Jan 25 09:01:05 OST6 kernel: Lustre:
>>>>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 33s
>>>>>> Jan 25 09:03:23 OST6 kernel: Lustre:
>>>>>> 9633:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 32s
>>>>>> Jan 25 09:11:25 OST6 kernel: Lustre:
>>>>>> 9585:0:(filter_io_26.c:706:filter_commitrw_write())
lustre-OST0006: slow
>>>>>> direct_io 36s*
>>>>>>
>>>>>> I googled around and found that it''s because a
problem with
>>>>>> oss_num_threads and even though brought it down to 64 (
followed by the
>>>>>> function i found in the 1.8 manual: thread_number = RAM
* CPU core / 128 MB,
>>>>>> its value is 256  )
>>>>>>
>>>>>> *options ost oss_num_threads=64*
>>>>>>
>>>>>> It still didn''t help.
>>>>>>
>>>>>> I thought it was only the harmless warning but maybe
wrong, our
>>>>>> performance is goes down quite heavily ( it''s
maybe because of other reason,
>>>>>> but for now, i am only doubting slow direct_io problem
)
>>>>>>
>>>>>> iostat -m 1 1
>>>>>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)     
01/25/2010
>>>>>>
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>            0.01    0.02    2.86   25.01    0.00   72.10
>>>>>>
>>>>>> Device:            tps    MB_read/s    MB_wrtn/s   
MB_read    MB_wrtn
>>>>>> sda               1.30         0.01         0.00     
11386       3469
>>>>>> sdb               1.30         0.01         0.00     
11531       3469
>>>>>> sdc             131.50        *12.40*         0.26  
11793218
>>>>>> 249934
>>>>>> sdd             178.46        *18.00*         0.26  
17124065
>>>>>> 250334
>>>>>> md2               3.33         0.02         0.00     
22915       2634
>>>>>> md1               0.00         0.00         0.00       
0          0
>>>>>> md0               0.00         0.00         0.00       
0          0
>>>>>> drbd3           480.10        *12.39*         0.26  
11789047
>>>>>> 249639
>>>>>> drbd6           565.85        *14.89*         0.26  
14168452
>>>>>> 249211
>>>>>>
>>>>>>
>>>>>> So, could anyone please tell me whether it''s
warning impact our system
>>>>>> performance or not ? and if it does, give me solution
or advice to resolve
>>>>>> it, please
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100129/11bbe4c8/attachment-0001.html

Lex

2010-Feb-01 08:58 UTC

head link

[Lustre-discuss] slow direct_io , slow journal .. in OST log

Hi Erik

Thank you for your answer but unfortunately, i don''t think that i
understand
clearly your idea. It sounds like a tuning tips for ext3 in our disks, not
for lustre ? ( do correct me if i''m misunderstanding you, plz )

I heard somewhere and also think that my problem is about relationship
between lustre thread and hadware usage and trying to find out the exactly
value for my options in mobprobe, but maybe there isn''t any clear
answer for
my issue ( the way i have calculated the thread number in osts) My problem
maybe relates to raid rebuilding process, too. I''ll try changing ost
thread
with other value and hope it will give me some useful result. If i gain some
good news, i''ll post it here with all you guys.

One again, many thanks for helping me ( especially you, Erik )

On Fri, Jan 29, 2010 at 11:11 PM, Erik Froese <erik.froese at gmail.com>
wrote:
> I have a separate journal device for each OST. On the OSS I have
>
> /dev/dsk/ost00 and /dev/dsk/ostj00
> /dev/dsk/ost01 and /dev/dsk/ostj01
>
> and so on.
>
> The journals are very small (400 MB), but they do a lot of small random
> I/O. I had 8 journal devices sharing the same 2 physical disks. They were
> set up in a mirror. This was too many I/O ops for the disk to handle. We
> created another mirrored pair and moved half of the journal devices to the
> new pair. We should probably go even further and and split them up more but
> it wastes space.
>
> We saw a lot of slow journal messages right before one of our OSTs crashed
> and became corrupted. We wound up removing the OST permanently after
getting
> as many files off of it as we could.
>
> Erik
>
>
> On Fri, Jan 29, 2010 at 2:04 AM, Lex <lexluthor87 at gmail.com>
wrote:
>
>> Hi guys, do you have any idea about my issue, my question?
>>
>> On Wed, Jan 27, 2010 at 8:59 AM, Lex <lexluthor87 at gmail.com>
wrote:
>>
>>> Hi all
>>>
>>> I heard somewhere about oversubscribing issue related to ost
thread, but
>>> i just wonder why i calculated followed the function that i founded
in the
>>> manual ( *thread_number = RAM * CPU core / 128 MB* - do correct me
if
>>> there''s something wrong with it, please ) , the
oversubscribing warning is
>>> still appears.
>>>
>>> Maybe i have to choose my own value from trial and error, but is
there
>>> any explanation for this situation?
>>>
>>> @Erik : could u please describe your bottleneck problem with
journal
>>> device for me ? as detail as better ?
>>>
>>>
>>> On Tue, Jan 26, 2010 at 10:00 PM, Erik Froese <erik.froese at
gmail.com>wrote:
>>>
>>>> Sorry Lex I misread your email. I saw similar messages about my
journal
>>>> devices. The OST is an ext3+extra features filesystem. Each FS
has an
>>>> associated journal that CAN be on a separate device. Its
supposed to speed
>>>> up small file operations. Mine were oversubscribed and became a
bottleneck.
>>>>
>>>> Erik
>>>>
>>>>
>>>> On Mon, Jan 25, 2010 at 11:40 AM, Lex <lexluthor87 at
gmail.com> wrote:
>>>>
>>>>> Sorry Erik if i''m rising such a "bad"
question, could u tell me more
>>>>> about OST journal device ? I even don''t know what
it is as well as haven''t
>>>>> seen it before, in the lustre manual.
>>>>>
>>>>> Best regards
>>>>>
>>>>>
>>>>> On Mon, Jan 25, 2010 at 10:52 PM, Erik Froese
<erik.froese at gmail.com>wrote:
>>>>>
>>>>>> Is each OST journals on its own physical disk?
I''ve seen those
>>>>>> messages when there isn''t enough hardware
dedicated to the journal device.
>>>>>> Erik
>>>>>>
>>>>>> On Sun, Jan 24, 2010 at 11:43 PM, Aaron Knister <
>>>>>> aaron.knister at gmail.com> wrote:
>>>>>>
>>>>>>> I don''t necessarily think there''s
anything wrong with using drbd or
>>>>>>> running it over gigabit ethernet. If you stop all
I/O to the lustre
>>>>>>> filesystem, what does an hdparm -t show on the sdc
and drbd devices? Do you
>>>>>>> have any performance numbers for the drbd or
underlying raid devices?
>>>>>>>
>>>>>>> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>>>>>>>
>>>>>>> Thank you for your fast reply, Aaron
>>>>>>>
>>>>>>> I''m using Giga Ethernet to synchronize
data between to our fail-over
>>>>>>> node. Is there something wrong ? Tell me, please
>>>>>>>
>>>>>>> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister
<
>>>>>>> aaron.knister at gmail.com> wrote:
>>>>>>>
>>>>>>>> My best guess (and please correct me if
I''m wrong) is that those
>>>>>>>> messages are because the underlying block
devices are slow to respond to i/o
>>>>>>>> requests. It looks like you''re using
DRBD. What''s your interconnect?
>>>>>>>>
>>>>>>>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>>>>>>>
>>>>>>>> Hi list
>>>>>>>>
>>>>>>>> I have one OSS with hadware info like this :
>>>>>>>>
>>>>>>>> CPU Intel(R) xeon E5420 2.5 Ghz
>>>>>>>> Chipset intel 5000P
>>>>>>>> 8GB RAM
>>>>>>>>
>>>>>>>> With this OSS, we using 2 RAID-5 arrays as OSTs
( each has 4 x 1.5
>>>>>>>> TB hard drive with RAID controller adaptec 5805
)
>>>>>>>>
>>>>>>>> I worked quite smooth before, but, about 2
weeks ago, in
>>>>>>>> /var/log/messages, i saw many warning ( i
thought so)  like this:
>>>>>>>>
>>>>>>>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>>>>>>>>
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 35s
>>>>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>>>>>
9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 41s
>>>>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>>>>>
9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous
>>>>>>>> similar messages
>>>>>>>> Jan 25 08:41:35 OST6 kernel: Lustre:
>>>>>>>>
9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 43s
>>>>>>>> Jan 25 08:58:10 OST6 kernel: Lustre:
>>>>>>>>
9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 31s
>>>>>>>> Jan 25 08:59:39 OST6 kernel: Lustre:
>>>>>>>>
9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 30s
>>>>>>>> Jan 25 09:01:05 OST6 kernel: Lustre:
>>>>>>>>
9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 33s
>>>>>>>> Jan 25 09:03:23 OST6 kernel: Lustre:
>>>>>>>>
9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 32s
>>>>>>>> Jan 25 09:11:25 OST6 kernel: Lustre:
>>>>>>>>
9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>>>> direct_io 36s*
>>>>>>>>
>>>>>>>> I googled around and found that it''s
because a problem with
>>>>>>>> oss_num_threads and even though brought it down
to 64 ( followed by the
>>>>>>>> function i found in the 1.8 manual:
thread_number = RAM * CPU core / 128 MB,
>>>>>>>> its value is 256  )
>>>>>>>>
>>>>>>>> *options ost oss_num_threads=64*
>>>>>>>>
>>>>>>>> It still didn''t help.
>>>>>>>>
>>>>>>>> I thought it was only the harmless warning but
maybe wrong, our
>>>>>>>> performance is goes down quite heavily (
it''s maybe because of other reason,
>>>>>>>> but for now, i am only doubting slow direct_io
problem )
>>>>>>>>
>>>>>>>> iostat -m 1 1
>>>>>>>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom
(OST6)      01/25/2010
>>>>>>>>
>>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal
%idle
>>>>>>>>            0.01    0.02    2.86   25.01    0.00
72.10
>>>>>>>>
>>>>>>>> Device:            tps    MB_read/s   
MB_wrtn/s    MB_read
>>>>>>>> MB_wrtn
>>>>>>>> sda               1.30         0.01        
0.00      11386
>>>>>>>> 3469
>>>>>>>> sdb               1.30         0.01        
0.00      11531
>>>>>>>> 3469
>>>>>>>> sdc             131.50        *12.40*        
0.26   11793218
>>>>>>>> 249934
>>>>>>>> sdd             178.46        *18.00*        
0.26   17124065
>>>>>>>> 250334
>>>>>>>> md2               3.33         0.02        
0.00      22915
>>>>>>>> 2634
>>>>>>>> md1               0.00         0.00        
0.00          0
>>>>>>>> 0
>>>>>>>> md0               0.00         0.00        
0.00          0
>>>>>>>> 0
>>>>>>>> drbd3           480.10        *12.39*        
0.26   11789047
>>>>>>>> 249639
>>>>>>>> drbd6           565.85        *14.89*        
0.26   14168452
>>>>>>>> 249211
>>>>>>>>
>>>>>>>>
>>>>>>>> So, could anyone please tell me whether
it''s warning impact our
>>>>>>>> system performance or not ? and if it does,
give me solution or advice to
>>>>>>>> resolve it, please
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 
_______________________________________________
>>>>>>>> Lustre-discuss mailing list
>>>>>>>> Lustre-discuss at lists.lustre.org
>>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Lustre-discuss mailing list
>>>>>>> Lustre-discuss at lists.lustre.org
>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100201/8be48474/attachment-0001.html

Lustre discuss - Jan 2010 - slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log

[Lustre-discuss] slow direct_io , slow journal .. in OST log