thr3ads.net - Lustre discuss - [Lustre-discuss] OST targets not mountable after disabling/enabling MMP [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Edward Walter

2010-Aug-09 13:44 UTC

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Hello List,

We recently experienced a power failure (and subsequent UPS failure) 
which caused our Lustre filesystem to shutdown hard.  We were able to 
bring it back online but started seeing errors where the OSTs were being 
remounted as read-only.  We observed that all of the read-only OSTs were 
reporting an I/O error on the same block (the MMP block) and generating 
the following message:
> Lustre: Server data-OST0004 on device /dev/sdd has started
> end_request: I/O error, dev sdd, sector 861112
> Buffer I/O error on device sdd, logical block 107639
> lost page write due to I/O error on sdd
> LDISKFS-fs error (device sdd): kmmpd: Error writing to MMP block
> end_request: I/O error, dev sdd, sector 0
> Buffer I/O error on device sdd, logical block 0
> lost page write due to I/O error on sdd
> LDISKFS-fs warning (device sdd): kmmpd: kmmpd being stopped since 
> filesystem has been remounted as readonly.
> end_request: I/O error, dev sdd, sector 861112
> Buffer I/O error on device sdd, logical block 107639
> lost page write due to I/O error on sddWe do have our OSTs setup for failover but were managing the access 
through the shared RAID array itself (using LUN fencing) so we don''t 
need the MMP feature.

We disabled MMP using tune2fs (tune2fs -O ^mmp /dev/sdd) on one set of 
OSTs.  When we tried to mount these OSTs we received a message that the 
volume could not be mounted because MMP was not enabled.  We 
subsequently re-enabled MMP (tune2fs -O mmp /dev/sdd).  Oddly this did 
not return a message indicating the MMP interval or block number.  
Running ''tune2fs -l'' indicates that MPP is enabled on the
volume
though.  We also observed that OST volumes we disabled MMP on are now 
indicating that MMP is enabled even though we did not re-enable it.

At this point; we can mount the OST targets using ldiskfs in read-only 
mode.  When we attempt to mount them as part of a lustre volume we get 
the following error:
Aug  9 09:25:53 oss-0-25 kernel: LDISKFS-fs warning (device sdd): 
ldiskfs_multi_mount_protect: fsck is running on the filesystem
Aug  9 09:25:53 oss-0-25 kernel: LDISKFS-fs warning (device sdd): 
ldiskfs_multi_mount_protect: MMP failure info: last update time: 
1280954496, last update node: oss-0-25, last update device: /dev/sdd

We''re not sure how to proceed at this point.  It seems like all of the 
filesystem objects are present (df reports correct numbers).

Has anyone seen this before and worked their way through getting things 
back online?

Note:
Lustre version = 1.6.6 (using Sun''s RPMs)
OS = Centos 5.2
Kernel = 2.6.18-92.1.10.el5_lustre.1.6.6smp

Thanks much.

-Ed Walter
Carnegie Mellon University

Ken Hornstein

2010-Aug-09 13:53 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

>We recently experienced a power failure (and subsequent UPS failure) 
>which caused our Lustre filesystem to shutdown hard.  We were able to 
>bring it back online but started seeing errors where the OSTs were being 
>remounted as read-only.  We observed that all of the read-only OSTs were 
>reporting an I/O error on the same block (the MMP block) and generating 
>the following message:
>[...]
I had a similar issue once, but the issue was tha the MMP block was
corrupted.  What finally fixed it was running tune2fs -E clear-mmp.
Maybe that might solve the problem?

--Ken

Edward Walter

2010-Aug-09 14:57 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Hi Ken,

Thanks for the tip. 

This gives me an MMP error though:
[root at oss-0-25 log]# tune2fs -E clear-mmp /dev/sdd
tune2fs 1.40.11.sun1 (17-June-2008)
tune2fs: MMP: appears fsck currently being run on the filesystem while 
trying to open /dev/sdd
Couldn''t find valid filesystem superblock.

At the risk of being obvious; we''re not running any kind of fsck 
operation on this volume.

Also, I can still mount this volume read-only using ldiskfs as the 
filesystem type (so I''m suspicious of the filesystem superblock
message).

Thanks.

-Ed

Ken Hornstein wrote:>> We recently experienced a power failure (and subsequent UPS failure) 
>> which caused our Lustre filesystem to shutdown hard.  We were able to 
>> bring it back online but started seeing errors where the OSTs were
being
>> remounted as read-only.  We observed that all of the read-only OSTs
were
>> reporting an I/O error on the same block (the MMP block) and generating
>> the following message:
>> [...]
>>     
>
> I had a similar issue once, but the issue was tha the MMP block was
> corrupted.  What finally fixed it was running tune2fs -E clear-mmp.
> Maybe that might solve the problem?
>
> --Ken
>
>

Ken Hornstein

2010-Aug-09 15:03 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

>This gives me an MMP error though:
>[root at oss-0-25 log]# tune2fs -E clear-mmp /dev/sdd
>tune2fs 1.40.11.sun1 (17-June-2008)
>tune2fs: MMP: appears fsck currently being run on the filesystem while 
>trying to open /dev/sdd
>Couldn''t find valid filesystem superblock.
Oh, I forgot ... did you try adding the -f flag?  E.g.:

# tune2fs -f -E clear-mmp /dev/sdd

According to the tune2fs man page, when you use clear-mmp, you also need
the -f flag.  Still being able to mount the filesystm read-only would
make sense to me, since that wouldn''t affect fsck being run.

--Ken

Edward Walter

2010-Aug-09 15:12 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Using ''tune2fs -f -E clear-mmp'' causes tune2fs to segfault:
tune2fs 1.40.11.sun1 (17-June-2008)

Bad options specified.

Extended options are separated by commas, and may take an argument which
        is set off by an equals (''='') sign.

Valid extended options are:
        stride=<RAID per-disk chunk size in blocks>
        stripe-width=<RAID stride*data disks in blocks>
        test_fs
        ^test_fs
Segmentation fault

Did you use a newer version of tune2fs/e2fsprogs?  Our current version 
is e2fsprogs-1.40.11.sun1-0redhat.  Do you know if it''s safe to rev up 
versions on e2fsprogs while running an older lustre kernel revision (1.6.6)?

Thanks again.

-Ed



Ken Hornstein wrote:>> This gives me an MMP error though:
>> [root at oss-0-25 log]# tune2fs -E clear-mmp /dev/sdd
>> tune2fs 1.40.11.sun1 (17-June-2008)
>> tune2fs: MMP: appears fsck currently being run on the filesystem while 
>> trying to open /dev/sdd
>> Couldn''t find valid filesystem superblock.
>>     
>
> Oh, I forgot ... did you try adding the -f flag?  E.g.:
>
> # tune2fs -f -E clear-mmp /dev/sdd
>
> According to the tune2fs man page, when you use clear-mmp, you also need
> the -f flag.  Still being able to mount the filesystm read-only would
> make sense to me, since that wouldn''t affect fsck being run.
>
> --Ken
>
>

Andreas Dilger

2010-Aug-09 15:15 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

On 2010-08-09, at 11:12, Edward Walter wrote:> Using ''tune2fs -f -E clear-mmp'' causes tune2fs to
segfault:
> tune2fs 1.40.11.sun1 (17-June-2008)
> 
> Bad options specified.
> 
> Extended options are separated by commas, and may take an argument which
>        is set off by an equals (''='') sign.
> 
> Valid extended options are:
>        stride=<RAID per-disk chunk size in blocks>
>        stripe-width=<RAID stride*data disks in blocks>
>        test_fs
>        ^test_fs
> Segmentation fault
> 
> Did you use a newer version of tune2fs/e2fsprogs?  Our current version is
e2fsprogs-1.40.11.sun1-0redhat.  Do you know if it''s safe to rev up
versions on e2fsprogs while running an older lustre kernel revision (1.6.6)?
Running newer e2fsprogs is OK, and in fact a lot of issues w.r.t. MMP were fixed
in newer releases.
> Ken Hornstein wrote:
>>> This gives me an MMP error though:
>>> [root at oss-0-25 log]# tune2fs -E clear-mmp /dev/sdd
>>> tune2fs 1.40.11.sun1 (17-June-2008)
>>> tune2fs: MMP: appears fsck currently being run on the filesystem
while
>>> trying to open /dev/sdd
>>> Couldn''t find valid filesystem superblock.
>>> 
>> 
>> Oh, I forgot ... did you try adding the -f flag?  E.g.:
>> 
>> # tune2fs -f -E clear-mmp /dev/sdd
>> 
>> According to the tune2fs man page, when you use clear-mmp, you also
need
>> the -f flag.  Still being able to mount the filesystm read-only would
>> make sense to me, since that wouldn''t affect fsck being run.
>> 
>> --Ken
>> 
>> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Ken Hornstein

2010-Aug-09 15:22 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

>Using ''tune2fs -f -E clear-mmp'' causes tune2fs to
segfault:
Ewww .... well, not sure what to tell you about that.
>Did you use a newer version of tune2fs/e2fsprogs?  Our current version 
>is e2fsprogs-1.40.11.sun1-0redhat.  Do you know if it''s safe to rev
up
>versions on e2fsprogs while running an older lustre kernel revision (1.6.6)?
I am using e2fsprogs-1.41.6.sun1-0suse ... and I know that is old.

I was going to say that I don''t know if revving up e2fsprogs is okay,
but
I see that Andreas already answered that one.  I can''t be 100% sure
that
upgrading e2fsprogs _will_ solve your problem, but I think it''s worth
a shot.

--Ken

laotsao 老曹

2010-Aug-09 15:49 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

hi
I did go through various lustre download
it seems that 1.6.7 1.8 1.8.0.1 all has e2fsprogs-1.40.11-sun1
the 1.8.1.1 has  1.41.6.sun1, hope that this version is good for Ur 
centos 5.2 and
kernel version

regards

On 8/9/2010 11:22 AM, Ken Hornstein wrote:>> Using ''tune2fs -f -E clear-mmp'' causes tune2fs to
segfault:
> Ewww .... well, not sure what to tell you about that.
>
>> Did you use a newer version of tune2fs/e2fsprogs?  Our current version
>> is e2fsprogs-1.40.11.sun1-0redhat.  Do you know if it''s safe
to rev up
>> versions on e2fsprogs while running an older lustre kernel revision
(1.6.6)?
> I am using e2fsprogs-1.41.6.sun1-0suse ... and I know that is old.
>
> I was going to say that I don''t know if revving up e2fsprogs is
okay, but
> I see that Andreas already answered that one.  I can''t be 100%
sure that
> upgrading e2fsprogs _will_ solve your problem, but I think it''s
worth
> a shot.
>
> --Ken
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 139 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100809/394e1e07/attachment-0001.vcf

laotsao 老曹

2010-Aug-09 16:47 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

http://downloads.lustre.org/public/tools/e2fsprogs/


On 8/9/2010 11:49 AM, laotsao ?? wrote:>
> hi
> I did go through various lustre download
> it seems that 1.6.7 1.8 1.8.0.1 all has e2fsprogs-1.40.11-sun1
> the 1.8.1.1 has  1.41.6.sun1, hope that this version is good for Ur 
> centos 5.2 and
> kernel version
>
> regards
>
> On 8/9/2010 11:22 AM, Ken Hornstein wrote:
>>> Using ''tune2fs -f -E clear-mmp'' causes tune2fs to
segfault:
>> Ewww .... well, not sure what to tell you about that.
>>
>>> Did you use a newer version of tune2fs/e2fsprogs?  Our current
version
>>> is e2fsprogs-1.40.11.sun1-0redhat.  Do you know if it''s
safe to rev up
>>> versions on e2fsprogs while running an older lustre kernel revision
>>> (1.6.6)?
>> I am using e2fsprogs-1.41.6.sun1-0suse ... and I know that is old.
>>
>> I was going to say that I don''t know if revving up e2fsprogs
is okay,
>> but
>> I see that Andreas already answered that one.  I can''t be 100%
sure that
>> upgrading e2fsprogs _will_ solve your problem, but I think
it''s worth
>> a shot.
>>
>> --Ken
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 139 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100809/e69c9de6/attachment.vcf

Edward Walter

2010-Aug-09 18:11 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Ok, we''re making progress. We updated our e2fsprogs to 
e2fsprogs-1.41.10.sun2-0redhat. That let us clear the MMP blocks and 
mount our OSTs as part of our lustre volume. :)

We''re continuing to test things and seeing weird behavior when we run
an
ost-survey though. It looks as though the lustre client is getting 
shuffled back and forth between OSS server pairs for our OSTs. The 
client times out connecting to the primary server, attempts to connect 
to the failover server (and fails because the OST is on the primary) and 
then reconnects to the primary server and finishes the survey. This 
behavior is not isolated to one particular OST (or client) and doesn''t 
occur with every survey.

###

Here''s an example of the error we see on the client when this occurs:

[root at compute-2-7 ~]# lfs check servers
data-MDT0000-mdc-ffff81041f9b2c00 active.
data-OST0000-osc-ffff81041f9b2c00 active.
data-OST0001-osc-ffff81041f9b2c00 active.
data-OST0002-osc-ffff81041f9b2c00 active.
data-OST0003-osc-ffff81041f9b2c00 active.
data-OST0004-osc-ffff81041f9b2c00 active.
data-OST0005-osc-ffff81041f9b2c00 active.
data-OST0006-osc-ffff81041f9b2c00 active.
data-OST0007-osc-ffff81041f9b2c00 active.
data-OST0008-osc-ffff81041f9b2c00 active.
data-OST0009-osc-ffff81041f9b2c00 active.
data-OST000a-osc-ffff81041f9b2c00 active.
error: check ''data-OST000b-osc-ffff81041f9b2c00'': Resource
temporarily
unavailable (11)

###

and here''s the relevant dmesg info:

[root at compute-2-7 ~]# dmesg |grep Lustre
Lustre: Client data-client has started
Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00 to 
NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
Lustre: Skipped 1 previous similar message
Lustre: data-OST000b-osc-ffff81041f9b2c00: Connection to service 
data-OST000b via nid 172.16.1.25 at o2ib was lost; in progress operations 
using this service will wait for recovery to complete.
Lustre: Skipped 3 previous similar messages
LustreError: 11-0: an error occurred while communicating with 
172.16.1.25 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 11 previous similar messages
Lustre: Changing connection for data-OST000b-osc-ffff81041f9b2c00 to 
172.16.1.23 at o2ib/172.16.1.23 at o2ib
Lustre: Skipped 11 previous similar messages
Lustre: 4264:0:(import.c:410:import_select_connection()) 
data-OST000b-osc-ffff81041f9b2c00: tried all connections, increasing 
latency to 6s
Lustre: 4264:0:(import.c:410:import_select_connection()) Skipped 4 
previous similar messages
LustreError: 11-0: an error occurred while communicating with 
172.16.1.25 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 1 previous similar message
Lustre: Changing connection for data-OST000b-osc-ffff81041f9b2c00 to 
172.16.1.23 at o2ib/172.16.1.23 at o2ib
Lustre: Skipped 1 previous similar message
Lustre: 4264:0:(import.c:410:import_select_connection()) 
data-OST000b-osc-ffff81041f9b2c00: tried all connections, increasing 
latency to 11s
Lustre: data-OST000b-osc-ffff81041f9b2c00: Connection restored to 
service data-OST000b using nid 172.16.1.25 at o2ib.
Lustre: Skipped 1 previous similar message

###

the ost-survey completes but it''s obvious that something''s not
right:

[root at compute-2-7 ~]# ost-survey -s 50 /lustre/
/usr/bin/ost-survey: 08/09/10 OST speed survey on /lustre/ from 
172.16.255.223 at o2ib
Number of Active OST devices : 12
Worst Read OST indx: 11 speed: 2.449542
Best Read OST indx: 3 speed: 2.512130
Read Average: 2.480302 +/- 0.018453 MB/s
Worst Write OST indx: 11 speed: 0.209190
Best Write OST indx: 4 speed: 5.595996
Write Average: 4.223409 +/- 2.038925 MB/s
Ost# Read(MB/s) Write(MB/s) Read-time Write-time
----------------------------------------------------
0 2.481 5.527 20.152 9.046
1 2.464 5.484 20.294 9.118
2 2.492 5.559 20.067 8.994
3 2.512 4.413 19.903 11.330
4 2.476 5.596 20.190 8.935
5 2.485 5.444 20.117 9.184
6 2.499 5.525 20.005 9.050
7 2.468 1.387 20.260 36.047
8 2.494 5.468 20.047 9.144
9 2.491 5.398 20.071 9.263
10 2.451 0.671 20.400 74.568
11 2.450 0.209 20.412 239.017

###

Sorry for the wall of text here and thanks for the help everyone.

-Ed

laotsao ?? wrote:> http://downloads.lustre.org/public/tools/e2fsprogs/
>
>
> On 8/9/2010 11:49 AM, laotsao ?? wrote:
>>
>> hi
>> I did go through various lustre download
>> it seems that 1.6.7 1.8 1.8.0.1 all has e2fsprogs-1.40.11-sun1
>> the 1.8.1.1 has 1.41.6.sun1, hope that this version is good for Ur 
>> centos 5.2 and
>> kernel version
>>
>> regards
>>
>> On 8/9/2010 11:22 AM, Ken Hornstein wrote:
>>>> Using ''tune2fs -f -E clear-mmp'' causes
tune2fs to segfault:
>>> Ewww .... well, not sure what to tell you about that.
>>>
>>>> Did you use a newer version of tune2fs/e2fsprogs? Our current
version
>>>> is e2fsprogs-1.40.11.sun1-0redhat. Do you know if it''s
safe to rev up
>>>> versions on e2fsprogs while running an older lustre kernel
revision
>>>> (1.6.6)?
>>> I am using e2fsprogs-1.41.6.sun1-0suse ... and I know that is old.
>>>
>>> I was going to say that I don''t know if revving up
e2fsprogs is
>>> okay, but
>>> I see that Andreas already answered that one. I can''t be
100% sure that
>>> upgrading e2fsprogs _will_ solve your problem, but I think
it''s worth
>>> a shot.
>>>
>>> --Ken
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Andreas Dilger

2010-Aug-09 20:13 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

On 2010-08-09, at 14:11, Edward Walter wrote:> We''re continuing to test things and seeing weird behavior when we
run an ost-survey though. It looks as though the lustre client is getting
> shuffled back and forth between OSS server pairs for our OSTs. The 
> client times out connecting to the primary server, attempts to connect to
the failover server (and fails because the OST is on the primary) and then
reconnects to the primary server and finishes the survey. This behavior is not
isolated to one particular OST (or client) and doesn''t occur with every
survey.
> 
> and here''s the relevant dmesg info:
> 
> [root at compute-2-7 ~]# dmesg |grep Lustre
> Lustre: Client data-client has started
> Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00 to 
> NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
> Lustre: Skipped 1 previous similar message
If you have a larger cluster (hundreds of clients) with 1.6.6 you have to
increase the lustre timeout value beyond 100s for the worst-case IO (300s is
pretty typical at 1000 clients), but this is too long for most cases.

What you really want is to upgrade to 1.8.x in order to get adaptive timeouts. 
This allows the clients/servers to handle varying network  and storage latency,
instead of having a fixed timeout.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Edward Walter

2010-Aug-09 20:32 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Andreas Dilger wrote:> On 2010-08-09, at 14:11, Edward Walter wrote:
>   
>> We''re continuing to test things and seeing weird behavior when
we run an ost-survey though. It looks as though the lustre client is getting
>> shuffled back and forth between OSS server pairs for our OSTs. The 
>> client times out connecting to the primary server, attempts to connect
to the failover server (and fails because the OST is on the primary) and then
reconnects to the primary server and finishes the survey. This behavior is not
isolated to one particular OST (or client) and doesn''t occur with every
survey.
>>
>> and here''s the relevant dmesg info:
>>
>> [root at compute-2-7 ~]# dmesg |grep Lustre
>> Lustre: Client data-client has started
>> Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00 to 
>> NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
>> Lustre: Skipped 1 previous similar message
>>     
>
> If you have a larger cluster (hundreds of clients) with 1.6.6 you have to
increase the lustre timeout value beyond 100s for the worst-case IO (300s is
pretty typical at 1000 clients), but this is too long for most cases.
>
> What you really want is to upgrade to 1.8.x in order to get adaptive
timeouts.  This allows the clients/servers to handle varying network  and
storage latency, instead of having a fixed timeout.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>   Hi Andreas,

Our cluster is fairly modest in size (104 clients, 4 OSS, 12 OSTs, 1 
active MDS). We have plans for upgrading to 1.8.x but those plans now 
include stabilizing our 1.6.6 installation so that we can do a full 
backup before upgrading.

For now; we''re doing our testing from 2-3 nodes without any of the
other
nodes mounting lustre. This configuration was stable and reliable until 
the hard shutdown. Obviously we''d like to get back to where we were 
before upgrading. Our timeout on the clients (cat 
/proc/sys/lustre/timeout) is 100s. Shouldn''t this be sufficient for 2 
clients? I think something else is going on.

-Ed

laotsao 老曹

2010-Aug-10 12:05 UTC

head link

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

hi
Timeout could due to Ur IB network?
it seems that there is not harm just increase timeout from 100s to 200s 
to see that ost-survey will finish without any error
after power outage did U check all FC paths and IB path are all good?
my 2c



On 8/9/2010 4:32 PM, Edward Walter wrote:> Andreas Dilger wrote:
>> On 2010-08-09, at 14:11, Edward Walter wrote:
>>
>>> We''re continuing to test things and seeing weird behavior
when we run an ost-survey though. It looks as though the lustre client is
getting
>>> shuffled back and forth between OSS server pairs for our OSTs. The
>>> client times out connecting to the primary server, attempts to
connect to the failover server (and fails because the OST is on the primary) and
then reconnects to the primary server and finishes the survey. This behavior is
not isolated to one particular OST (or client) and doesn''t occur with
every survey.
>>>
>>> and here''s the relevant dmesg info:
>>>
>>> [root at compute-2-7 ~]# dmesg |grep Lustre
>>> Lustre: Client data-client has started
>>> Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00
to
>>> NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
>>> Lustre: Skipped 1 previous similar message
>>>
>> If you have a larger cluster (hundreds of clients) with 1.6.6 you have
to increase the lustre timeout value beyond 100s for the worst-case IO (300s is
pretty typical at 1000 clients), but this is too long for most cases.
>>
>> What you really want is to upgrade to 1.8.x in order to get adaptive
timeouts.  This allows the clients/servers to handle varying network  and
storage latency, instead of having a fixed timeout.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>>
> Hi Andreas,
>
> Our cluster is fairly modest in size (104 clients, 4 OSS, 12 OSTs, 1
> active MDS). We have plans for upgrading to 1.8.x but those plans now
> include stabilizing our 1.6.6 installation so that we can do a full
> backup before upgrading.
>
> For now; we''re doing our testing from 2-3 nodes without any of the
other
> nodes mounting lustre. This configuration was stable and reliable until
> the hard shutdown. Obviously we''d like to get back to where we
were
> before upgrading. Our timeout on the clients (cat
> /proc/sys/lustre/timeout) is 100s. Shouldn''t this be sufficient
for 2
> clients? I think something else is going on.
>
> -Ed
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 139 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100810/94a0fcea/attachment.vcf

Lustre discuss - Aug 2010 - OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

[Lustre-discuss] OST targets not mountable after disabling/enabling MMP