thr3ads.net - Lustre discuss - [Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Bob Ball

2010-Dec-14 16:19 UTC

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

I am trying to get a lustre client to mount the service, but with one or 
more OST disabled.  This does not appear to be working.  Lustre version 
is 1.8.4.

  mount -o localflock,exclude=umt3-OST0019 -t lustre 
10.10.1.140 at tcp0:/umt3 /lustre/umt3

dmesg on this client shows the following during the umount/mount sequence:

Lustre: client ffff810c25c03800 umount complete
Lustre: Skipped 1 previous similar message
Lustre: MGC10.10.1.140 at tcp: Reactivating import
Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding 
umt3-OST0019 (on exclusion list)
Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Skipped 1 
previous similar message
Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
x1354682302740498 sent from umt3-MDT0000-mdc-ffff810628209000 to NID 
10.10.1.49 at tcp 0s ago has failed due to network error (5s prior to 
deadline).
   req at ffff810620e66400 x1354682302740498/t0 
o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1 dl 
1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 
previous similar message
Lustre: Client umt3-client has started

When I check following the mount, using "lctl dl", I see the
following,
and it is clear that the OST is active as far as this client is concerned.

  19 UP osc umt3-OST0018-osc-ffff810628209000 
05b29472-d125-c36e-c023-e0eb76aaf353 5
  20 UP osc umt3-OST0019-osc-ffff810628209000 
05b29472-d125-c36e-c023-e0eb76aaf353 5
  21 UP osc umt3-OST001a-osc-ffff810628209000 
05b29472-d125-c36e-c023-e0eb76aaf353 5

Two questions here.  The first, obviously, is what is wrong with this 
picture?  Why can''t I exclude this OST from activity on this client? 
Is
it because the OSS serving that OST still has the OST active?  If the 
OST were deactivated or otherwise unavailable on the OSS, would the 
client mount then succeed to exclude this OST?  (OK, more than one 
question in the group....)

Second group, what is the correct syntax for excluding more that one 
OST?  Is it a comma-separated list of exclusions, or are separate 
excludes required?

  mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t lustre 
10.10.1.140 at tcp0:/umt3/lustre/umt3
                or
  mount -o localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t 
lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3

Thanks,
bob

Andreas Dilger

2010-Dec-14 16:57 UTC

head link

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

The error message shows a timeout connecting to umt3-MDT0000 and not the OST. 
The operation 38 is MDS_CONNECT, AFAIK.

Cheers, Andreas

On 2010-12-14, at 9:19, Bob Ball <ball at umich.edu> wrote:
> I am trying to get a lustre client to mount the service, but with one or 
> more OST disabled.  This does not appear to be working.  Lustre version 
> is 1.8.4.
> 
>  mount -o localflock,exclude=umt3-OST0019 -t lustre 
> 10.10.1.140 at tcp0:/umt3 /lustre/umt3
> 
> dmesg on this client shows the following during the umount/mount sequence:
> 
> Lustre: client ffff810c25c03800 umount complete
> Lustre: Skipped 1 previous similar message
> Lustre: MGC10.10.1.140 at tcp: Reactivating import
> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding 
> umt3-OST0019 (on exclusion list)
> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Skipped 1 
> previous similar message
> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1354682302740498 sent from umt3-MDT0000-mdc-ffff810628209000 to NID 
> 10.10.1.49 at tcp 0s ago has failed due to network error (5s prior to 
> deadline).
>   req at ffff810620e66400 x1354682302740498/t0 
> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1 dl 
> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 
> previous similar message
> Lustre: Client umt3-client has started
> 
> When I check following the mount, using "lctl dl", I see the
following,
> and it is clear that the OST is active as far as this client is concerned.
> 
>  19 UP osc umt3-OST0018-osc-ffff810628209000 
> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>  20 UP osc umt3-OST0019-osc-ffff810628209000 
> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>  21 UP osc umt3-OST001a-osc-ffff810628209000 
> 05b29472-d125-c36e-c023-e0eb76aaf353 5
> 
> Two questions here.  The first, obviously, is what is wrong with this 
> picture?  Why can''t I exclude this OST from activity on this
client?  Is
> it because the OSS serving that OST still has the OST active?  If the 
> OST were deactivated or otherwise unavailable on the OSS, would the 
> client mount then succeed to exclude this OST?  (OK, more than one 
> question in the group....)
> 
> Second group, what is the correct syntax for excluding more that one 
> OST?  Is it a comma-separated list of exclusions, or are separate 
> excludes required?
> 
>  mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t lustre 
> 10.10.1.140 at tcp0:/umt3/lustre/umt3
>                or
>  mount -o localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t 
> lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3
> 
> Thanks,
> bob
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bob Ball

2010-Dec-14 20:05 UTC

head link

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

Well, you are absolutely right, it is a timeout talking to what it 
THINKS is the MDT.  The thing is, it is NOT!

We were set up for HA for the MDT, with 10.10.1.48 and 10.10.1.49 
watching and talking to one another.  The RedHat service was 
problematic, so right now 10.10.1.48 is the MDT, and has /mnt/mdt 
mounted, and 10.10.1.49 is being used to do backups, and has 
/mnt/mdt_snapshot mounted.  The actual volume is an iSCSI location.

So, somehow, the client node has found and is talking to the wrong 
host!  Not good.  Scary.  Got to do something about this.....

Suggestions appreciated....

bob

On 12/14/2010 11:57 AM, Andreas Dilger wrote:> The error message shows a timeout connecting to umt3-MDT0000 and not the
OST.  The operation 38 is MDS_CONNECT, AFAIK.
>
> Cheers, Andreas
>
> On 2010-12-14, at 9:19, Bob Ball<ball at umich.edu>  wrote:
>
>> I am trying to get a lustre client to mount the service, but with one
or
>> more OST disabled.  This does not appear to be working.  Lustre version
>> is 1.8.4.
>>
>>   mount -o localflock,exclude=umt3-OST0019 -t lustre
>> 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>
>> dmesg on this client shows the following during the umount/mount
sequence:
>>
>> Lustre: client ffff810c25c03800 umount complete
>> Lustre: Skipped 1 previous similar message
>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding
>> umt3-OST0019 (on exclusion list)
>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Skipped 1
>> previous similar message
>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1354682302740498 sent from umt3-MDT0000-mdc-ffff810628209000 to NID
>> 10.10.1.49 at tcp 0s ago has failed due to network error (5s prior to
>> deadline).
>>    req at ffff810620e66400 x1354682302740498/t0
>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1
dl
>> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1
>> previous similar message
>> Lustre: Client umt3-client has started
>>
>> When I check following the mount, using "lctl dl", I see the
following,
>> and it is clear that the OST is active as far as this client is
concerned.
>>
>>   19 UP osc umt3-OST0018-osc-ffff810628209000
>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>   20 UP osc umt3-OST0019-osc-ffff810628209000
>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>   21 UP osc umt3-OST001a-osc-ffff810628209000
>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>
>> Two questions here.  The first, obviously, is what is wrong with this
>> picture?  Why can''t I exclude this OST from activity on this
client?  Is
>> it because the OSS serving that OST still has the OST active?  If the
>> OST were deactivated or otherwise unavailable on the OSS, would the
>> client mount then succeed to exclude this OST?  (OK, more than one
>> question in the group....)
>>
>> Second group, what is the correct syntax for excluding more that one
>> OST?  Is it a comma-separated list of exclusions, or are separate
>> excludes required?
>>
>>   mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t lustre
>> 10.10.1.140 at tcp0:/umt3/lustre/umt3
>>                 or
>>   mount -o localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t
>> lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>
>> Thanks,
>> bob
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Bob Ball

2010-Dec-14 21:41 UTC

head link

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

OK, so, we rebooted 10.10.1.49 into a different, non-lustre kernel.  
Then, to be as certain as I could be that the client did not know about 
10.10.1.49, I rebooted it as well.  After it was fully up (with the 
lustre file system mount in /etc/fstab) I umounted it, then mounted 
again as below.  And, the message still came back that it was trying to 
contact 10.10.1.49 instead of 10.10.1.48 as it should.  To repeat, the 
dmesg is logging:

Lustre: MGC10.10.1.140 at tcp: Reactivating import
Lustre: 10523:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding 
umt3-OST0019 (on exclusion list)
Lustre: 5936:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
x1355139761832543 sent from umt3-MDT0000-mdc-ffff81062c82c400 to NID 
10.10.1.49 at tcp 0s ago has failed due to network error (5s prior to 
deadline).
   req at ffff81060e4ebc00 x1355139761832543/t0 
o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1 dl 
1292362202 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: Client umt3-client has started

I guess I need to know why, in the world, is this client still trying to 
access 10.10.1.49?  Is there something, perhaps, on the MGC machine that 
is causing this mis-direct?  What?  And, most importantly, how do I fix 
this?

bob

On 12/14/2010 3:05 PM, Bob Ball wrote:> Well, you are absolutely right, it is a timeout talking to what it
> THINKS is the MDT.  The thing is, it is NOT!
>
> We were set up for HA for the MDT, with 10.10.1.48 and 10.10.1.49
> watching and talking to one another.  The RedHat service was
> problematic, so right now 10.10.1.48 is the MDT, and has /mnt/mdt
> mounted, and 10.10.1.49 is being used to do backups, and has
> /mnt/mdt_snapshot mounted.  The actual volume is an iSCSI location.
>
> So, somehow, the client node has found and is talking to the wrong
> host!  Not good.  Scary.  Got to do something about this.....
>
> Suggestions appreciated....
>
> bob
>
> On 12/14/2010 11:57 AM, Andreas Dilger wrote:
>> The error message shows a timeout connecting to umt3-MDT0000 and not
the OST.  The operation 38 is MDS_CONNECT, AFAIK.
>>
>> Cheers, Andreas
>>
>> On 2010-12-14, at 9:19, Bob Ball<ball at umich.edu>   wrote:
>>
>>> I am trying to get a lustre client to mount the service, but with
one or
>>> more OST disabled.  This does not appear to be working.  Lustre
version
>>> is 1.8.4.
>>>
>>>    mount -o localflock,exclude=umt3-OST0019 -t lustre
>>> 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>
>>> dmesg on this client shows the following during the umount/mount
sequence:
>>>
>>> Lustre: client ffff810c25c03800 umount complete
>>> Lustre: Skipped 1 previous similar message
>>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion())
Excluding
>>> umt3-OST0019 (on exclusion list)
>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion())
Skipped 1
>>> previous similar message
>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request
>>> x1354682302740498 sent from umt3-MDT0000-mdc-ffff810628209000 to
NID
>>> 10.10.1.49 at tcp 0s ago has failed due to network error (5s prior
to
>>> deadline).
>>>     req at ffff810620e66400 x1354682302740498/t0
>>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0
to 1 dl
>>> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped
1
>>> previous similar message
>>> Lustre: Client umt3-client has started
>>>
>>> When I check following the mount, using "lctl dl", I see
the following,
>>> and it is clear that the OST is active as far as this client is
concerned.
>>>
>>>    19 UP osc umt3-OST0018-osc-ffff810628209000
>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>    20 UP osc umt3-OST0019-osc-ffff810628209000
>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>    21 UP osc umt3-OST001a-osc-ffff810628209000
>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>
>>> Two questions here.  The first, obviously, is what is wrong with
this
>>> picture?  Why can''t I exclude this OST from activity on
this client?  Is
>>> it because the OSS serving that OST still has the OST active?  If
the
>>> OST were deactivated or otherwise unavailable on the OSS, would the
>>> client mount then succeed to exclude this OST?  (OK, more than one
>>> question in the group....)
>>>
>>> Second group, what is the correct syntax for excluding more that
one
>>> OST?  Is it a comma-separated list of exclusions, or are separate
>>> excludes required?
>>>
>>>    mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t lustre
>>> 10.10.1.140 at tcp0:/umt3/lustre/umt3
>>>                  or
>>>    mount -o localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t
>>> lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>
>>> Thanks,
>>> bob
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Kevin Van Maren

2010-Dec-14 22:12 UTC

head link

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

The clients (and servers) get the list of NIDs for each mdt/ost device 
from the MGS at mount time.

Having the clients fail to connect to 10.10.1.49 is _expected_ when the 
service is failed over
to 10.10.1.48.  However, they should succeed in connecting to 10.10.1.48 
and then you should
no longer get complaints about 10.10.1.49.

If the clients are not failing over to 10.10.1.48, then you might not 
have the failover NID
properly specified to allow failover.  Are you sure you properly 
specified the failover parameters
during mkfs on the MDT and did the first mount from the correct machine?

If the NIDs are wrong, it is possible to correct it using --writeconf.  
See the manual (or search
the list archives).

Kevin


Bob Ball wrote:> OK, so, we rebooted 10.10.1.49 into a different, non-lustre kernel.  
> Then, to be as certain as I could be that the client did not know about 
> 10.10.1.49, I rebooted it as well.  After it was fully up (with the 
> lustre file system mount in /etc/fstab) I umounted it, then mounted 
> again as below.  And, the message still came back that it was trying to 
> contact 10.10.1.49 instead of 10.10.1.48 as it should.  To repeat, the 
> dmesg is logging:
>
> Lustre: MGC10.10.1.140 at tcp: Reactivating import
> Lustre: 10523:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding 
> umt3-OST0019 (on exclusion list)
> Lustre: 5936:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1355139761832543 sent from umt3-MDT0000-mdc-ffff81062c82c400 to NID 
> 10.10.1.49 at tcp 0s ago has failed due to network error (5s prior to 
> deadline).
>    req at ffff81060e4ebc00 x1355139761832543/t0 
> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1 dl 
> 1292362202 ref 1 fl Rpc:N/0/0 rc 0/0
> Lustre: Client umt3-client has started
>
> I guess I need to know why, in the world, is this client still trying to 
> access 10.10.1.49?  Is there something, perhaps, on the MGC machine that 
> is causing this mis-direct?  What?  And, most importantly, how do I fix 
> this?
>
> bob
>
> On 12/14/2010 3:05 PM, Bob Ball wrote:
>   
>> Well, you are absolutely right, it is a timeout talking to what it
>> THINKS is the MDT.  The thing is, it is NOT!
>>
>> We were set up for HA for the MDT, with 10.10.1.48 and 10.10.1.49
>> watching and talking to one another.  The RedHat service was
>> problematic, so right now 10.10.1.48 is the MDT, and has /mnt/mdt
>> mounted, and 10.10.1.49 is being used to do backups, and has
>> /mnt/mdt_snapshot mounted.  The actual volume is an iSCSI location.
>>
>> So, somehow, the client node has found and is talking to the wrong
>> host!  Not good.  Scary.  Got to do something about this.....
>>
>> Suggestions appreciated....
>>
>> bob
>>
>> On 12/14/2010 11:57 AM, Andreas Dilger wrote:
>>     
>>> The error message shows a timeout connecting to umt3-MDT0000 and
not the OST.  The operation 38 is MDS_CONNECT, AFAIK.
>>>
>>> Cheers, Andreas
>>>
>>> On 2010-12-14, at 9:19, Bob Ball<ball at umich.edu>   wrote:
>>>
>>>       
>>>> I am trying to get a lustre client to mount the service, but
with one or
>>>> more OST disabled.  This does not appear to be working.  Lustre
version
>>>> is 1.8.4.
>>>>
>>>>    mount -o localflock,exclude=umt3-OST0019 -t lustre
>>>> 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>
>>>> dmesg on this client shows the following during the
umount/mount sequence:
>>>>
>>>> Lustre: client ffff810c25c03800 umount complete
>>>> Lustre: Skipped 1 previous similar message
>>>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion())
Excluding
>>>> umt3-OST0019 (on exclusion list)
>>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion())
Skipped 1
>>>> previous similar message
>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request
>>>> x1354682302740498 sent from umt3-MDT0000-mdc-ffff810628209000
to NID
>>>> 10.10.1.49 at tcp 0s ago has failed due to network error (5s
prior to
>>>> deadline).
>>>>     req at ffff810620e66400 x1354682302740498/t0
>>>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584
e 0 to 1 dl
>>>> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request())
Skipped 1
>>>> previous similar message
>>>> Lustre: Client umt3-client has started
>>>>
>>>> When I check following the mount, using "lctl dl", I
see the following,
>>>> and it is clear that the OST is active as far as this client is
concerned.
>>>>
>>>>    19 UP osc umt3-OST0018-osc-ffff810628209000
>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>    20 UP osc umt3-OST0019-osc-ffff810628209000
>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>    21 UP osc umt3-OST001a-osc-ffff810628209000
>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>
>>>> Two questions here.  The first, obviously, is what is wrong
with this
>>>> picture?  Why can''t I exclude this OST from activity
on this client?  Is
>>>> it because the OSS serving that OST still has the OST active? 
If the
>>>> OST were deactivated or otherwise unavailable on the OSS, would
the
>>>> client mount then succeed to exclude this OST?  (OK, more than
one
>>>> question in the group....)
>>>>
>>>> Second group, what is the correct syntax for excluding more
that one
>>>> OST?  Is it a comma-separated list of exclusions, or are
separate
>>>> excludes required?
>>>>
>>>>    mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t
lustre
>>>> 10.10.1.140 at tcp0:/umt3/lustre/umt3
>>>>                  or
>>>>    mount -o
localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t
>>>> lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>
>>>> Thanks,
>>>> bob
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>         
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Bob Ball

2010-Dec-15 18:33 UTC

head link

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

And, the hole gets deeper.  I was digging in the list archives, and in 
the manual, and decided to look at what was stored in the file systems 
using "tunefs.lustre --print".

The mgs machine is fine:
[mgs:~]# tunefs.lustre --print /dev/sdb
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     MGS
...

Individual OSS and their OST are fine:
[root at umfs05 ~]# tunefs.lustre --print /dev/sdf
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     umt3-OST0000
...

But, on the MDT, not so fine:
root at lmd01 ~# tunefs.lustre --print /dev/sde
checking for existing Lustre data: not found

tunefs.lustre FATAL: Device /dev/sde has not been formatted with mkfs.lustre
tunefs.lustre: exiting with 19 (No such device)

This is, of course, not true.  The partition was once upon a time 
formatted this way, but somehow, over time and untrackable operations, 
this history was lost.  So, before I can begin to deal with the issue 
below, it seems this little issue needs to be addressed.  However, I 
have no idea where to begin with this.  As we have 2 MDT machines, 
originally set up as HA fail-overs, I guess it is possible this would 
work fine if the MDT were mounted on its twin at 10.10.1.49 instead of 
on this machine at 10.10.1.48?

Can someone suggest a workable path to resolve this?  I have not (yet) 
taken the MDT offline to remount as ldiskfs and look at details.

bob

On 12/14/2010 5:12 PM, Kevin Van Maren wrote:> The clients (and servers) get the list of NIDs for each mdt/ost device 
> from the MGS at mount time.
>
> Having the clients fail to connect to 10.10.1.49 is _expected_ when 
> the service is failed over
> to 10.10.1.48.  However, they should succeed in connecting to 
> 10.10.1.48 and then you should
> no longer get complaints about 10.10.1.49.
>
> If the clients are not failing over to 10.10.1.48, then you might not 
> have the failover NID
> properly specified to allow failover.  Are you sure you properly 
> specified the failover parameters
> during mkfs on the MDT and did the first mount from the correct machine?
>
> If the NIDs are wrong, it is possible to correct it using 
> --writeconf.  See the manual (or search
> the list archives).
>
> Kevin
>
>
> Bob Ball wrote:
>> OK, so, we rebooted 10.10.1.49 into a different, non-lustre kernel.  
>> Then, to be as certain as I could be that the client did not know 
>> about 10.10.1.49, I rebooted it as well.  After it was fully up (with 
>> the lustre file system mount in /etc/fstab) I umounted it, then 
>> mounted again as below.  And, the message still came back that it was 
>> trying to contact 10.10.1.49 instead of 10.10.1.48 as it should.  To 
>> repeat, the dmesg is logging:
>>
>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>> Lustre: 10523:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding 
>> umt3-OST0019 (on exclusion list)
>> Lustre: 5936:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ 
>> Request x1355139761832543 sent from umt3-MDT0000-mdc-ffff81062c82c400 
>> to NID 10.10.1.49 at tcp 0s ago has failed due to network error (5s 
>> prior to deadline).
>>    req at ffff81060e4ebc00 x1355139761832543/t0 
>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1
dl
>> 1292362202 ref 1 fl Rpc:N/0/0 rc 0/0
>> Lustre: Client umt3-client has started
>>
>> I guess I need to know why, in the world, is this client still trying 
>> to access 10.10.1.49?  Is there something, perhaps, on the MGC 
>> machine that is causing this mis-direct?  What?  And, most 
>> importantly, how do I fix this?
>>
>> bob
>>
>> On 12/14/2010 3:05 PM, Bob Ball wrote:
>>> Well, you are absolutely right, it is a timeout talking to what it
>>> THINKS is the MDT.  The thing is, it is NOT!
>>>
>>> We were set up for HA for the MDT, with 10.10.1.48 and 10.10.1.49
>>> watching and talking to one another.  The RedHat service was
>>> problematic, so right now 10.10.1.48 is the MDT, and has /mnt/mdt
>>> mounted, and 10.10.1.49 is being used to do backups, and has
>>> /mnt/mdt_snapshot mounted.  The actual volume is an iSCSI location.
>>>
>>> So, somehow, the client node has found and is talking to the wrong
>>> host!  Not good.  Scary.  Got to do something about this.....
>>>
>>> Suggestions appreciated....
>>>
>>> bob
>>>
>>> On 12/14/2010 11:57 AM, Andreas Dilger wrote:
>>>> The error message shows a timeout connecting to umt3-MDT0000
and
>>>> not the OST.  The operation 38 is MDS_CONNECT, AFAIK.
>>>>
>>>> Cheers, Andreas
>>>>
>>>> On 2010-12-14, at 9:19, Bob Ball<ball at umich.edu>  
wrote:
>>>>
>>>>> I am trying to get a lustre client to mount the service,
but with
>>>>> one or
>>>>> more OST disabled.  This does not appear to be working. 
Lustre
>>>>> version
>>>>> is 1.8.4.
>>>>>
>>>>>    mount -o localflock,exclude=umt3-OST0019 -t lustre
>>>>> 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>>
>>>>> dmesg on this client shows the following during the
umount/mount
>>>>> sequence:
>>>>>
>>>>> Lustre: client ffff810c25c03800 umount complete
>>>>> Lustre: Skipped 1 previous similar message
>>>>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>>>>> Lustre:
450250:0:(obd_mount.c:1786:lustre_check_exclusion())
>>>>> Excluding
>>>>> umt3-OST0019 (on exclusion list)
>>>>> Lustre:
450250:0:(obd_mount.c:1786:lustre_check_exclusion())
>>>>> Skipped 1
>>>>> previous similar message
>>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request())
@@@
>>>>> Request
>>>>> x1354682302740498 sent from
umt3-MDT0000-mdc-ffff810628209000 to NID
>>>>> 10.10.1.49 at tcp 0s ago has failed due to network error
(5s prior to
>>>>> deadline).
>>>>>     req at ffff810620e66400 x1354682302740498/t0
>>>>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens
368/584 e 0 to 1 dl
>>>>> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
>>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request())
Skipped 1
>>>>> previous similar message
>>>>> Lustre: Client umt3-client has started
>>>>>
>>>>> When I check following the mount, using "lctl
dl", I see the
>>>>> following,
>>>>> and it is clear that the OST is active as far as this
client is
>>>>> concerned.
>>>>>
>>>>>    19 UP osc umt3-OST0018-osc-ffff810628209000
>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>>    20 UP osc umt3-OST0019-osc-ffff810628209000
>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>>    21 UP osc umt3-OST001a-osc-ffff810628209000
>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>>
>>>>> Two questions here.  The first, obviously, is what is wrong
with this
>>>>> picture?  Why can''t I exclude this OST from
activity on this
>>>>> client?  Is
>>>>> it because the OSS serving that OST still has the OST
active?  If the
>>>>> OST were deactivated or otherwise unavailable on the OSS,
would the
>>>>> client mount then succeed to exclude this OST?  (OK, more
than one
>>>>> question in the group....)
>>>>>
>>>>> Second group, what is the correct syntax for excluding more
that one
>>>>> OST?  Is it a comma-separated list of exclusions, or are
separate
>>>>> excludes required?
>>>>>
>>>>>    mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t
lustre
>>>>> 10.10.1.140 at tcp0:/umt3/lustre/umt3
>>>>>                  or
>>>>>    mount -o
localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t
>>>>> lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>>
>>>>> Thanks,
>>>>> bob
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>

Bob Ball

2010-Dec-15 18:58 UTC

head link

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

A bit more information.  /dev/sde on lmd01 is an iSCSI volume.  Made a 
snapshot of it, mounted it "-t ldiskfs", and find the following 
information in place at the mount point:

root at lmd01 /mnt/tmp# ll
total 128
-rwx------  1 root root  1792 Apr 20  2010 CATALOGS
drwxr-xr-x  2 root root  4096 Dec 13 15:55 CONFIGS
-rw-r--r--  1 root root  4096 Apr 20  2010 health_check
-rw-r--r--  1 root root 56064 Dec  3 18:26 last_rcvd
drwxrwxrwx  2 root root  4096 Apr 20  2010 LOGS
drwx------  2 root root 16384 Apr 20  2010 lost+found
-rw-r--r--  1 root root   448 Sep 10 12:17 lov_objid
drwxrwxrwx  2 root root 20480 Dec 13 23:57 OBJECTS
drwxrwxrwx  2 root root 12288 Dec 15 11:31 PENDING
drwxr-xr-x 19 root root  4096 Dec  6 06:28 ROOT

root at lmd01 /mnt/tmp# ll CONFIGS
total 76
-rw-r--r-- 1 root root 12288 May 21  2010 mountdata
-rw-r--r-- 1 root root 61944 Dec 13 15:55 umt3-MDT0000

bob


On 12/15/2010 1:33 PM, Bob Ball wrote:> And, the hole gets deeper.  I was digging in the list archives, and in
> the manual, and decided to look at what was stored in the file systems
> using "tunefs.lustre --print".
>
> The mgs machine is fine:
> [mgs:~]# tunefs.lustre --print /dev/sdb
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>      Read previous values:
> Target:     MGS
> ...
>
> Individual OSS and their OST are fine:
> [root at umfs05 ~]# tunefs.lustre --print /dev/sdf
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>      Read previous values:
> Target:     umt3-OST0000
> ...
>
> But, on the MDT, not so fine:
> root at lmd01 ~# tunefs.lustre --print /dev/sde
> checking for existing Lustre data: not found
>
> tunefs.lustre FATAL: Device /dev/sde has not been formatted with
mkfs.lustre
> tunefs.lustre: exiting with 19 (No such device)
>
> This is, of course, not true.  The partition was once upon a time
> formatted this way, but somehow, over time and untrackable operations,
> this history was lost.  So, before I can begin to deal with the issue
> below, it seems this little issue needs to be addressed.  However, I
> have no idea where to begin with this.  As we have 2 MDT machines,
> originally set up as HA fail-overs, I guess it is possible this would
> work fine if the MDT were mounted on its twin at 10.10.1.49 instead of
> on this machine at 10.10.1.48?
>
> Can someone suggest a workable path to resolve this?  I have not (yet)
> taken the MDT offline to remount as ldiskfs and look at details.
>
> bob
>
> On 12/14/2010 5:12 PM, Kevin Van Maren wrote:
>> The clients (and servers) get the list of NIDs for each mdt/ost device
>> from the MGS at mount time.
>>
>> Having the clients fail to connect to 10.10.1.49 is _expected_ when
>> the service is failed over
>> to 10.10.1.48.  However, they should succeed in connecting to
>> 10.10.1.48 and then you should
>> no longer get complaints about 10.10.1.49.
>>
>> If the clients are not failing over to 10.10.1.48, then you might not
>> have the failover NID
>> properly specified to allow failover.  Are you sure you properly
>> specified the failover parameters
>> during mkfs on the MDT and did the first mount from the correct
machine?
>>
>> If the NIDs are wrong, it is possible to correct it using
>> --writeconf.  See the manual (or search
>> the list archives).
>>
>> Kevin
>>
>>
>> Bob Ball wrote:
>>> OK, so, we rebooted 10.10.1.49 into a different, non-lustre kernel.
>>> Then, to be as certain as I could be that the client did not know
>>> about 10.10.1.49, I rebooted it as well.  After it was fully up
(with
>>> the lustre file system mount in /etc/fstab) I umounted it, then
>>> mounted again as below.  And, the message still came back that it
was
>>> trying to contact 10.10.1.49 instead of 10.10.1.48 as it should. 
To
>>> repeat, the dmesg is logging:
>>>
>>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>>> Lustre: 10523:0:(obd_mount.c:1786:lustre_check_exclusion())
Excluding
>>> umt3-OST0019 (on exclusion list)
>>> Lustre: 5936:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
>>> Request x1355139761832543 sent from
umt3-MDT0000-mdc-ffff81062c82c400
>>> to NID 10.10.1.49 at tcp 0s ago has failed due to network error (5s
>>> prior to deadline).
>>>     req at ffff81060e4ebc00 x1355139761832543/t0
>>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0
to 1 dl
>>> 1292362202 ref 1 fl Rpc:N/0/0 rc 0/0
>>> Lustre: Client umt3-client has started
>>>
>>> I guess I need to know why, in the world, is this client still
trying
>>> to access 10.10.1.49?  Is there something, perhaps, on the MGC
>>> machine that is causing this mis-direct?  What?  And, most
>>> importantly, how do I fix this?
>>>
>>> bob
>>>
>>> On 12/14/2010 3:05 PM, Bob Ball wrote:
>>>> Well, you are absolutely right, it is a timeout talking to what
it
>>>> THINKS is the MDT.  The thing is, it is NOT!
>>>>
>>>> We were set up for HA for the MDT, with 10.10.1.48 and
10.10.1.49
>>>> watching and talking to one another.  The RedHat service was
>>>> problematic, so right now 10.10.1.48 is the MDT, and has
/mnt/mdt
>>>> mounted, and 10.10.1.49 is being used to do backups, and has
>>>> /mnt/mdt_snapshot mounted.  The actual volume is an iSCSI
location.
>>>>
>>>> So, somehow, the client node has found and is talking to the
wrong
>>>> host!  Not good.  Scary.  Got to do something about this.....
>>>>
>>>> Suggestions appreciated....
>>>>
>>>> bob
>>>>
>>>> On 12/14/2010 11:57 AM, Andreas Dilger wrote:
>>>>> The error message shows a timeout connecting to
umt3-MDT0000 and
>>>>> not the OST.  The operation 38 is MDS_CONNECT, AFAIK.
>>>>>
>>>>> Cheers, Andreas
>>>>>
>>>>> On 2010-12-14, at 9:19, Bob Ball<ball at umich.edu>  
wrote:
>>>>>
>>>>>> I am trying to get a lustre client to mount the
service, but with
>>>>>> one or
>>>>>> more OST disabled.  This does not appear to be working.
Lustre
>>>>>> version
>>>>>> is 1.8.4.
>>>>>>
>>>>>>     mount -o localflock,exclude=umt3-OST0019 -t lustre
>>>>>> 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>>>
>>>>>> dmesg on this client shows the following during the
umount/mount
>>>>>> sequence:
>>>>>>
>>>>>> Lustre: client ffff810c25c03800 umount complete
>>>>>> Lustre: Skipped 1 previous similar message
>>>>>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>>>>>> Lustre:
450250:0:(obd_mount.c:1786:lustre_check_exclusion())
>>>>>> Excluding
>>>>>> umt3-OST0019 (on exclusion list)
>>>>>> Lustre:
450250:0:(obd_mount.c:1786:lustre_check_exclusion())
>>>>>> Skipped 1
>>>>>> previous similar message
>>>>>> Lustre:
5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
>>>>>> Request
>>>>>> x1354682302740498 sent from
umt3-MDT0000-mdc-ffff810628209000 to NID
>>>>>> 10.10.1.49 at tcp 0s ago has failed due to network
error (5s prior to
>>>>>> deadline).
>>>>>>      req at ffff810620e66400 x1354682302740498/t0
>>>>>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens
368/584 e 0 to 1 dl
>>>>>> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
>>>>>> Lustre:
5942:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1
>>>>>> previous similar message
>>>>>> Lustre: Client umt3-client has started
>>>>>>
>>>>>> When I check following the mount, using "lctl
dl", I see the
>>>>>> following,
>>>>>> and it is clear that the OST is active as far as this
client is
>>>>>> concerned.
>>>>>>
>>>>>>     19 UP osc umt3-OST0018-osc-ffff810628209000
>>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>>>     20 UP osc umt3-OST0019-osc-ffff810628209000
>>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>>>     21 UP osc umt3-OST001a-osc-ffff810628209000
>>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>>>
>>>>>> Two questions here.  The first, obviously, is what is
wrong with this
>>>>>> picture?  Why can''t I exclude this OST from
activity on this
>>>>>> client?  Is
>>>>>> it because the OSS serving that OST still has the OST
active?  If the
>>>>>> OST were deactivated or otherwise unavailable on the
OSS, would the
>>>>>> client mount then succeed to exclude this OST?  (OK,
more than one
>>>>>> question in the group....)
>>>>>>
>>>>>> Second group, what is the correct syntax for excluding
more that one
>>>>>> OST?  Is it a comma-separated list of exclusions, or
are separate
>>>>>> excludes required?
>>>>>>
>>>>>>     mount -o
localflock,exclude=umt3-OST0019,umt3-OST0020 -t lustre
>>>>>> 10.10.1.140 at tcp0:/umt3/lustre/umt3
>>>>>>                   or
>>>>>>     mount -o
localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t
>>>>>> lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>>>
>>>>>> Thanks,
>>>>>> bob
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Lustre discuss - Dec 2010 - Trying to mount lustre on a client when one or more OST is disabled

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled