thr3ads.net - Lustre discuss - [Lustre-discuss] How To change server recovery timeout [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Wojciech Turek

2007-Nov-07 11:24 UTC

[Lustre-discuss] How To change server recovery timeout

Hi,

Our lustre environment is:
2.6.9-55.0.9.EL_lustre.1.6.3smp

I would like to change recovery timeout from default value 250s to  
something longer

I tried example from manual:

set_timeout <secs> Sets the timeout (obd_timeout) for a server
to wait before failing recovery.

We performed that experiment on our test lustre installation with one  
OST.

storage02 is our OSS

[root at storage02 ~]# lctl dl
   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
   1 UP ost OSS OSS_uuid 3
   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
[root at storage02 ~]# lctl --device 2 set_timeout 600
set_timeout has been deprecated. Use conf_param instead.
e.g. conf_param lustre-MDT0000 obd_timeout=50
usage: conf_param obd_timeout=<secs>

run <command> after connecting to device <devno>
--device <devno> <command [args ...]>

[root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
No device found for name MGS: Invalid argument
error: conf_param: No such device

It looks like I need to run this command from MGS node so I  moved  
then to MGS server called storage03

[root at storage03 ~]# lctl dl
   0 UP mgs MGS MGS 9
   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
   2 UP mdt MDS MDS_uuid 3
   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
[root at storage03 ~]# lctl device 5
[root at storage03 ~]# lctl conf_param obd_timeout=600
error: conf_param: Function not implemented
[root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
error: conf_param: Function not implemented

[root at storage03 ~]# lctl help conf_param
conf_param: set a permanent config param. This command must be run on  
the MGS node
usage: conf_param <target.keyword=val> ...

[root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
error: conf_param: Invalid argument
[root at storage03 ~]#


I searched whole /proc/*/lustre for file that can store this timeout  
value but nothing were found.

Could someone advise how to change value for recovery timeout?

Cheers,

Wojciech Turek



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071107/c32a9e70/attachment-0002.html

Cliff White

2007-Nov-07 17:58 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Wojciech Turek wrote:> Hi,
> 
> Our lustre environment is:
> 2.6.9-55.0.9.EL_lustre.1.6.3smp
> 
> I would like to change recovery timeout from default value 250s to 
> something longer
> 
> I tried example from manual:
> 
> set_timeout <secs> Sets the timeout (obd_timeout) for a server
> to wait before failing recovery.
> 
> We performed that experiment on our test lustre installation with one OST.
> 
> storage02 is our OSS
> 
> [root at storage02 ~]# lctl dl
>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
>   1 UP ost OSS OSS_uuid 3
>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
> [root at storage02 ~]# lctl --device 2 set_timeout 600
> set_timeout has been deprecated. Use conf_param instead.
> e.g. conf_param lustre-MDT0000 obd_timeout=50
> usage: conf_param obd_timeout=<secs>
> 
> run <command> after connecting to device <devno>
> --device <devno> <command [args ...]>
> 
> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
> No device found for name MGS: Invalid argument
> error: conf_param: No such device
> 
> It looks like I need to run this command from MGS node so I  moved then 
> to MGS server called storage03
> 
> [root at storage03 ~]# lctl dl
>   0 UP mgs MGS MGS 9
>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>   2 UP mdt MDS MDS_uuid 3
>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
> [root at storage03 ~]# lctl device 5
> [root at storage03 ~]# lctl conf_param obd_timeout=600
> error: conf_param: Function not implemented
> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
> error: conf_param: Function not implemented
> 
> [root at storage03 ~]# lctl help conf_param
> conf_param: set a permanent config param. This command must be run on 
> the MGS node
> usage: conf_param <target.keyword=val> ...
> 
> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
> error: conf_param: Invalid argument
> [root at storage03 ~]#
> 
> 
> I searched whole /proc/*/lustre for file that can store this timeout 
> value but nothing were found.
> 
> Could someone advise how to change value for recovery timeout?
> 
> Cheers,
> 
> Wojciech Turek
> 
It looks like your file system is named ''home'' - you can
confirm with
tunefs.lustre --print <MDS device> | grep "Lustre FS"

The correct command (Run on the MGS) would be
# lctl conf_param home.sys.timeout=<val>

Example:
[root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
Lustre FS:  lustre
[root at ft4 ~]# cat /proc/sys/lustre/timeout
130
[root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
[root at ft4 ~]# cat /proc/sys/lustre/timeout
150

cliffw
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Wojciech Turek

2007-Nov-07 18:46 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Hi Cliff,

On 7 Nov 2007, at 17:58, Cliff White wrote:
> Wojciech Turek wrote:
>> Hi,
>> Our lustre environment is:
>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>> I would like to change recovery timeout from default value 250s to  
>> something longer
>> I tried example from manual:
>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>> to wait before failing recovery.
>> We performed that experiment on our test lustre installation with  
>> one OST.
>> storage02 is our OSS
>> [root at storage02 ~]# lctl dl
>>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86
5
>>   1 UP ost OSS OSS_uuid 3
>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>> set_timeout has been deprecated. Use conf_param instead.
>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>> usage: conf_param obd_timeout=<secs>
>> run <command> after connecting to device <devno>
>> --device <devno> <command [args ...]>
>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>> No device found for name MGS: Invalid argument
>> error: conf_param: No such device
>> It looks like I need to run this command from MGS node so I  moved  
>> then to MGS server called storage03
>> [root at storage03 ~]# lctl dl
>>   0 UP mgs MGS MGS 9
>>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3
5
>>   2 UP mdt MDS MDS_uuid 3
>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>> [root at storage03 ~]# lctl device 5
>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>> error: conf_param: Function not implemented
>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>> error: conf_param: Function not implemented
>> [root at storage03 ~]# lctl help conf_param
>> conf_param: set a permanent config param. This command must be run  
>> on the MGS node
>> usage: conf_param <target.keyword=val> ...
>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
>> error: conf_param: Invalid argument
>> [root at storage03 ~]#
>> I searched whole /proc/*/lustre for file that can store this  
>> timeout value but nothing were found.
>> Could someone advise how to change value for recovery timeout?
>> Cheers,
>> Wojciech Turek
>
> It looks like your file system is named ''home'' - you can
confirm with
> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>
> The correct command (Run on the MGS) would be
> # lctl conf_param home.sys.timeout=<val>
>
> Example:
> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
> Lustre FS:  lustre
> [root at ft4 ~]# cat /proc/sys/lustre/timeout
> 130
> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
> [root at ft4 ~]# cat /proc/sys/lustre/timeout
> 150Thanks for your email. I am afraid your tips aren''t very helpful in  
this case. As stated in the subject I am asking about recovery timeout.
You can find it for example in /proc/fs/lustre/obdfilter/<OST>/ 
recovery_status whilst one of your OST''s is in recovery state. By  
default this timeout is 250s.
Whereas you are talking about system obd timeout (according to CFS  
documentation chapter 4.1.2 ) which is not a subject of my concern.

Any way I tried your example just to see if it works and again I am  
afraid it doesn''t work for me, see below:
I have combined mgs and mds configuration.

[[root at storage03 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             10317828   3452824   6340888  36% /
/dev/sda6              7605856     49788   7169708   1% /local
/dev/sda3              4127108     41000   3876460   2% /tmp
/dev/sda2              4127108    753668   3163792  20% /var
/dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
/dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
/dev/dm-3            286696376   1461588 268850900   1% /mnt/home-md/mdt
[root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre
FS"
Lustre FS:  home-md
Lustre FS:  home-md
[root at storage03 ~]# cat /proc/sys/lustre/timeout
100
[root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
error: conf_param: Invalid argument
[root at storage03 ~]#

Cheers,

Wojciech Turek


>
> cliffw
>
>> --------------------------------------------------------------------- 
>> ---
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071107/be8d112f/attachment-0002.html

Cliff White

2007-Nov-07 20:38 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Wojciech Turek wrote:> Hi Cliff,
> 
> On 7 Nov 2007, at 17:58, Cliff White wrote:
> 
>> Wojciech Turek wrote:
>>> Hi,
>>> Our lustre environment is:
>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>> I would like to change recovery timeout from default value 250s to 
>>> something longer
>>> I tried example from manual:
>>> set_timeout <secs> Sets the timeout (obd_timeout) for a
server
>>> to wait before failing recovery.
>>> We performed that experiment on our test lustre installation with
one
>>> OST.
>>> storage02 is our OSS
>>> [root at storage02 ~]# lctl dl
>>>   0 UP mgc MGC10.143.245.3 at tcp
31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>   1 UP ost OSS OSS_uuid 3
>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>> set_timeout has been deprecated. Use conf_param instead.
>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>> usage: conf_param obd_timeout=<secs>
>>> run <command> after connecting to device <devno>
>>> --device <devno> <command [args ...]>
>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>>> No device found for name MGS: Invalid argument
>>> error: conf_param: No such device
>>> It looks like I need to run this command from MGS node so I  moved 
>>> then to MGS server called storage03
>>> [root at storage03 ~]# lctl dl
>>>   0 UP mgs MGS MGS 9
>>>   1 UP mgc MGC10.143.245.3 at tcp
f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>   2 UP mdt MDS MDS_uuid 3
>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>> [root at storage03 ~]# lctl device 5
>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>> error: conf_param: Function not implemented
>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>>> error: conf_param: Function not implemented
>>> [root at storage03 ~]# lctl help conf_param
>>> conf_param: set a permanent config param. This command must be run
on
>>> the MGS node
>>> usage: conf_param <target.keyword=val> ...
>>> [root at storage03 ~]# lctl conf_param
home-md-MDT0000.obd_timeout=600
>>> error: conf_param: Invalid argument
>>> [root at storage03 ~]#
>>> I searched whole /proc/*/lustre for file that can store this
timeout
>>> value but nothing were found.
>>> Could someone advise how to change value for recovery timeout?
>>> Cheers,
>>> Wojciech Turek
>>
>> It looks like your file system is named ''home'' - you
can confirm with
>> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>>
>> The correct command (Run on the MGS) would be
>> # lctl conf_param home.sys.timeout=<val>
>>
>> Example:
>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre
FS"
>> Lustre FS:  lustre
>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>> 130
>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>> 150
> Thanks for your email. I am afraid your tips aren''t very helpful
in this
> case. As stated in the subject I am asking about recovery timeout.
> You can find it for example in 
> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your
OST''s
> is in recovery state. By default this timeout is 250s.
> Whereas you are talking about system obd timeout (according to CFS 
> documentation chapter 4.1.2 ) which is not a subject of my concern.
> 
> Any way I tried your example just to see if it works and again I am 
> afraid it doesn''t work for me, see below:
> I have combined mgs and mds configuration.
> 
> [[root at storage03 ~]# df
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sda1             10317828   3452824   6340888  36% /
> /dev/sda6              7605856     49788   7169708   1% /local
> /dev/sda3              4127108     41000   3876460   2% /tmp
> /dev/sda2              4127108    753668   3163792  20% /var
> /dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
> /dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
> /dev/dm-3            286696376   1461588 268850900   1% /mnt/home-md/mdt
> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre
FS"
> Lustre FS:  home-md
> Lustre FS:  home-md
> [root at storage03 ~]# cat /proc/sys/lustre/timeout
> 100
> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
> error: conf_param: Invalid argument
> [root at storage03 ~]#
> 
Hmm, not sure why that isn''t working for you, I tested the example I 
gave. Sorry about the mis-read. The obd recovery timeout is defined in 
relation to obd_timeout, and afaik not changeable at runtime:

lustre/include/lustre_lib.h
#define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2)
...which gives the default 250 seconds for the default obd_timeout (100 
seconds)

cliffw
> Cheers,
> 
> Wojciech Turek
> 
> 
> 
>>
>> cliffw
>>
>>>
------------------------------------------------------------------------
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
> 
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service
> email: wjt27 at cam.ac.uk
> tel. +441223763517
> 
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Nathan Rutman

2007-Nov-07 22:31 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Cliff White wrote:> Wojciech Turek wrote:
>   
>> Hi Cliff,
>>
>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>
>>     
>>> Wojciech Turek wrote:
>>>       
>>>> Hi,
>>>> Our lustre environment is:
>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>> I would like to change recovery timeout from default value 250s
to
>>>> something longer
>>>> I tried example from manual:
>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a
server
>>>> to wait before failing recovery.
>>>> We performed that experiment on our test lustre installation
with one
>>>> OST.
>>>> storage02 is our OSS
>>>> [root at storage02 ~]# lctl dl
>>>>   0 UP mgc MGC10.143.245.3 at tcp
31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>   1 UP ost OSS OSS_uuid 3
>>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>> set_timeout has been deprecated. Use conf_param instead.
>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>         
sorry about this bad help message.  It''s wrong.>>>> usage: conf_param obd_timeout=<secs>
>>>> run <command> after connecting to device <devno>
>>>> --device <devno> <command [args ...]>
>>>> [root at storage02 ~]# lctl --device 1 conf_param
obd_timeout=600
>>>> No device found for name MGS: Invalid argument
>>>> error: conf_param: No such device
>>>> It looks like I need to run this command from MGS node so I 
moved
>>>> then to MGS server called storage03
>>>> [root at storage03 ~]# lctl dl
>>>>   0 UP mgs MGS MGS 9
>>>>   1 UP mgc MGC10.143.245.3 at tcp
f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>   2 UP mdt MDS MDS_uuid 3
>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>> [root at storage03 ~]# lctl device 5
>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>> error: conf_param: Function not implemented
>>>> [root at storage03 ~]# lctl --device 5 conf_param
obd_timeout=600
>>>> error: conf_param: Function not implemented
>>>> [root at storage03 ~]# lctl help conf_param
>>>> conf_param: set a permanent config param. This command must be
run on
>>>> the MGS node
>>>> usage: conf_param <target.keyword=val> ...
>>>> [root at storage03 ~]# lctl conf_param
home-md-MDT0000.obd_timeout=600
>>>> error: conf_param: Invalid argument
>>>> [root at storage03 ~]#
>>>> I searched whole /proc/*/lustre for file that can store this
timeout
>>>> value but nothing were found.
>>>> Could someone advise how to change value for recovery timeout?
>>>> Cheers,
>>>> Wojciech Turek
>>>>         
>>> It looks like your file system is named ''home'' -
you can confirm with
>>> tunefs.lustre --print <MDS device> | grep "Lustre
FS"
>>>
>>> The correct command (Run on the MGS) would be
>>> # lctl conf_param home.sys.timeout=<val>
>>>
>>> Example:
>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre
FS"
>>> Lustre FS:  lustre
>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>> 130
>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>> 150
>>>       
>> Thanks for your email. I am afraid your tips aren''t very
helpful in this
>> case. As stated in the subject I am asking about recovery timeout.
>> You can find it for example in 
>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of
your OST''s
>> is in recovery state. By default this timeout is 250s.
>> Whereas you are talking about system obd timeout (according to CFS 
>> documentation chapter 4.1.2 ) which is not a subject of my concern.
>>
>> Any way I tried your example just to see if it works and again I am 
>> afraid it doesn''t work for me, see below:
>> I have combined mgs and mds configuration.
>>
>> [[root at storage03 ~]# df
>> Filesystem           1K-blocks      Used Available Use% Mounted on
>> /dev/sda1             10317828   3452824   6340888  36% /
>> /dev/sda6              7605856     49788   7169708   1% /local
>> /dev/sda3              4127108     41000   3876460   2% /tmp
>> /dev/sda2              4127108    753668   3163792  20% /var
>> /dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
>> /dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
>> /dev/dm-3            286696376   1461588 268850900   1%
/mnt/home-md/mdt
>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep
"Lustre FS"
>> Lustre FS:  home-md
>> Lustre FS:  home-md
>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>> 100
>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>> error: conf_param: Invalid argument
>> [root at storage03 ~]#
>>     You need to do this on the MGS node, with the MGS running.

mgs> lctl conf_param testfs.sys.timeout=150
anynode> cat /proc/sys/lustre/timeout


> Hmm, not sure why that isn''t working for you, I tested the example
I
> gave. Sorry about the mis-read. The obd recovery timeout is defined in 
> relation to obd_timeout, and afaik not changeable at runtime:
>
> lustre/include/lustre_lib.h
> #define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2)
> ...which gives the default 250 seconds for the default obd_timeout (100 
> seconds)
>
> cliffw
>
>   That''s correct.  These are tied together before lustre 1.6.4.
>> Cheers,
>>
>> Wojciech Turek
>>
>>
>>
>>     
>>> cliffw
>>>
>>>       
>>>>
------------------------------------------------------------------------
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>         
>> Mr Wojciech Turek
>> Assistant System Manager
>> University of Cambridge
>> High Performance Computing service
>> email: wjt27 at cam.ac.uk
>> tel. +441223763517
>>
>>
>>
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Wojciech Turek

2007-Nov-07 23:56 UTC

head link

[Lustre-discuss] How To change server recovery timeout

On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
> Cliff White wrote:
>> Wojciech Turek wrote:
>>
>>> Hi Cliff,
>>>
>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>
>>>
>>>> Wojciech Turek wrote:
>>>>
>>>>> Hi,
>>>>> Our lustre environment is:
>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>> I would like to change recovery timeout from default value
250s
>>>>> to something longer
>>>>> I tried example from manual:
>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for
a server
>>>>> to wait before failing recovery.
>>>>> We performed that experiment on our test lustre
installation
>>>>> with one OST.
>>>>> storage02 is our OSS
>>>>> [root at storage02 ~]# lctl dl
>>>>>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4- 
>>>>> c760-45d3df426d86 5
>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>>> set_timeout has been deprecated. Use conf_param instead.
>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>
> sorry about this bad help message.  It''s wrong.
>>>>> usage: conf_param obd_timeout=<secs>
>>>>> run <command> after connecting to device
<devno>
>>>>> --device <devno> <command [args ...]>
>>>>> [root at storage02 ~]# lctl --device 1 conf_param
obd_timeout=600
>>>>> No device found for name MGS: Invalid argument
>>>>> error: conf_param: No such device
>>>>> It looks like I need to run this command from MGS node so I
>>>>> moved then to MGS server called storage03
>>>>> [root at storage03 ~]# lctl dl
>>>>>   0 UP mgs MGS MGS 9
>>>>>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada- 
>>>>> b602a5ca9ab3 5
>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>>> [root at storage03 ~]# lctl device 5
>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>>> error: conf_param: Function not implemented
>>>>> [root at storage03 ~]# lctl --device 5 conf_param
obd_timeout=600
>>>>> error: conf_param: Function not implemented
>>>>> [root at storage03 ~]# lctl help conf_param
>>>>> conf_param: set a permanent config param. This command must
be
>>>>> run on the MGS node
>>>>> usage: conf_param <target.keyword=val> ...
>>>>> [root at storage03 ~]# lctl conf_param home-md- 
>>>>> MDT0000.obd_timeout=600
>>>>> error: conf_param: Invalid argument
>>>>> [root at storage03 ~]#
>>>>> I searched whole /proc/*/lustre for file that can store
this
>>>>> timeout value but nothing were found.
>>>>> Could someone advise how to change value for recovery
timeout?
>>>>> Cheers,
>>>>> Wojciech Turek
>>>>>
>>>> It looks like your file system is named
''home'' - you can confirm
>>>> with
>>>> tunefs.lustre --print <MDS device> | grep "Lustre
FS"
>>>>
>>>> The correct command (Run on the MGS) would be
>>>> # lctl conf_param home.sys.timeout=<val>
>>>>
>>>> Example:
>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep
"Lustre FS"
>>>> Lustre FS:  lustre
>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>> 130
>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>> 150
>>>>
>>> Thanks for your email. I am afraid your tips aren''t very
helpful
>>> in this case. As stated in the subject I am asking about recovery  
>>> timeout.
>>> You can find it for example in
/proc/fs/lustre/obdfilter/<OST>/
>>> recovery_status whilst one of your OST''s is in recovery
state. By
>>> default this timeout is 250s.
>>> Whereas you are talking about system obd timeout (according to  
>>> CFS documentation chapter 4.1.2 ) which is not a subject of my  
>>> concern.
>>>
>>> Any way I tried your example just to see if it works and again I  
>>> am afraid it doesn''t work for me, see below:
>>> I have combined mgs and mds configuration.
>>>
>>> [[root at storage03 ~]# df
>>> Filesystem           1K-blocks      Used Available Use% Mounted on
>>> /dev/sda1             10317828   3452824   6340888  36% /
>>> /dev/sda6              7605856     49788   7169708   1% /local
>>> /dev/sda3              4127108     41000   3876460   2% /tmp
>>> /dev/sda2              4127108    753668   3163792  20% /var
>>> /dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
>>> /dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
>>> /dev/dm-3            286696376   1461588 268850900   1% /mnt/home- 
>>> md/mdt
>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep
"Lustre
>>> FS"
>>> Lustre FS:  home-md
>>> Lustre FS:  home-md
>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>> 100
>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>>> error: conf_param: Invalid argument
>>> [root at storage03 ~]#
>>>
> You need to do this on the MGS node, with the MGS running.
>
> mgs> lctl conf_param testfs.sys.timeout=150
> anynode> cat /proc/sys/lustre/timeoutThis isn''t working for me. In my production configuration I have MGS  
combined with MDT on the same server. My lustre configuration  
consists of two file systems.
[root at mds01 ~]# tunefs.lustre --print /dev/dm-0
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     ddn-home-MDT0000
Index:      0
Lustre FS:  ddn-home
Mount type: ldiskfs
Flags:      0x5
               (MDT MGS )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp


    Permanent disk data:
Target:     ddn-home-MDT0000
Index:      0
Lustre FS:  ddn-home
Mount type: ldiskfs
Flags:      0x5
               (MDT MGS )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp

exiting before disk write.
[root at mds01 ~]# tunefs.lustre --print /dev/dm-1
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     ddn-data-MDT0000
Index:      0
Lustre FS:  ddn-data
Mount type: ldiskfs
Flags:      0x1
               (MDT )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp


    Permanent disk data:
Target:     ddn-data-MDT0000
Index:      0
Lustre FS:  ddn-data
Mount type: ldiskfs
Flags:      0x1
               (MDT )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp

exiting before disk write.
[root at mds01 ~]#

As you can see above MGS is on /dev/dm-0 combined with MDT for ddn- 
home file system.
If I try command line from your example I get this:
[root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
error: conf_param: Invalid argument

Server mds01 is 100% MGS node. What is wrong here then? The only two  
reasons for that problem I can think of is that file system name  
contain "-" character. However I didn''t find anything in  
documentation that would say that this character is not allowed to be  
used. Another reason is that MGS is combined with MDS?

syslog contains following messages:

Nov  7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_llog.c: 
1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device from  
lctl is ''ddn-home''
Nov  7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_handler.c: 
605:mgs_iocontrol()) setparam err -22
Nov  7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_llog.c: 
1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device from  
lctl is ''ddn-data''
Nov  7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_handler.c: 
605:mgs_iocontrol()) setparam err -22
Nov  7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_llog.c: 
1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device from  
lctl is ''ddn-data''
Nov  7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_handler.c: 
605:mgs_iocontrol()) setparam err -22
Nov  7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_llog.c: 
1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device from  
lctl is ''ddn-data''
Nov  7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_handler.c: 
605:mgs_iocontrol()) setparam err -22
Nov  7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_llog.c: 
1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device from  
lctl is ''ddn-data''
Nov  7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_handler.c: 
605:mgs_iocontrol()) setparam err -22
Nov  7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_llog.c: 
1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device from  
lctl is ''ddn-home''
Nov  7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_handler.c: 
605:mgs_iocontrol()) setparam err -22

 From above it looks like only first part of file system name is  
recognized "ddn" and -home or -data is omitted.

Please advise.

Wojciech Turek
>
>
>
>> Hmm, not sure why that isn''t working for you, I tested the
example
>> I gave. Sorry about the mis-read. The obd recovery timeout is  
>> defined in relation to obd_timeout, and afaik not changeable at  
>> runtime:
>>
>> lustre/include/lustre_lib.h
>> #define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2)
>> ...which gives the default 250 seconds for the default obd_timeout  
>> (100 seconds)
>>
>> cliffw
>>
>>
> That''s correct.  These are tied together before lustre 1.6.4.
>
>>> Cheers,
>>>
>>> Wojciech Turek
>>>
>>>
>>>
>>>
>>>> cliffw
>>>>
>>>>
>>>>>
------------------------------------------------------------------
>>>>> ------
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at clusterfs.com
>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>
>>> Mr Wojciech Turek
>>> Assistant System Manager
>>> University of Cambridge
>>> High Performance Computing service
>>> email: wjt27 at cam.ac.uk
>>> tel. +441223763517
>>>
>>>
>>>
>>>
>>>
>>>
--------------------------------------------------------------------
>>> ----
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
>
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071107/a19effe3/attachment-0002.html

Nathan Rutman

2007-Nov-08 18:54 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Wojciech Turek wrote:>
> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>
>> Cliff White wrote:
>>> Wojciech Turek wrote:
>>>
>>>   
>>>
>>>> Hi Cliff,
>>>>
>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>
>>>>     
>>>>
>>>>> Wojciech Turek wrote:
>>>>>
>>>>>       
>>>>>
>>>>>> Hi,
>>>>>> Our lustre environment is:
>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>> I would like to change recovery timeout from default
value 250s
>>>>>> to something longer
>>>>>> I tried example from manual:
>>>>>> set_timeout <secs> Sets the timeout (obd_timeout)
for a server
>>>>>> to wait before failing recovery.
>>>>>> We performed that experiment on our test lustre
installation with
>>>>>> one OST.
>>>>>> storage02 is our OSS
>>>>>> [root at storage02 ~]# lctl dl
>>>>>>   0 UP mgc MGC10.143.245.3 at tcp
31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>>>> set_timeout has been deprecated. Use conf_param
instead.
>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>>
>>>>>>         
>>>>>>
>> sorry about this bad help message.  It''s wrong.
>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>> run <command> after connecting to device
<devno>
>>>>>> --device <devno> <command [args ...]>
>>>>>> [root at storage02 ~]# lctl --device 1 conf_param
obd_timeout=600
>>>>>> No device found for name MGS: Invalid argument
>>>>>> error: conf_param: No such device
>>>>>> It looks like I need to run this command from MGS node
so I
>>>>>> moved then to MGS server called storage03
>>>>>> [root at storage03 ~]# lctl dl
>>>>>>   0 UP mgs MGS MGS 9
>>>>>>   1 UP mgc MGC10.143.245.3 at tcp
f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>>>> [root at storage03 ~]# lctl device 5
>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>>>> error: conf_param: Function not implemented
>>>>>> [root at storage03 ~]# lctl --device 5 conf_param
obd_timeout=600
>>>>>> error: conf_param: Function not implemented
>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>> conf_param: set a permanent config param. This command
must be
>>>>>> run on the MGS node
>>>>>> usage: conf_param <target.keyword=val> ...
>>>>>> [root at storage03 ~]# lctl conf_param
home-md-MDT0000.obd_timeout=600
>>>>>> error: conf_param: Invalid argument
>>>>>> [root at storage03 ~]#
>>>>>> I searched whole /proc/*/lustre for file that can store
this
>>>>>> timeout value but nothing were found.
>>>>>> Could someone advise how to change value for recovery
timeout?
>>>>>> Cheers,
>>>>>> Wojciech Turek
>>>>>>
>>>>>>         
>>>>>>
>>>>> It looks like your file system is named
''home'' - you can confirm with
>>>>> tunefs.lustre --print <MDS device> | grep
"Lustre FS"
>>>>>
>>>>> The correct command (Run on the MGS) would be
>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>
>>>>> Example:
>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep
"Lustre FS"
>>>>> Lustre FS:  lustre
>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>> 130
>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>> 150
>>>>>
>>>>>       
>>>>>
>>>> Thanks for your email. I am afraid your tips aren''t
very helpful in
>>>> this case. As stated in the subject I am asking about recovery
timeout.
>>>> You can find it for example in 
>>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst
one of your
>>>> OST''s is in recovery state. By default this timeout is
250s.
>>>> Whereas you are talking about system obd timeout (according to
CFS
>>>> documentation chapter 4.1.2 ) which is not a subject of my
concern.
>>>>
>>>> Any way I tried your example just to see if it works and again
I am
>>>> afraid it doesn''t work for me, see below:
>>>> I have combined mgs and mds configuration.
>>>>
>>>> [[root at storage03 ~]# df
>>>> Filesystem           1K-blocks      Used Available Use% Mounted
on
>>>> /dev/sda1             10317828   3452824   6340888  36% /
>>>> /dev/sda6              7605856     49788   7169708   1% /local
>>>> /dev/sda3              4127108     41000   3876460   2% /tmp
>>>> /dev/sda2              4127108    753668   3163792  20% /var
>>>> /dev/dm-2            1845747840 447502120 1398245720  25%
/mnt/sdb
>>>> /dev/dm-1            6140723200 4632947344 1507775856  76%
/mnt/sdc
>>>> /dev/dm-3            286696376   1461588 268850900   1% 
>>>> /mnt/home-md/mdt
>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep
"Lustre FS"
>>>> Lustre FS:  home-md
>>>> Lustre FS:  home-md
>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>> 100
>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>>>> error: conf_param: Invalid argument
>>>> [root at storage03 ~]#
>>>>
>>>>     
>>>>
>> You need to do this on the MGS node, with the MGS running.
>>
>> mgs> lctl conf_param testfs.sys.timeout=150
>> anynode> cat /proc/sys/lustre/timeout
> This isn''t working for me. In my production configuration I have
MGS
> combined with MDT on the same server. My lustre configuration consists 
> of two file systems.
> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>    Read previous values:
> Target:     ddn-home-MDT0000
> Index:      0
> Lustre FS:  ddn-home
> Mount type: ldiskfs
> Flags:      0x5
>               (MDT MGS )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at
tcp
>
>
>    Permanent disk data:
> Target:     ddn-home-MDT0000
> Index:      0
> Lustre FS:  ddn-home
> Mount type: ldiskfs
> Flags:      0x5
>               (MDT MGS )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at
tcp
>
> exiting before disk write.
> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>    Read previous values:
> Target:     ddn-data-MDT0000
> Index:      0
> Lustre FS:  ddn-data
> Mount type: ldiskfs
> Flags:      0x1
>               (MDT )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at
tcp
>
>
>    Permanent disk data:
> Target:     ddn-data-MDT0000
> Index:      0
> Lustre FS:  ddn-data
> Mount type: ldiskfs
> Flags:      0x1
>               (MDT )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at
tcp
>
> exiting before disk write.
> [root at mds01 ~]# 
>
> As you can see above MGS is on /dev/dm-0 combined with MDT for 
> ddn-home file system.
> If I try command line from your example I get this:
> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
> error: conf_param: Invalid argument
>
> Server mds01 is 100% MGS node. What is wrong here then? The only two 
> reasons for that problem I can think of is that file system name 
> contain "-" character. However I didn''t find anything in
documentation
> that would say that this character is not allowed to be used. Another 
> reason is that MGS is combined with MDS?
>
> syslog contains following messages:
>
> Nov  7 18:38:35 mds01 kernel: LustreError: 
> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is ''ddn-home''
> Nov  7 18:38:35 mds01 kernel: LustreError: 
> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:39:46 mds01 kernel: LustreError: 
> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is ''ddn-data''
> Nov  7 18:39:46 mds01 kernel: LustreError: 
> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:39:54 mds01 kernel: LustreError: 
> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is ''ddn-data''
> Nov  7 18:39:54 mds01 kernel: LustreError: 
> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:40:01 mds01 kernel: LustreError: 
> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is ''ddn-data''
> Nov  7 18:40:01 mds01 kernel: LustreError: 
> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:41:06 mds01 kernel: LustreError: 
> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is ''ddn-data''
> Nov  7 18:41:06 mds01 kernel: LustreError: 
> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:41:15 mds01 kernel: LustreError: 
> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is ''ddn-home''
> Nov  7 18:41:15 mds01 kernel: LustreError: 
> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>
> From above it looks like only first part of file system name is 
> recognized "ddn" and -home or -data is omitted.
>
> Please advise.
>
> Wojciech Turek
You seem to have found a bug.  I just tried this myself and it doesn''t 
work with a "-" in the name.  Maybe use a ''.''
instead until we fix it.

Nathan Rutman

2007-Nov-08 19:04 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Nathan Rutman wrote:> Wojciech Turek wrote:
>   
>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>>
>>     
>>> Cliff White wrote:
>>>       
>>>> Wojciech Turek wrote:
>>>>
>>>>   
>>>>
>>>>         
>>>>> Hi Cliff,
>>>>>
>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>>
>>>>>     
>>>>>
>>>>>           
>>>>>> Wojciech Turek wrote:
>>>>>>
>>>>>>       
>>>>>>
>>>>>>             
>>>>>>> Hi,
>>>>>>> Our lustre environment is:
>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>>> I would like to change recovery timeout from
default value 250s
>>>>>>> to something longer
>>>>>>> I tried example from manual:
>>>>>>> set_timeout <secs> Sets the timeout
(obd_timeout) for a server
>>>>>>> to wait before failing recovery.
>>>>>>> We performed that experiment on our test lustre
installation with
>>>>>>> one OST.
>>>>>>> storage02 is our OSS
>>>>>>> [root at storage02 ~]# lctl dl
>>>>>>>   0 UP mgc MGC10.143.245.3 at tcp
31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>>>   2 UP obdfilter home-md-OST0001
home-md-OST0001_UUID 7
>>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout
600
>>>>>>> set_timeout has been deprecated. Use conf_param
instead.
>>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>               
>>> sorry about this bad help message.  It''s wrong.
>>>       
>>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>>> run <command> after connecting to device
<devno>
>>>>>>> --device <devno> <command [args ...]>
>>>>>>> [root at storage02 ~]# lctl --device 1 conf_param
obd_timeout=600
>>>>>>> No device found for name MGS: Invalid argument
>>>>>>> error: conf_param: No such device
>>>>>>> It looks like I need to run this command from MGS
node so I
>>>>>>> moved then to MGS server called storage03
>>>>>>> [root at storage03 ~]# lctl dl
>>>>>>>   0 UP mgs MGS MGS 9
>>>>>>>   1 UP mgc MGC10.143.245.3 at tcp
f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID
5
>>>>>>> [root at storage03 ~]# lctl device 5
>>>>>>> [root at storage03 ~]# lctl conf_param
obd_timeout=600
>>>>>>> error: conf_param: Function not implemented
>>>>>>> [root at storage03 ~]# lctl --device 5 conf_param
obd_timeout=600
>>>>>>> error: conf_param: Function not implemented
>>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>>> conf_param: set a permanent config param. This
command must be
>>>>>>> run on the MGS node
>>>>>>> usage: conf_param <target.keyword=val> ...
>>>>>>> [root at storage03 ~]# lctl conf_param
home-md-MDT0000.obd_timeout=600
>>>>>>> error: conf_param: Invalid argument
>>>>>>> [root at storage03 ~]#
>>>>>>> I searched whole /proc/*/lustre for file that can
store this
>>>>>>> timeout value but nothing were found.
>>>>>>> Could someone advise how to change value for
recovery timeout?
>>>>>>> Cheers,
>>>>>>> Wojciech Turek
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>               
>>>>>> It looks like your file system is named
''home'' - you can confirm with
>>>>>> tunefs.lustre --print <MDS device> | grep
"Lustre FS"
>>>>>>
>>>>>> The correct command (Run on the MGS) would be
>>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>>
>>>>>> Example:
>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep
"Lustre FS"
>>>>>> Lustre FS:  lustre
>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>> 130
>>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>> 150
>>>>>>
>>>>>>       
>>>>>>
>>>>>>             
>>>>> Thanks for your email. I am afraid your tips
aren''t very helpful in
>>>>> this case. As stated in the subject I am asking about
recovery timeout.
>>>>> You can find it for example in 
>>>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status
whilst one of your
>>>>> OST''s is in recovery state. By default this
timeout is 250s.
>>>>> Whereas you are talking about system obd timeout (according
to CFS
>>>>> documentation chapter 4.1.2 ) which is not a subject of my
concern.
>>>>>
>>>>> Any way I tried your example just to see if it works and
again I am
>>>>> afraid it doesn''t work for me, see below:
>>>>> I have combined mgs and mds configuration.
>>>>>
>>>>> [[root at storage03 ~]# df
>>>>> Filesystem           1K-blocks      Used Available Use%
Mounted on
>>>>> /dev/sda1             10317828   3452824   6340888  36% /
>>>>> /dev/sda6              7605856     49788   7169708   1%
/local
>>>>> /dev/sda3              4127108     41000   3876460   2%
/tmp
>>>>> /dev/sda2              4127108    753668   3163792  20%
/var
>>>>> /dev/dm-2            1845747840 447502120 1398245720  25%
/mnt/sdb
>>>>> /dev/dm-1            6140723200 4632947344 1507775856  76%
/mnt/sdc
>>>>> /dev/dm-3            286696376   1461588 268850900   1% 
>>>>> /mnt/home-md/mdt
>>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3
|grep "Lustre FS"
>>>>> Lustre FS:  home-md
>>>>> Lustre FS:  home-md
>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>>> 100
>>>>> [root at storage03 ~]# lctl conf_param
home-md.sys.timeout=150
>>>>> error: conf_param: Invalid argument
>>>>> [root at storage03 ~]#
>>>>>
>>>>>     
>>>>>
>>>>>           
>>> You need to do this on the MGS node, with the MGS running.
>>>
>>> mgs> lctl conf_param testfs.sys.timeout=150
>>> anynode> cat /proc/sys/lustre/timeout
>>>       
>> This isn''t working for me. In my production configuration I
have MGS
>> combined with MDT on the same server. My lustre configuration consists 
>> of two file systems.
>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
>> checking for existing Lustre data: found CONFIGS/mountdata
>> Reading CONFIGS/mountdata
>>
>>    Read previous values:
>> Target:     ddn-home-MDT0000
>> Index:      0
>> Lustre FS:  ddn-home
>> Mount type: ldiskfs
>> Flags:      0x5
>>               (MDT MGS )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202
at tcp
>>
>>
>>    Permanent disk data:
>> Target:     ddn-home-MDT0000
>> Index:      0
>> Lustre FS:  ddn-home
>> Mount type: ldiskfs
>> Flags:      0x5
>>               (MDT MGS )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202
at tcp
>>
>> exiting before disk write.
>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
>> checking for existing Lustre data: found CONFIGS/mountdata
>> Reading CONFIGS/mountdata
>>
>>    Read previous values:
>> Target:     ddn-data-MDT0000
>> Index:      0
>> Lustre FS:  ddn-data
>> Mount type: ldiskfs
>> Flags:      0x1
>>               (MDT )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202
at tcp
>>
>>
>>    Permanent disk data:
>> Target:     ddn-data-MDT0000
>> Index:      0
>> Lustre FS:  ddn-data
>> Mount type: ldiskfs
>> Flags:      0x1
>>               (MDT )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202
at tcp
>>
>> exiting before disk write.
>> [root at mds01 ~]# 
>>
>> As you can see above MGS is on /dev/dm-0 combined with MDT for 
>> ddn-home file system.
>> If I try command line from your example I get this:
>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
>> error: conf_param: Invalid argument
>>
>> Server mds01 is 100% MGS node. What is wrong here then? The only two 
>> reasons for that problem I can think of is that file system name 
>> contain "-" character. However I didn''t find
anything in documentation
>> that would say that this character is not allowed to be used. Another 
>> reason is that MGS is combined with MDS?
>>
>> syslog contains following messages:
>>
>> Nov  7 18:38:35 mds01 kernel: LustreError: 
>> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is ''ddn-home''
>> Nov  7 18:38:35 mds01 kernel: LustreError: 
>> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:39:46 mds01 kernel: LustreError: 
>> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is ''ddn-data''
>> Nov  7 18:39:46 mds01 kernel: LustreError: 
>> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:39:54 mds01 kernel: LustreError: 
>> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is ''ddn-data''
>> Nov  7 18:39:54 mds01 kernel: LustreError: 
>> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:40:01 mds01 kernel: LustreError: 
>> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is ''ddn-data''
>> Nov  7 18:40:01 mds01 kernel: LustreError: 
>> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:41:06 mds01 kernel: LustreError: 
>> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is ''ddn-data''
>> Nov  7 18:41:06 mds01 kernel: LustreError: 
>> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:41:15 mds01 kernel: LustreError: 
>> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is ''ddn-home''
>> Nov  7 18:41:15 mds01 kernel: LustreError: 
>> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>
>> From above it looks like only first part of file system name is 
>> recognized "ddn" and -home or -data is omitted.
>>
>> Please advise.
>>
>> Wojciech Turek
>>     
>
> You seem to have found a bug.  I just tried this myself and it
doesn''t
> work with a "-" in the name.  Maybe use a ''.''
instead until we fix it.
>   Argh, sorry, that doesn''t work with conf_param either.  But an 
underscore ''_'' does.  I''m filing a bug report...

Wojciech Turek

2007-Nov-09 02:38 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Hi,

It is  a lesson for me to do not change old habits. I always used "_"
and for latest filesystem I did exception for the impression that it  
looks neater with "-" and here we go.
Can I change file system name without reformatting everything? File  
system with bad name is in production and it is essential for me to  
fix it without long service downtime.

Thanks

Wojciech Turek

On 8 Nov 2007, at 19:04, Nathan Rutman wrote:
> Nathan Rutman wrote:
>> Wojciech Turek wrote:
>>
>>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>>>
>>>
>>>> Cliff White wrote:
>>>>
>>>>> Wojciech Turek wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi Cliff,
>>>>>>
>>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Wojciech Turek wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Our lustre environment is:
>>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>>>> I would like to change recovery timeout from
default value
>>>>>>>> 250s to something longer
>>>>>>>> I tried example from manual:
>>>>>>>> set_timeout <secs> Sets the timeout
(obd_timeout) for a server
>>>>>>>> to wait before failing recovery.
>>>>>>>> We performed that experiment on our test lustre
installation
>>>>>>>> with one OST.
>>>>>>>> storage02 is our OSS
>>>>>>>> [root at storage02 ~]# lctl dl
>>>>>>>>   0 UP mgc MGC10.143.245.3 at tcp
31259d9b-e655-cdc4-
>>>>>>>> c760-45d3df426d86 5
>>>>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>>>>   2 UP obdfilter home-md-OST0001
home-md-OST0001_UUID 7
>>>>>>>> [root at storage02 ~]# lctl --device 2
set_timeout 600
>>>>>>>> set_timeout has been deprecated. Use conf_param
instead.
>>>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>> sorry about this bad help message.  It''s wrong.
>>>>
>>>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>>>> run <command> after connecting to device
<devno>
>>>>>>>> --device <devno> <command [args
...]>
>>>>>>>> [root at storage02 ~]# lctl --device 1
conf_param obd_timeout=600
>>>>>>>> No device found for name MGS: Invalid argument
>>>>>>>> error: conf_param: No such device
>>>>>>>> It looks like I need to run this command from
MGS node so I
>>>>>>>> moved then to MGS server called storage03
>>>>>>>> [root at storage03 ~]# lctl dl
>>>>>>>>   0 UP mgs MGS MGS 9
>>>>>>>>   1 UP mgc MGC10.143.245.3 at tcp
f51a910b-a08e-4be6-5ada-
>>>>>>>> b602a5ca9ab3 5
>>>>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID
5
>>>>>>>>   5 UP osc home-md-OST0001-osc
home-md-mdtlov_UUID 5
>>>>>>>> [root at storage03 ~]# lctl device 5
>>>>>>>> [root at storage03 ~]# lctl conf_param
obd_timeout=600
>>>>>>>> error: conf_param: Function not implemented
>>>>>>>> [root at storage03 ~]# lctl --device 5
conf_param obd_timeout=600
>>>>>>>> error: conf_param: Function not implemented
>>>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>>>> conf_param: set a permanent config param. This
command must
>>>>>>>> be run on the MGS node
>>>>>>>> usage: conf_param <target.keyword=val>
...
>>>>>>>> [root at storage03 ~]# lctl conf_param home-md-
>>>>>>>> MDT0000.obd_timeout=600
>>>>>>>> error: conf_param: Invalid argument
>>>>>>>> [root at storage03 ~]#
>>>>>>>> I searched whole /proc/*/lustre for file that
can store this
>>>>>>>> timeout value but nothing were found.
>>>>>>>> Could someone advise how to change value for
recovery timeout?
>>>>>>>> Cheers,
>>>>>>>> Wojciech Turek
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> It looks like your file system is named
''home'' - you can
>>>>>>> confirm with
>>>>>>> tunefs.lustre --print <MDS device> | grep
"Lustre FS"
>>>>>>>
>>>>>>> The correct command (Run on the MGS) would be
>>>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>>>
>>>>>>> Example:
>>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb
|grep "Lustre FS"
>>>>>>> Lustre FS:  lustre
>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>>> 130
>>>>>>> [root at ft4 ~]# lctl conf_param
lustre.sys.timeout=150
>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>>> 150
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Thanks for your email. I am afraid your tips
aren''t very
>>>>>> helpful in this case. As stated in the subject I am
asking
>>>>>> about recovery timeout.
>>>>>> You can find it for example in
/proc/fs/lustre/obdfilter/<OST>/
>>>>>> recovery_status whilst one of your OST''s is in
recovery state.
>>>>>> By default this timeout is 250s.
>>>>>> Whereas you are talking about system obd timeout
(according to
>>>>>> CFS documentation chapter 4.1.2 ) which is not a
subject of my
>>>>>> concern.
>>>>>>
>>>>>> Any way I tried your example just to see if it works
and again
>>>>>> I am afraid it doesn''t work for me, see below:
>>>>>> I have combined mgs and mds configuration.
>>>>>>
>>>>>> [[root at storage03 ~]# df
>>>>>> Filesystem           1K-blocks      Used Available Use%
>>>>>> Mounted on
>>>>>> /dev/sda1             10317828   3452824   6340888  36%
/
>>>>>> /dev/sda6              7605856     49788   7169708   1%
/local
>>>>>> /dev/sda3              4127108     41000   3876460   2%
/tmp
>>>>>> /dev/sda2              4127108    753668   3163792  20%
/var
>>>>>> /dev/dm-2            1845747840 447502120 1398245720 
25% /mnt/
>>>>>> sdb
>>>>>> /dev/dm-1            6140723200 4632947344 1507775856 
76% /
>>>>>> mnt/sdc
>>>>>> /dev/dm-3            286696376   1461588 268850900   1%
/mnt/
>>>>>> home-md/mdt
>>>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3
|grep
>>>>>> "Lustre FS"
>>>>>> Lustre FS:  home-md
>>>>>> Lustre FS:  home-md
>>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>>>> 100
>>>>>> [root at storage03 ~]# lctl conf_param
home-md.sys.timeout=150
>>>>>> error: conf_param: Invalid argument
>>>>>> [root at storage03 ~]#
>>>>>>
>>>>>>
>>>>>>
>>>> You need to do this on the MGS node, with the MGS running.
>>>>
>>>> mgs> lctl conf_param testfs.sys.timeout=150
>>>> anynode> cat /proc/sys/lustre/timeout
>>>>
>>> This isn''t working for me. In my production configuration
I have
>>> MGS combined with MDT on the same server. My lustre configuration  
>>> consists of two file systems.
>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
>>> checking for existing Lustre data: found CONFIGS/mountdata
>>> Reading CONFIGS/mountdata
>>>
>>>    Read previous values:
>>> Target:     ddn-home-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-home
>>> Mount type: ldiskfs
>>> Flags:      0x5
>>>               (MDT MGS )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: failover.node=10.143.245.202 at tcp  
>>> mgsnode=10.143.245.202 at tcp
>>>
>>>
>>>    Permanent disk data:
>>> Target:     ddn-home-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-home
>>> Mount type: ldiskfs
>>> Flags:      0x5
>>>               (MDT MGS )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: failover.node=10.143.245.202 at tcp  
>>> mgsnode=10.143.245.202 at tcp
>>>
>>> exiting before disk write.
>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
>>> checking for existing Lustre data: found CONFIGS/mountdata
>>> Reading CONFIGS/mountdata
>>>
>>>    Read previous values:
>>> Target:     ddn-data-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-data
>>> Mount type: ldiskfs
>>> Flags:      0x1
>>>               (MDT )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: mgsnode=10.143.245.201 at tcp  
>>> failover.node=10.143.245.202 at tcp
>>>
>>>
>>>    Permanent disk data:
>>> Target:     ddn-data-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-data
>>> Mount type: ldiskfs
>>> Flags:      0x1
>>>               (MDT )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: mgsnode=10.143.245.201 at tcp  
>>> failover.node=10.143.245.202 at tcp
>>>
>>> exiting before disk write.
>>> [root at mds01 ~]#
>>> As you can see above MGS is on /dev/dm-0 combined with MDT for  
>>> ddn-home file system.
>>> If I try command line from your example I get this:
>>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
>>> error: conf_param: Invalid argument
>>>
>>> Server mds01 is 100% MGS node. What is wrong here then? The only  
>>> two reasons for that problem I can think of is that file system  
>>> name contain "-" character. However I didn''t
find anything in
>>> documentation that would say that this character is not allowed  
>>> to be used. Another reason is that MGS is combined with MDS?
>>>
>>> syslog contains following messages:
>>>
>>> Nov  7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is ''ddn-home''
>>> Nov  7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is ''ddn-data''
>>> Nov  7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is ''ddn-data''
>>> Nov  7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is ''ddn-data''
>>> Nov  7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is ''ddn-data''
>>> Nov  7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is ''ddn-home''
>>> Nov  7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>>
>>> From above it looks like only first part of file system name is  
>>> recognized "ddn" and -home or -data is omitted.
>>>
>>> Please advise.
>>>
>>> Wojciech Turek
>>>
>>
>> You seem to have found a bug.  I just tried this myself and it  
>> doesn''t work with a "-" in the name.  Maybe use a
''.'' instead
>> until we fix it.
>>
> Argh, sorry, that doesn''t work with conf_param either.  But an  
> underscore ''_'' does.  I''m filing a bug report...
>
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071109/72792580/attachment-0002.html

Nathan Rutman

2007-Nov-09 23:28 UTC

head link

[Lustre-discuss] How To change server recovery timeout

Wojciech Turek wrote:> Hi,
>
> It is  a lesson for me to do not change old habits. I always used
"_"
> and for latest filesystem I did exception for the impression that it 
> looks neater with "-" and here we go.
> Can I change file system name without reformatting everything? File 
> system with bad name is in production and it is essential for me to 
> fix it without long service downtime.
Yes, but you will have to shut everything down.  tunefs --writeconf all 
the servers, restart the MGS first.  While you''re at it, you can set
the
timeout.  (This can be overridden later with conf_param). 

tunefs.lustre --writeconf --param="sys.timeout=50" /dev/sda



>
> Thanks
>
> Wojciech Turek
>
> On 8 Nov 2007, at 19:04, Nathan Rutman wrote:
>
>> Nathan Rutman wrote:
>>> Wojciech Turek wrote:
>>>
>>>   
>>>
>>>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>>>>
>>>>     
>>>>
>>>>> Cliff White wrote:
>>>>>
>>>>>       
>>>>>
>>>>>> Wojciech Turek wrote:
>>>>>>
>>>>>>   
>>>>>>
>>>>>>         
>>>>>>
>>>>>>> Hi Cliff,
>>>>>>>
>>>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>>>>
>>>>>>>     
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>>>>> Wojciech Turek wrote:
>>>>>>>>
>>>>>>>>       
>>>>>>>>
>>>>>>>>             
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> Our lustre environment is:
>>>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>>>>> I would like to change recovery timeout
from default value
>>>>>>>>> 250s to something longer
>>>>>>>>> I tried example from manual:
>>>>>>>>> set_timeout <secs> Sets the timeout
(obd_timeout) for a server
>>>>>>>>> to wait before failing recovery.
>>>>>>>>> We performed that experiment on our test
lustre installation
>>>>>>>>> with one OST.
>>>>>>>>> storage02 is our OSS
>>>>>>>>> [root at storage02 ~]# lctl dl
>>>>>>>>>   0 UP mgc MGC10.143.245.3 at tcp 
>>>>>>>>> 31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>>>>>   2 UP obdfilter home-md-OST0001
home-md-OST0001_UUID 7
>>>>>>>>> [root at storage02 ~]# lctl --device 2
set_timeout 600
>>>>>>>>> set_timeout has been deprecated. Use
conf_param instead.
>>>>>>>>> e.g. conf_param lustre-MDT0000
obd_timeout=50
>>>>>>>>>
>>>>>>>>>         
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>
>>>>> sorry about this bad help message.  It''s wrong.
>>>>>
>>>>>       
>>>>>
>>>>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>>>>> run <command> after connecting to
device <devno>
>>>>>>>>> --device <devno> <command [args
...]>
>>>>>>>>> [root at storage02 ~]# lctl --device 1
conf_param obd_timeout=600
>>>>>>>>> No device found for name MGS: Invalid
argument
>>>>>>>>> error: conf_param: No such device
>>>>>>>>> It looks like I need to run this command
from MGS node so I
>>>>>>>>> moved then to MGS server called storage03
>>>>>>>>> [root at storage03 ~]# lctl dl
>>>>>>>>>   0 UP mgs MGS MGS 9
>>>>>>>>>   1 UP mgc MGC10.143.245.3 at tcp 
>>>>>>>>> f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>>>>>   3 UP lov home-md-mdtlov
home-md-mdtlov_UUID 4
>>>>>>>>>   4 UP mds home-md-MDT0000
home-md-MDT0000_UUID 5
>>>>>>>>>   5 UP osc home-md-OST0001-osc
home-md-mdtlov_UUID 5
>>>>>>>>> [root at storage03 ~]# lctl device 5
>>>>>>>>> [root at storage03 ~]# lctl conf_param
obd_timeout=600
>>>>>>>>> error: conf_param: Function not implemented
>>>>>>>>> [root at storage03 ~]# lctl --device 5
conf_param obd_timeout=600
>>>>>>>>> error: conf_param: Function not implemented
>>>>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>>>>> conf_param: set a permanent config param.
This command must be
>>>>>>>>> run on the MGS node
>>>>>>>>> usage: conf_param
<target.keyword=val> ...
>>>>>>>>> [root at storage03 ~]# lctl conf_param 
>>>>>>>>> home-md-MDT0000.obd_timeout=600
>>>>>>>>> error: conf_param: Invalid argument
>>>>>>>>> [root at storage03 ~]#
>>>>>>>>> I searched whole /proc/*/lustre for file
that can store this
>>>>>>>>> timeout value but nothing were found.
>>>>>>>>> Could someone advise how to change value
for recovery timeout?
>>>>>>>>> Cheers,
>>>>>>>>> Wojciech Turek
>>>>>>>>>
>>>>>>>>>         
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>
>>>>>>>> It looks like your file system is named
''home'' - you can
>>>>>>>> confirm with
>>>>>>>> tunefs.lustre --print <MDS device> | grep
"Lustre FS"
>>>>>>>>
>>>>>>>> The correct command (Run on the MGS) would be
>>>>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>>>>
>>>>>>>> Example:
>>>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb
|grep "Lustre FS"
>>>>>>>> Lustre FS:  lustre
>>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>>>> 130
>>>>>>>> [root at ft4 ~]# lctl conf_param
lustre.sys.timeout=150
>>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>>>> 150
>>>>>>>>
>>>>>>>>       
>>>>>>>>
>>>>>>>>             
>>>>>>>>
>>>>>>> Thanks for your email. I am afraid your tips
aren''t very helpful
>>>>>>> in this case. As stated in the subject I am asking
about
>>>>>>> recovery timeout.
>>>>>>> You can find it for example in 
>>>>>>>
/proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of
>>>>>>> your OST''s is in recovery state. By
default this timeout is 250s.
>>>>>>> Whereas you are talking about system obd timeout
(according to
>>>>>>> CFS documentation chapter 4.1.2 ) which is not a
subject of my
>>>>>>> concern.
>>>>>>>
>>>>>>> Any way I tried your example just to see if it
works and again I
>>>>>>> am afraid it doesn''t work for me, see
below:
>>>>>>> I have combined mgs and mds configuration.
>>>>>>>
>>>>>>> [[root at storage03 ~]# df
>>>>>>> Filesystem           1K-blocks      Used Available
Use% Mounted on
>>>>>>> /dev/sda1             10317828   3452824   6340888 
36% /
>>>>>>> /dev/sda6              7605856     49788   7169708 
1% /local
>>>>>>> /dev/sda3              4127108     41000   3876460 
2% /tmp
>>>>>>> /dev/sda2              4127108    753668   3163792 
20% /var
>>>>>>> /dev/dm-2            1845747840 447502120
1398245720  25% /mnt/sdb
>>>>>>> /dev/dm-1            6140723200 4632947344
1507775856  76% /mnt/sdc
>>>>>>> /dev/dm-3            286696376   1461588 268850900 
1%
>>>>>>> /mnt/home-md/mdt
>>>>>>> [root at storage03 ~]# tunefs.lustre --print
/dev/dm-3 |grep
>>>>>>> "Lustre FS"
>>>>>>> Lustre FS:  home-md
>>>>>>> Lustre FS:  home-md
>>>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>>>>> 100
>>>>>>> [root at storage03 ~]# lctl conf_param
home-md.sys.timeout=150
>>>>>>> error: conf_param: Invalid argument
>>>>>>> [root at storage03 ~]#
>>>>>>>
>>>>>>>     
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>> You need to do this on the MGS node, with the MGS running.
>>>>>
>>>>> mgs> lctl conf_param testfs.sys.timeout=150
>>>>> anynode> cat /proc/sys/lustre/timeout
>>>>>
>>>>>       
>>>>>
>>>> This isn''t working for me. In my production
configuration I have
>>>> MGS combined with MDT on the same server. My lustre
configuration
>>>> consists of two file systems.
>>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
>>>> checking for existing Lustre data: found CONFIGS/mountdata
>>>> Reading CONFIGS/mountdata
>>>>
>>>>    Read previous values:
>>>> Target:     ddn-home-MDT0000
>>>> Index:      0
>>>> Lustre FS:  ddn-home
>>>> Mount type: ldiskfs
>>>> Flags:      0x5
>>>>               (MDT MGS )
>>>> Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr
>>>> Parameters: failover.node=10.143.245.202 at tcp
mgsnode=10.143.245.202 at tcp
>>>>
>>>>
>>>>    Permanent disk data:
>>>> Target:     ddn-home-MDT0000
>>>> Index:      0
>>>> Lustre FS:  ddn-home
>>>> Mount type: ldiskfs
>>>> Flags:      0x5
>>>>               (MDT MGS )
>>>> Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr
>>>> Parameters: failover.node=10.143.245.202 at tcp
mgsnode=10.143.245.202 at tcp
>>>>
>>>> exiting before disk write.
>>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
>>>> checking for existing Lustre data: found CONFIGS/mountdata
>>>> Reading CONFIGS/mountdata
>>>>
>>>>    Read previous values:
>>>> Target:     ddn-data-MDT0000
>>>> Index:      0
>>>> Lustre FS:  ddn-data
>>>> Mount type: ldiskfs
>>>> Flags:      0x1
>>>>               (MDT )
>>>> Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr
>>>> Parameters: mgsnode=10.143.245.201 at tcp
failover.node=10.143.245.202 at tcp
>>>>
>>>>
>>>>    Permanent disk data:
>>>> Target:     ddn-data-MDT0000
>>>> Index:      0
>>>> Lustre FS:  ddn-data
>>>> Mount type: ldiskfs
>>>> Flags:      0x1
>>>>               (MDT )
>>>> Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr
>>>> Parameters: mgsnode=10.143.245.201 at tcp
failover.node=10.143.245.202 at tcp
>>>>
>>>> exiting before disk write.
>>>> [root at mds01 ~]# 
>>>> As you can see above MGS is on /dev/dm-0 combined with MDT for 
>>>> ddn-home file system.
>>>> If I try command line from your example I get this:
>>>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
>>>> error: conf_param: Invalid argument
>>>>
>>>> Server mds01 is 100% MGS node. What is wrong here then? The
only
>>>> two reasons for that problem I can think of is that file system
>>>> name contain "-" character. However I didn''t
find anything in
>>>> documentation that would say that this character is not allowed
to
>>>> be used. Another reason is that MGS is combined with MDS?
>>>>
>>>> syslog contains following messages:
>>>>
>>>> Nov  7 18:38:35 mds01 kernel: LustreError: 
>>>> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets
for
>>>> ddn.  cfg_device from lctl is ''ddn-home''
>>>> Nov  7 18:38:35 mds01 kernel: LustreError: 
>>>> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>>> Nov  7 18:39:46 mds01 kernel: LustreError: 
>>>> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets
for
>>>> ddn.  cfg_device from lctl is ''ddn-data''
>>>> Nov  7 18:39:46 mds01 kernel: LustreError: 
>>>> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>>> Nov  7 18:39:54 mds01 kernel: LustreError: 
>>>> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets
for
>>>> ddn.  cfg_device from lctl is ''ddn-data''
>>>> Nov  7 18:39:54 mds01 kernel: LustreError: 
>>>> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>>> Nov  7 18:40:01 mds01 kernel: LustreError: 
>>>> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets
for
>>>> ddn.  cfg_device from lctl is ''ddn-data''
>>>> Nov  7 18:40:01 mds01 kernel: LustreError: 
>>>> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>>> Nov  7 18:41:06 mds01 kernel: LustreError: 
>>>> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets
for
>>>> ddn.  cfg_device from lctl is ''ddn-data''
>>>> Nov  7 18:41:06 mds01 kernel: LustreError: 
>>>> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>>> Nov  7 18:41:15 mds01 kernel: LustreError: 
>>>> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets
for
>>>> ddn.  cfg_device from lctl is ''ddn-home''
>>>> Nov  7 18:41:15 mds01 kernel: LustreError: 
>>>> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>>>
>>>> From above it looks like only first part of file system name is
>>>> recognized "ddn" and -home or -data is omitted.
>>>>
>>>> Please advise.
>>>>
>>>> Wojciech Turek
>>>>
>>>>     
>>>>
>>>
>>> You seem to have found a bug.  I just tried this myself and it 
>>> doesn''t work with a "-" in the name.  Maybe use
a ''.'' instead until
>>> we fix it.
>>>
>>>   
>>>
>> Argh, sorry, that doesn''t work with conf_param either.  But an
>> underscore ''_'' does.  I''m filing a bug
report...
>>
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service 
> email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> tel. +441223763517
>
>
>

Reasonably Related Threads

Search for more seemingly similar threads

Lustre discuss - Nov 2007 - How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

[Lustre-discuss] How To change server recovery timeout

Reasonably Related Threads