Wojciech Turek
2007-Nov-07 11:24 UTC
[Lustre-discuss] How To change server recovery timeout
Hi, Our lustre environment is: 2.6.9-55.0.9.EL_lustre.1.6.3smp I would like to change recovery timeout from default value 250s to something longer I tried example from manual: set_timeout <secs> Sets the timeout (obd_timeout) for a server to wait before failing recovery. We performed that experiment on our test lustre installation with one OST. storage02 is our OSS [root at storage02 ~]# lctl dl 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 [root at storage02 ~]# lctl --device 2 set_timeout 600 set_timeout has been deprecated. Use conf_param instead. e.g. conf_param lustre-MDT0000 obd_timeout=50 usage: conf_param obd_timeout=<secs> run <command> after connecting to device <devno> --device <devno> <command [args ...]> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 No device found for name MGS: Invalid argument error: conf_param: No such device It looks like I need to run this command from MGS node so I moved then to MGS server called storage03 [root at storage03 ~]# lctl dl 0 UP mgs MGS MGS 9 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 2 UP mdt MDS MDS_uuid 3 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 [root at storage03 ~]# lctl device 5 [root at storage03 ~]# lctl conf_param obd_timeout=600 error: conf_param: Function not implemented [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 error: conf_param: Function not implemented [root at storage03 ~]# lctl help conf_param conf_param: set a permanent config param. This command must be run on the MGS node usage: conf_param <target.keyword=val> ... [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600 error: conf_param: Invalid argument [root at storage03 ~]# I searched whole /proc/*/lustre for file that can store this timeout value but nothing were found. Could someone advise how to change value for recovery timeout? Cheers, Wojciech Turek -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071107/c32a9e70/attachment-0002.html
Wojciech Turek wrote:> Hi, > > Our lustre environment is: > 2.6.9-55.0.9.EL_lustre.1.6.3smp > > I would like to change recovery timeout from default value 250s to > something longer > > I tried example from manual: > > set_timeout <secs> Sets the timeout (obd_timeout) for a server > to wait before failing recovery. > > We performed that experiment on our test lustre installation with one OST. > > storage02 is our OSS > > [root at storage02 ~]# lctl dl > 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5 > 1 UP ost OSS OSS_uuid 3 > 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 > [root at storage02 ~]# lctl --device 2 set_timeout 600 > set_timeout has been deprecated. Use conf_param instead. > e.g. conf_param lustre-MDT0000 obd_timeout=50 > usage: conf_param obd_timeout=<secs> > > run <command> after connecting to device <devno> > --device <devno> <command [args ...]> > > [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 > No device found for name MGS: Invalid argument > error: conf_param: No such device > > It looks like I need to run this command from MGS node so I moved then > to MGS server called storage03 > > [root at storage03 ~]# lctl dl > 0 UP mgs MGS MGS 9 > 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 > 2 UP mdt MDS MDS_uuid 3 > 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 > 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 > 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 > [root at storage03 ~]# lctl device 5 > [root at storage03 ~]# lctl conf_param obd_timeout=600 > error: conf_param: Function not implemented > [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 > error: conf_param: Function not implemented > > [root at storage03 ~]# lctl help conf_param > conf_param: set a permanent config param. This command must be run on > the MGS node > usage: conf_param <target.keyword=val> ... > > [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600 > error: conf_param: Invalid argument > [root at storage03 ~]# > > > I searched whole /proc/*/lustre for file that can store this timeout > value but nothing were found. > > Could someone advise how to change value for recovery timeout? > > Cheers, > > Wojciech Turek >It looks like your file system is named ''home'' - you can confirm with tunefs.lustre --print <MDS device> | grep "Lustre FS" The correct command (Run on the MGS) would be # lctl conf_param home.sys.timeout=<val> Example: [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" Lustre FS: lustre [root at ft4 ~]# cat /proc/sys/lustre/timeout 130 [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 [root at ft4 ~]# cat /proc/sys/lustre/timeout 150 cliffw> > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Wojciech Turek
2007-Nov-07 18:46 UTC
[Lustre-discuss] How To change server recovery timeout
Hi Cliff, On 7 Nov 2007, at 17:58, Cliff White wrote:> Wojciech Turek wrote: >> Hi, >> Our lustre environment is: >> 2.6.9-55.0.9.EL_lustre.1.6.3smp >> I would like to change recovery timeout from default value 250s to >> something longer >> I tried example from manual: >> set_timeout <secs> Sets the timeout (obd_timeout) for a server >> to wait before failing recovery. >> We performed that experiment on our test lustre installation with >> one OST. >> storage02 is our OSS >> [root at storage02 ~]# lctl dl >> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5 >> 1 UP ost OSS OSS_uuid 3 >> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >> [root at storage02 ~]# lctl --device 2 set_timeout 600 >> set_timeout has been deprecated. Use conf_param instead. >> e.g. conf_param lustre-MDT0000 obd_timeout=50 >> usage: conf_param obd_timeout=<secs> >> run <command> after connecting to device <devno> >> --device <devno> <command [args ...]> >> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >> No device found for name MGS: Invalid argument >> error: conf_param: No such device >> It looks like I need to run this command from MGS node so I moved >> then to MGS server called storage03 >> [root at storage03 ~]# lctl dl >> 0 UP mgs MGS MGS 9 >> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 >> 2 UP mdt MDS MDS_uuid 3 >> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >> [root at storage03 ~]# lctl device 5 >> [root at storage03 ~]# lctl conf_param obd_timeout=600 >> error: conf_param: Function not implemented >> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >> error: conf_param: Function not implemented >> [root at storage03 ~]# lctl help conf_param >> conf_param: set a permanent config param. This command must be run >> on the MGS node >> usage: conf_param <target.keyword=val> ... >> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600 >> error: conf_param: Invalid argument >> [root at storage03 ~]# >> I searched whole /proc/*/lustre for file that can store this >> timeout value but nothing were found. >> Could someone advise how to change value for recovery timeout? >> Cheers, >> Wojciech Turek > > It looks like your file system is named ''home'' - you can confirm with > tunefs.lustre --print <MDS device> | grep "Lustre FS" > > The correct command (Run on the MGS) would be > # lctl conf_param home.sys.timeout=<val> > > Example: > [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" > Lustre FS: lustre > [root at ft4 ~]# cat /proc/sys/lustre/timeout > 130 > [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 > [root at ft4 ~]# cat /proc/sys/lustre/timeout > 150Thanks for your email. I am afraid your tips aren''t very helpful in this case. As stated in the subject I am asking about recovery timeout. You can find it for example in /proc/fs/lustre/obdfilter/<OST>/ recovery_status whilst one of your OST''s is in recovery state. By default this timeout is 250s. Whereas you are talking about system obd timeout (according to CFS documentation chapter 4.1.2 ) which is not a subject of my concern. Any way I tried your example just to see if it works and again I am afraid it doesn''t work for me, see below: I have combined mgs and mds configuration. [[root at storage03 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 10317828 3452824 6340888 36% / /dev/sda6 7605856 49788 7169708 1% /local /dev/sda3 4127108 41000 3876460 2% /tmp /dev/sda2 4127108 753668 3163792 20% /var /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc /dev/dm-3 286696376 1461588 268850900 1% /mnt/home-md/mdt [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS" Lustre FS: home-md Lustre FS: home-md [root at storage03 ~]# cat /proc/sys/lustre/timeout 100 [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 error: conf_param: Invalid argument [root at storage03 ~]# Cheers, Wojciech Turek> > cliffw > >> --------------------------------------------------------------------- >> --- >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071107/be8d112f/attachment-0002.html
Wojciech Turek wrote:> Hi Cliff, > > On 7 Nov 2007, at 17:58, Cliff White wrote: > >> Wojciech Turek wrote: >>> Hi, >>> Our lustre environment is: >>> 2.6.9-55.0.9.EL_lustre.1.6.3smp >>> I would like to change recovery timeout from default value 250s to >>> something longer >>> I tried example from manual: >>> set_timeout <secs> Sets the timeout (obd_timeout) for a server >>> to wait before failing recovery. >>> We performed that experiment on our test lustre installation with one >>> OST. >>> storage02 is our OSS >>> [root at storage02 ~]# lctl dl >>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5 >>> 1 UP ost OSS OSS_uuid 3 >>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >>> [root at storage02 ~]# lctl --device 2 set_timeout 600 >>> set_timeout has been deprecated. Use conf_param instead. >>> e.g. conf_param lustre-MDT0000 obd_timeout=50 >>> usage: conf_param obd_timeout=<secs> >>> run <command> after connecting to device <devno> >>> --device <devno> <command [args ...]> >>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >>> No device found for name MGS: Invalid argument >>> error: conf_param: No such device >>> It looks like I need to run this command from MGS node so I moved >>> then to MGS server called storage03 >>> [root at storage03 ~]# lctl dl >>> 0 UP mgs MGS MGS 9 >>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 >>> 2 UP mdt MDS MDS_uuid 3 >>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >>> [root at storage03 ~]# lctl device 5 >>> [root at storage03 ~]# lctl conf_param obd_timeout=600 >>> error: conf_param: Function not implemented >>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >>> error: conf_param: Function not implemented >>> [root at storage03 ~]# lctl help conf_param >>> conf_param: set a permanent config param. This command must be run on >>> the MGS node >>> usage: conf_param <target.keyword=val> ... >>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600 >>> error: conf_param: Invalid argument >>> [root at storage03 ~]# >>> I searched whole /proc/*/lustre for file that can store this timeout >>> value but nothing were found. >>> Could someone advise how to change value for recovery timeout? >>> Cheers, >>> Wojciech Turek >> >> It looks like your file system is named ''home'' - you can confirm with >> tunefs.lustre --print <MDS device> | grep "Lustre FS" >> >> The correct command (Run on the MGS) would be >> # lctl conf_param home.sys.timeout=<val> >> >> Example: >> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" >> Lustre FS: lustre >> [root at ft4 ~]# cat /proc/sys/lustre/timeout >> 130 >> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 >> [root at ft4 ~]# cat /proc/sys/lustre/timeout >> 150 > Thanks for your email. I am afraid your tips aren''t very helpful in this > case. As stated in the subject I am asking about recovery timeout. > You can find it for example in > /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your OST''s > is in recovery state. By default this timeout is 250s. > Whereas you are talking about system obd timeout (according to CFS > documentation chapter 4.1.2 ) which is not a subject of my concern. > > Any way I tried your example just to see if it works and again I am > afraid it doesn''t work for me, see below: > I have combined mgs and mds configuration. > > [[root at storage03 ~]# df > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda1 10317828 3452824 6340888 36% / > /dev/sda6 7605856 49788 7169708 1% /local > /dev/sda3 4127108 41000 3876460 2% /tmp > /dev/sda2 4127108 753668 3163792 20% /var > /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb > /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc > /dev/dm-3 286696376 1461588 268850900 1% /mnt/home-md/mdt > [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS" > Lustre FS: home-md > Lustre FS: home-md > [root at storage03 ~]# cat /proc/sys/lustre/timeout > 100 > [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 > error: conf_param: Invalid argument > [root at storage03 ~]# >Hmm, not sure why that isn''t working for you, I tested the example I gave. Sorry about the mis-read. The obd recovery timeout is defined in relation to obd_timeout, and afaik not changeable at runtime: lustre/include/lustre_lib.h #define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2) ...which gives the default 250 seconds for the default obd_timeout (100 seconds) cliffw> Cheers, > > Wojciech Turek > > > >> >> cliffw >> >>> ------------------------------------------------------------------------ >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at clusterfs.com >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: wjt27 at cam.ac.uk > tel. +441223763517 > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Nathan Rutman
2007-Nov-07 22:31 UTC
[Lustre-discuss] How To change server recovery timeout
Cliff White wrote:> Wojciech Turek wrote: > >> Hi Cliff, >> >> On 7 Nov 2007, at 17:58, Cliff White wrote: >> >> >>> Wojciech Turek wrote: >>> >>>> Hi, >>>> Our lustre environment is: >>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp >>>> I would like to change recovery timeout from default value 250s to >>>> something longer >>>> I tried example from manual: >>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server >>>> to wait before failing recovery. >>>> We performed that experiment on our test lustre installation with one >>>> OST. >>>> storage02 is our OSS >>>> [root at storage02 ~]# lctl dl >>>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5 >>>> 1 UP ost OSS OSS_uuid 3 >>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >>>> [root at storage02 ~]# lctl --device 2 set_timeout 600 >>>> set_timeout has been deprecated. Use conf_param instead. >>>> e.g. conf_param lustre-MDT0000 obd_timeout=50 >>>>sorry about this bad help message. It''s wrong.>>>> usage: conf_param obd_timeout=<secs> >>>> run <command> after connecting to device <devno> >>>> --device <devno> <command [args ...]> >>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >>>> No device found for name MGS: Invalid argument >>>> error: conf_param: No such device >>>> It looks like I need to run this command from MGS node so I moved >>>> then to MGS server called storage03 >>>> [root at storage03 ~]# lctl dl >>>> 0 UP mgs MGS MGS 9 >>>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 >>>> 2 UP mdt MDS MDS_uuid 3 >>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >>>> [root at storage03 ~]# lctl device 5 >>>> [root at storage03 ~]# lctl conf_param obd_timeout=600 >>>> error: conf_param: Function not implemented >>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >>>> error: conf_param: Function not implemented >>>> [root at storage03 ~]# lctl help conf_param >>>> conf_param: set a permanent config param. This command must be run on >>>> the MGS node >>>> usage: conf_param <target.keyword=val> ... >>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600 >>>> error: conf_param: Invalid argument >>>> [root at storage03 ~]# >>>> I searched whole /proc/*/lustre for file that can store this timeout >>>> value but nothing were found. >>>> Could someone advise how to change value for recovery timeout? >>>> Cheers, >>>> Wojciech Turek >>>> >>> It looks like your file system is named ''home'' - you can confirm with >>> tunefs.lustre --print <MDS device> | grep "Lustre FS" >>> >>> The correct command (Run on the MGS) would be >>> # lctl conf_param home.sys.timeout=<val> >>> >>> Example: >>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" >>> Lustre FS: lustre >>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>> 130 >>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 >>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>> 150 >>> >> Thanks for your email. I am afraid your tips aren''t very helpful in this >> case. As stated in the subject I am asking about recovery timeout. >> You can find it for example in >> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your OST''s >> is in recovery state. By default this timeout is 250s. >> Whereas you are talking about system obd timeout (according to CFS >> documentation chapter 4.1.2 ) which is not a subject of my concern. >> >> Any way I tried your example just to see if it works and again I am >> afraid it doesn''t work for me, see below: >> I have combined mgs and mds configuration. >> >> [[root at storage03 ~]# df >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/sda1 10317828 3452824 6340888 36% / >> /dev/sda6 7605856 49788 7169708 1% /local >> /dev/sda3 4127108 41000 3876460 2% /tmp >> /dev/sda2 4127108 753668 3163792 20% /var >> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb >> /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc >> /dev/dm-3 286696376 1461588 268850900 1% /mnt/home-md/mdt >> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS" >> Lustre FS: home-md >> Lustre FS: home-md >> [root at storage03 ~]# cat /proc/sys/lustre/timeout >> 100 >> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 >> error: conf_param: Invalid argument >> [root at storage03 ~]# >>You need to do this on the MGS node, with the MGS running. mgs> lctl conf_param testfs.sys.timeout=150 anynode> cat /proc/sys/lustre/timeout> Hmm, not sure why that isn''t working for you, I tested the example I > gave. Sorry about the mis-read. The obd recovery timeout is defined in > relation to obd_timeout, and afaik not changeable at runtime: > > lustre/include/lustre_lib.h > #define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2) > ...which gives the default 250 seconds for the default obd_timeout (100 > seconds) > > cliffw > >That''s correct. These are tied together before lustre 1.6.4.>> Cheers, >> >> Wojciech Turek >> >> >> >> >>> cliffw >>> >>> >>>> ------------------------------------------------------------------------ >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at clusterfs.com >>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>>> >> Mr Wojciech Turek >> Assistant System Manager >> University of Cambridge >> High Performance Computing service >> email: wjt27 at cam.ac.uk >> tel. +441223763517 >> >> >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Wojciech Turek
2007-Nov-07 23:56 UTC
[Lustre-discuss] How To change server recovery timeout
On 7 Nov 2007, at 22:31, Nathan Rutman wrote:> Cliff White wrote: >> Wojciech Turek wrote: >> >>> Hi Cliff, >>> >>> On 7 Nov 2007, at 17:58, Cliff White wrote: >>> >>> >>>> Wojciech Turek wrote: >>>> >>>>> Hi, >>>>> Our lustre environment is: >>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp >>>>> I would like to change recovery timeout from default value 250s >>>>> to something longer >>>>> I tried example from manual: >>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server >>>>> to wait before failing recovery. >>>>> We performed that experiment on our test lustre installation >>>>> with one OST. >>>>> storage02 is our OSS >>>>> [root at storage02 ~]# lctl dl >>>>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4- >>>>> c760-45d3df426d86 5 >>>>> 1 UP ost OSS OSS_uuid 3 >>>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600 >>>>> set_timeout has been deprecated. Use conf_param instead. >>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50 >>>>> > sorry about this bad help message. It''s wrong. >>>>> usage: conf_param obd_timeout=<secs> >>>>> run <command> after connecting to device <devno> >>>>> --device <devno> <command [args ...]> >>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >>>>> No device found for name MGS: Invalid argument >>>>> error: conf_param: No such device >>>>> It looks like I need to run this command from MGS node so I >>>>> moved then to MGS server called storage03 >>>>> [root at storage03 ~]# lctl dl >>>>> 0 UP mgs MGS MGS 9 >>>>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada- >>>>> b602a5ca9ab3 5 >>>>> 2 UP mdt MDS MDS_uuid 3 >>>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >>>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >>>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >>>>> [root at storage03 ~]# lctl device 5 >>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600 >>>>> error: conf_param: Function not implemented >>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >>>>> error: conf_param: Function not implemented >>>>> [root at storage03 ~]# lctl help conf_param >>>>> conf_param: set a permanent config param. This command must be >>>>> run on the MGS node >>>>> usage: conf_param <target.keyword=val> ... >>>>> [root at storage03 ~]# lctl conf_param home-md- >>>>> MDT0000.obd_timeout=600 >>>>> error: conf_param: Invalid argument >>>>> [root at storage03 ~]# >>>>> I searched whole /proc/*/lustre for file that can store this >>>>> timeout value but nothing were found. >>>>> Could someone advise how to change value for recovery timeout? >>>>> Cheers, >>>>> Wojciech Turek >>>>> >>>> It looks like your file system is named ''home'' - you can confirm >>>> with >>>> tunefs.lustre --print <MDS device> | grep "Lustre FS" >>>> >>>> The correct command (Run on the MGS) would be >>>> # lctl conf_param home.sys.timeout=<val> >>>> >>>> Example: >>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" >>>> Lustre FS: lustre >>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>> 130 >>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 >>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>> 150 >>>> >>> Thanks for your email. I am afraid your tips aren''t very helpful >>> in this case. As stated in the subject I am asking about recovery >>> timeout. >>> You can find it for example in /proc/fs/lustre/obdfilter/<OST>/ >>> recovery_status whilst one of your OST''s is in recovery state. By >>> default this timeout is 250s. >>> Whereas you are talking about system obd timeout (according to >>> CFS documentation chapter 4.1.2 ) which is not a subject of my >>> concern. >>> >>> Any way I tried your example just to see if it works and again I >>> am afraid it doesn''t work for me, see below: >>> I have combined mgs and mds configuration. >>> >>> [[root at storage03 ~]# df >>> Filesystem 1K-blocks Used Available Use% Mounted on >>> /dev/sda1 10317828 3452824 6340888 36% / >>> /dev/sda6 7605856 49788 7169708 1% /local >>> /dev/sda3 4127108 41000 3876460 2% /tmp >>> /dev/sda2 4127108 753668 3163792 20% /var >>> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb >>> /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc >>> /dev/dm-3 286696376 1461588 268850900 1% /mnt/home- >>> md/mdt >>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre >>> FS" >>> Lustre FS: home-md >>> Lustre FS: home-md >>> [root at storage03 ~]# cat /proc/sys/lustre/timeout >>> 100 >>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 >>> error: conf_param: Invalid argument >>> [root at storage03 ~]# >>> > You need to do this on the MGS node, with the MGS running. > > mgs> lctl conf_param testfs.sys.timeout=150 > anynode> cat /proc/sys/lustre/timeoutThis isn''t working for me. In my production configuration I have MGS combined with MDT on the same server. My lustre configuration consists of two file systems. [root at mds01 ~]# tunefs.lustre --print /dev/dm-0 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: ddn-home-MDT0000 Index: 0 Lustre FS: ddn-home Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp Permanent disk data: Target: ddn-home-MDT0000 Index: 0 Lustre FS: ddn-home Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp exiting before disk write. [root at mds01 ~]# tunefs.lustre --print /dev/dm-1 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: ddn-data-MDT0000 Index: 0 Lustre FS: ddn-data Mount type: ldiskfs Flags: 0x1 (MDT ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp Permanent disk data: Target: ddn-data-MDT0000 Index: 0 Lustre FS: ddn-data Mount type: ldiskfs Flags: 0x1 (MDT ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp exiting before disk write. [root at mds01 ~]# As you can see above MGS is on /dev/dm-0 combined with MDT for ddn- home file system. If I try command line from your example I get this: [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200 error: conf_param: Invalid argument Server mds01 is 100% MGS node. What is wrong here then? The only two reasons for that problem I can think of is that file system name contain "-" character. However I didn''t find anything in documentation that would say that this character is not allowed to be used. Another reason is that MGS is combined with MDS? syslog contains following messages: Nov 7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_llog.c: 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device from lctl is ''ddn-home'' Nov 7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_handler.c: 605:mgs_iocontrol()) setparam err -22 Nov 7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_llog.c: 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device from lctl is ''ddn-data'' Nov 7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_handler.c: 605:mgs_iocontrol()) setparam err -22 Nov 7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_llog.c: 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device from lctl is ''ddn-data'' Nov 7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_handler.c: 605:mgs_iocontrol()) setparam err -22 Nov 7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_llog.c: 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device from lctl is ''ddn-data'' Nov 7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_handler.c: 605:mgs_iocontrol()) setparam err -22 Nov 7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_llog.c: 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device from lctl is ''ddn-data'' Nov 7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_handler.c: 605:mgs_iocontrol()) setparam err -22 Nov 7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_llog.c: 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device from lctl is ''ddn-home'' Nov 7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_handler.c: 605:mgs_iocontrol()) setparam err -22 From above it looks like only first part of file system name is recognized "ddn" and -home or -data is omitted. Please advise. Wojciech Turek> > > >> Hmm, not sure why that isn''t working for you, I tested the example >> I gave. Sorry about the mis-read. The obd recovery timeout is >> defined in relation to obd_timeout, and afaik not changeable at >> runtime: >> >> lustre/include/lustre_lib.h >> #define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2) >> ...which gives the default 250 seconds for the default obd_timeout >> (100 seconds) >> >> cliffw >> >> > That''s correct. These are tied together before lustre 1.6.4. > >>> Cheers, >>> >>> Wojciech Turek >>> >>> >>> >>> >>>> cliffw >>>> >>>> >>>>> ------------------------------------------------------------------ >>>>> ------ >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at clusterfs.com >>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>>>> >>> Mr Wojciech Turek >>> Assistant System Manager >>> University of Cambridge >>> High Performance Computing service >>> email: wjt27 at cam.ac.uk >>> tel. +441223763517 >>> >>> >>> >>> >>> >>> -------------------------------------------------------------------- >>> ---- >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at clusterfs.com >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> >Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071107/a19effe3/attachment-0002.html
Nathan Rutman
2007-Nov-08 18:54 UTC
[Lustre-discuss] How To change server recovery timeout
Wojciech Turek wrote:> > On 7 Nov 2007, at 22:31, Nathan Rutman wrote: > >> Cliff White wrote: >>> Wojciech Turek wrote: >>> >>> >>> >>>> Hi Cliff, >>>> >>>> On 7 Nov 2007, at 17:58, Cliff White wrote: >>>> >>>> >>>> >>>>> Wojciech Turek wrote: >>>>> >>>>> >>>>> >>>>>> Hi, >>>>>> Our lustre environment is: >>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp >>>>>> I would like to change recovery timeout from default value 250s >>>>>> to something longer >>>>>> I tried example from manual: >>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server >>>>>> to wait before failing recovery. >>>>>> We performed that experiment on our test lustre installation with >>>>>> one OST. >>>>>> storage02 is our OSS >>>>>> [root at storage02 ~]# lctl dl >>>>>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5 >>>>>> 1 UP ost OSS OSS_uuid 3 >>>>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600 >>>>>> set_timeout has been deprecated. Use conf_param instead. >>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50 >>>>>> >>>>>> >>>>>> >> sorry about this bad help message. It''s wrong. >>>>>> usage: conf_param obd_timeout=<secs> >>>>>> run <command> after connecting to device <devno> >>>>>> --device <devno> <command [args ...]> >>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >>>>>> No device found for name MGS: Invalid argument >>>>>> error: conf_param: No such device >>>>>> It looks like I need to run this command from MGS node so I >>>>>> moved then to MGS server called storage03 >>>>>> [root at storage03 ~]# lctl dl >>>>>> 0 UP mgs MGS MGS 9 >>>>>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 >>>>>> 2 UP mdt MDS MDS_uuid 3 >>>>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >>>>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >>>>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >>>>>> [root at storage03 ~]# lctl device 5 >>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600 >>>>>> error: conf_param: Function not implemented >>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >>>>>> error: conf_param: Function not implemented >>>>>> [root at storage03 ~]# lctl help conf_param >>>>>> conf_param: set a permanent config param. This command must be >>>>>> run on the MGS node >>>>>> usage: conf_param <target.keyword=val> ... >>>>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600 >>>>>> error: conf_param: Invalid argument >>>>>> [root at storage03 ~]# >>>>>> I searched whole /proc/*/lustre for file that can store this >>>>>> timeout value but nothing were found. >>>>>> Could someone advise how to change value for recovery timeout? >>>>>> Cheers, >>>>>> Wojciech Turek >>>>>> >>>>>> >>>>>> >>>>> It looks like your file system is named ''home'' - you can confirm with >>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS" >>>>> >>>>> The correct command (Run on the MGS) would be >>>>> # lctl conf_param home.sys.timeout=<val> >>>>> >>>>> Example: >>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" >>>>> Lustre FS: lustre >>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>> 130 >>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 >>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>> 150 >>>>> >>>>> >>>>> >>>> Thanks for your email. I am afraid your tips aren''t very helpful in >>>> this case. As stated in the subject I am asking about recovery timeout. >>>> You can find it for example in >>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your >>>> OST''s is in recovery state. By default this timeout is 250s. >>>> Whereas you are talking about system obd timeout (according to CFS >>>> documentation chapter 4.1.2 ) which is not a subject of my concern. >>>> >>>> Any way I tried your example just to see if it works and again I am >>>> afraid it doesn''t work for me, see below: >>>> I have combined mgs and mds configuration. >>>> >>>> [[root at storage03 ~]# df >>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>> /dev/sda1 10317828 3452824 6340888 36% / >>>> /dev/sda6 7605856 49788 7169708 1% /local >>>> /dev/sda3 4127108 41000 3876460 2% /tmp >>>> /dev/sda2 4127108 753668 3163792 20% /var >>>> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb >>>> /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc >>>> /dev/dm-3 286696376 1461588 268850900 1% >>>> /mnt/home-md/mdt >>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS" >>>> Lustre FS: home-md >>>> Lustre FS: home-md >>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout >>>> 100 >>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 >>>> error: conf_param: Invalid argument >>>> [root at storage03 ~]# >>>> >>>> >>>> >> You need to do this on the MGS node, with the MGS running. >> >> mgs> lctl conf_param testfs.sys.timeout=150 >> anynode> cat /proc/sys/lustre/timeout > This isn''t working for me. In my production configuration I have MGS > combined with MDT on the same server. My lustre configuration consists > of two file systems. > [root at mds01 ~]# tunefs.lustre --print /dev/dm-0 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: ddn-home-MDT0000 > Index: 0 > Lustre FS: ddn-home > Mount type: ldiskfs > Flags: 0x5 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp > > > Permanent disk data: > Target: ddn-home-MDT0000 > Index: 0 > Lustre FS: ddn-home > Mount type: ldiskfs > Flags: 0x5 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp > > exiting before disk write. > [root at mds01 ~]# tunefs.lustre --print /dev/dm-1 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: ddn-data-MDT0000 > Index: 0 > Lustre FS: ddn-data > Mount type: ldiskfs > Flags: 0x1 > (MDT ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp > > > Permanent disk data: > Target: ddn-data-MDT0000 > Index: 0 > Lustre FS: ddn-data > Mount type: ldiskfs > Flags: 0x1 > (MDT ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp > > exiting before disk write. > [root at mds01 ~]# > > As you can see above MGS is on /dev/dm-0 combined with MDT for > ddn-home file system. > If I try command line from your example I get this: > [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200 > error: conf_param: Invalid argument > > Server mds01 is 100% MGS node. What is wrong here then? The only two > reasons for that problem I can think of is that file system name > contain "-" character. However I didn''t find anything in documentation > that would say that this character is not allowed to be used. Another > reason is that MGS is combined with MDS? > > syslog contains following messages: > > Nov 7 18:38:35 mds01 kernel: LustreError: > 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. > cfg_device from lctl is ''ddn-home'' > Nov 7 18:38:35 mds01 kernel: LustreError: > 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 > Nov 7 18:39:46 mds01 kernel: LustreError: > 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. > cfg_device from lctl is ''ddn-data'' > Nov 7 18:39:46 mds01 kernel: LustreError: > 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 > Nov 7 18:39:54 mds01 kernel: LustreError: > 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. > cfg_device from lctl is ''ddn-data'' > Nov 7 18:39:54 mds01 kernel: LustreError: > 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 > Nov 7 18:40:01 mds01 kernel: LustreError: > 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. > cfg_device from lctl is ''ddn-data'' > Nov 7 18:40:01 mds01 kernel: LustreError: > 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 > Nov 7 18:41:06 mds01 kernel: LustreError: > 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. > cfg_device from lctl is ''ddn-data'' > Nov 7 18:41:06 mds01 kernel: LustreError: > 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 > Nov 7 18:41:15 mds01 kernel: LustreError: > 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. > cfg_device from lctl is ''ddn-home'' > Nov 7 18:41:15 mds01 kernel: LustreError: > 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 > > From above it looks like only first part of file system name is > recognized "ddn" and -home or -data is omitted. > > Please advise. > > Wojciech TurekYou seem to have found a bug. I just tried this myself and it doesn''t work with a "-" in the name. Maybe use a ''.'' instead until we fix it.
Nathan Rutman
2007-Nov-08 19:04 UTC
[Lustre-discuss] How To change server recovery timeout
Nathan Rutman wrote:> Wojciech Turek wrote: > >> On 7 Nov 2007, at 22:31, Nathan Rutman wrote: >> >> >>> Cliff White wrote: >>> >>>> Wojciech Turek wrote: >>>> >>>> >>>> >>>> >>>>> Hi Cliff, >>>>> >>>>> On 7 Nov 2007, at 17:58, Cliff White wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Wojciech Turek wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> Our lustre environment is: >>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp >>>>>>> I would like to change recovery timeout from default value 250s >>>>>>> to something longer >>>>>>> I tried example from manual: >>>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server >>>>>>> to wait before failing recovery. >>>>>>> We performed that experiment on our test lustre installation with >>>>>>> one OST. >>>>>>> storage02 is our OSS >>>>>>> [root at storage02 ~]# lctl dl >>>>>>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5 >>>>>>> 1 UP ost OSS OSS_uuid 3 >>>>>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600 >>>>>>> set_timeout has been deprecated. Use conf_param instead. >>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>> sorry about this bad help message. It''s wrong. >>> >>>>>>> usage: conf_param obd_timeout=<secs> >>>>>>> run <command> after connecting to device <devno> >>>>>>> --device <devno> <command [args ...]> >>>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >>>>>>> No device found for name MGS: Invalid argument >>>>>>> error: conf_param: No such device >>>>>>> It looks like I need to run this command from MGS node so I >>>>>>> moved then to MGS server called storage03 >>>>>>> [root at storage03 ~]# lctl dl >>>>>>> 0 UP mgs MGS MGS 9 >>>>>>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 >>>>>>> 2 UP mdt MDS MDS_uuid 3 >>>>>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >>>>>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >>>>>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >>>>>>> [root at storage03 ~]# lctl device 5 >>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600 >>>>>>> error: conf_param: Function not implemented >>>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >>>>>>> error: conf_param: Function not implemented >>>>>>> [root at storage03 ~]# lctl help conf_param >>>>>>> conf_param: set a permanent config param. This command must be >>>>>>> run on the MGS node >>>>>>> usage: conf_param <target.keyword=val> ... >>>>>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600 >>>>>>> error: conf_param: Invalid argument >>>>>>> [root at storage03 ~]# >>>>>>> I searched whole /proc/*/lustre for file that can store this >>>>>>> timeout value but nothing were found. >>>>>>> Could someone advise how to change value for recovery timeout? >>>>>>> Cheers, >>>>>>> Wojciech Turek >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> It looks like your file system is named ''home'' - you can confirm with >>>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS" >>>>>> >>>>>> The correct command (Run on the MGS) would be >>>>>> # lctl conf_param home.sys.timeout=<val> >>>>>> >>>>>> Example: >>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" >>>>>> Lustre FS: lustre >>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>>> 130 >>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 >>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>>> 150 >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Thanks for your email. I am afraid your tips aren''t very helpful in >>>>> this case. As stated in the subject I am asking about recovery timeout. >>>>> You can find it for example in >>>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your >>>>> OST''s is in recovery state. By default this timeout is 250s. >>>>> Whereas you are talking about system obd timeout (according to CFS >>>>> documentation chapter 4.1.2 ) which is not a subject of my concern. >>>>> >>>>> Any way I tried your example just to see if it works and again I am >>>>> afraid it doesn''t work for me, see below: >>>>> I have combined mgs and mds configuration. >>>>> >>>>> [[root at storage03 ~]# df >>>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>>> /dev/sda1 10317828 3452824 6340888 36% / >>>>> /dev/sda6 7605856 49788 7169708 1% /local >>>>> /dev/sda3 4127108 41000 3876460 2% /tmp >>>>> /dev/sda2 4127108 753668 3163792 20% /var >>>>> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb >>>>> /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc >>>>> /dev/dm-3 286696376 1461588 268850900 1% >>>>> /mnt/home-md/mdt >>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS" >>>>> Lustre FS: home-md >>>>> Lustre FS: home-md >>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout >>>>> 100 >>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 >>>>> error: conf_param: Invalid argument >>>>> [root at storage03 ~]# >>>>> >>>>> >>>>> >>>>> >>> You need to do this on the MGS node, with the MGS running. >>> >>> mgs> lctl conf_param testfs.sys.timeout=150 >>> anynode> cat /proc/sys/lustre/timeout >>> >> This isn''t working for me. In my production configuration I have MGS >> combined with MDT on the same server. My lustre configuration consists >> of two file systems. >> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0 >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: ddn-home-MDT0000 >> Index: 0 >> Lustre FS: ddn-home >> Mount type: ldiskfs >> Flags: 0x5 >> (MDT MGS ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp >> >> >> Permanent disk data: >> Target: ddn-home-MDT0000 >> Index: 0 >> Lustre FS: ddn-home >> Mount type: ldiskfs >> Flags: 0x5 >> (MDT MGS ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp >> >> exiting before disk write. >> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1 >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: ddn-data-MDT0000 >> Index: 0 >> Lustre FS: ddn-data >> Mount type: ldiskfs >> Flags: 0x1 >> (MDT ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp >> >> >> Permanent disk data: >> Target: ddn-data-MDT0000 >> Index: 0 >> Lustre FS: ddn-data >> Mount type: ldiskfs >> Flags: 0x1 >> (MDT ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp >> >> exiting before disk write. >> [root at mds01 ~]# >> >> As you can see above MGS is on /dev/dm-0 combined with MDT for >> ddn-home file system. >> If I try command line from your example I get this: >> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200 >> error: conf_param: Invalid argument >> >> Server mds01 is 100% MGS node. What is wrong here then? The only two >> reasons for that problem I can think of is that file system name >> contain "-" character. However I didn''t find anything in documentation >> that would say that this character is not allowed to be used. Another >> reason is that MGS is combined with MDS? >> >> syslog contains following messages: >> >> Nov 7 18:38:35 mds01 kernel: LustreError: >> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. >> cfg_device from lctl is ''ddn-home'' >> Nov 7 18:38:35 mds01 kernel: LustreError: >> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >> Nov 7 18:39:46 mds01 kernel: LustreError: >> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. >> cfg_device from lctl is ''ddn-data'' >> Nov 7 18:39:46 mds01 kernel: LustreError: >> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >> Nov 7 18:39:54 mds01 kernel: LustreError: >> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. >> cfg_device from lctl is ''ddn-data'' >> Nov 7 18:39:54 mds01 kernel: LustreError: >> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >> Nov 7 18:40:01 mds01 kernel: LustreError: >> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. >> cfg_device from lctl is ''ddn-data'' >> Nov 7 18:40:01 mds01 kernel: LustreError: >> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >> Nov 7 18:41:06 mds01 kernel: LustreError: >> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. >> cfg_device from lctl is ''ddn-data'' >> Nov 7 18:41:06 mds01 kernel: LustreError: >> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >> Nov 7 18:41:15 mds01 kernel: LustreError: >> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. >> cfg_device from lctl is ''ddn-home'' >> Nov 7 18:41:15 mds01 kernel: LustreError: >> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >> >> From above it looks like only first part of file system name is >> recognized "ddn" and -home or -data is omitted. >> >> Please advise. >> >> Wojciech Turek >> > > You seem to have found a bug. I just tried this myself and it doesn''t > work with a "-" in the name. Maybe use a ''.'' instead until we fix it. >Argh, sorry, that doesn''t work with conf_param either. But an underscore ''_'' does. I''m filing a bug report...
Wojciech Turek
2007-Nov-09 02:38 UTC
[Lustre-discuss] How To change server recovery timeout
Hi, It is a lesson for me to do not change old habits. I always used "_" and for latest filesystem I did exception for the impression that it looks neater with "-" and here we go. Can I change file system name without reformatting everything? File system with bad name is in production and it is essential for me to fix it without long service downtime. Thanks Wojciech Turek On 8 Nov 2007, at 19:04, Nathan Rutman wrote:> Nathan Rutman wrote: >> Wojciech Turek wrote: >> >>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote: >>> >>> >>>> Cliff White wrote: >>>> >>>>> Wojciech Turek wrote: >>>>> >>>>> >>>>> >>>>>> Hi Cliff, >>>>>> >>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Wojciech Turek wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, >>>>>>>> Our lustre environment is: >>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp >>>>>>>> I would like to change recovery timeout from default value >>>>>>>> 250s to something longer >>>>>>>> I tried example from manual: >>>>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server >>>>>>>> to wait before failing recovery. >>>>>>>> We performed that experiment on our test lustre installation >>>>>>>> with one OST. >>>>>>>> storage02 is our OSS >>>>>>>> [root at storage02 ~]# lctl dl >>>>>>>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4- >>>>>>>> c760-45d3df426d86 5 >>>>>>>> 1 UP ost OSS OSS_uuid 3 >>>>>>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >>>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600 >>>>>>>> set_timeout has been deprecated. Use conf_param instead. >>>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50 >>>>>>>> >>>>>>>> >>>>>>>> >>>> sorry about this bad help message. It''s wrong. >>>> >>>>>>>> usage: conf_param obd_timeout=<secs> >>>>>>>> run <command> after connecting to device <devno> >>>>>>>> --device <devno> <command [args ...]> >>>>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >>>>>>>> No device found for name MGS: Invalid argument >>>>>>>> error: conf_param: No such device >>>>>>>> It looks like I need to run this command from MGS node so I >>>>>>>> moved then to MGS server called storage03 >>>>>>>> [root at storage03 ~]# lctl dl >>>>>>>> 0 UP mgs MGS MGS 9 >>>>>>>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada- >>>>>>>> b602a5ca9ab3 5 >>>>>>>> 2 UP mdt MDS MDS_uuid 3 >>>>>>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >>>>>>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >>>>>>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >>>>>>>> [root at storage03 ~]# lctl device 5 >>>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600 >>>>>>>> error: conf_param: Function not implemented >>>>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >>>>>>>> error: conf_param: Function not implemented >>>>>>>> [root at storage03 ~]# lctl help conf_param >>>>>>>> conf_param: set a permanent config param. This command must >>>>>>>> be run on the MGS node >>>>>>>> usage: conf_param <target.keyword=val> ... >>>>>>>> [root at storage03 ~]# lctl conf_param home-md- >>>>>>>> MDT0000.obd_timeout=600 >>>>>>>> error: conf_param: Invalid argument >>>>>>>> [root at storage03 ~]# >>>>>>>> I searched whole /proc/*/lustre for file that can store this >>>>>>>> timeout value but nothing were found. >>>>>>>> Could someone advise how to change value for recovery timeout? >>>>>>>> Cheers, >>>>>>>> Wojciech Turek >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> It looks like your file system is named ''home'' - you can >>>>>>> confirm with >>>>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS" >>>>>>> >>>>>>> The correct command (Run on the MGS) would be >>>>>>> # lctl conf_param home.sys.timeout=<val> >>>>>>> >>>>>>> Example: >>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" >>>>>>> Lustre FS: lustre >>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>>>> 130 >>>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 >>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>>>> 150 >>>>>>> >>>>>>> >>>>>>> >>>>>> Thanks for your email. I am afraid your tips aren''t very >>>>>> helpful in this case. As stated in the subject I am asking >>>>>> about recovery timeout. >>>>>> You can find it for example in /proc/fs/lustre/obdfilter/<OST>/ >>>>>> recovery_status whilst one of your OST''s is in recovery state. >>>>>> By default this timeout is 250s. >>>>>> Whereas you are talking about system obd timeout (according to >>>>>> CFS documentation chapter 4.1.2 ) which is not a subject of my >>>>>> concern. >>>>>> >>>>>> Any way I tried your example just to see if it works and again >>>>>> I am afraid it doesn''t work for me, see below: >>>>>> I have combined mgs and mds configuration. >>>>>> >>>>>> [[root at storage03 ~]# df >>>>>> Filesystem 1K-blocks Used Available Use% >>>>>> Mounted on >>>>>> /dev/sda1 10317828 3452824 6340888 36% / >>>>>> /dev/sda6 7605856 49788 7169708 1% /local >>>>>> /dev/sda3 4127108 41000 3876460 2% /tmp >>>>>> /dev/sda2 4127108 753668 3163792 20% /var >>>>>> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/ >>>>>> sdb >>>>>> /dev/dm-1 6140723200 4632947344 1507775856 76% / >>>>>> mnt/sdc >>>>>> /dev/dm-3 286696376 1461588 268850900 1% /mnt/ >>>>>> home-md/mdt >>>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep >>>>>> "Lustre FS" >>>>>> Lustre FS: home-md >>>>>> Lustre FS: home-md >>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout >>>>>> 100 >>>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 >>>>>> error: conf_param: Invalid argument >>>>>> [root at storage03 ~]# >>>>>> >>>>>> >>>>>> >>>> You need to do this on the MGS node, with the MGS running. >>>> >>>> mgs> lctl conf_param testfs.sys.timeout=150 >>>> anynode> cat /proc/sys/lustre/timeout >>>> >>> This isn''t working for me. In my production configuration I have >>> MGS combined with MDT on the same server. My lustre configuration >>> consists of two file systems. >>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0 >>> checking for existing Lustre data: found CONFIGS/mountdata >>> Reading CONFIGS/mountdata >>> >>> Read previous values: >>> Target: ddn-home-MDT0000 >>> Index: 0 >>> Lustre FS: ddn-home >>> Mount type: ldiskfs >>> Flags: 0x5 >>> (MDT MGS ) >>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>> Parameters: failover.node=10.143.245.202 at tcp >>> mgsnode=10.143.245.202 at tcp >>> >>> >>> Permanent disk data: >>> Target: ddn-home-MDT0000 >>> Index: 0 >>> Lustre FS: ddn-home >>> Mount type: ldiskfs >>> Flags: 0x5 >>> (MDT MGS ) >>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>> Parameters: failover.node=10.143.245.202 at tcp >>> mgsnode=10.143.245.202 at tcp >>> >>> exiting before disk write. >>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1 >>> checking for existing Lustre data: found CONFIGS/mountdata >>> Reading CONFIGS/mountdata >>> >>> Read previous values: >>> Target: ddn-data-MDT0000 >>> Index: 0 >>> Lustre FS: ddn-data >>> Mount type: ldiskfs >>> Flags: 0x1 >>> (MDT ) >>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>> Parameters: mgsnode=10.143.245.201 at tcp >>> failover.node=10.143.245.202 at tcp >>> >>> >>> Permanent disk data: >>> Target: ddn-data-MDT0000 >>> Index: 0 >>> Lustre FS: ddn-data >>> Mount type: ldiskfs >>> Flags: 0x1 >>> (MDT ) >>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>> Parameters: mgsnode=10.143.245.201 at tcp >>> failover.node=10.143.245.202 at tcp >>> >>> exiting before disk write. >>> [root at mds01 ~]# >>> As you can see above MGS is on /dev/dm-0 combined with MDT for >>> ddn-home file system. >>> If I try command line from your example I get this: >>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200 >>> error: conf_param: Invalid argument >>> >>> Server mds01 is 100% MGS node. What is wrong here then? The only >>> two reasons for that problem I can think of is that file system >>> name contain "-" character. However I didn''t find anything in >>> documentation that would say that this character is not allowed >>> to be used. Another reason is that MGS is combined with MDS? >>> >>> syslog contains following messages: >>> >>> Nov 7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_llog.c: >>> 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device >>> from lctl is ''ddn-home'' >>> Nov 7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_handler.c: >>> 605:mgs_iocontrol()) setparam err -22 >>> Nov 7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_llog.c: >>> 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device >>> from lctl is ''ddn-data'' >>> Nov 7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_handler.c: >>> 605:mgs_iocontrol()) setparam err -22 >>> Nov 7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_llog.c: >>> 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device >>> from lctl is ''ddn-data'' >>> Nov 7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_handler.c: >>> 605:mgs_iocontrol()) setparam err -22 >>> Nov 7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_llog.c: >>> 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device >>> from lctl is ''ddn-data'' >>> Nov 7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_handler.c: >>> 605:mgs_iocontrol()) setparam err -22 >>> Nov 7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_llog.c: >>> 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device >>> from lctl is ''ddn-data'' >>> Nov 7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_handler.c: >>> 605:mgs_iocontrol()) setparam err -22 >>> Nov 7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_llog.c: >>> 1957:mgs_setparam()) No filesystem targets for ddn. cfg_device >>> from lctl is ''ddn-home'' >>> Nov 7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_handler.c: >>> 605:mgs_iocontrol()) setparam err -22 >>> >>> From above it looks like only first part of file system name is >>> recognized "ddn" and -home or -data is omitted. >>> >>> Please advise. >>> >>> Wojciech Turek >>> >> >> You seem to have found a bug. I just tried this myself and it >> doesn''t work with a "-" in the name. Maybe use a ''.'' instead >> until we fix it. >> > Argh, sorry, that doesn''t work with conf_param either. But an > underscore ''_'' does. I''m filing a bug report... >Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071109/72792580/attachment-0002.html
Nathan Rutman
2007-Nov-09 23:28 UTC
[Lustre-discuss] How To change server recovery timeout
Wojciech Turek wrote:> Hi, > > It is a lesson for me to do not change old habits. I always used "_" > and for latest filesystem I did exception for the impression that it > looks neater with "-" and here we go. > Can I change file system name without reformatting everything? File > system with bad name is in production and it is essential for me to > fix it without long service downtime.Yes, but you will have to shut everything down. tunefs --writeconf all the servers, restart the MGS first. While you''re at it, you can set the timeout. (This can be overridden later with conf_param). tunefs.lustre --writeconf --param="sys.timeout=50" /dev/sda> > Thanks > > Wojciech Turek > > On 8 Nov 2007, at 19:04, Nathan Rutman wrote: > >> Nathan Rutman wrote: >>> Wojciech Turek wrote: >>> >>> >>> >>>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote: >>>> >>>> >>>> >>>>> Cliff White wrote: >>>>> >>>>> >>>>> >>>>>> Wojciech Turek wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Hi Cliff, >>>>>>> >>>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Wojciech Turek wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> Our lustre environment is: >>>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp >>>>>>>>> I would like to change recovery timeout from default value >>>>>>>>> 250s to something longer >>>>>>>>> I tried example from manual: >>>>>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server >>>>>>>>> to wait before failing recovery. >>>>>>>>> We performed that experiment on our test lustre installation >>>>>>>>> with one OST. >>>>>>>>> storage02 is our OSS >>>>>>>>> [root at storage02 ~]# lctl dl >>>>>>>>> 0 UP mgc MGC10.143.245.3 at tcp >>>>>>>>> 31259d9b-e655-cdc4-c760-45d3df426d86 5 >>>>>>>>> 1 UP ost OSS OSS_uuid 3 >>>>>>>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7 >>>>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600 >>>>>>>>> set_timeout has been deprecated. Use conf_param instead. >>>>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>> sorry about this bad help message. It''s wrong. >>>>> >>>>> >>>>> >>>>>>>>> usage: conf_param obd_timeout=<secs> >>>>>>>>> run <command> after connecting to device <devno> >>>>>>>>> --device <devno> <command [args ...]> >>>>>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600 >>>>>>>>> No device found for name MGS: Invalid argument >>>>>>>>> error: conf_param: No such device >>>>>>>>> It looks like I need to run this command from MGS node so I >>>>>>>>> moved then to MGS server called storage03 >>>>>>>>> [root at storage03 ~]# lctl dl >>>>>>>>> 0 UP mgs MGS MGS 9 >>>>>>>>> 1 UP mgc MGC10.143.245.3 at tcp >>>>>>>>> f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5 >>>>>>>>> 2 UP mdt MDS MDS_uuid 3 >>>>>>>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4 >>>>>>>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5 >>>>>>>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5 >>>>>>>>> [root at storage03 ~]# lctl device 5 >>>>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600 >>>>>>>>> error: conf_param: Function not implemented >>>>>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600 >>>>>>>>> error: conf_param: Function not implemented >>>>>>>>> [root at storage03 ~]# lctl help conf_param >>>>>>>>> conf_param: set a permanent config param. This command must be >>>>>>>>> run on the MGS node >>>>>>>>> usage: conf_param <target.keyword=val> ... >>>>>>>>> [root at storage03 ~]# lctl conf_param >>>>>>>>> home-md-MDT0000.obd_timeout=600 >>>>>>>>> error: conf_param: Invalid argument >>>>>>>>> [root at storage03 ~]# >>>>>>>>> I searched whole /proc/*/lustre for file that can store this >>>>>>>>> timeout value but nothing were found. >>>>>>>>> Could someone advise how to change value for recovery timeout? >>>>>>>>> Cheers, >>>>>>>>> Wojciech Turek >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> It looks like your file system is named ''home'' - you can >>>>>>>> confirm with >>>>>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS" >>>>>>>> >>>>>>>> The correct command (Run on the MGS) would be >>>>>>>> # lctl conf_param home.sys.timeout=<val> >>>>>>>> >>>>>>>> Example: >>>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS" >>>>>>>> Lustre FS: lustre >>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>>>>> 130 >>>>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150 >>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout >>>>>>>> 150 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Thanks for your email. I am afraid your tips aren''t very helpful >>>>>>> in this case. As stated in the subject I am asking about >>>>>>> recovery timeout. >>>>>>> You can find it for example in >>>>>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of >>>>>>> your OST''s is in recovery state. By default this timeout is 250s. >>>>>>> Whereas you are talking about system obd timeout (according to >>>>>>> CFS documentation chapter 4.1.2 ) which is not a subject of my >>>>>>> concern. >>>>>>> >>>>>>> Any way I tried your example just to see if it works and again I >>>>>>> am afraid it doesn''t work for me, see below: >>>>>>> I have combined mgs and mds configuration. >>>>>>> >>>>>>> [[root at storage03 ~]# df >>>>>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>>>>> /dev/sda1 10317828 3452824 6340888 36% / >>>>>>> /dev/sda6 7605856 49788 7169708 1% /local >>>>>>> /dev/sda3 4127108 41000 3876460 2% /tmp >>>>>>> /dev/sda2 4127108 753668 3163792 20% /var >>>>>>> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb >>>>>>> /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc >>>>>>> /dev/dm-3 286696376 1461588 268850900 1% >>>>>>> /mnt/home-md/mdt >>>>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep >>>>>>> "Lustre FS" >>>>>>> Lustre FS: home-md >>>>>>> Lustre FS: home-md >>>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout >>>>>>> 100 >>>>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150 >>>>>>> error: conf_param: Invalid argument >>>>>>> [root at storage03 ~]# >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> You need to do this on the MGS node, with the MGS running. >>>>> >>>>> mgs> lctl conf_param testfs.sys.timeout=150 >>>>> anynode> cat /proc/sys/lustre/timeout >>>>> >>>>> >>>>> >>>> This isn''t working for me. In my production configuration I have >>>> MGS combined with MDT on the same server. My lustre configuration >>>> consists of two file systems. >>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0 >>>> checking for existing Lustre data: found CONFIGS/mountdata >>>> Reading CONFIGS/mountdata >>>> >>>> Read previous values: >>>> Target: ddn-home-MDT0000 >>>> Index: 0 >>>> Lustre FS: ddn-home >>>> Mount type: ldiskfs >>>> Flags: 0x5 >>>> (MDT MGS ) >>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp >>>> >>>> >>>> Permanent disk data: >>>> Target: ddn-home-MDT0000 >>>> Index: 0 >>>> Lustre FS: ddn-home >>>> Mount type: ldiskfs >>>> Flags: 0x5 >>>> (MDT MGS ) >>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp >>>> >>>> exiting before disk write. >>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1 >>>> checking for existing Lustre data: found CONFIGS/mountdata >>>> Reading CONFIGS/mountdata >>>> >>>> Read previous values: >>>> Target: ddn-data-MDT0000 >>>> Index: 0 >>>> Lustre FS: ddn-data >>>> Mount type: ldiskfs >>>> Flags: 0x1 >>>> (MDT ) >>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp >>>> >>>> >>>> Permanent disk data: >>>> Target: ddn-data-MDT0000 >>>> Index: 0 >>>> Lustre FS: ddn-data >>>> Mount type: ldiskfs >>>> Flags: 0x1 >>>> (MDT ) >>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >>>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp >>>> >>>> exiting before disk write. >>>> [root at mds01 ~]# >>>> As you can see above MGS is on /dev/dm-0 combined with MDT for >>>> ddn-home file system. >>>> If I try command line from your example I get this: >>>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200 >>>> error: conf_param: Invalid argument >>>> >>>> Server mds01 is 100% MGS node. What is wrong here then? The only >>>> two reasons for that problem I can think of is that file system >>>> name contain "-" character. However I didn''t find anything in >>>> documentation that would say that this character is not allowed to >>>> be used. Another reason is that MGS is combined with MDS? >>>> >>>> syslog contains following messages: >>>> >>>> Nov 7 18:38:35 mds01 kernel: LustreError: >>>> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for >>>> ddn. cfg_device from lctl is ''ddn-home'' >>>> Nov 7 18:38:35 mds01 kernel: LustreError: >>>> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >>>> Nov 7 18:39:46 mds01 kernel: LustreError: >>>> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for >>>> ddn. cfg_device from lctl is ''ddn-data'' >>>> Nov 7 18:39:46 mds01 kernel: LustreError: >>>> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >>>> Nov 7 18:39:54 mds01 kernel: LustreError: >>>> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for >>>> ddn. cfg_device from lctl is ''ddn-data'' >>>> Nov 7 18:39:54 mds01 kernel: LustreError: >>>> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >>>> Nov 7 18:40:01 mds01 kernel: LustreError: >>>> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for >>>> ddn. cfg_device from lctl is ''ddn-data'' >>>> Nov 7 18:40:01 mds01 kernel: LustreError: >>>> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >>>> Nov 7 18:41:06 mds01 kernel: LustreError: >>>> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for >>>> ddn. cfg_device from lctl is ''ddn-data'' >>>> Nov 7 18:41:06 mds01 kernel: LustreError: >>>> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >>>> Nov 7 18:41:15 mds01 kernel: LustreError: >>>> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for >>>> ddn. cfg_device from lctl is ''ddn-home'' >>>> Nov 7 18:41:15 mds01 kernel: LustreError: >>>> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22 >>>> >>>> From above it looks like only first part of file system name is >>>> recognized "ddn" and -home or -data is omitted. >>>> >>>> Please advise. >>>> >>>> Wojciech Turek >>>> >>>> >>>> >>> >>> You seem to have found a bug. I just tried this myself and it >>> doesn''t work with a "-" in the name. Maybe use a ''.'' instead until >>> we fix it. >>> >>> >>> >> Argh, sorry, that doesn''t work with conf_param either. But an >> underscore ''_'' does. I''m filing a bug report... >> > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk> > tel. +441223763517 > > >