thr3ads.net - Lustre discuss - [Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Roger Sersted

2010-Jul-12 15:51 UTC

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

This is a small development system with a combined MDS/MGS on a single node with
a SCSI interface to a disk array.  There are two OSSes, each with a single OST 
of 1.4TB comprised of a SATA array.  In all cases, the entire device (/dev/sdc) 
is used with no partitioning.

I upgraded my Lustre MDS and OSS servers from 1.6.6 to 1.8.3.  I did this via a 
complete OS install and then performing a writeconf on each of the nodes.

Unfortunately, each of the OSSes thinks it''s Lustre "Target"
is
"lustre1-OST0000".  I''ve mounted the partitions via ldiskfs
and the underlying
data is still there.  I know which OSS is supposed to be
"lustre1-OST0001", but
I can''t find any docs that explain how to set that.

Thanks,

Roger S.

Wojciech Turek

2010-Jul-12 17:45 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Hi,

Could you please post system logs that were generated during first mount
after the upgrade?
Did you run writeconf on MDT and all OSTs?






On 12 July 2010 16:51, Roger Sersted <rs1 at aps.anl.gov> wrote:
>
>
> This is a small development system with a combined MDS/MGS on a single node
> with
> a SCSI interface to a disk array.  There are two OSSes, each with a single
> OST
> of 1.4TB comprised of a SATA array.  In all cases, the entire device
> (/dev/sdc)
> is used with no partitioning.
>
> I upgraded my Lustre MDS and OSS servers from 1.6.6 to 1.8.3.  I did this
> via a
> complete OS install and then performing a writeconf on each of the nodes.
>
> Unfortunately, each of the OSSes thinks it''s Lustre
"Target" is
> "lustre1-OST0000".  I''ve mounted the partitions via
ldiskfs and the
> underlying
> data is still there.  I know which OSS is supposed to be
"lustre1-OST0001",
> but
> I can''t find any docs that explain how to set that.
>
> Thanks,
>
> Roger S.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100712/bbddabfc/attachment-0001.html

Roger Sersted

2010-Jul-12 18:38 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Thanks for the quick response.  The logs on the problem server indicates the 
ldiskfs RPM was not installed for the first mount attempt.  Lustre rejected the 
attempt here:

un 26 17:43:58 puppy7 kernel: LustreError: 
3358:0:(obd_mount.c:1290:server_kernel_mount()) premount /dev/sdc:0x0 ldiskfs 
failed: -19, ldiskfs2 failed: -19.  Is the ldiskfs module available?
Jun 26 17:43:58 puppy7 kernel: LustreError: 
3358:0:(obd_mount.c:1616:server_fill_super()) Unable to mount device /dev/sdc:
-19
Jun 26 17:43:58 puppy7 kernel: LustreError: 
3358:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount  (-19)
Jun 26 17:44:10 puppy7 ntpd[3082]: synchronized to 172.16.2.254, stratum 3
Jun 26 17:44:19 puppy7 kernel: LustreError: 
3368:0:(obd_mount.c:1290:server_kernel_mount()) premount /dev/sdc:0x0 ldiskfs 
failed: -19, ldiskfs2 failed: -19.  Is the ldiskfs module available?
Jun 26 17:44:19 puppy7 kernel: LustreError: 
3368:0:(obd_mount.c:1616:server_fill_super()) Unable to mount device /dev/sdc:
-19
Jun 26 17:44:19 puppy7 kernel: LustreError: 
3368:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount  (-19)
Jun 26 17:53:39 puppy7 kernel: LustreError: 
3430:0:(obd_mount.c:1290:server_kernel_mount()) premount /dev/sdc:0x0 ldiskfs 
failed: -19, ldiskfs2 failed: -19.  Is the ldiskfs module available?
Jun 26 17:53:39 puppy7 kernel: LustreError: 
3430:0:(obd_mount.c:1616:server_fill_super()) Unable to mount device /dev/sdc:
-19
Jun 26 17:53:39 puppy7 kernel: LustreError: 
3430:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount  (-19)

I then installed the ldiskfs RPM on all the Lustre nodes (and fixed my kickstart
config), modprobe''d lustre and attempted again:

Jun 26 17:54:30 puppy7 kernel: init dynlocks cache
Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5
Jun 26 17:54:30 puppy7 kernel: LDISKFS-fs: barriers enabled
Jun 26 17:54:33 puppy7 kernel: kjournald2 starting: pid 3457, dev sdc:8, commit 
interval 5 seconds
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs warning: checktime reached, running 
e2fsck is recommended
Jun 26 17:54:33 puppy7 kernel: LDISKFS FS on sdc, internal journal on sdc:8
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: delayed allocation enabled
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: file extents enabled
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc enabled
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: recovery complete.
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mounted filesystem sdc with ordered 
data mode
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success)
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal 
hits, 0 2^N hits, 0 breaks, 0 lost
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: barriers enabled
Jun 26 17:54:33 puppy7 kernel: kjournald2 starting: pid 3460, dev sdc:8, commit 
interval 5 seconds
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs warning: checktime reached, running 
e2fsck is recommended
Jun 26 17:54:33 puppy7 kernel: LDISKFS FS on sdc, internal journal on sdc:8
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: delayed allocation enabled
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: file extents enabled
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc enabled
Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mounted filesystem sdc with ordered 
data mode
Jun 26 17:54:38 puppy7 kernel: Lustre: 
2725:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1339651978690561
sent from MGC172.17.2.5 at o2ib to NID 172.17.2.5 at o2ib 5s ago has timed out
(5s
prior to deadline).
Jun 26 17:54:38 puppy7 kernel:   req at ffff810067706400 x1339651978690561/t0 
o250->MGS at MGC172.17.2.5@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1277592878
ref 1
fl Rpc:N/0/0 rc 0/0
Jun 26 17:54:38 puppy7 kernel: LustreError: 
3445:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
req at ffff81013d553c00 x1339651978690563/t0 o101->MGS at
MGC172.17.2.5@o2ib_0:26/25
lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Jun 26 17:54:38 puppy7 kernel: Lustre: Filtering OBD driver;
http://www.lustre.org/
Jun 26 17:54:38 puppy7 kernel: Lustre: lustre1-OST0001: Now serving 
lustre1-OST0001 on /dev/sdc with recovery enabled
Jun 26 17:55:03 puppy7 kernel: Lustre: 
2725:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1339651978690564
sent from MGC172.17.2.5 at o2ib to NID 172.17.2.5 at o2ib 5s ago has timed out
(5s
prior to deadline).

-----------  a few timeout messages later ....

Jun 26 17:55:43 puppy7 kernel: LustreError: 
3649:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
req at ffff810060569800 x1339651978690572/t0 o101->MGS at
MGC172.17.2.5@o2ib_0:26/25
lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Jun 26 17:55:52 puppy7 kernel: Lustre: MGC172.17.2.5 at o2ib: Reactivating
import
Jun 26 17:55:52 puppy7 kernel: Lustre: lustre1-OST0001: received MDS connection 
from 172.17.2.5 at o2ib
Jun 26 17:59:11 puppy7 ntpd[3082]: kernel time sync enabled 0001
Jun 26 18:03:51 puppy7 kernel: Lustre: 
2724:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1339651978690598
sent from MGC172.17.2.5 at o2ib to NID 172.17.2.5 at o2ib 17s ago has timed out
(17s
prior to deadline).
Jun 26 18:03:51 puppy7 kernel:   req at ffff810060bcd800 x1339651978690598/t0 
o400->MGS at MGC172.17.2.5@o2ib_0:26/25 lens 192/384 e 0 to 1 dl 1277593431
ref 1
fl Rpc:N/0/0 rc 0/0
Jun 26 18:03:51 puppy7 kernel: Lustre: 
2724:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 2 previous similar 
messages
Jun 26 18:03:51 puppy7 kernel: LustreError: 166-1: MGC172.17.2.5 at o2ib: 
Connection to service MGS via nid 172.17.2.5 at o2ib was lost; in progress 
operations using this service will fail.

--------------------------------------------

According to the above it looked like everything worked.  But, after waiting a 
while, I still couldn''t mount lustre on a client.  I found a similar
problem on
the list,in that case, the fix was to mount the device as type ldiskfs and 
remove CONFIGS/<targetname>.  I hope that didn''t permanently
corrupt lustre?

Thanks,

Roger S.

Wojciech Turek wrote:> Hi,
> 
> Could you please post system logs that were generated during first mount 
> after the upgrade?
> Did you run writeconf on MDT and all OSTs?
> 
> 
> 
> 
>  
> 
> On 12 July 2010 16:51, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
> 
>     This is a small development system with a combined MDS/MGS on a
>     single node with
>     a SCSI interface to a disk array.  There are two OSSes, each with a
>     single OST
>     of 1.4TB comprised of a SATA array.  In all cases, the entire device
>     (/dev/sdc)
>     is used with no partitioning.
> 
>     I upgraded my Lustre MDS and OSS servers from 1.6.6 to 1.8.3.  I did
>     this via a
>     complete OS install and then performing a writeconf on each of the
>     nodes.
> 
>     Unfortunately, each of the OSSes thinks it''s Lustre
"Target" is
>     "lustre1-OST0000".  I''ve mounted the partitions via
ldiskfs and the
>     underlying
>     data is still there.  I know which OSS is supposed to be
>     "lustre1-OST0001", but
>     I can''t find any docs that explain how to set that.
> 
>     Thanks,
> 
>     Roger S.
> 
>     _______________________________________________
>     Lustre-discuss mailing list
>     Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at
lists.lustre.org>
>     http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> 
> -- 
> --
> Wojciech Turek
> 
> Assistant System Manager
> 
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> Tel: (+)44 1223 763517

Roger Sersted

2010-Jul-14 19:46 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Any additional info?

Thanks,

Roger S.

Wojciech Turek

2010-Jul-14 22:57 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Hi Roger,

Sorry for the delay. From the ldiskfs messages I seem to me that you are
using ext4 ldiskfs
(Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5).
If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think taht in
lustre-1.8.3 you should use ext3 based ldiskfs rpm.

Can you also  tell us a bit more about your setup? From what you wrote so
far I understand you have 2 OSS servers and each server has one OST device.
In addition to that you have a third server which acts as a MGS/MDS, is that
right?

The logs you provided seem to be only from one server called puppy7 so it
does not give a whole picture of the situation. The timeout messages may
indicate a problem with communication between the servers but it is really
difficult to say without seeing the whole picture or at least more elements
of it.

To check if you have correct rpms installed can you please run ''rpm -qa
|
grep lustre'' on both OSS servers and the MDS?

Also please provide output from command ''lctl list_nids''  run
on both OSS
servers, MDS and a client?

In addition to above please run following command on all lustre targets
(OSTs and MDT) to display your current lustre configuration

 tunefs.lustre --dryrun --print /dev/<ost_device>

If possible please attach syslog from each machine from the time you mounted
lustre targets (OST and MDT).

Best regards,

Wojciech

On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov> wrote:
>
> Any additional info?
>
> Thanks,
>
> Roger S.
>

-- 
--
Wojciech Turek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100714/7efc42f8/attachment.html

Roger Sersted

2010-Jul-15 14:55 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

OK.  This looks bad.  It appears that I should have upgraded ext3 to ext4, I 
found instructions for that,

	tune2fs -O extents,uninit_bg,dir_index /dev/XXX
	fsck -pf /dev/XXX
	
Is the above correct?  I''d like to move our systems to ext4. I
didn''t know those
steps were necessary.

Other answers listed below.

Wojciech Turek wrote:> Hi Roger,
> 
> Sorry for the delay. From the ldiskfs messages I seem to me that you are 
> using ext4 ldiskfs
> (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5).
> If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think taht in 
> lustre-1.8.3 you should use ext3 based ldiskfs rpm.
> 
> Can you also  tell us a bit more about your setup? From what you wrote 
> so far I understand you have 2 OSS servers and each server has one OST 
> device. In addition to that you have a third server which acts as a 
> MGS/MDS, is that right?
> 
> The logs you provided seem to be only from one server called puppy7 so 
> it does not give a whole picture of the situation. The timeout messages 
> may indicate a problem with communication between the servers but it is 
> really difficult to say without seeing the whole picture or at least 
> more elements of it.
> 
> To check if you have correct rpms installed can you please run
''rpm -qa
> | grep lustre'' on both OSS servers and the MDS?
> 
> Also please provide output from command ''lctl list_nids'' 
run on both
> OSS servers, MDS and a client?
puppy5 (MDS/MGS)
172.17.2.5 at o2ib
172.16.2.5 at tcp

puppy6 (OSS)
172.17.2.6 at o2ib
172.16.2.6 at tcp

puppy7 (OSS)
172.17.2.7 at o2ib
172.16.2.7 at tcp

> 
> In addition to above please run following command on all lustre targets 
> (OSTs and MDT) to display your current lustre configuration
> 
>  tunefs.lustre --dryrun --print /dev/<ost_device>
puppy5 (MDS/MGS)
    Read previous values:
Target:     lustre1-MDT0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x405
               (MDT MGS )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: lov.stripesize=125K lov.stripecount=2 
mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
mdt.group_upcall=NONE


    Permanent disk data:
Target:     lustre1-MDT0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x405
               (MDT MGS )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: lov.stripesize=125K lov.stripecount=2 
mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
mdt.group_upcall=NONE

exiting before disk write.
----------------------------------------------------
puppy6
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre1-OST0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib


    Permanent disk data:
Target:     lustre1-OST0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib
--------------------------------------------------
puppy7 (this is the broken OSS. The "Target" should be
"lustre1-OST0001")
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre1-OST0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib


    Permanent disk data:
Target:     lustre1-OST0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib

exiting before disk write.

> 
> If possible please attach syslog from each machine from the time you 
> mounted lustre targets (OST and MDT).
> 
> Best regards,
> 
> Wojciech
> 
> On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
>     Any additional info?
> 
>     Thanks,
> 
>     Roger S.
> 
> 
> 
> 
> -- 
> --
> Wojciech Turek
> 
>

Wojciech Turek

2010-Jul-15 15:01 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

can you also please post output of  ''rpm -qa | grep lustre''
run on puppy5-7
?

On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov> wrote:
>
> OK.  This looks bad.  It appears that I should have upgraded ext3 to ext4,
> I found instructions for that,
>
>        tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>        fsck -pf /dev/XXX
>
> Is the above correct?  I''d like to move our systems to ext4. I
didn''t know
> those steps were necessary.
>
> Other answers listed below.
>
>
> Wojciech Turek wrote:
>
>> Hi Roger,
>>
>> Sorry for the delay. From the ldiskfs messages I seem to me that you
are
>> using ext4 ldiskfs
>> (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5).
>> If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think taht
in
>> lustre-1.8.3 you should use ext3 based ldiskfs rpm.
>>
>> Can you also  tell us a bit more about your setup? From what you wrote
so
>> far I understand you have 2 OSS servers and each server has one OST
device.
>> In addition to that you have a third server which acts as a MGS/MDS, is
that
>> right?
>>
>> The logs you provided seem to be only from one server called puppy7 so
it
>> does not give a whole picture of the situation. The timeout messages
may
>> indicate a problem with communication between the servers but it is
really
>> difficult to say without seeing the whole picture or at least more
elements
>> of it.
>>
>> To check if you have correct rpms installed can you please run
''rpm -qa |
>> grep lustre'' on both OSS servers and the MDS?
>>
>> Also please provide output from command ''lctl
list_nids''  run on both OSS
>> servers, MDS and a client?
>>
>
> puppy5 (MDS/MGS)
>
> 172.17.2.5 at o2ib
> 172.16.2.5 at tcp
>
> puppy6 (OSS)
> 172.17.2.6 at o2ib
> 172.16.2.6 at tcp
>
> puppy7 (OSS)
> 172.17.2.7 at o2ib
> 172.16.2.7 at tcp
>
>
>
>
>> In addition to above please run following command on all lustre targets
>> (OSTs and MDT) to display your current lustre configuration
>>
>>  tunefs.lustre --dryrun --print /dev/<ost_device>
>>
>
> puppy5 (MDS/MGS)
>   Read previous values:
> Target:     lustre1-MDT0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x405
>              (MDT MGS )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: lov.stripesize=125K lov.stripecount=2
> mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
> mdt.group_upcall=NONE
>
>
>   Permanent disk data:
> Target:     lustre1-MDT0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x405
>              (MDT MGS )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: lov.stripesize=125K lov.stripecount=2
> mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
> mdt.group_upcall=NONE
>
> exiting before disk write.
> ----------------------------------------------------
> puppy6
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>   Read previous values:
> Target:     lustre1-OST0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
>
>   Permanent disk data:
> Target:     lustre1-OST0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
> --------------------------------------------------
> puppy7 (this is the broken OSS. The "Target" should be
"lustre1-OST0001")
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>   Read previous values:
> Target:     lustre1-OST0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
>
>   Permanent disk data:
> Target:     lustre1-OST0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
> exiting before disk write.
>
>
>
>> If possible please attach syslog from each machine from the time you
>> mounted lustre targets (OST and MDT).
>>
>> Best regards,
>>
>> Wojciech
>>
>> On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov <mailto:
>> rs1 at aps.anl.gov>> wrote:
>>
>>
>>    Any additional info?
>>
>>    Thanks,
>>
>>    Roger S.
>>
>>
>>
>>
>> --
>> --
>> Wojciech Turek
>>
>>
>>

-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/33503541/attachment-0001.html

Roger Sersted

2010-Jul-15 15:14 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Wojciech Turek wrote:> can you also please post output of  ''rpm -qa | grep
lustre'' run on
> puppy5-7 ?

[root at puppy5 log]# rpm -qa |grep -i lustre
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3

[root at puppy6 log]# rpm -qa | grep -i lustre
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3

[root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3

Thanks,

Roger S.
> 
> On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
>     OK.  This looks bad.  It appears that I should have upgraded ext3 to
>     ext4, I found instructions for that,
> 
>            tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>            fsck -pf /dev/XXX
>            
>     Is the above correct?  I''d like to move our systems to ext4. I
>     didn''t know those steps were necessary.
> 
>     Other answers listed below.
> 
> 
>     Wojciech Turek wrote:
> 
>         Hi Roger,
> 
>         Sorry for the delay. From the ldiskfs messages I seem to me that
>         you are using ext4 ldiskfs
>         (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>         ext4-2.6-rhel5).
>         If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think
>         taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm.
> 
>         Can you also  tell us a bit more about your setup? From what you
>         wrote so far I understand you have 2 OSS servers and each server
>         has one OST device. In addition to that you have a third server
>         which acts as a MGS/MDS, is that right?
> 
>         The logs you provided seem to be only from one server called
>         puppy7 so it does not give a whole picture of the situation. The
>         timeout messages may indicate a problem with communication
>         between the servers but it is really difficult to say without
>         seeing the whole picture or at least more elements of it.
> 
>         To check if you have correct rpms installed can you please run
>         ''rpm -qa | grep lustre'' on both OSS servers and
the MDS?
> 
>         Also please provide output from command ''lctl
list_nids''  run on
>         both OSS servers, MDS and a client?
> 
> 
>     puppy5 (MDS/MGS)
> 
>     172.17.2.5 at o2ib
>     172.16.2.5 at tcp
> 
>     puppy6 (OSS)
>     172.17.2.6 at o2ib
>     172.16.2.6 at tcp
> 
>     puppy7 (OSS)
>     172.17.2.7 at o2ib
>     172.16.2.7 at tcp
> 
> 
> 
> 
>         In addition to above please run following command on all lustre
>         targets (OSTs and MDT) to display your current lustre configuration
> 
>          tunefs.lustre --dryrun --print /dev/<ost_device>
> 
> 
>     puppy5 (MDS/MGS)
>       Read previous values:
>     Target:     lustre1-MDT0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x405
>                  (MDT MGS )
>     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>     Parameters: lov.stripesize=125K lov.stripecount=2
>     mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>     mdt.group_upcall=NONE
> 
> 
>       Permanent disk data:
>     Target:     lustre1-MDT0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x405
>                  (MDT MGS )
>     Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>     Parameters: lov.stripesize=125K lov.stripecount=2
>     mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>     mdt.group_upcall=NONE
> 
>     exiting before disk write.
>     ----------------------------------------------------
>     puppy6
>     checking for existing Lustre data: found CONFIGS/mountdata
>     Reading CONFIGS/mountdata
> 
>       Read previous values:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>       Permanent disk data:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
>     --------------------------------------------------
>     puppy7 (this is the broken OSS. The "Target" should be
>     "lustre1-OST0001")
>     checking for existing Lustre data: found CONFIGS/mountdata
>     Reading CONFIGS/mountdata
> 
>       Read previous values:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>       Permanent disk data:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x2
>                  (OST )
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
>     exiting before disk write.
> 
> 
> 
>         If possible please attach syslog from each machine from the time
>         you mounted lustre targets (OST and MDT).
> 
>         Best regards,
> 
>         Wojciech
> 
>         On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>>> wrote:
> 
> 
>            Any additional info?
> 
>            Thanks,
> 
>            Roger S.
> 
> 
> 
> 
>         -- 
>         --
>         Wojciech Turek
> 
> 
> 
> 
> 
> -- 
> --
> Wojciech Turek
> 
> Assistant System Manager
> 
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> Tel: (+)44 1223 763517

Wojciech Turek

2010-Jul-15 17:56 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Hi Roger,

the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style ext3
based ldiskfs and one set for the ext4 based ldiskfs. When upgrading from
1.6.6 to 1.8.3 I think you should not try to use the ext4 based packages,
can you let us know which RPMs have you used?



On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov> wrote:
>
>
> Wojciech Turek wrote:
>
>> can you also please post output of  ''rpm -qa | grep
lustre'' run on
>> puppy5-7 ?
>>
>
>
> [root at puppy5 log]# rpm -qa |grep -i lustre
> kernel-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>
> [root at puppy6 log]# rpm -qa | grep -i lustre
> kernel-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>
> [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
> kernel-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>
> Thanks,
>
> Roger S.
>
>
>> On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov <mailto:
>> rs1 at aps.anl.gov>> wrote:
>>
>>
>>    OK.  This looks bad.  It appears that I should have upgraded ext3 to
>>    ext4, I found instructions for that,
>>
>>           tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>>           fsck -pf /dev/XXX
>>               Is the above correct?  I''d like to move our
systems to ext4.
>> I
>>    didn''t know those steps were necessary.
>>
>>    Other answers listed below.
>>
>>
>>    Wojciech Turek wrote:
>>
>>        Hi Roger,
>>
>>        Sorry for the delay. From the ldiskfs messages I seem to me that
>>        you are using ext4 ldiskfs
>>        (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>>        ext4-2.6-rhel5).
>>        If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think
>>        taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm.
>>
>>        Can you also  tell us a bit more about your setup? From what you
>>        wrote so far I understand you have 2 OSS servers and each server
>>        has one OST device. In addition to that you have a third server
>>        which acts as a MGS/MDS, is that right?
>>
>>        The logs you provided seem to be only from one server called
>>        puppy7 so it does not give a whole picture of the situation. The
>>        timeout messages may indicate a problem with communication
>>        between the servers but it is really difficult to say without
>>        seeing the whole picture or at least more elements of it.
>>
>>        To check if you have correct rpms installed can you please run
>>        ''rpm -qa | grep lustre'' on both OSS servers
and the MDS?
>>
>>        Also please provide output from command ''lctl
list_nids''  run on
>>        both OSS servers, MDS and a client?
>>
>>
>>    puppy5 (MDS/MGS)
>>
>>    172.17.2.5 at o2ib
>>    172.16.2.5 at tcp
>>
>>    puppy6 (OSS)
>>    172.17.2.6 at o2ib
>>    172.16.2.6 at tcp
>>
>>    puppy7 (OSS)
>>    172.17.2.7 at o2ib
>>    172.16.2.7 at tcp
>>
>>
>>
>>
>>        In addition to above please run following command on all lustre
>>        targets (OSTs and MDT) to display your current lustre
configuration
>>
>>         tunefs.lustre --dryrun --print /dev/<ost_device>
>>
>>
>>    puppy5 (MDS/MGS)
>>      Read previous values:
>>    Target:     lustre1-MDT0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x405
>>                 (MDT MGS )
>>    Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>    Parameters: lov.stripesize=125K lov.stripecount=2
>>    mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>>    mdt.group_upcall=NONE
>>
>>
>>      Permanent disk data:
>>    Target:     lustre1-MDT0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x405
>>                 (MDT MGS )
>>    Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>    Parameters: lov.stripesize=125K lov.stripecount=2
>>    mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>>    mdt.group_upcall=NONE
>>
>>    exiting before disk write.
>>    ----------------------------------------------------
>>    puppy6
>>    checking for existing Lustre data: found CONFIGS/mountdata
>>    Reading CONFIGS/mountdata
>>
>>      Read previous values:
>>    Target:     lustre1-OST0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x2
>>                 (OST )
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>
>>      Permanent disk data:
>>    Target:     lustre1-OST0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x2
>>                 (OST )
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17.2.5 at o2ib
>>    --------------------------------------------------
>>    puppy7 (this is the broken OSS. The "Target" should be
>>    "lustre1-OST0001")
>>    checking for existing Lustre data: found CONFIGS/mountdata
>>    Reading CONFIGS/mountdata
>>
>>      Read previous values:
>>    Target:     lustre1-OST0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x2
>>                 (OST )
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>
>>      Permanent disk data:
>>    Target:     lustre1-OST0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x2
>>                 (OST )
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>    exiting before disk write.
>>
>>
>>
>>        If possible please attach syslog from each machine from the time
>>        you mounted lustre targets (OST and MDT).
>>
>>        Best regards,
>>
>>        Wojciech
>>
>>        On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>>
>>        <mailto:rs1 at aps.anl.gov>>> wrote:
>>
>>
>>           Any additional info?
>>
>>           Thanks,
>>
>>           Roger S.
>>
>>
>>
>>
>>        --         --
>>        Wojciech Turek
>>
>>
>>
>>
>>
>> --
>> --
>> Wojciech Turek
>>
>> Assistant System Manager
>> 517
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/1231e2ee/attachment-0001.html

Roger Sersted

2010-Jul-15 19:02 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

I am using the ext4 RPMs.  I ran the following commands on the MDS and OSS nodes
(lustre was not running at the time):

	tune2fs -O extents,uninit_bg,dir_index /dev/XXX
	fsck -pf /dev/XXX

I then started Lustre "mount -t lustre /dev/XXX /lustre" on the OSSes
and then
the MDS.  The problem still persisted. I then shutdown Lustre by unmounting the 
Lustre filesystems on the MDS/OSS nodes.

My last and most desperate step was to "hack" the CONFIG files.  On
puppy7, I
did the following:

	1. mount -t ldiskfs /dev/sdc /mnt
	2. cd /mnt/CONFIG
	3. mv lustre1-OST0000 lustre1-OST0001
	4. vim -nb lustre1-OST0001 mountdata
	5. I changed OST0000 to OST0001.
	6. I verified my changes by comparing an "od -c" of before and after.
	7. umount /mnt
	8. tunefs.lustre -writeconf /dev/sdc

The output of step 8 is:

   tunefs.lustre -writeconf /dev/sdc
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre1-OST0001
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x102
               (OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib


    Permanent disk data:
Target:     lustre1-OST0000
Index:      0
Lustre FS:  lustre1
Mount type: ldiskfs
Flags:      0x102
               (OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.17.2.5 at o2ib

Writing CONFIGS/mountdata

Now part of the system seems to have the correct Target value.

Thanks for your time on this.

Roger S.

Wojciech Turek wrote:> Hi Roger,
> 
> the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style ext3 
> based ldiskfs and one set for the ext4 based ldiskfs. When upgrading 
> from 1.6.6 to 1.8.3 I think you should not try to use the ext4 based 
> packages, can you let us know which RPMs have you used?
> 
> 
> 
> On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
> 
>     Wojciech Turek wrote:
> 
>         can you also please post output of  ''rpm -qa | grep
lustre'' run
>         on puppy5-7 ?
> 
> 
> 
>     [root at puppy5 log]# rpm -qa |grep -i lustre
>     kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>     lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>     mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>     [root at puppy6 log]# rpm -qa | grep -i lustre
>     kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>     lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>     mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>     [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
>     kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>     lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>     mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>     lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>     Thanks,
> 
>     Roger S.
> 
> 
>         On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>>> wrote:
> 
> 
>            OK.  This looks bad.  It appears that I should have upgraded
>         ext3 to
>            ext4, I found instructions for that,
> 
>                   tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>                   fsck -pf /dev/XXX
>                       Is the above correct?  I''d like to move our
>         systems to ext4. I
>            didn''t know those steps were necessary.
> 
>            Other answers listed below.
> 
> 
>            Wojciech Turek wrote:
> 
>                Hi Roger,
> 
>                Sorry for the delay. From the ldiskfs messages I seem to
>         me that
>                you are using ext4 ldiskfs
>                (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>                ext4-2.6-rhel5).
>                If you upgrading from 1.6.6 you ldiskfs is ext3 based so
>         I think
>                taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm.
> 
>                Can you also  tell us a bit more about your setup? From
>         what you
>                wrote so far I understand you have 2 OSS servers and each
>         server
>                has one OST device. In addition to that you have a third
>         server
>                which acts as a MGS/MDS, is that right?
> 
>                The logs you provided seem to be only from one server called
>                puppy7 so it does not give a whole picture of the
>         situation. The
>                timeout messages may indicate a problem with communication
>                between the servers but it is really difficult to say
without
>                seeing the whole picture or at least more elements of it.
> 
>                To check if you have correct rpms installed can you
>         please run
>                ''rpm -qa | grep lustre'' on both OSS
servers and the MDS?
> 
>                Also please provide output from command ''lctl
list_nids''
>          run on
>                both OSS servers, MDS and a client?
> 
> 
>            puppy5 (MDS/MGS)
> 
>            172.17.2.5 at o2ib
>            172.16.2.5 at tcp
> 
>            puppy6 (OSS)
>            172.17.2.6 at o2ib
>            172.16.2.6 at tcp
> 
>            puppy7 (OSS)
>            172.17.2.7 at o2ib
>            172.16.2.7 at tcp
> 
> 
> 
> 
>                In addition to above please run following command on all
>         lustre
>                targets (OSTs and MDT) to display your current lustre
>         configuration
> 
>                 tunefs.lustre --dryrun --print /dev/<ost_device>
> 
> 
>            puppy5 (MDS/MGS)
>              Read previous values:
>            Target:     lustre1-MDT0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x405
>                         (MDT MGS )
>            Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>            Parameters: lov.stripesize=125K lov.stripecount=2
>            mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>            mdt.group_upcall=NONE
> 
> 
>              Permanent disk data:
>            Target:     lustre1-MDT0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x405
>                         (MDT MGS )
>            Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>            Parameters: lov.stripesize=125K lov.stripecount=2
>            mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>            mdt.group_upcall=NONE
> 
>            exiting before disk write.
>            ----------------------------------------------------
>            puppy6
>            checking for existing Lustre data: found CONFIGS/mountdata
>            Reading CONFIGS/mountdata
> 
>              Read previous values:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>              Permanent disk data:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
>            --------------------------------------------------
>            puppy7 (this is the broken OSS. The "Target" should be
>            "lustre1-OST0001")
>            checking for existing Lustre data: found CONFIGS/mountdata
>            Reading CONFIGS/mountdata
> 
>              Read previous values:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>              Permanent disk data:
>            Target:     lustre1-OST0000
>            Index:      0
>            Lustre FS:  lustre1
>            Mount type: ldiskfs
>            Flags:      0x2
>                         (OST )
>            Persistent mount opts: errors=remount-ro,extents,mballoc
>            Parameters: mgsnode=172.17.2.5 at o2ib
> 
>            exiting before disk write.
> 
> 
> 
>                If possible please attach syslog from each machine from
>         the time
>                you mounted lustre targets (OST and MDT).
> 
>                Best regards,
> 
>                Wojciech
> 
>                On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
> 
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>> wrote:
> 
> 
>                   Any additional info?
> 
>                   Thanks,
> 
>                   Roger S.
> 
> 
> 
> 
>                --         --
>                Wojciech Turek
> 
> 
> 
> 
> 
>         -- 
>         --
>         Wojciech Turek
> 
>         Assistant System Manager
>         517
>

Roger Sersted

2010-Jul-15 22:50 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

I greatly appreciate the time and effort you''ve taken.  It is difficult
to
diagnose and support sophisticated systems via an email exchange. 
Unfortunately, I am under time constraints to get this system usable. 
Therefore, I have started reformatting the MDS and OSSes.


Roger S.

Wojciech Turek

2010-Jul-15 22:54 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Hi Roger

Where did you find this CONFIG hack?
Did you make a copy of the CONFIG dir before followed this steps?



On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov> wrote:
>
> I am using the ext4 RPMs.  I ran the following commands on the MDS and OSS
> nodes (lustre was not running at the time):
>
>
>        tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>        fsck -pf /dev/XXX
>
> I then started Lustre "mount -t lustre /dev/XXX /lustre" on the
OSSes and
> then the MDS.  The problem still persisted. I then shutdown Lustre by
> unmounting the Lustre filesystems on the MDS/OSS nodes.
>
> My last and most desperate step was to "hack" the CONFIG files. 
On puppy7,
> I did the following:
>
>        1. mount -t ldiskfs /dev/sdc /mnt
>        2. cd /mnt/CONFIG
>        3. mv lustre1-OST0000 lustre1-OST0001
>        4. vim -nb lustre1-OST0001 mountdata
>        5. I changed OST0000 to OST0001.
>        6. I verified my changes by comparing an "od -c" of before
and
> after.
>        7. umount /mnt
>        8. tunefs.lustre -writeconf /dev/sdc
>
> The output of step 8 is:
>
>  tunefs.lustre -writeconf /dev/sdc
>
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>   Read previous values:
> Target:     lustre1-OST0001
>
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x102
>              (OST writeconf )
>
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
>
>   Permanent disk data:
> Target:     lustre1-OST0000
> Index:      0
> Lustre FS:  lustre1
> Mount type: ldiskfs
> Flags:      0x102
>              (OST writeconf )
>
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=172.17.2.5 at o2ib
>
> Writing CONFIGS/mountdata
>
> Now part of the system seems to have the correct Target value.
>
> Thanks for your time on this.
>
> Roger S.
>
> Wojciech Turek wrote:
>
>> Hi Roger,
>>
>> the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style
ext3
>> based ldiskfs and one set for the ext4 based ldiskfs. When upgrading
from
>> 1.6.6 to 1.8.3 I think you should not try to use the ext4 based
packages,
>> can you let us know which RPMs have you used?
>>
>>
>>
>> On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov <mailto:
>> rs1 at aps.anl.gov>> wrote:
>>
>>
>>
>>    Wojciech Turek wrote:
>>
>>        can you also please post output of  ''rpm -qa | grep
lustre'' run
>>        on puppy5-7 ?
>>
>>
>>
>>    [root at puppy5 log]# rpm -qa |grep -i lustre
>>    kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>    lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>    mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>
>>    [root at puppy6 log]# rpm -qa | grep -i lustre
>>    kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>    lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>    mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>
>>    [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
>>    kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>    lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>    mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>    lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>
>>    Thanks,
>>
>>    Roger S.
>>
>>
>>        On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov>>> wrote:
>>
>>
>>           OK.  This looks bad.  It appears that I should have upgraded
>>        ext3 to
>>           ext4, I found instructions for that,
>>
>>                  tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>>                  fsck -pf /dev/XXX
>>                      Is the above correct?  I''d like to move
our
>>        systems to ext4. I
>>           didn''t know those steps were necessary.
>>
>>           Other answers listed below.
>>
>>
>>           Wojciech Turek wrote:
>>
>>               Hi Roger,
>>
>>               Sorry for the delay. From the ldiskfs messages I seem to
>>        me that
>>               you are using ext4 ldiskfs
>>               (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>>               ext4-2.6-rhel5).
>>               If you upgrading from 1.6.6 you ldiskfs is ext3 based so
>>        I think
>>               taht in lustre-1.8.3 you should use ext3 based ldiskfs
rpm.
>>
>>               Can you also  tell us a bit more about your setup? From
>>        what you
>>               wrote so far I understand you have 2 OSS servers and each
>>        server
>>               has one OST device. In addition to that you have a third
>>        server
>>               which acts as a MGS/MDS, is that right?
>>
>>               The logs you provided seem to be only from one server
called
>>               puppy7 so it does not give a whole picture of the
>>        situation. The
>>               timeout messages may indicate a problem with
communication
>>               between the servers but it is really difficult to say
>> without
>>               seeing the whole picture or at least more elements of it.
>>
>>               To check if you have correct rpms installed can you
>>        please run
>>               ''rpm -qa | grep lustre'' on both OSS
servers and the MDS?
>>
>>               Also please provide output from command ''lctl
list_nids''
>>         run on
>>               both OSS servers, MDS and a client?
>>
>>
>>           puppy5 (MDS/MGS)
>>
>>           172.17.2.5 at o2ib
>>           172.16.2.5 at tcp
>>
>>           puppy6 (OSS)
>>           172.17.2.6 at o2ib
>>           172.16.2.6 at tcp
>>
>>           puppy7 (OSS)
>>           172.17.2.7 at o2ib
>>           172.16.2.7 at tcp
>>
>>
>>
>>
>>               In addition to above please run following command on all
>>        lustre
>>               targets (OSTs and MDT) to display your current lustre
>>        configuration
>>
>>                tunefs.lustre --dryrun --print /dev/<ost_device>
>>
>>
>>           puppy5 (MDS/MGS)
>>             Read previous values:
>>           Target:     lustre1-MDT0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x405
>>                        (MDT MGS )
>>           Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr
>>           Parameters: lov.stripesize=125K lov.stripecount=2
>>           mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>>           mdt.group_upcall=NONE
>>
>>
>>             Permanent disk data:
>>           Target:     lustre1-MDT0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x405
>>                        (MDT MGS )
>>           Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr
>>           Parameters: lov.stripesize=125K lov.stripecount=2
>>           mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE
>>           mdt.group_upcall=NONE
>>
>>           exiting before disk write.
>>           ----------------------------------------------------
>>           puppy6
>>           checking for existing Lustre data: found CONFIGS/mountdata
>>           Reading CONFIGS/mountdata
>>
>>             Read previous values:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>
>>             Permanent disk data:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>           --------------------------------------------------
>>           puppy7 (this is the broken OSS. The "Target" should
be
>>           "lustre1-OST0001")
>>           checking for existing Lustre data: found CONFIGS/mountdata
>>           Reading CONFIGS/mountdata
>>
>>             Read previous values:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>
>>             Permanent disk data:
>>           Target:     lustre1-OST0000
>>           Index:      0
>>           Lustre FS:  lustre1
>>           Mount type: ldiskfs
>>           Flags:      0x2
>>                        (OST )
>>           Persistent mount opts: errors=remount-ro,extents,mballoc
>>           Parameters: mgsnode=172.17.2.5 at o2ib
>>
>>           exiting before disk write.
>>
>>
>>
>>               If possible please attach syslog from each machine from
>>        the time
>>               you mounted lustre targets (OST and MDT).
>>
>>               Best regards,
>>
>>               Wojciech
>>
>>               On 14 July 2010 20:46, Roger Sersted <rs1 at
aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>
>>        <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>> wrote:
>>
>>
>>                  Any additional info?
>>
>>                  Thanks,
>>
>>                  Roger S.
>>
>>
>>
>>
>>               --         --
>>               Wojciech Turek
>>
>>
>>
>>
>>
>>        --         --
>>        Wojciech Turek
>>
>>        Assistant System Manager
>>        517
>>
>>

--
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/a28ab0b4/attachment-0001.html

Wojciech Turek

2010-Jul-15 22:58 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

No problem , sorry I could help better, unfortunatelly the time I can spend
on the list is currently very limited. Could you please point me to the
place where you found this CONFIG hack as I would like to find out at what
context were those steps recommended?

Best regards,

Wojciech

On 15 July 2010 23:50, Roger Sersted <rs1 at aps.anl.gov> wrote:
>
>
> I greatly appreciate the time and effort you''ve taken.  It is
difficult to
> diagnose and support sophisticated systems via an email exchange.
> Unfortunately, I am under time constraints to get this system usable.
> Therefore, I have started reformatting the MDS and OSSes.
>
>
> Roger S.
>

-- 
--
Wojciech Turek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/71416b32/attachment.html

Roger Sersted

2010-Jul-16 13:43 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

I didn''t find the hack anywhere.  I looked at what those files
contained and
decided to "hack and slash".  Apparently, those files are generated
from data
within the filesystem system itself.  A second running of writeconf displayed 
the target value to be "lustre1-OST0000", which is what I
didn''t want. :-(


Roger S.

Wojciech Turek wrote:> Hi Roger
> 
> Where did you find this CONFIG hack?
> Did you make a copy of the CONFIG dir before followed this steps?
> 
> 
> 
> On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov 
> <mailto:rs1 at aps.anl.gov>> wrote:
> 
> 
>     I am using the ext4 RPMs.  I ran the following commands on the MDS
>     and OSS nodes (lustre was not running at the time):
> 
> 
>            tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>            fsck -pf /dev/XXX
> 
>     I then started Lustre "mount -t lustre /dev/XXX /lustre" on
the
>     OSSes and then the MDS.  The problem still persisted. I then
>     shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS
>     nodes.
> 
>     My last and most desperate step was to "hack" the CONFIG
files.  On
>     puppy7, I did the following:
> 
>            1. mount -t ldiskfs /dev/sdc /mnt
>            2. cd /mnt/CONFIG
>            3. mv lustre1-OST0000 lustre1-OST0001
>            4. vim -nb lustre1-OST0001 mountdata
>            5. I changed OST0000 to OST0001.
>            6. I verified my changes by comparing an "od -c" of
before
>     and after.
>            7. umount /mnt
>            8. tunefs.lustre -writeconf /dev/sdc
> 
>     The output of step 8 is:
> 
>      tunefs.lustre -writeconf /dev/sdc
> 
>     checking for existing Lustre data: found CONFIGS/mountdata
>     Reading CONFIGS/mountdata
> 
>       Read previous values:
>     Target:     lustre1-OST0001
> 
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x102
>                  (OST writeconf )
> 
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>       Permanent disk data:
>     Target:     lustre1-OST0000
>     Index:      0
>     Lustre FS:  lustre1
>     Mount type: ldiskfs
>     Flags:      0x102
>                  (OST writeconf )
> 
>     Persistent mount opts: errors=remount-ro,extents,mballoc
>     Parameters: mgsnode=172.17.2.5 at o2ib
> 
>     Writing CONFIGS/mountdata
> 
>     Now part of the system seems to have the correct Target value.
> 
>     Thanks for your time on this.
> 
>     Roger S.
> 
>     Wojciech Turek wrote:
> 
>         Hi Roger,
> 
>         the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old
>         style ext3 based ldiskfs and one set for the ext4 based ldiskfs.
>         When upgrading from 1.6.6 to 1.8.3 I think you should not try to
>         use the ext4 based packages, can you let us know which RPMs have
>         you used?
> 
> 
> 
>         On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>>> wrote:
> 
> 
> 
>            Wojciech Turek wrote:
> 
>                can you also please post output of  ''rpm -qa | grep
>         lustre'' run
>                on puppy5-7 ?
> 
> 
> 
>            [root at puppy5 log]# rpm -qa |grep -i lustre
>            kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>            lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>            mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>            [root at puppy6 log]# rpm -qa | grep -i lustre
>            kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>            lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>            mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>            [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
>            kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>            lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>            mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>            lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
>            Thanks,
> 
>            Roger S.
> 
> 
>                On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov
>         <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>> wrote:
> 
> 
>                   OK.  This looks bad.  It appears that I should have
>         upgraded
>                ext3 to
>                   ext4, I found instructions for that,
> 
>                          tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>                          fsck -pf /dev/XXX
>                              Is the above correct?  I''d like to
move our
>                systems to ext4. I
>                   didn''t know those steps were necessary.
> 
>                   Other answers listed below.
> 
> 
>                   Wojciech Turek wrote:
> 
>                       Hi Roger,
> 
>                       Sorry for the delay. From the ldiskfs messages I
>         seem to
>                me that
>                       you are using ext4 ldiskfs
>                       (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>                       ext4-2.6-rhel5).
>                       If you upgrading from 1.6.6 you ldiskfs is ext3
>         based so
>                I think
>                       taht in lustre-1.8.3 you should use ext3 based
>         ldiskfs rpm.
> 
>                       Can you also  tell us a bit more about your setup?
>         From
>                what you
>                       wrote so far I understand you have 2 OSS servers
>         and each
>                server
>                       has one OST device. In addition to that you have a
>         third
>                server
>                       which acts as a MGS/MDS, is that right?
> 
>                       The logs you provided seem to be only from one
>         server called
>                       puppy7 so it does not give a whole picture of the
>                situation. The
>                       timeout messages may indicate a problem with
>         communication
>                       between the servers but it is really difficult to
>         say without
>                       seeing the whole picture or at least more elements
>         of it.
> 
>                       To check if you have correct rpms installed can you
>                please run
>                       ''rpm -qa | grep lustre'' on both OSS
servers and
>         the MDS?
> 
>                       Also please provide output from command
''lctl
>         list_nids''
>                 run on
>                       both OSS servers, MDS and a client?
> 
> 
>                   puppy5 (MDS/MGS)
> 
>                   172.17.2.5 at o2ib
>                   172.16.2.5 at tcp
> 
>                   puppy6 (OSS)
>                   172.17.2.6 at o2ib
>                   172.16.2.6 at tcp
> 
>                   puppy7 (OSS)
>                   172.17.2.7 at o2ib
>                   172.16.2.7 at tcp
> 
> 
> 
> 
>                       In addition to above please run following command
>         on all
>                lustre
>                       targets (OSTs and MDT) to display your current lustre
>                configuration
> 
>                        tunefs.lustre --dryrun --print
/dev/<ost_device>
> 
> 
>                   puppy5 (MDS/MGS)
>                     Read previous values:
>                   Target:     lustre1-MDT0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x405
>                                (MDT MGS )
>                   Persistent mount opts:
>         errors=remount-ro,iopen_nopriv,user_xattr
>                   Parameters: lov.stripesize=125K lov.stripecount=2
>                   mdt.group_upcall=/usr/sbin/l_getgroups
>         mdt.group_upcall=NONE
>                   mdt.group_upcall=NONE
> 
> 
>                     Permanent disk data:
>                   Target:     lustre1-MDT0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x405
>                                (MDT MGS )
>                   Persistent mount opts:
>         errors=remount-ro,iopen_nopriv,user_xattr
>                   Parameters: lov.stripesize=125K lov.stripecount=2
>                   mdt.group_upcall=/usr/sbin/l_getgroups
>         mdt.group_upcall=NONE
>                   mdt.group_upcall=NONE
> 
>                   exiting before disk write.
>                   ----------------------------------------------------
>                   puppy6
>                   checking for existing Lustre data: found
CONFIGS/mountdata
>                   Reading CONFIGS/mountdata
> 
>                     Read previous values:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>                     Permanent disk data:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
>                   --------------------------------------------------
>                   puppy7 (this is the broken OSS. The "Target"
should be
>                   "lustre1-OST0001")
>                   checking for existing Lustre data: found
CONFIGS/mountdata
>                   Reading CONFIGS/mountdata
> 
>                     Read previous values:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
> 
> 
>                     Permanent disk data:
>                   Target:     lustre1-OST0000
>                   Index:      0
>                   Lustre FS:  lustre1
>                   Mount type: ldiskfs
>                   Flags:      0x2
>                                (OST )
>                   Persistent mount opts: errors=remount-ro,extents,mballoc
>                   Parameters: mgsnode=172.17.2.5 at o2ib
> 
>                   exiting before disk write.
> 
> 
> 
>                       If possible please attach syslog from each machine
>         from
>                the time
>                       you mounted lustre targets (OST and MDT).
> 
>                       Best regards,
> 
>                       Wojciech
> 
>                       On 14 July 2010 20:46, Roger Sersted
>         <rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>
>                       <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>
>                <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>
> 
>                       <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>
>         <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>>> wrote:
> 
> 
>                          Any additional info?
> 
>                          Thanks,
> 
>                          Roger S.
> 
> 
> 
> 
>                       --         --
>                       Wojciech Turek
> 
> 
> 
> 
> 
>                --         --
>                Wojciech Turek
> 
>                Assistant System Manager
>                517
> 
> 
> 
> 
> --

Andreas Dilger

2010-Jul-16 15:46 UTC

head link

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

The use of ext3 or ext4 and the filesystem feature flags has nothing to do with
the setting of the incorrect target. I don''t know how you got to that
state, but there are a number of places where the OST index is stored that need
to verified and fixed.

There is the mountdata file, which you have already found. There is the
filesystem label, which you can view and change with the e2label command.

There is also the last_rcvd file that has the OST UUID at the start and the ost
index as one of it''s fields. Normally I would just say to delete this
file, since it can be recreated at mount time, but since the OST already has an
identity crisis I''m not sure it would get it right.

You should fire up a binary editor to change the UUID in last_rcvd and look at
the lsd_index field in struct lustre_server_data (which is what is stored at the
beginning of the last_rcvd file).

Cheers, Andreas

On 2010-07-16, at 7:43, Roger Sersted <rs1 at aps.anl.gov> wrote:
> 
> I didn''t find the hack anywhere.  I looked at what those files
contained and
> decided to "hack and slash".  Apparently, those files are
generated from data
> within the filesystem system itself.  A second running of writeconf
displayed
> the target value to be "lustre1-OST0000", which is what I
didn''t want. :-(
> 
> 
> Roger S.
> 
> Wojciech Turek wrote:
>> Hi Roger
>> 
>> Where did you find this CONFIG hack?
>> Did you make a copy of the CONFIG dir before followed this steps?
>> 
>> 
>> 
>> On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov 
>> <mailto:rs1 at aps.anl.gov>> wrote:
>> 
>> 
>>    I am using the ext4 RPMs.  I ran the following commands on the MDS
>>    and OSS nodes (lustre was not running at the time):
>> 
>> 
>>           tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>>           fsck -pf /dev/XXX
>> 
>>    I then started Lustre "mount -t lustre /dev/XXX /lustre"
on the
>>    OSSes and then the MDS.  The problem still persisted. I then
>>    shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS
>>    nodes.
>> 
>>    My last and most desperate step was to "hack" the CONFIG
files.  On
>>    puppy7, I did the following:
>> 
>>           1. mount -t ldiskfs /dev/sdc /mnt
>>           2. cd /mnt/CONFIG
>>           3. mv lustre1-OST0000 lustre1-OST0001
>>           4. vim -nb lustre1-OST0001 mountdata
>>           5. I changed OST0000 to OST0001.
>>           6. I verified my changes by comparing an "od -c" of
before
>>    and after.
>>           7. umount /mnt
>>           8. tunefs.lustre -writeconf /dev/sdc
>> 
>>    The output of step 8 is:
>> 
>>     tunefs.lustre -writeconf /dev/sdc
>> 
>>    checking for existing Lustre data: found CONFIGS/mountdata
>>    Reading CONFIGS/mountdata
>> 
>>      Read previous values:
>>    Target:     lustre1-OST0001
>> 
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x102
>>                 (OST writeconf )
>> 
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17.2.5 at o2ib
>> 
>> 
>>      Permanent disk data:
>>    Target:     lustre1-OST0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x102
>>                 (OST writeconf )
>> 
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17.2.5 at o2ib
>> 
>>    Writing CONFIGS/mountdata
>> 
>>    Now part of the system seems to have the correct Target value.
>> 
>>    Thanks for your time on this.
>> 
>>    Roger S.
>> 
>>    Wojciech Turek wrote:
>> 
>>        Hi Roger,
>> 
>>        the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old
>>        style ext3 based ldiskfs and one set for the ext4 based ldiskfs.
>>        When upgrading from 1.6.6 to 1.8.3 I think you should not try to
>>        use the ext4 based packages, can you let us know which RPMs have
>>        you used?
>> 
>> 
>> 
>>        On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov>>> wrote:
>> 
>> 
>> 
>>           Wojciech Turek wrote:
>> 
>>               can you also please post output of  ''rpm -qa |
grep
>>        lustre'' run
>>               on puppy5-7 ?
>> 
>> 
>> 
>>           [root at puppy5 log]# rpm -qa |grep -i lustre
>>           kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>           lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>           mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>> 
>>           [root at puppy6 log]# rpm -qa | grep -i lustre
>>           kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>           lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>           mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>> 
>>           [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre
>>           kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>           lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>           mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>> 
>>           Thanks,
>> 
>>           Roger S.
>> 
>> 
>>               On 15 July 2010 15:55, Roger Sersted <rs1 at
aps.anl.gov
>>        <mailto:rs1 at aps.anl.gov>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>
>>        <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>> wrote:
>> 
>> 
>>                  OK.  This looks bad.  It appears that I should have
>>        upgraded
>>               ext3 to
>>                  ext4, I found instructions for that,
>> 
>>                         tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>>                         fsck -pf /dev/XXX
>>                             Is the above correct?  I''d like to
move our
>>               systems to ext4. I
>>                  didn''t know those steps were necessary.
>> 
>>                  Other answers listed below.
>> 
>> 
>>                  Wojciech Turek wrote:
>> 
>>                      Hi Roger,
>> 
>>                      Sorry for the delay. From the ldiskfs messages I
>>        seem to
>>               me that
>>                      you are using ext4 ldiskfs
>>                      (Jun 26 17:54:30 puppy7 kernel: ldiskfs created
from
>>                      ext4-2.6-rhel5).
>>                      If you upgrading from 1.6.6 you ldiskfs is ext3
>>        based so
>>               I think
>>                      taht in lustre-1.8.3 you should use ext3 based
>>        ldiskfs rpm.
>> 
>>                      Can you also  tell us a bit more about your setup?
>>        From
>>               what you
>>                      wrote so far I understand you have 2 OSS servers
>>        and each
>>               server
>>                      has one OST device. In addition to that you have a
>>        third
>>               server
>>                      which acts as a MGS/MDS, is that right?
>> 
>>                      The logs you provided seem to be only from one
>>        server called
>>                      puppy7 so it does not give a whole picture of the
>>               situation. The
>>                      timeout messages may indicate a problem with
>>        communication
>>                      between the servers but it is really difficult to
>>        say without
>>                      seeing the whole picture or at least more elements
>>        of it.
>> 
>>                      To check if you have correct rpms installed can
you
>>               please run
>>                      ''rpm -qa | grep lustre'' on both
OSS servers and
>>        the MDS?
>> 
>>                      Also please provide output from command
''lctl
>>        list_nids''
>>                run on
>>                      both OSS servers, MDS and a client?
>> 
>> 
>>                  puppy5 (MDS/MGS)
>> 
>>                  172.17.2.5 at o2ib
>>                  172.16.2.5 at tcp
>> 
>>                  puppy6 (OSS)
>>                  172.17.2.6 at o2ib
>>                  172.16.2.6 at tcp
>> 
>>                  puppy7 (OSS)
>>                  172.17.2.7 at o2ib
>>                  172.16.2.7 at tcp
>> 
>> 
>> 
>> 
>>                      In addition to above please run following command
>>        on all
>>               lustre
>>                      targets (OSTs and MDT) to display your current
lustre
>>               configuration
>> 
>>                       tunefs.lustre --dryrun --print
/dev/<ost_device>
>> 
>> 
>>                  puppy5 (MDS/MGS)
>>                    Read previous values:
>>                  Target:     lustre1-MDT0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x405
>>                               (MDT MGS )
>>                  Persistent mount opts:
>>        errors=remount-ro,iopen_nopriv,user_xattr
>>                  Parameters: lov.stripesize=125K lov.stripecount=2
>>                  mdt.group_upcall=/usr/sbin/l_getgroups
>>        mdt.group_upcall=NONE
>>                  mdt.group_upcall=NONE
>> 
>> 
>>                    Permanent disk data:
>>                  Target:     lustre1-MDT0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x405
>>                               (MDT MGS )
>>                  Persistent mount opts:
>>        errors=remount-ro,iopen_nopriv,user_xattr
>>                  Parameters: lov.stripesize=125K lov.stripecount=2
>>                  mdt.group_upcall=/usr/sbin/l_getgroups
>>        mdt.group_upcall=NONE
>>                  mdt.group_upcall=NONE
>> 
>>                  exiting before disk write.
>>                  ----------------------------------------------------
>>                  puppy6
>>                  checking for existing Lustre data: found
CONFIGS/mountdata
>>                  Reading CONFIGS/mountdata
>> 
>>                    Read previous values:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts:
errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17.2.5 at o2ib
>> 
>> 
>>                    Permanent disk data:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts:
errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17.2.5 at o2ib
>>                  --------------------------------------------------
>>                  puppy7 (this is the broken OSS. The "Target"
should be
>>                  "lustre1-OST0001")
>>                  checking for existing Lustre data: found
CONFIGS/mountdata
>>                  Reading CONFIGS/mountdata
>> 
>>                    Read previous values:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts:
errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17.2.5 at o2ib
>> 
>> 
>>                    Permanent disk data:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts:
errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17.2.5 at o2ib
>> 
>>                  exiting before disk write.
>> 
>> 
>> 
>>                      If possible please attach syslog from each machine
>>        from
>>               the time
>>                      you mounted lustre targets (OST and MDT).
>> 
>>                      Best regards,
>> 
>>                      Wojciech
>> 
>>                      On 14 July 2010 20:46, Roger Sersted
>>        <rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>
>>                      <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>
>>        <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>
>>               <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>
>>        <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>
>> 
>>                      <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>
>>        <mailto:rs1 at aps.anl.gov <mailto:rs1 at
aps.anl.gov>>>>> wrote:
>> 
>> 
>>                         Any additional info?
>> 
>>                         Thanks,
>> 
>>                         Roger S.
>> 
>> 
>> 
>> 
>>                      --         --
>>                      Wojciech Turek
>> 
>> 
>> 
>> 
>> 
>>               --         --
>>               Wojciech Turek
>> 
>>               Assistant System Manager
>>               517
>> 
>> 
>> 
>> 
>> -- 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Jul 2010 - 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value