Roger Sersted
2010-Jul-12 15:51 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
This is a small development system with a combined MDS/MGS on a single node with a SCSI interface to a disk array. There are two OSSes, each with a single OST of 1.4TB comprised of a SATA array. In all cases, the entire device (/dev/sdc) is used with no partitioning. I upgraded my Lustre MDS and OSS servers from 1.6.6 to 1.8.3. I did this via a complete OS install and then performing a writeconf on each of the nodes. Unfortunately, each of the OSSes thinks it''s Lustre "Target" is "lustre1-OST0000". I''ve mounted the partitions via ldiskfs and the underlying data is still there. I know which OSS is supposed to be "lustre1-OST0001", but I can''t find any docs that explain how to set that. Thanks, Roger S.
Wojciech Turek
2010-Jul-12 17:45 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Hi, Could you please post system logs that were generated during first mount after the upgrade? Did you run writeconf on MDT and all OSTs? On 12 July 2010 16:51, Roger Sersted <rs1 at aps.anl.gov> wrote:> > > This is a small development system with a combined MDS/MGS on a single node > with > a SCSI interface to a disk array. There are two OSSes, each with a single > OST > of 1.4TB comprised of a SATA array. In all cases, the entire device > (/dev/sdc) > is used with no partitioning. > > I upgraded my Lustre MDS and OSS servers from 1.6.6 to 1.8.3. I did this > via a > complete OS install and then performing a writeconf on each of the nodes. > > Unfortunately, each of the OSSes thinks it''s Lustre "Target" is > "lustre1-OST0000". I''ve mounted the partitions via ldiskfs and the > underlying > data is still there. I know which OSS is supposed to be "lustre1-OST0001", > but > I can''t find any docs that explain how to set that. > > Thanks, > > Roger S. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100712/bbddabfc/attachment-0001.html
Roger Sersted
2010-Jul-12 18:38 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Thanks for the quick response. The logs on the problem server indicates the ldiskfs RPM was not installed for the first mount attempt. Lustre rejected the attempt here: un 26 17:43:58 puppy7 kernel: LustreError: 3358:0:(obd_mount.c:1290:server_kernel_mount()) premount /dev/sdc:0x0 ldiskfs failed: -19, ldiskfs2 failed: -19. Is the ldiskfs module available? Jun 26 17:43:58 puppy7 kernel: LustreError: 3358:0:(obd_mount.c:1616:server_fill_super()) Unable to mount device /dev/sdc: -19 Jun 26 17:43:58 puppy7 kernel: LustreError: 3358:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-19) Jun 26 17:44:10 puppy7 ntpd[3082]: synchronized to 172.16.2.254, stratum 3 Jun 26 17:44:19 puppy7 kernel: LustreError: 3368:0:(obd_mount.c:1290:server_kernel_mount()) premount /dev/sdc:0x0 ldiskfs failed: -19, ldiskfs2 failed: -19. Is the ldiskfs module available? Jun 26 17:44:19 puppy7 kernel: LustreError: 3368:0:(obd_mount.c:1616:server_fill_super()) Unable to mount device /dev/sdc: -19 Jun 26 17:44:19 puppy7 kernel: LustreError: 3368:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-19) Jun 26 17:53:39 puppy7 kernel: LustreError: 3430:0:(obd_mount.c:1290:server_kernel_mount()) premount /dev/sdc:0x0 ldiskfs failed: -19, ldiskfs2 failed: -19. Is the ldiskfs module available? Jun 26 17:53:39 puppy7 kernel: LustreError: 3430:0:(obd_mount.c:1616:server_fill_super()) Unable to mount device /dev/sdc: -19 Jun 26 17:53:39 puppy7 kernel: LustreError: 3430:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-19) I then installed the ldiskfs RPM on all the Lustre nodes (and fixed my kickstart config), modprobe''d lustre and attempted again: Jun 26 17:54:30 puppy7 kernel: init dynlocks cache Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5 Jun 26 17:54:30 puppy7 kernel: LDISKFS-fs: barriers enabled Jun 26 17:54:33 puppy7 kernel: kjournald2 starting: pid 3457, dev sdc:8, commit interval 5 seconds Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended Jun 26 17:54:33 puppy7 kernel: LDISKFS FS on sdc, internal journal on sdc:8 Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: delayed allocation enabled Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: file extents enabled Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc enabled Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: recovery complete. Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mounted filesystem sdc with ordered data mode Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: barriers enabled Jun 26 17:54:33 puppy7 kernel: kjournald2 starting: pid 3460, dev sdc:8, commit interval 5 seconds Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended Jun 26 17:54:33 puppy7 kernel: LDISKFS FS on sdc, internal journal on sdc:8 Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: delayed allocation enabled Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: file extents enabled Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mballoc enabled Jun 26 17:54:33 puppy7 kernel: LDISKFS-fs: mounted filesystem sdc with ordered data mode Jun 26 17:54:38 puppy7 kernel: Lustre: 2725:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1339651978690561 sent from MGC172.17.2.5 at o2ib to NID 172.17.2.5 at o2ib 5s ago has timed out (5s prior to deadline). Jun 26 17:54:38 puppy7 kernel: req at ffff810067706400 x1339651978690561/t0 o250->MGS at MGC172.17.2.5@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1277592878 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 26 17:54:38 puppy7 kernel: LustreError: 3445:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81013d553c00 x1339651978690563/t0 o101->MGS at MGC172.17.2.5@o2ib_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 Jun 26 17:54:38 puppy7 kernel: Lustre: Filtering OBD driver; http://www.lustre.org/ Jun 26 17:54:38 puppy7 kernel: Lustre: lustre1-OST0001: Now serving lustre1-OST0001 on /dev/sdc with recovery enabled Jun 26 17:55:03 puppy7 kernel: Lustre: 2725:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1339651978690564 sent from MGC172.17.2.5 at o2ib to NID 172.17.2.5 at o2ib 5s ago has timed out (5s prior to deadline). ----------- a few timeout messages later .... Jun 26 17:55:43 puppy7 kernel: LustreError: 3649:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff810060569800 x1339651978690572/t0 o101->MGS at MGC172.17.2.5@o2ib_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 Jun 26 17:55:52 puppy7 kernel: Lustre: MGC172.17.2.5 at o2ib: Reactivating import Jun 26 17:55:52 puppy7 kernel: Lustre: lustre1-OST0001: received MDS connection from 172.17.2.5 at o2ib Jun 26 17:59:11 puppy7 ntpd[3082]: kernel time sync enabled 0001 Jun 26 18:03:51 puppy7 kernel: Lustre: 2724:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1339651978690598 sent from MGC172.17.2.5 at o2ib to NID 172.17.2.5 at o2ib 17s ago has timed out (17s prior to deadline). Jun 26 18:03:51 puppy7 kernel: req at ffff810060bcd800 x1339651978690598/t0 o400->MGS at MGC172.17.2.5@o2ib_0:26/25 lens 192/384 e 0 to 1 dl 1277593431 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 26 18:03:51 puppy7 kernel: Lustre: 2724:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jun 26 18:03:51 puppy7 kernel: LustreError: 166-1: MGC172.17.2.5 at o2ib: Connection to service MGS via nid 172.17.2.5 at o2ib was lost; in progress operations using this service will fail. -------------------------------------------- According to the above it looked like everything worked. But, after waiting a while, I still couldn''t mount lustre on a client. I found a similar problem on the list,in that case, the fix was to mount the device as type ldiskfs and remove CONFIGS/<targetname>. I hope that didn''t permanently corrupt lustre? Thanks, Roger S. Wojciech Turek wrote:> Hi, > > Could you please post system logs that were generated during first mount > after the upgrade? > Did you run writeconf on MDT and all OSTs? > > > > > > > On 12 July 2010 16:51, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>> wrote: > > > > This is a small development system with a combined MDS/MGS on a > single node with > a SCSI interface to a disk array. There are two OSSes, each with a > single OST > of 1.4TB comprised of a SATA array. In all cases, the entire device > (/dev/sdc) > is used with no partitioning. > > I upgraded my Lustre MDS and OSS servers from 1.6.6 to 1.8.3. I did > this via a > complete OS install and then performing a writeconf on each of the > nodes. > > Unfortunately, each of the OSSes thinks it''s Lustre "Target" is > "lustre1-OST0000". I''ve mounted the partitions via ldiskfs and the > underlying > data is still there. I know which OSS is supposed to be > "lustre1-OST0001", but > I can''t find any docs that explain how to set that. > > Thanks, > > Roger S. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > -- > Wojciech Turek > > Assistant System Manager > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk> > Tel: (+)44 1223 763517
Roger Sersted
2010-Jul-14 19:46 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Any additional info? Thanks, Roger S.
Wojciech Turek
2010-Jul-14 22:57 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Hi Roger, Sorry for the delay. From the ldiskfs messages I seem to me that you are using ext4 ldiskfs (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5). If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm. Can you also tell us a bit more about your setup? From what you wrote so far I understand you have 2 OSS servers and each server has one OST device. In addition to that you have a third server which acts as a MGS/MDS, is that right? The logs you provided seem to be only from one server called puppy7 so it does not give a whole picture of the situation. The timeout messages may indicate a problem with communication between the servers but it is really difficult to say without seeing the whole picture or at least more elements of it. To check if you have correct rpms installed can you please run ''rpm -qa | grep lustre'' on both OSS servers and the MDS? Also please provide output from command ''lctl list_nids'' run on both OSS servers, MDS and a client? In addition to above please run following command on all lustre targets (OSTs and MDT) to display your current lustre configuration tunefs.lustre --dryrun --print /dev/<ost_device> If possible please attach syslog from each machine from the time you mounted lustre targets (OST and MDT). Best regards, Wojciech On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov> wrote:> > Any additional info? > > Thanks, > > Roger S. >-- -- Wojciech Turek -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100714/7efc42f8/attachment.html
Roger Sersted
2010-Jul-15 14:55 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
OK. This looks bad. It appears that I should have upgraded ext3 to ext4, I found instructions for that, tune2fs -O extents,uninit_bg,dir_index /dev/XXX fsck -pf /dev/XXX Is the above correct? I''d like to move our systems to ext4. I didn''t know those steps were necessary. Other answers listed below. Wojciech Turek wrote:> Hi Roger, > > Sorry for the delay. From the ldiskfs messages I seem to me that you are > using ext4 ldiskfs > (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5). > If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think taht in > lustre-1.8.3 you should use ext3 based ldiskfs rpm. > > Can you also tell us a bit more about your setup? From what you wrote > so far I understand you have 2 OSS servers and each server has one OST > device. In addition to that you have a third server which acts as a > MGS/MDS, is that right? > > The logs you provided seem to be only from one server called puppy7 so > it does not give a whole picture of the situation. The timeout messages > may indicate a problem with communication between the servers but it is > really difficult to say without seeing the whole picture or at least > more elements of it. > > To check if you have correct rpms installed can you please run ''rpm -qa > | grep lustre'' on both OSS servers and the MDS? > > Also please provide output from command ''lctl list_nids'' run on both > OSS servers, MDS and a client?puppy5 (MDS/MGS) 172.17.2.5 at o2ib 172.16.2.5 at tcp puppy6 (OSS) 172.17.2.6 at o2ib 172.16.2.6 at tcp puppy7 (OSS) 172.17.2.7 at o2ib 172.16.2.7 at tcp> > In addition to above please run following command on all lustre targets > (OSTs and MDT) to display your current lustre configuration > > tunefs.lustre --dryrun --print /dev/<ost_device>puppy5 (MDS/MGS) Read previous values: Target: lustre1-MDT0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x405 (MDT MGS ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: lov.stripesize=125K lov.stripecount=2 mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE mdt.group_upcall=NONE Permanent disk data: Target: lustre1-MDT0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x405 (MDT MGS ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: lov.stripesize=125K lov.stripecount=2 mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE mdt.group_upcall=NONE exiting before disk write. ---------------------------------------------------- puppy6 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre1-OST0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.17.2.5 at o2ib Permanent disk data: Target: lustre1-OST0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.17.2.5 at o2ib -------------------------------------------------- puppy7 (this is the broken OSS. The "Target" should be "lustre1-OST0001") checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre1-OST0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.17.2.5 at o2ib Permanent disk data: Target: lustre1-OST0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.17.2.5 at o2ib exiting before disk write.> > If possible please attach syslog from each machine from the time you > mounted lustre targets (OST and MDT). > > Best regards, > > Wojciech > > On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>> wrote: > > > Any additional info? > > Thanks, > > Roger S. > > > > > -- > -- > Wojciech Turek > >
Wojciech Turek
2010-Jul-15 15:01 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
can you also please post output of ''rpm -qa | grep lustre'' run on puppy5-7 ? On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov> wrote:> > OK. This looks bad. It appears that I should have upgraded ext3 to ext4, > I found instructions for that, > > tune2fs -O extents,uninit_bg,dir_index /dev/XXX > fsck -pf /dev/XXX > > Is the above correct? I''d like to move our systems to ext4. I didn''t know > those steps were necessary. > > Other answers listed below. > > > Wojciech Turek wrote: > >> Hi Roger, >> >> Sorry for the delay. From the ldiskfs messages I seem to me that you are >> using ext4 ldiskfs >> (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from ext4-2.6-rhel5). >> If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think taht in >> lustre-1.8.3 you should use ext3 based ldiskfs rpm. >> >> Can you also tell us a bit more about your setup? From what you wrote so >> far I understand you have 2 OSS servers and each server has one OST device. >> In addition to that you have a third server which acts as a MGS/MDS, is that >> right? >> >> The logs you provided seem to be only from one server called puppy7 so it >> does not give a whole picture of the situation. The timeout messages may >> indicate a problem with communication between the servers but it is really >> difficult to say without seeing the whole picture or at least more elements >> of it. >> >> To check if you have correct rpms installed can you please run ''rpm -qa | >> grep lustre'' on both OSS servers and the MDS? >> >> Also please provide output from command ''lctl list_nids'' run on both OSS >> servers, MDS and a client? >> > > puppy5 (MDS/MGS) > > 172.17.2.5 at o2ib > 172.16.2.5 at tcp > > puppy6 (OSS) > 172.17.2.6 at o2ib > 172.16.2.6 at tcp > > puppy7 (OSS) > 172.17.2.7 at o2ib > 172.16.2.7 at tcp > > > > >> In addition to above please run following command on all lustre targets >> (OSTs and MDT) to display your current lustre configuration >> >> tunefs.lustre --dryrun --print /dev/<ost_device> >> > > puppy5 (MDS/MGS) > Read previous values: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE > mdt.group_upcall=NONE > > > Permanent disk data: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE > mdt.group_upcall=NONE > > exiting before disk write. > ---------------------------------------------------- > puppy6 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > -------------------------------------------------- > puppy7 (this is the broken OSS. The "Target" should be "lustre1-OST0001") > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > exiting before disk write. > > > >> If possible please attach syslog from each machine from the time you >> mounted lustre targets (OST and MDT). >> >> Best regards, >> >> Wojciech >> >> On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov <mailto: >> rs1 at aps.anl.gov>> wrote: >> >> >> Any additional info? >> >> Thanks, >> >> Roger S. >> >> >> >> >> -- >> -- >> Wojciech Turek >> >> >>-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/33503541/attachment-0001.html
Roger Sersted
2010-Jul-15 15:14 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Wojciech Turek wrote:> can you also please post output of ''rpm -qa | grep lustre'' run on > puppy5-7 ?[root at puppy5 log]# rpm -qa |grep -i lustre kernel-2.6.18-164.11.1.el5_lustre.1.8.3 lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 [root at puppy6 log]# rpm -qa | grep -i lustre kernel-2.6.18-164.11.1.el5_lustre.1.8.3 lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre kernel-2.6.18-164.11.1.el5_lustre.1.8.3 lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 Thanks, Roger S.> > On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>> wrote: > > > OK. This looks bad. It appears that I should have upgraded ext3 to > ext4, I found instructions for that, > > tune2fs -O extents,uninit_bg,dir_index /dev/XXX > fsck -pf /dev/XXX > > Is the above correct? I''d like to move our systems to ext4. I > didn''t know those steps were necessary. > > Other answers listed below. > > > Wojciech Turek wrote: > > Hi Roger, > > Sorry for the delay. From the ldiskfs messages I seem to me that > you are using ext4 ldiskfs > (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from > ext4-2.6-rhel5). > If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think > taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm. > > Can you also tell us a bit more about your setup? From what you > wrote so far I understand you have 2 OSS servers and each server > has one OST device. In addition to that you have a third server > which acts as a MGS/MDS, is that right? > > The logs you provided seem to be only from one server called > puppy7 so it does not give a whole picture of the situation. The > timeout messages may indicate a problem with communication > between the servers but it is really difficult to say without > seeing the whole picture or at least more elements of it. > > To check if you have correct rpms installed can you please run > ''rpm -qa | grep lustre'' on both OSS servers and the MDS? > > Also please provide output from command ''lctl list_nids'' run on > both OSS servers, MDS and a client? > > > puppy5 (MDS/MGS) > > 172.17.2.5 at o2ib > 172.16.2.5 at tcp > > puppy6 (OSS) > 172.17.2.6 at o2ib > 172.16.2.6 at tcp > > puppy7 (OSS) > 172.17.2.7 at o2ib > 172.16.2.7 at tcp > > > > > In addition to above please run following command on all lustre > targets (OSTs and MDT) to display your current lustre configuration > > tunefs.lustre --dryrun --print /dev/<ost_device> > > > puppy5 (MDS/MGS) > Read previous values: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE > mdt.group_upcall=NONE > > > Permanent disk data: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE > mdt.group_upcall=NONE > > exiting before disk write. > ---------------------------------------------------- > puppy6 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > -------------------------------------------------- > puppy7 (this is the broken OSS. The "Target" should be > "lustre1-OST0001") > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > exiting before disk write. > > > > If possible please attach syslog from each machine from the time > you mounted lustre targets (OST and MDT). > > Best regards, > > Wojciech > > On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>>> wrote: > > > Any additional info? > > Thanks, > > Roger S. > > > > > -- > -- > Wojciech Turek > > > > > > -- > -- > Wojciech Turek > > Assistant System Manager > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk> > Tel: (+)44 1223 763517
Wojciech Turek
2010-Jul-15 17:56 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Hi Roger, the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style ext3 based ldiskfs and one set for the ext4 based ldiskfs. When upgrading from 1.6.6 to 1.8.3 I think you should not try to use the ext4 based packages, can you let us know which RPMs have you used? On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov> wrote:> > > Wojciech Turek wrote: > >> can you also please post output of ''rpm -qa | grep lustre'' run on >> puppy5-7 ? >> > > > [root at puppy5 log]# rpm -qa |grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > [root at puppy6 log]# rpm -qa | grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > Thanks, > > Roger S. > > >> On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov <mailto: >> rs1 at aps.anl.gov>> wrote: >> >> >> OK. This looks bad. It appears that I should have upgraded ext3 to >> ext4, I found instructions for that, >> >> tune2fs -O extents,uninit_bg,dir_index /dev/XXX >> fsck -pf /dev/XXX >> Is the above correct? I''d like to move our systems to ext4. >> I >> didn''t know those steps were necessary. >> >> Other answers listed below. >> >> >> Wojciech Turek wrote: >> >> Hi Roger, >> >> Sorry for the delay. From the ldiskfs messages I seem to me that >> you are using ext4 ldiskfs >> (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from >> ext4-2.6-rhel5). >> If you upgrading from 1.6.6 you ldiskfs is ext3 based so I think >> taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm. >> >> Can you also tell us a bit more about your setup? From what you >> wrote so far I understand you have 2 OSS servers and each server >> has one OST device. In addition to that you have a third server >> which acts as a MGS/MDS, is that right? >> >> The logs you provided seem to be only from one server called >> puppy7 so it does not give a whole picture of the situation. The >> timeout messages may indicate a problem with communication >> between the servers but it is really difficult to say without >> seeing the whole picture or at least more elements of it. >> >> To check if you have correct rpms installed can you please run >> ''rpm -qa | grep lustre'' on both OSS servers and the MDS? >> >> Also please provide output from command ''lctl list_nids'' run on >> both OSS servers, MDS and a client? >> >> >> puppy5 (MDS/MGS) >> >> 172.17.2.5 at o2ib >> 172.16.2.5 at tcp >> >> puppy6 (OSS) >> 172.17.2.6 at o2ib >> 172.16.2.6 at tcp >> >> puppy7 (OSS) >> 172.17.2.7 at o2ib >> 172.16.2.7 at tcp >> >> >> >> >> In addition to above please run following command on all lustre >> targets (OSTs and MDT) to display your current lustre configuration >> >> tunefs.lustre --dryrun --print /dev/<ost_device> >> >> >> puppy5 (MDS/MGS) >> Read previous values: >> Target: lustre1-MDT0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x405 >> (MDT MGS ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: lov.stripesize=125K lov.stripecount=2 >> mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE >> mdt.group_upcall=NONE >> >> >> Permanent disk data: >> Target: lustre1-MDT0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x405 >> (MDT MGS ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: lov.stripesize=125K lov.stripecount=2 >> mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE >> mdt.group_upcall=NONE >> >> exiting before disk write. >> ---------------------------------------------------- >> puppy6 >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> >> Permanent disk data: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> -------------------------------------------------- >> puppy7 (this is the broken OSS. The "Target" should be >> "lustre1-OST0001") >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> >> Permanent disk data: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> exiting before disk write. >> >> >> >> If possible please attach syslog from each machine from the time >> you mounted lustre targets (OST and MDT). >> >> Best regards, >> >> Wojciech >> >> On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov >> >> <mailto:rs1 at aps.anl.gov>>> wrote: >> >> >> Any additional info? >> >> Thanks, >> >> Roger S. >> >> >> >> >> -- -- >> Wojciech Turek >> >> >> >> >> >> -- >> -- >> Wojciech Turek >> >> Assistant System Manager >> 517 >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/1231e2ee/attachment-0001.html
Roger Sersted
2010-Jul-15 19:02 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
I am using the ext4 RPMs. I ran the following commands on the MDS and OSS nodes (lustre was not running at the time): tune2fs -O extents,uninit_bg,dir_index /dev/XXX fsck -pf /dev/XXX I then started Lustre "mount -t lustre /dev/XXX /lustre" on the OSSes and then the MDS. The problem still persisted. I then shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS nodes. My last and most desperate step was to "hack" the CONFIG files. On puppy7, I did the following: 1. mount -t ldiskfs /dev/sdc /mnt 2. cd /mnt/CONFIG 3. mv lustre1-OST0000 lustre1-OST0001 4. vim -nb lustre1-OST0001 mountdata 5. I changed OST0000 to OST0001. 6. I verified my changes by comparing an "od -c" of before and after. 7. umount /mnt 8. tunefs.lustre -writeconf /dev/sdc The output of step 8 is: tunefs.lustre -writeconf /dev/sdc checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre1-OST0001 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.17.2.5 at o2ib Permanent disk data: Target: lustre1-OST0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.17.2.5 at o2ib Writing CONFIGS/mountdata Now part of the system seems to have the correct Target value. Thanks for your time on this. Roger S. Wojciech Turek wrote:> Hi Roger, > > the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style ext3 > based ldiskfs and one set for the ext4 based ldiskfs. When upgrading > from 1.6.6 to 1.8.3 I think you should not try to use the ext4 based > packages, can you let us know which RPMs have you used? > > > > On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>> wrote: > > > > Wojciech Turek wrote: > > can you also please post output of ''rpm -qa | grep lustre'' run > on puppy5-7 ? > > > > [root at puppy5 log]# rpm -qa |grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > [root at puppy6 log]# rpm -qa | grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > Thanks, > > Roger S. > > > On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>>> wrote: > > > OK. This looks bad. It appears that I should have upgraded > ext3 to > ext4, I found instructions for that, > > tune2fs -O extents,uninit_bg,dir_index /dev/XXX > fsck -pf /dev/XXX > Is the above correct? I''d like to move our > systems to ext4. I > didn''t know those steps were necessary. > > Other answers listed below. > > > Wojciech Turek wrote: > > Hi Roger, > > Sorry for the delay. From the ldiskfs messages I seem to > me that > you are using ext4 ldiskfs > (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from > ext4-2.6-rhel5). > If you upgrading from 1.6.6 you ldiskfs is ext3 based so > I think > taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm. > > Can you also tell us a bit more about your setup? From > what you > wrote so far I understand you have 2 OSS servers and each > server > has one OST device. In addition to that you have a third > server > which acts as a MGS/MDS, is that right? > > The logs you provided seem to be only from one server called > puppy7 so it does not give a whole picture of the > situation. The > timeout messages may indicate a problem with communication > between the servers but it is really difficult to say without > seeing the whole picture or at least more elements of it. > > To check if you have correct rpms installed can you > please run > ''rpm -qa | grep lustre'' on both OSS servers and the MDS? > > Also please provide output from command ''lctl list_nids'' > run on > both OSS servers, MDS and a client? > > > puppy5 (MDS/MGS) > > 172.17.2.5 at o2ib > 172.16.2.5 at tcp > > puppy6 (OSS) > 172.17.2.6 at o2ib > 172.16.2.6 at tcp > > puppy7 (OSS) > 172.17.2.7 at o2ib > 172.16.2.7 at tcp > > > > > In addition to above please run following command on all > lustre > targets (OSTs and MDT) to display your current lustre > configuration > > tunefs.lustre --dryrun --print /dev/<ost_device> > > > puppy5 (MDS/MGS) > Read previous values: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE > mdt.group_upcall=NONE > > > Permanent disk data: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE > mdt.group_upcall=NONE > > exiting before disk write. > ---------------------------------------------------- > puppy6 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > -------------------------------------------------- > puppy7 (this is the broken OSS. The "Target" should be > "lustre1-OST0001") > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > exiting before disk write. > > > > If possible please attach syslog from each machine from > the time > you mounted lustre targets (OST and MDT). > > Best regards, > > Wojciech > > On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> > > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote: > > > Any additional info? > > Thanks, > > Roger S. > > > > > -- -- > Wojciech Turek > > > > > > -- > -- > Wojciech Turek > > Assistant System Manager > 517 >
Roger Sersted
2010-Jul-15 22:50 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
I greatly appreciate the time and effort you''ve taken. It is difficult to diagnose and support sophisticated systems via an email exchange. Unfortunately, I am under time constraints to get this system usable. Therefore, I have started reformatting the MDS and OSSes. Roger S.
Wojciech Turek
2010-Jul-15 22:54 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
Hi Roger Where did you find this CONFIG hack? Did you make a copy of the CONFIG dir before followed this steps? On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov> wrote:> > I am using the ext4 RPMs. I ran the following commands on the MDS and OSS > nodes (lustre was not running at the time): > > > tune2fs -O extents,uninit_bg,dir_index /dev/XXX > fsck -pf /dev/XXX > > I then started Lustre "mount -t lustre /dev/XXX /lustre" on the OSSes and > then the MDS. The problem still persisted. I then shutdown Lustre by > unmounting the Lustre filesystems on the MDS/OSS nodes. > > My last and most desperate step was to "hack" the CONFIG files. On puppy7, > I did the following: > > 1. mount -t ldiskfs /dev/sdc /mnt > 2. cd /mnt/CONFIG > 3. mv lustre1-OST0000 lustre1-OST0001 > 4. vim -nb lustre1-OST0001 mountdata > 5. I changed OST0000 to OST0001. > 6. I verified my changes by comparing an "od -c" of before and > after. > 7. umount /mnt > 8. tunefs.lustre -writeconf /dev/sdc > > The output of step 8 is: > > tunefs.lustre -writeconf /dev/sdc > > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0001 > > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x102 > (OST writeconf ) > > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x102 > (OST writeconf ) > > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > Writing CONFIGS/mountdata > > Now part of the system seems to have the correct Target value. > > Thanks for your time on this. > > Roger S. > > Wojciech Turek wrote: > >> Hi Roger, >> >> the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old style ext3 >> based ldiskfs and one set for the ext4 based ldiskfs. When upgrading from >> 1.6.6 to 1.8.3 I think you should not try to use the ext4 based packages, >> can you let us know which RPMs have you used? >> >> >> >> On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov <mailto: >> rs1 at aps.anl.gov>> wrote: >> >> >> >> Wojciech Turek wrote: >> >> can you also please post output of ''rpm -qa | grep lustre'' run >> on puppy5-7 ? >> >> >> >> [root at puppy5 log]# rpm -qa |grep -i lustre >> kernel-2.6.18-164.11.1.el5_lustre.1.8.3 >> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 >> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> >> [root at puppy6 log]# rpm -qa | grep -i lustre >> kernel-2.6.18-164.11.1.el5_lustre.1.8.3 >> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 >> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> >> [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre >> kernel-2.6.18-164.11.1.el5_lustre.1.8.3 >> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 >> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> >> Thanks, >> >> Roger S. >> >> >> On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov>>> wrote: >> >> >> OK. This looks bad. It appears that I should have upgraded >> ext3 to >> ext4, I found instructions for that, >> >> tune2fs -O extents,uninit_bg,dir_index /dev/XXX >> fsck -pf /dev/XXX >> Is the above correct? I''d like to move our >> systems to ext4. I >> didn''t know those steps were necessary. >> >> Other answers listed below. >> >> >> Wojciech Turek wrote: >> >> Hi Roger, >> >> Sorry for the delay. From the ldiskfs messages I seem to >> me that >> you are using ext4 ldiskfs >> (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from >> ext4-2.6-rhel5). >> If you upgrading from 1.6.6 you ldiskfs is ext3 based so >> I think >> taht in lustre-1.8.3 you should use ext3 based ldiskfs rpm. >> >> Can you also tell us a bit more about your setup? From >> what you >> wrote so far I understand you have 2 OSS servers and each >> server >> has one OST device. In addition to that you have a third >> server >> which acts as a MGS/MDS, is that right? >> >> The logs you provided seem to be only from one server called >> puppy7 so it does not give a whole picture of the >> situation. The >> timeout messages may indicate a problem with communication >> between the servers but it is really difficult to say >> without >> seeing the whole picture or at least more elements of it. >> >> To check if you have correct rpms installed can you >> please run >> ''rpm -qa | grep lustre'' on both OSS servers and the MDS? >> >> Also please provide output from command ''lctl list_nids'' >> run on >> both OSS servers, MDS and a client? >> >> >> puppy5 (MDS/MGS) >> >> 172.17.2.5 at o2ib >> 172.16.2.5 at tcp >> >> puppy6 (OSS) >> 172.17.2.6 at o2ib >> 172.16.2.6 at tcp >> >> puppy7 (OSS) >> 172.17.2.7 at o2ib >> 172.16.2.7 at tcp >> >> >> >> >> In addition to above please run following command on all >> lustre >> targets (OSTs and MDT) to display your current lustre >> configuration >> >> tunefs.lustre --dryrun --print /dev/<ost_device> >> >> >> puppy5 (MDS/MGS) >> Read previous values: >> Target: lustre1-MDT0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x405 >> (MDT MGS ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: lov.stripesize=125K lov.stripecount=2 >> mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE >> mdt.group_upcall=NONE >> >> >> Permanent disk data: >> Target: lustre1-MDT0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x405 >> (MDT MGS ) >> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: lov.stripesize=125K lov.stripecount=2 >> mdt.group_upcall=/usr/sbin/l_getgroups mdt.group_upcall=NONE >> mdt.group_upcall=NONE >> >> exiting before disk write. >> ---------------------------------------------------- >> puppy6 >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> >> Permanent disk data: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> -------------------------------------------------- >> puppy7 (this is the broken OSS. The "Target" should be >> "lustre1-OST0001") >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> >> Permanent disk data: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> exiting before disk write. >> >> >> >> If possible please attach syslog from each machine from >> the time >> you mounted lustre targets (OST and MDT). >> >> Best regards, >> >> Wojciech >> >> On 14 July 2010 20:46, Roger Sersted <rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> >> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote: >> >> >> Any additional info? >> >> Thanks, >> >> Roger S. >> >> >> >> >> -- -- >> Wojciech Turek >> >> >> >> >> >> -- -- >> Wojciech Turek >> >> Assistant System Manager >> 517 >> >>-- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/a28ab0b4/attachment-0001.html
Wojciech Turek
2010-Jul-15 22:58 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
No problem , sorry I could help better, unfortunatelly the time I can spend on the list is currently very limited. Could you please point me to the place where you found this CONFIG hack as I would like to find out at what context were those steps recommended? Best regards, Wojciech On 15 July 2010 23:50, Roger Sersted <rs1 at aps.anl.gov> wrote:> > > I greatly appreciate the time and effort you''ve taken. It is difficult to > diagnose and support sophisticated systems via an email exchange. > Unfortunately, I am under time constraints to get this system usable. > Therefore, I have started reformatting the MDS and OSSes. > > > Roger S. >-- -- Wojciech Turek -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/71416b32/attachment.html
Roger Sersted
2010-Jul-16 13:43 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
I didn''t find the hack anywhere. I looked at what those files contained and decided to "hack and slash". Apparently, those files are generated from data within the filesystem system itself. A second running of writeconf displayed the target value to be "lustre1-OST0000", which is what I didn''t want. :-( Roger S. Wojciech Turek wrote:> Hi Roger > > Where did you find this CONFIG hack? > Did you make a copy of the CONFIG dir before followed this steps? > > > > On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>> wrote: > > > I am using the ext4 RPMs. I ran the following commands on the MDS > and OSS nodes (lustre was not running at the time): > > > tune2fs -O extents,uninit_bg,dir_index /dev/XXX > fsck -pf /dev/XXX > > I then started Lustre "mount -t lustre /dev/XXX /lustre" on the > OSSes and then the MDS. The problem still persisted. I then > shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS > nodes. > > My last and most desperate step was to "hack" the CONFIG files. On > puppy7, I did the following: > > 1. mount -t ldiskfs /dev/sdc /mnt > 2. cd /mnt/CONFIG > 3. mv lustre1-OST0000 lustre1-OST0001 > 4. vim -nb lustre1-OST0001 mountdata > 5. I changed OST0000 to OST0001. > 6. I verified my changes by comparing an "od -c" of before > and after. > 7. umount /mnt > 8. tunefs.lustre -writeconf /dev/sdc > > The output of step 8 is: > > tunefs.lustre -writeconf /dev/sdc > > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0001 > > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x102 > (OST writeconf ) > > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x102 > (OST writeconf ) > > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > Writing CONFIGS/mountdata > > Now part of the system seems to have the correct Target value. > > Thanks for your time on this. > > Roger S. > > Wojciech Turek wrote: > > Hi Roger, > > the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old > style ext3 based ldiskfs and one set for the ext4 based ldiskfs. > When upgrading from 1.6.6 to 1.8.3 I think you should not try to > use the ext4 based packages, can you let us know which RPMs have > you used? > > > > On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov>>> wrote: > > > > Wojciech Turek wrote: > > can you also please post output of ''rpm -qa | grep > lustre'' run > on puppy5-7 ? > > > > [root at puppy5 log]# rpm -qa |grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > [root at puppy6 log]# rpm -qa | grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre > kernel-2.6.18-164.11.1.el5_lustre.1.8.3 > lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 > mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 > lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 > > Thanks, > > Roger S. > > > On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov > <mailto:rs1 at aps.anl.gov> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote: > > > OK. This looks bad. It appears that I should have > upgraded > ext3 to > ext4, I found instructions for that, > > tune2fs -O extents,uninit_bg,dir_index /dev/XXX > fsck -pf /dev/XXX > Is the above correct? I''d like to move our > systems to ext4. I > didn''t know those steps were necessary. > > Other answers listed below. > > > Wojciech Turek wrote: > > Hi Roger, > > Sorry for the delay. From the ldiskfs messages I > seem to > me that > you are using ext4 ldiskfs > (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from > ext4-2.6-rhel5). > If you upgrading from 1.6.6 you ldiskfs is ext3 > based so > I think > taht in lustre-1.8.3 you should use ext3 based > ldiskfs rpm. > > Can you also tell us a bit more about your setup? > From > what you > wrote so far I understand you have 2 OSS servers > and each > server > has one OST device. In addition to that you have a > third > server > which acts as a MGS/MDS, is that right? > > The logs you provided seem to be only from one > server called > puppy7 so it does not give a whole picture of the > situation. The > timeout messages may indicate a problem with > communication > between the servers but it is really difficult to > say without > seeing the whole picture or at least more elements > of it. > > To check if you have correct rpms installed can you > please run > ''rpm -qa | grep lustre'' on both OSS servers and > the MDS? > > Also please provide output from command ''lctl > list_nids'' > run on > both OSS servers, MDS and a client? > > > puppy5 (MDS/MGS) > > 172.17.2.5 at o2ib > 172.16.2.5 at tcp > > puppy6 (OSS) > 172.17.2.6 at o2ib > 172.16.2.6 at tcp > > puppy7 (OSS) > 172.17.2.7 at o2ib > 172.16.2.7 at tcp > > > > > In addition to above please run following command > on all > lustre > targets (OSTs and MDT) to display your current lustre > configuration > > tunefs.lustre --dryrun --print /dev/<ost_device> > > > puppy5 (MDS/MGS) > Read previous values: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: > errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups > mdt.group_upcall=NONE > mdt.group_upcall=NONE > > > Permanent disk data: > Target: lustre1-MDT0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x405 > (MDT MGS ) > Persistent mount opts: > errors=remount-ro,iopen_nopriv,user_xattr > Parameters: lov.stripesize=125K lov.stripecount=2 > mdt.group_upcall=/usr/sbin/l_getgroups > mdt.group_upcall=NONE > mdt.group_upcall=NONE > > exiting before disk write. > ---------------------------------------------------- > puppy6 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > -------------------------------------------------- > puppy7 (this is the broken OSS. The "Target" should be > "lustre1-OST0001") > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > > Permanent disk data: > Target: lustre1-OST0000 > Index: 0 > Lustre FS: lustre1 > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=172.17.2.5 at o2ib > > exiting before disk write. > > > > If possible please attach syslog from each machine > from > the time > you mounted lustre targets (OST and MDT). > > Best regards, > > Wojciech > > On 14 July 2010 20:46, Roger Sersted > <rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> > > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> > <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>>> wrote: > > > Any additional info? > > Thanks, > > Roger S. > > > > > -- -- > Wojciech Turek > > > > > > -- -- > Wojciech Turek > > Assistant System Manager > 517 > > > > > --
Andreas Dilger
2010-Jul-16 15:46 UTC
[Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value
The use of ext3 or ext4 and the filesystem feature flags has nothing to do with the setting of the incorrect target. I don''t know how you got to that state, but there are a number of places where the OST index is stored that need to verified and fixed. There is the mountdata file, which you have already found. There is the filesystem label, which you can view and change with the e2label command. There is also the last_rcvd file that has the OST UUID at the start and the ost index as one of it''s fields. Normally I would just say to delete this file, since it can be recreated at mount time, but since the OST already has an identity crisis I''m not sure it would get it right. You should fire up a binary editor to change the UUID in last_rcvd and look at the lsd_index field in struct lustre_server_data (which is what is stored at the beginning of the last_rcvd file). Cheers, Andreas On 2010-07-16, at 7:43, Roger Sersted <rs1 at aps.anl.gov> wrote:> > I didn''t find the hack anywhere. I looked at what those files contained and > decided to "hack and slash". Apparently, those files are generated from data > within the filesystem system itself. A second running of writeconf displayed > the target value to be "lustre1-OST0000", which is what I didn''t want. :-( > > > Roger S. > > Wojciech Turek wrote: >> Hi Roger >> >> Where did you find this CONFIG hack? >> Did you make a copy of the CONFIG dir before followed this steps? >> >> >> >> On 15 July 2010 20:02, Roger Sersted <rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov>> wrote: >> >> >> I am using the ext4 RPMs. I ran the following commands on the MDS >> and OSS nodes (lustre was not running at the time): >> >> >> tune2fs -O extents,uninit_bg,dir_index /dev/XXX >> fsck -pf /dev/XXX >> >> I then started Lustre "mount -t lustre /dev/XXX /lustre" on the >> OSSes and then the MDS. The problem still persisted. I then >> shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS >> nodes. >> >> My last and most desperate step was to "hack" the CONFIG files. On >> puppy7, I did the following: >> >> 1. mount -t ldiskfs /dev/sdc /mnt >> 2. cd /mnt/CONFIG >> 3. mv lustre1-OST0000 lustre1-OST0001 >> 4. vim -nb lustre1-OST0001 mountdata >> 5. I changed OST0000 to OST0001. >> 6. I verified my changes by comparing an "od -c" of before >> and after. >> 7. umount /mnt >> 8. tunefs.lustre -writeconf /dev/sdc >> >> The output of step 8 is: >> >> tunefs.lustre -writeconf /dev/sdc >> >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: lustre1-OST0001 >> >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x102 >> (OST writeconf ) >> >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> >> Permanent disk data: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x102 >> (OST writeconf ) >> >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> Writing CONFIGS/mountdata >> >> Now part of the system seems to have the correct Target value. >> >> Thanks for your time on this. >> >> Roger S. >> >> Wojciech Turek wrote: >> >> Hi Roger, >> >> the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old >> style ext3 based ldiskfs and one set for the ext4 based ldiskfs. >> When upgrading from 1.6.6 to 1.8.3 I think you should not try to >> use the ext4 based packages, can you let us know which RPMs have >> you used? >> >> >> >> On 15 July 2010 16:14, Roger Sersted <rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov> <mailto:rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov>>> wrote: >> >> >> >> Wojciech Turek wrote: >> >> can you also please post output of ''rpm -qa | grep >> lustre'' run >> on puppy5-7 ? >> >> >> >> [root at puppy5 log]# rpm -qa |grep -i lustre >> kernel-2.6.18-164.11.1.el5_lustre.1.8.3 >> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 >> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> >> [root at puppy6 log]# rpm -qa | grep -i lustre >> kernel-2.6.18-164.11.1.el5_lustre.1.8.3 >> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 >> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> >> [root at puppy7 CONFIGS]# rpm -qa | grep -i lustre >> kernel-2.6.18-164.11.1.el5_lustre.1.8.3 >> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3 >> mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3 >> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3 >> >> Thanks, >> >> Roger S. >> >> >> On 15 July 2010 15:55, Roger Sersted <rs1 at aps.anl.gov >> <mailto:rs1 at aps.anl.gov> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>> wrote: >> >> >> OK. This looks bad. It appears that I should have >> upgraded >> ext3 to >> ext4, I found instructions for that, >> >> tune2fs -O extents,uninit_bg,dir_index /dev/XXX >> fsck -pf /dev/XXX >> Is the above correct? I''d like to move our >> systems to ext4. I >> didn''t know those steps were necessary. >> >> Other answers listed below. >> >> >> Wojciech Turek wrote: >> >> Hi Roger, >> >> Sorry for the delay. From the ldiskfs messages I >> seem to >> me that >> you are using ext4 ldiskfs >> (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from >> ext4-2.6-rhel5). >> If you upgrading from 1.6.6 you ldiskfs is ext3 >> based so >> I think >> taht in lustre-1.8.3 you should use ext3 based >> ldiskfs rpm. >> >> Can you also tell us a bit more about your setup? >> From >> what you >> wrote so far I understand you have 2 OSS servers >> and each >> server >> has one OST device. In addition to that you have a >> third >> server >> which acts as a MGS/MDS, is that right? >> >> The logs you provided seem to be only from one >> server called >> puppy7 so it does not give a whole picture of the >> situation. The >> timeout messages may indicate a problem with >> communication >> between the servers but it is really difficult to >> say without >> seeing the whole picture or at least more elements >> of it. >> >> To check if you have correct rpms installed can you >> please run >> ''rpm -qa | grep lustre'' on both OSS servers and >> the MDS? >> >> Also please provide output from command ''lctl >> list_nids'' >> run on >> both OSS servers, MDS and a client? >> >> >> puppy5 (MDS/MGS) >> >> 172.17.2.5 at o2ib >> 172.16.2.5 at tcp >> >> puppy6 (OSS) >> 172.17.2.6 at o2ib >> 172.16.2.6 at tcp >> >> puppy7 (OSS) >> 172.17.2.7 at o2ib >> 172.16.2.7 at tcp >> >> >> >> >> In addition to above please run following command >> on all >> lustre >> targets (OSTs and MDT) to display your current lustre >> configuration >> >> tunefs.lustre --dryrun --print /dev/<ost_device> >> >> >> puppy5 (MDS/MGS) >> Read previous values: >> Target: lustre1-MDT0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x405 >> (MDT MGS ) >> Persistent mount opts: >> errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: lov.stripesize=125K lov.stripecount=2 >> mdt.group_upcall=/usr/sbin/l_getgroups >> mdt.group_upcall=NONE >> mdt.group_upcall=NONE >> >> >> Permanent disk data: >> Target: lustre1-MDT0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x405 >> (MDT MGS ) >> Persistent mount opts: >> errors=remount-ro,iopen_nopriv,user_xattr >> Parameters: lov.stripesize=125K lov.stripecount=2 >> mdt.group_upcall=/usr/sbin/l_getgroups >> mdt.group_upcall=NONE >> mdt.group_upcall=NONE >> >> exiting before disk write. >> ---------------------------------------------------- >> puppy6 >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> >> Permanent disk data: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> -------------------------------------------------- >> puppy7 (this is the broken OSS. The "Target" should be >> "lustre1-OST0001") >> checking for existing Lustre data: found CONFIGS/mountdata >> Reading CONFIGS/mountdata >> >> Read previous values: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> >> Permanent disk data: >> Target: lustre1-OST0000 >> Index: 0 >> Lustre FS: lustre1 >> Mount type: ldiskfs >> Flags: 0x2 >> (OST ) >> Persistent mount opts: errors=remount-ro,extents,mballoc >> Parameters: mgsnode=172.17.2.5 at o2ib >> >> exiting before disk write. >> >> >> >> If possible please attach syslog from each machine >> from >> the time >> you mounted lustre targets (OST and MDT). >> >> Best regards, >> >> Wojciech >> >> On 14 July 2010 20:46, Roger Sersted >> <rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>> >> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov> >> <mailto:rs1 at aps.anl.gov <mailto:rs1 at aps.anl.gov>>>>> wrote: >> >> >> Any additional info? >> >> Thanks, >> >> Roger S. >> >> >> >> >> -- -- >> Wojciech Turek >> >> >> >> >> >> -- -- >> Wojciech Turek >> >> Assistant System Manager >> 517 >> >> >> >> >> -- > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss