huangql
2009-Sep-10 08:17 UTC
[Lustre-discuss] Upgrade 32-bit kernel to 64-bit kernel for MDS
Dear, list We got a high frequency of MDS node crash this days in our Lustre system. However, it''s not clear what to cause it. According to the current filesystem, there are 20 OSS with 64-bit kernel except the MDS with 32-bit kernel. So we want to use 64-bit kernel instead. But we are worried about the metedata will be dirty after we upgrade 32-bit kernel to 64-bit kernel for MDS. Can you give us some points? Thank you for your help in advance. Thanks, Sarea 2009-09-10 huangql -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090910/a589cd4a/attachment.html
Andreas Dilger
2009-Sep-10 13:26 UTC
[Lustre-discuss] Upgrade 32-bit kernel to 64-bit kernel for MDS
On Sep 10, 2009 16:17 +0800, huangql wrote:> We got a high frequency of MDS node crash this days in our > Lustre system. However, it''s not clear what to cause it. According to > the current filesystem, there are 20 OSS with 64-bit kernel except the > MDS with 32-bit kernel. So we want to use 64-bit kernel instead. > > But we are worried about the metedata will be dirty after we upgrade > 32-bit kernel to 64-bit kernel for MDS. Can you give us some points?The on-disk format for 32-bit and 64-bit filesystems is the same. The only thing you need to ensure is that both 64-bit kernel+modules and 64-bit userspace Lustre RPMs are installed. Otherwise there can be problems with the user->kernel ioctl commands. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi all, I''m running into a rather strange problem where new files cannot be created (lustre 1.6.6), but existing files can be modified and even deleted. The MDS reports: LustreError: X:0:(mds_open.c:431:mds_create_objects()) error creating objects for inode Y: For a large subset of inodes. The OSTs, MDT all show as ''healthy'', and an e2fsck had recently been run on all of the lustre storage components prior to the problem presenting itself. Also, all partitions are no-where close to being full (much greater then 5% free) Is this fairly common? Is the solution as trivial as a writeconf or tunefs call? I couldn''t find any reference to this problem in the lustre-discuss archives so I figured I''d ask if anyone else has seen this. Thank you, Adam
Hello! On Sep 10, 2009, at 10:33 AM, Adam wrote:> I''m running into a rather strange problem where new files cannot be > created (lustre 1.6.6), but existing files can be modified and even > deleted. > The MDS reports: > LustreError: X:0:(mds_open.c:431:mds_create_objects()) error creating > objects for inode Y: > For a large subset of inodes.What is the rc value reported in that message? Are there any messages on OSTs at the same time? Bye, Oleg
-5, sorry I took a far better look then simply checking the health status. (Part of) the problem is that the MDS doesn''t see the OSTs, eg: [root at MDS ~]# lctl dl 0 UP mgc MGC10.29.48.1 at o2ib eeb1eccc-727e-f3f8-824a-9862a42e3b08 5 1 UP mdt MDS MDS_uuid 3 2 UP lov lfs-mdtlov lfs-mdtlov_UUID 4 3 UP mds lfs-MDT0000 lfs-MDT0000_UUID 671 [root at MDS ~]# the OSS''s seem okay: [root at OSS5 ~]# lctl dl 0 UP mgc MGC10.29.48.1 at o2ib af5bdde6-7fe2-a944-05cd-28459ef91385 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter lfs-OST0002 lfs-OST0002_UUID 671 3 UP obdfilter lfs-OST000a lfs-OST000a_UUID 671 4 UP obdfilter lfs-OST0012 lfs-OST0012_UUID 671 5 UP obdfilter lfs-OST001a lfs-OST001a_UUID 671 [root at OSS5 ~]# writeconf? Thanks, Adam On Thu, 2009-09-10 at 10:45 -0400, Oleg Drokin wrote:> Hello! > > On Sep 10, 2009, at 10:33 AM, Adam wrote: > > I''m running into a rather strange problem where new files cannot be > > created (lustre 1.6.6), but existing files can be modified and even > > deleted. > > The MDS reports: > > LustreError: X:0:(mds_open.c:431:mds_create_objects()) error creating > > objects for inode Y: > > For a large subset of inodes. > > What is the rc value reported in that message? > Are there any messages on OSTs at the same time? > > Bye, > Oleg
On Sep 10, 2009 11:00 -0400, Adam wrote:> -5, sorry I took a far better look then simply checking the health > status. (Part of) the problem is that the MDS doesn''t see the OSTs, > eg: > > [root at MDS ~]# lctl dl > 0 UP mgc MGC10.29.48.1 at o2ib eeb1eccc-727e-f3f8-824a-9862a42e3b08 5 > 1 UP mdt MDS MDS_uuid 3 > 2 UP lov lfs-mdtlov lfs-mdtlov_UUID 4 > 3 UP mds lfs-MDT0000 lfs-MDT0000_UUID 671 > [root at MDS ~]#You need to look at the MDS startup logs and/or just shut down the MDS and start it up again, to see why it isn''t connecting to the OSS nodes. If it can''t connect to any of them, I would suspect a network problem. Try "ping", "lctl ping", "telnet OSS5 988" to see if you can check whether the connection is working.> the OSS''s seem okay: > > [root at OSS5 ~]# lctl dl > 0 UP mgc MGC10.29.48.1 at o2ib af5bdde6-7fe2-a944-05cd-28459ef91385 5 > 1 UP ost OSS OSS_uuid 3 > 2 UP obdfilter lfs-OST0002 lfs-OST0002_UUID 671 > 3 UP obdfilter lfs-OST000a lfs-OST000a_UUID 671 > 4 UP obdfilter lfs-OST0012 lfs-OST0012_UUID 671 > 5 UP obdfilter lfs-OST001a lfs-OST001a_UUID 671 > [root at OSS5 ~]# > > writeconf?Well, that is a last resort, it probably isn''t needed.> On Thu, 2009-09-10 at 10:45 -0400, Oleg Drokin wrote: > > Hello! > > > > On Sep 10, 2009, at 10:33 AM, Adam wrote: > > > I''m running into a rather strange problem where new files cannot be > > > created (lustre 1.6.6), but existing files can be modified and even > > > deleted. > > > The MDS reports: > > > LustreError: X:0:(mds_open.c:431:mds_create_objects()) error creating > > > objects for inode Y: > > > For a large subset of inodes. > > > > What is the rc value reported in that message? > > Are there any messages on OSTs at the same time? > > > > Bye, > > Oleg > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Unmounting worked, but remounting resulted in a LBUG (edited): X=25043:0 X:(tracefile.c:450:libcfs_assertion_failed()) LBUG X:(handler.c:2049:mds_setup()) ASSERTION(! lvfs_check_rdonly(lvfs_sbdev(mnt->mnt_sb))) failed X:(tracefile.c:450:libcfs_assertion_failed()) LBUG After rebooting, an attempted mount of the mds results in (heavily edited): lfs-MDT0000: denying duplicate export for 91134603-5957-7699-8c87-6305e1e508d5 (class_hash.c:190:lustre_hash_additem_unique()) Already found the key in hash [UUID_HASH] (llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x7360029:0x992d6208: rc -2 (llog_cat.c:176:llog_cat_id2handle()) error opening log id 0x7360029:992d6208: rc -2 (llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0x7360029 (llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2 Failing over lfs-MDT0000 setting obd lfs-MDT0000 device ''unknown-block(253,1)'' read-only *** Turning device dm-1 (0xfd00001) read-only and the new mount eventually fails: [root at MDS ~]# mount -t lustre /dev/mapper/mpath1 /mnt/mds mount.lustre: mount /dev/mapper/mpath1 at /mnt/mds failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) e2fsck shows the mdt as clean -- any idea as to what''s tripping it? Thanks, Adam On Thu, 2009-09-10 at 21:41 +0200, Andreas Dilger wrote:> On Sep 10, 2009 11:00 -0400, Adam wrote: > > -5, sorry I took a far better look then simply checking the health > > status. (Part of) the problem is that the MDS doesn''t see the OSTs, > > eg: > > > > [root at MDS ~]# lctl dl > > 0 UP mgc MGC10.29.48.1 at o2ib eeb1eccc-727e-f3f8-824a-9862a42e3b08 5 > > 1 UP mdt MDS MDS_uuid 3 > > 2 UP lov lfs-mdtlov lfs-mdtlov_UUID 4 > > 3 UP mds lfs-MDT0000 lfs-MDT0000_UUID 671 > > [root at MDS ~]# > > You need to look at the MDS startup logs and/or just shut down the > MDS and start it up again, to see why it isn''t connecting to the > OSS nodes. If it can''t connect to any of them, I would suspect > a network problem. Try "ping", "lctl ping", "telnet OSS5 988" to > see if you can check whether the connection is working. > > > the OSS''s seem okay: > > > > [root at OSS5 ~]# lctl dl > > 0 UP mgc MGC10.29.48.1 at o2ib af5bdde6-7fe2-a944-05cd-28459ef91385 5 > > 1 UP ost OSS OSS_uuid 3 > > 2 UP obdfilter lfs-OST0002 lfs-OST0002_UUID 671 > > 3 UP obdfilter lfs-OST000a lfs-OST000a_UUID 671 > > 4 UP obdfilter lfs-OST0012 lfs-OST0012_UUID 671 > > 5 UP obdfilter lfs-OST001a lfs-OST001a_UUID 671 > > [root at OSS5 ~]# > > > > writeconf? > > Well, that is a last resort, it probably isn''t needed. > > > On Thu, 2009-09-10 at 10:45 -0400, Oleg Drokin wrote: > > > Hello! > > > > > > On Sep 10, 2009, at 10:33 AM, Adam wrote: > > > > I''m running into a rather strange problem where new files cannot be > > > > created (lustre 1.6.6), but existing files can be modified and even > > > > deleted. > > > > The MDS reports: > > > > LustreError: X:0:(mds_open.c:431:mds_create_objects()) error creating > > > > objects for inode Y: > > > > For a large subset of inodes. > > > > > > What is the rc value reported in that message? > > > Are there any messages on OSTs at the same time? > > > > > > Bye, > > > Oleg > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.
Okay, solved! I can''t take credit for this, but this should work for anyone in the same 1.6.6 situation. Documented here to make it easier for others to find this solution. Problem, corrupt CATALOGS file. Solution: (I also unmounted all the OSTs, MGT -- left the clients mounted)>From a location where you can mount the mdt:mkdir /tmp/lfs_mdt mount -t ldiskfs /dev/mdtpath /tmp/lfs_mdt mv /tmp/lfs_mdt/CATALOGS /tmp/lfs_mdt/CATALOGS.old touch /tmp/lfs_mdt/CATALOGS umount /tmp/lfs_mdt and then, Boo-Yeah!: [root at MDS /]# pdsh -a ''df | grep path | wc -l; cat /proc/fs/lustre/heal*; lctl dl | wc -l'' | sort OSS10: 4 OSS10: 6 OSS10: healthy MGS: 1 MGS: 2 MGS: healthy MDS: 1 MDS: 36 MDS: healthy OSS3: 4 OSS3: 6 OSS3: healthy OSS4: 4 OSS4: 6 OSS4: healthy OSS5: 4 OSS5: 6 OSS5: healthy OSS6: 4 OSS6: 6 OSS6: healthy OSS7: 4 OSS7: 6 OSS7: healthy OSS8: 4 OSS8: 6 OSS8: healthy OSS9: 4 OSS9: 6 OSS9: healthy Thank you for the advise Andreas, in this case the lustre logging was indeed key to determining the true nature of the problem. Thank you, Adam On Thu, 2009-09-10 at 20:51 -0400, Adam wrote:> Unmounting worked, but remounting resulted in a LBUG (edited): > > X=25043:0 > X:(tracefile.c:450:libcfs_assertion_failed()) LBUG > X:(handler.c:2049:mds_setup()) ASSERTION(! > lvfs_check_rdonly(lvfs_sbdev(mnt->mnt_sb))) failed > X:(tracefile.c:450:libcfs_assertion_failed()) LBUG > > After rebooting, an attempted mount of the mds results in (heavily > edited): > > lfs-MDT0000: denying duplicate export for > 91134603-5957-7699-8c87-6305e1e508d5 > (class_hash.c:190:lustre_hash_additem_unique()) Already found the key in > hash [UUID_HASH] > (llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile > 0x7360029:0x992d6208: rc -2 > (llog_cat.c:176:llog_cat_id2handle()) error opening log id > 0x7360029:992d6208: rc -2 > (llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0x7360029 > (llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb > failed: -2 > Failing over lfs-MDT0000 > setting obd lfs-MDT0000 device ''unknown-block(253,1)'' read-only *** > Turning device dm-1 (0xfd00001) read-only > > and the new mount eventually fails: > > [root at MDS ~]# mount -t lustre /dev/mapper/mpath1 /mnt/mds > mount.lustre: mount /dev/mapper/mpath1 at /mnt/mds failed: No such file > or directory > Is the MGS specification correct? > Is the filesystem name correct? > If upgrading, is the copied client log valid? (see upgrade docs) > > e2fsck shows the mdt as clean -- any idea as to what''s tripping it? > > Thanks, > Adam > > On Thu, 2009-09-10 at 21:41 +0200, Andreas Dilger wrote: > > On Sep 10, 2009 11:00 -0400, Adam wrote: > > > -5, sorry I took a far better look then simply checking the health > > > status. (Part of) the problem is that the MDS doesn''t see the OSTs, > > > eg: > > > > > > [root at MDS ~]# lctl dl > > > 0 UP mgc MGC10.29.48.1 at o2ib eeb1eccc-727e-f3f8-824a-9862a42e3b08 5 > > > 1 UP mdt MDS MDS_uuid 3 > > > 2 UP lov lfs-mdtlov lfs-mdtlov_UUID 4 > > > 3 UP mds lfs-MDT0000 lfs-MDT0000_UUID 671 > > > [root at MDS ~]# > > > > You need to look at the MDS startup logs and/or just shut down the > > MDS and start it up again, to see why it isn''t connecting to the > > OSS nodes. If it can''t connect to any of them, I would suspect > > a network problem. Try "ping", "lctl ping", "telnet OSS5 988" to > > see if you can check whether the connection is working. > > > > > the OSS''s seem okay: > > > > > > [root at OSS5 ~]# lctl dl > > > 0 UP mgc MGC10.29.48.1 at o2ib af5bdde6-7fe2-a944-05cd-28459ef91385 5 > > > 1 UP ost OSS OSS_uuid 3 > > > 2 UP obdfilter lfs-OST0002 lfs-OST0002_UUID 671 > > > 3 UP obdfilter lfs-OST000a lfs-OST000a_UUID 671 > > > 4 UP obdfilter lfs-OST0012 lfs-OST0012_UUID 671 > > > 5 UP obdfilter lfs-OST001a lfs-OST001a_UUID 671 > > > [root at OSS5 ~]# > > > > > > writeconf? > > > > Well, that is a last resort, it probably isn''t needed. > > > > > On Thu, 2009-09-10 at 10:45 -0400, Oleg Drokin wrote: > > > > Hello! > > > > > > > > On Sep 10, 2009, at 10:33 AM, Adam wrote: > > > > > I''m running into a rather strange problem where new files cannot be > > > > > created (lustre 1.6.6), but existing files can be modified and even > > > > > deleted. > > > > > The MDS reports: > > > > > LustreError: X:0:(mds_open.c:431:mds_create_objects()) error creating > > > > > objects for inode Y: > > > > > For a large subset of inodes. > > > > > > > > What is the rc value reported in that message? > > > > Are there any messages on OSTs at the same time? > > > > > > > > Bye, > > > > Oleg > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc.
On Sep 10, 2009 20:50 -0400, Adam wrote:> Unmounting worked, but remounting resulted in a LBUG (edited): > > X:(tracefile.c:450:libcfs_assertion_failed()) LBUG > X:(handler.c:2049:mds_setup()) ASSERTION(! > lvfs_check_rdonly(lvfs_sbdev(mnt->mnt_sb))) failed > X:(tracefile.c:450:libcfs_assertion_failed()) LBUGThis means it didn''t unmount cleanly.> After rebooting, an attempted mount of the mds results in (heavily > edited): > > lfs-MDT0000: denying duplicate export for > 91134603-5957-7699-8c87-6305e1e508d5 > (class_hash.c:190:lustre_hash_additem_unique()) Already found the key in > hash [UUID_HASH] > (llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile > 0x7360029:0x992d6208: rc -2 > (llog_cat.c:176:llog_cat_id2handle()) error opening log id > 0x7360029:992d6208: rc -2 > (llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0x7360029 > (llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb > failed: -2 > Failing over lfs-MDT0000 > setting obd lfs-MDT0000 device ''unknown-block(253,1)'' read-only *** > Turning device dm-1 (0xfd00001) read-onlyYou should be able to mount with "-o abort-recovery" to clean up the above problem.> and the new mount eventually fails: > > [root at MDS ~]# mount -t lustre /dev/mapper/mpath1 /mnt/mds > mount.lustre: mount /dev/mapper/mpath1 at /mnt/mds failed: No such file > or directory > Is the MGS specification correct? > Is the filesystem name correct? > If upgrading, is the copied client log valid? (see upgrade docs) > > e2fsck shows the mdt as clean -- any idea as to what''s tripping it? > > Thanks, > Adam > > On Thu, 2009-09-10 at 21:41 +0200, Andreas Dilger wrote: > > On Sep 10, 2009 11:00 -0400, Adam wrote: > > > -5, sorry I took a far better look then simply checking the health > > > status. (Part of) the problem is that the MDS doesn''t see the OSTs, > > > eg: > > > > > > [root at MDS ~]# lctl dl > > > 0 UP mgc MGC10.29.48.1 at o2ib eeb1eccc-727e-f3f8-824a-9862a42e3b08 5 > > > 1 UP mdt MDS MDS_uuid 3 > > > 2 UP lov lfs-mdtlov lfs-mdtlov_UUID 4 > > > 3 UP mds lfs-MDT0000 lfs-MDT0000_UUID 671 > > > [root at MDS ~]# > > > > You need to look at the MDS startup logs and/or just shut down the > > MDS and start it up again, to see why it isn''t connecting to the > > OSS nodes. If it can''t connect to any of them, I would suspect > > a network problem. Try "ping", "lctl ping", "telnet OSS5 988" to > > see if you can check whether the connection is working. > > > > > the OSS''s seem okay: > > > > > > [root at OSS5 ~]# lctl dl > > > 0 UP mgc MGC10.29.48.1 at o2ib af5bdde6-7fe2-a944-05cd-28459ef91385 5 > > > 1 UP ost OSS OSS_uuid 3 > > > 2 UP obdfilter lfs-OST0002 lfs-OST0002_UUID 671 > > > 3 UP obdfilter lfs-OST000a lfs-OST000a_UUID 671 > > > 4 UP obdfilter lfs-OST0012 lfs-OST0012_UUID 671 > > > 5 UP obdfilter lfs-OST001a lfs-OST001a_UUID 671 > > > [root at OSS5 ~]# > > > > > > writeconf? > > > > Well, that is a last resort, it probably isn''t needed. > > > > > On Thu, 2009-09-10 at 10:45 -0400, Oleg Drokin wrote: > > > > Hello! > > > > > > > > On Sep 10, 2009, at 10:33 AM, Adam wrote: > > > > > I''m running into a rather strange problem where new files cannot be > > > > > created (lustre 1.6.6), but existing files can be modified and even > > > > > deleted. > > > > > The MDS reports: > > > > > LustreError: X:0:(mds_open.c:431:mds_create_objects()) error creating > > > > > objects for inode Y: > > > > > For a large subset of inodes. > > > > > > > > What is the rc value reported in that message? > > > > Are there any messages on OSTs at the same time? > > > > > > > > Bye, > > > > Oleg > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.