Dear list, Our MDS losts 5/7 osc devices after a reconfiguration: 1.umount all osts 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 3.mount -t ldiskfs mdtdevice mountpoint 4.rm CONFIGS/* 5.mount -t lustre mdtdevice mountpoint 6. mount all osts we can see 7 the osc devices on a mounted client, but only 2/7 osc devices on MDS. When I create a file with 7 stripe, the MDS server crashed. Any ideas? lctl dl 0 UP mgs MGS MGS 707 1 UP mgc MGC192.168.50.50 at tcp 43fc0787-3580-ce63-5019-94a7903f2fb0 5 2 UP mdt MDS MDS_uuid 3 3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4 4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 803 5 UP ost OSS OSS_uuid 3 6 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 803 7 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 805 8 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5 9 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 805 10 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5 This is a small lustre installation. We have only 2 server, there are 3 OSTs and 1 MDT/MGS on one server, 4 OSTs on the other server. The lustre version is : 2.6.9-67.0.22.EL_lustre.1.6.6smp on x86-64. Best Regards Lu Wang -------------------------------------------------------------- Computing Center IHEP Office: Computing Center,123 19B Yuquan Road Tel: (+86) 10 88236012-607 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049,China Email: Lu.Wang at ihep.ac.cn --------------------------------------------------------------
I had a similar problem, after a reconfiguration, running tunefs.lustre ost I got a kernel panic, I solved using fsck on the ost damaged. Antonio Lu Wang ha scritto:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpoint > 6. mount all osts > > we can see 7 the osc devices on a mounted client, but only 2/7 osc devices on MDS. When I create a file with 7 stripe, the MDS server crashed. Any ideas? > lctl dl > 0 UP mgs MGS MGS 707 > 1 UP mgc MGC192.168.50.50 at tcp 43fc0787-3580-ce63-5019-94a7903f2fb0 5 > 2 UP mdt MDS MDS_uuid 3 > 3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4 > 4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 803 > 5 UP ost OSS OSS_uuid 3 > 6 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 803 > 7 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 805 > 8 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5 > 9 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 805 > 10 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5 > > This is a small lustre installation. We have only 2 server, there are 3 OSTs and 1 MDT/MGS on one server, 4 OSTs on the other server. The lustre version is : 2.6.9-67.0.22.EL_lustre.1.6.6smp on x86-64. > > Best Regards > Lu Wang > >
On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpointWhere is it documented to delete all of the files in CONFIGS? This deletes the action of step #2 above, and isn''t a good idea. Presumably there was also a step 4b to unmount the filesystem from type lfdiskfs? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi, dearlist We have done the steps from 1 to 5, but we can still only see 2/7 osc devices on the MDS, but we can see 7 osc devices on a mounted client. Then we run e2fsck on the osts, However, we get the same result. And we had the problem, after we do the step 1 to 4 and umount the filesystem from ldiskfs, then run the step 5, we got the logs: According to the logs, we have no idea why there are 63 clients and where the MDS get the clients information with we removing CONFIGS/* and linking down. Nov 12 09:02:37 beshome01 kernel: Lustre: 2474:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service besfs2-MDT0000, 63 recoverable clients, last_transno 118077355 Nov 12 09:02:37 beshome01 kernel: Lustre: MDT besfs2-MDT0000 now serving dev (besfs2-MDT0000/5ceb6ad6-e810-9fae-4862-8ed0913bf7e7), but will be in recovery for at least 5:00, or until 63 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/besfs2-MDT0000/recovery_status. 2009-11-12 huangql ???? Andreas Dilger ????? 2009-11-12 06:19:53 ???? Lu Wang ??? lustre-discuss ??? Re: [Lustre-discuss] osc lost on MDS server On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpointWhere is it documented to delete all of the files in CONFIGS? This deletes the action of step #2 above, and isn''t a good idea. Presumably there was also a step 4b to unmount the filesystem from type lfdiskfs? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091112/8a0c46e6/attachment-0001.html
Hi list, We have tried again trying to recover the system to a consistant state with following steps: 1. Fulled out the 10Gbit Ethernet links connecting to the computing clustre, and connected the 2 server using a direct ether net link. This step isolated the 2 servers from computing clustre to avoid the interferes from running clients( may be umount unclearly). 2. umount all the osts 3. umount MDT. 4. mount MDT as ldiskfs, and rm all files under CONFIGS( this files are confirmed not right ) and unmount. 5. running tunefs.lustre --erase-params --mgs --mdt --fsname=besfs2 --writeconf /dev/sda1 to MDT device. This command returned an Fatal Error which said it assumed that this is a upgrading operation from 1.4 to 1.6. The device is trying to copy client file from /tmp/****/LOG/ but failed, so it made a log file from "last_rcvd". 6. We ingored the error and mounted MDT successfully. lctl dl showed 5 device. 7. mount every OST as ldiskfs, and rm all files undre CONFIGS and umount 8. running tunefs.lustre --erase-params --ost --mgsnode=192.168.50.50 --index=old index --fsname=besfs2 --writeconf /dev/sd* to each OST. This command also rreturned the assumption about 1.4 to 1.6 upgrade,and then it made a log file from "last_recv". However, there was no Fatal error. 9. We mounted osts one by one. 10. This time we could see every osc for OST, however, only 2 osc are UP , the other 5 are IN. 0 UP mgs MGS MGS 7 1 UP mgc MGC192.168.50.50 at tcp 26aae9d0-202e-abf3-3cb0-746eea59d7a4 5 2 UP mdt MDS MDS_uuid 3 3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4 4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 3 5 IN osc besfs2-OST0000-osc besfs2-mdtlov_UUID 5 6 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5 7 IN osc besfs2-OST0003-osc besfs2-mdtlov_UUID 5 8 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5 9 IN osc besfs2-OST0004-osc besfs2-mdtlov_UUID 5 10 IN osc besfs2-OST0005-osc besfs2-mdtlov_UUID 5 11 IN osc besfs2-OST0006-osc besfs2-mdtlov_UUID 5 12 UP ost OSS OSS_uuid 3 13 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 5 14 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 5 15 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 5 We find this error log in /var/log/message: kernel: LustreError: 2407:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0xbc28013:0xf77a298: rc -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id 0xbc28013:f77a298: rc -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0xbc28013 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3675:osc_llog_init()) osc ''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000'' cnt 1 catid 00000104110f1ce8 rc=-2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3677:osc_llog_init()) logid 0xbc28002:0x9a60e39f Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(lov_log.c:230:lov_llog_init()) error osc_llog_init idx 0 osc ''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000'' (rc=-2) Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_log.c:220:mds_llog_init()) lov_llog_init err -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:417:llog_cat_initialize()) rc: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:916:__mds_lov_synchronize()) besfs2-OST0000_UUID failed at update_mds: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:959:__mds_lov_synchronize()) besfs2-OST0000_UUID sync failed -2, deactivating Any ideas? ------------------ Lu Wang 2009-11-12 ------------------------------------------------------------- ????huangql ?????2009-11-12 09:25:44 ????Andreas Dilger; Lu Wang ???lustre-discuss ???Re: Re: [Lustre-discuss] osc lost on MDS server Hi, dearlist We have done the steps from 1 to 5, but we can still only see 2/7 osc devices on the MDS, but we can see 7 osc devices on a mounted client. Then we run e2fsck on the osts, However, we get the same result. And we had the problem, after we do the step 1 to 4 and umount the filesystem from ldiskfs, then run the step 5, we got the logs: According to the logs, we have no idea why there are 63 clients and where the MDS get the clients information with we removing CONFIGS/* and linking down. Nov 12 09:02:37 beshome01 kernel: Lustre: 2474:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service besfs2-MDT0000, 63 recoverable clients, last_transno 118077355 Nov 12 09:02:37 beshome01 kernel: Lustre: MDT besfs2-MDT0000 now serving dev (besfs2-MDT0000/5ceb6ad6-e810-9fae-4862-8ed0913bf7e7), but will be in recovery for at least 5:00, or until 63 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/besfs2-MDT0000/recovery_status. 2009-11-12 huangql ???? Andreas Dilger ????? 2009-11-12 06:19:53 ???? Lu Wang ??? lustre-discuss ??? Re: [Lustre-discuss] osc lost on MDS server On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpointWhere is it documented to delete all of the files in CONFIGS? This deletes the action of step #2 above, and isn''t a good idea. Presumably there was also a step 4b to unmount the filesystem from type lfdiskfs? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
We take the 2 servers back to the cluster. After 15 hours''s running, we get this errors in /var/log/message: Nov 13 10:37:04 beshome01 kernel: LustreError: 2359:0:(llog_obd.c:211:llog_add()) No ctxt Nov 13 10:37:04 beshome01 kernel: LustreError: 2359:0:(llog_obd.c:211:llog_add()) Skipped 2 previous similar messages Nov 13 10:39:49 beshome01 kernel: LustreError: 2360:0:(llog_obd.c:211:llog_add()) No ctxt Nov 13 10:39:49 beshome01 kernel: LustreError: 2360:0:(llog_obd.c:211:llog_add()) Skipped 16 previous similar messages Nov 13 10:52:43 beshome01 kernel: LustreError: 2332:0:(llog_obd.c:211:llog_add()) No ctxt Nov 13 10:52:43 beshome01 kernel: LustreError: 2332:0:(llog_obd.c:211:llog_add()) Skipped 265 previous similar messages Nov 13 10:53:46 beshome01 kernel: LustreError: 2346:0:(llog_obd.c:211:llog_add()) No ctxt Nov 13 10:53:46 beshome01 kernel: LustreError: 2346:0:(llog_obd.c:211:llog_add()) Skipped 105 previous similar messages Nov 13 10:54:38 beshome01 kernel: LustreError: 2335:0:(llog_obd.c:211:llog_add()) No ctxt Nov 13 10:54:38 beshome01 kernel: LustreError: 2335:0:(llog_obd.c:211:llog_add()) Skipped 4 previous similar messages Nov 13 10:56:04 beshome01 kernel: LustreError: 2356:0:(llog_obd.c:211:llog_add()) No ctxt Nov 13 10:56:04 beshome01 kernel: LustreError: 2356:0:(llog_obd.c:211:llog_add()) Skipped 5 previous similar messages Nov 13 10:59:26 beshome01 kernel: LustreError: 2357:0:(llog_obd.c:211:llog_add()) No ctxt Nov 13 10:59:26 beshome01 kernel: LustreError: 2357:0:(llog_obd.c:211:llog_add()) Skipped 3 previous similar messages Since we still have 2 "UP" osc on MDS, and all the osc are "UP" on lustre clients, users feel the system is back to normal. However, new objects can only be created on 2 OSTs. If the write I/Os increase, we will get: Nov 12 18:50:28 beshome01 kernel: Lustre: 2599:0:(filter_io_26.c:714:filter_commitrw_write()) besfs2-OST0002: slow i_mutex 30s ------------------ Lu Wang 2009-11-13 ------------------------------------------------------------- ????Lu Wang ?????2009-11-12 17:52:05 ????lustre-discuss ??? ???Re: [Lustre-discuss] osc lost on MDS server Hi list, We have tried again trying to recover the system to a consistant state with following steps: 1. Fulled out the 10Gbit Ethernet links connecting to the computing clustre, and connected the 2 server using a direct ether net link. This step isolated the 2 servers from computing clustre to avoid the interferes from running clients( may be umount unclearly). 2. umount all the osts 3. umount MDT. 4. mount MDT as ldiskfs, and rm all files under CONFIGS( this files are confirmed not right ) and unmount. 5. running tunefs.lustre --erase-params --mgs --mdt --fsname=besfs2 --writeconf /dev/sda1 to MDT device. This command returned an Fatal Error which said it assumed that this is a upgrading operation from 1.4 to 1.6. The device is trying to copy client file from /tmp/****/LOG/ but failed, so it made a log file from "last_rcvd". 6. We ingored the error and mounted MDT successfully. lctl dl showed 5 device. 7. mount every OST as ldiskfs, and rm all files undre CONFIGS and umount 8. running tunefs.lustre --erase-params --ost --mgsnode=192.168.50.50 --index=old index --fsname=besfs2 --writeconf /dev/sd* to each OST. This command also rreturned the assumption about 1.4 to 1.6 upgrade,and then it made a log file from "last_recv". However, there was no Fatal error. 9. We mounted osts one by one. 10. This time we could see every osc for OST, however, only 2 osc are UP , the other 5 are IN. 0 UP mgs MGS MGS 7 1 UP mgc MGC192.168.50.50 at tcp 26aae9d0-202e-abf3-3cb0-746eea59d7a4 5 2 UP mdt MDS MDS_uuid 3 3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4 4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 3 5 IN osc besfs2-OST0000-osc besfs2-mdtlov_UUID 5 6 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5 7 IN osc besfs2-OST0003-osc besfs2-mdtlov_UUID 5 8 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5 9 IN osc besfs2-OST0004-osc besfs2-mdtlov_UUID 5 10 IN osc besfs2-OST0005-osc besfs2-mdtlov_UUID 5 11 IN osc besfs2-OST0006-osc besfs2-mdtlov_UUID 5 12 UP ost OSS OSS_uuid 3 13 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 5 14 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 5 15 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 5 We find this error log in /var/log/message: kernel: LustreError: 2407:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0xbc28013:0xf77a298: rc -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id 0xbc28013:f77a298: rc -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0xbc28013 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3675:osc_llog_init()) osc ''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000'' cnt 1 catid 00000104110f1ce8 rc=-2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3677:osc_llog_init()) logid 0xbc28002:0x9a60e39f Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(lov_log.c:230:lov_llog_init()) error osc_llog_init idx 0 osc ''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000'' (rc=-2) Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_log.c:220:mds_llog_init()) lov_llog_init err -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:417:llog_cat_initialize()) rc: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:916:__mds_lov_synchronize()) besfs2-OST0000_UUID failed at update_mds: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:959:__mds_lov_synchronize()) besfs2-OST0000_UUID sync failed -2, deactivating Any ideas? ------------------ Lu Wang 2009-11-12 ------------------------------------------------------------- ????huangql ?????2009-11-12 09:25:44 ????Andreas Dilger; Lu Wang ???lustre-discuss ???Re: Re: [Lustre-discuss] osc lost on MDS server Hi, dearlist We have done the steps from 1 to 5, but we can still only see 2/7 osc devices on the MDS, but we can see 7 osc devices on a mounted client. Then we run e2fsck on the osts, However, we get the same result. And we had the problem, after we do the step 1 to 4 and umount the filesystem from ldiskfs, then run the step 5, we got the logs: According to the logs, we have no idea why there are 63 clients and where the MDS get the clients information with we removing CONFIGS/* and linking down. Nov 12 09:02:37 beshome01 kernel: Lustre: 2474:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service besfs2-MDT0000, 63 recoverable clients, last_transno 118077355 Nov 12 09:02:37 beshome01 kernel: Lustre: MDT besfs2-MDT0000 now serving dev (besfs2-MDT0000/5ceb6ad6-e810-9fae-4862-8ed0913bf7e7), but will be in recovery for at least 5:00, or until 63 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/besfs2-MDT0000/recovery_status. 2009-11-12 huangql ???? Andreas Dilger ????? 2009-11-12 06:19:53 ???? Lu Wang ??? lustre-discuss ??? Re: [Lustre-discuss] osc lost on MDS server On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpointWhere is it documented to delete all of the files in CONFIGS? This deletes the action of step #2 above, and isn''t a good idea. Presumably there was also a step 4b to unmount the filesystem from type lfdiskfs? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss