Dear list,
Our MDS losts 5/7 osc devices after a reconfiguration:
1.umount all osts
2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1
3.mount -t ldiskfs mdtdevice mountpoint
4.rm CONFIGS/*
5.mount -t lustre mdtdevice mountpoint
6. mount all osts
we can see 7 the osc devices on a mounted client, but only 2/7 osc devices on
MDS. When I create a file with 7 stripe, the MDS server crashed. Any ideas?
lctl dl
0 UP mgs MGS MGS 707
1 UP mgc MGC192.168.50.50 at tcp 43fc0787-3580-ce63-5019-94a7903f2fb0 5
2 UP mdt MDS MDS_uuid 3
3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4
4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 803
5 UP ost OSS OSS_uuid 3
6 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 803
7 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 805
8 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5
9 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 805
10 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5
This is a small lustre installation. We have only 2 server, there are 3 OSTs
and 1 MDT/MGS on one server, 4 OSTs on the other server. The lustre version is
: 2.6.9-67.0.22.EL_lustre.1.6.6smp on x86-64.
Best Regards
Lu Wang
--------------------------------------------------------------
Computing Center
IHEP Office: Computing Center,123
19B Yuquan Road Tel: (+86) 10 88236012-607
P.O. Box 918-7 Fax: (+86) 10 8823 6839
Beijing 100049,China Email: Lu.Wang at ihep.ac.cn
--------------------------------------------------------------
I had a similar problem, after a reconfiguration, running tunefs.lustre ost I got a kernel panic, I solved using fsck on the ost damaged. Antonio Lu Wang ha scritto:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpoint > 6. mount all osts > > we can see 7 the osc devices on a mounted client, but only 2/7 osc devices on MDS. When I create a file with 7 stripe, the MDS server crashed. Any ideas? > lctl dl > 0 UP mgs MGS MGS 707 > 1 UP mgc MGC192.168.50.50 at tcp 43fc0787-3580-ce63-5019-94a7903f2fb0 5 > 2 UP mdt MDS MDS_uuid 3 > 3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4 > 4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 803 > 5 UP ost OSS OSS_uuid 3 > 6 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 803 > 7 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 805 > 8 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5 > 9 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 805 > 10 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5 > > This is a small lustre installation. We have only 2 server, there are 3 OSTs and 1 MDT/MGS on one server, 4 OSTs on the other server. The lustre version is : 2.6.9-67.0.22.EL_lustre.1.6.6smp on x86-64. > > Best Regards > Lu Wang > >
On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpointWhere is it documented to delete all of the files in CONFIGS? This deletes the action of step #2 above, and isn''t a good idea. Presumably there was also a step 4b to unmount the filesystem from type lfdiskfs? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi, dearlist We have done the steps from 1 to 5, but we can still only see 2/7 osc devices on the MDS, but we can see 7 osc devices on a mounted client. Then we run e2fsck on the osts, However, we get the same result. And we had the problem, after we do the step 1 to 4 and umount the filesystem from ldiskfs, then run the step 5, we got the logs: According to the logs, we have no idea why there are 63 clients and where the MDS get the clients information with we removing CONFIGS/* and linking down. Nov 12 09:02:37 beshome01 kernel: Lustre: 2474:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service besfs2-MDT0000, 63 recoverable clients, last_transno 118077355 Nov 12 09:02:37 beshome01 kernel: Lustre: MDT besfs2-MDT0000 now serving dev (besfs2-MDT0000/5ceb6ad6-e810-9fae-4862-8ed0913bf7e7), but will be in recovery for at least 5:00, or until 63 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/besfs2-MDT0000/recovery_status. 2009-11-12 huangql ???? Andreas Dilger ????? 2009-11-12 06:19:53 ???? Lu Wang ??? lustre-discuss ??? Re: [Lustre-discuss] osc lost on MDS server On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpointWhere is it documented to delete all of the files in CONFIGS? This deletes the action of step #2 above, and isn''t a good idea. Presumably there was also a step 4b to unmount the filesystem from type lfdiskfs? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091112/8a0c46e6/attachment-0001.html
Hi list, We have tried again trying to recover the system to a consistant state with following steps: 1. Fulled out the 10Gbit Ethernet links connecting to the computing clustre, and connected the 2 server using a direct ether net link. This step isolated the 2 servers from computing clustre to avoid the interferes from running clients( may be umount unclearly). 2. umount all the osts 3. umount MDT. 4. mount MDT as ldiskfs, and rm all files under CONFIGS( this files are confirmed not right ) and unmount. 5. running tunefs.lustre --erase-params --mgs --mdt --fsname=besfs2 --writeconf /dev/sda1 to MDT device. This command returned an Fatal Error which said it assumed that this is a upgrading operation from 1.4 to 1.6. The device is trying to copy client file from /tmp/****/LOG/ but failed, so it made a log file from "last_rcvd". 6. We ingored the error and mounted MDT successfully. lctl dl showed 5 device. 7. mount every OST as ldiskfs, and rm all files undre CONFIGS and umount 8. running tunefs.lustre --erase-params --ost --mgsnode=192.168.50.50 --index=old index --fsname=besfs2 --writeconf /dev/sd* to each OST. This command also rreturned the assumption about 1.4 to 1.6 upgrade,and then it made a log file from "last_recv". However, there was no Fatal error. 9. We mounted osts one by one. 10. This time we could see every osc for OST, however, only 2 osc are UP , the other 5 are IN. 0 UP mgs MGS MGS 7 1 UP mgc MGC192.168.50.50 at tcp 26aae9d0-202e-abf3-3cb0-746eea59d7a4 5 2 UP mdt MDS MDS_uuid 3 3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4 4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 3 5 IN osc besfs2-OST0000-osc besfs2-mdtlov_UUID 5 6 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5 7 IN osc besfs2-OST0003-osc besfs2-mdtlov_UUID 5 8 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5 9 IN osc besfs2-OST0004-osc besfs2-mdtlov_UUID 5 10 IN osc besfs2-OST0005-osc besfs2-mdtlov_UUID 5 11 IN osc besfs2-OST0006-osc besfs2-mdtlov_UUID 5 12 UP ost OSS OSS_uuid 3 13 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 5 14 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 5 15 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 5 We find this error log in /var/log/message: kernel: LustreError: 2407:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0xbc28013:0xf77a298: rc -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id 0xbc28013:f77a298: rc -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0xbc28013 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3675:osc_llog_init()) osc ''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000'' cnt 1 catid 00000104110f1ce8 rc=-2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3677:osc_llog_init()) logid 0xbc28002:0x9a60e39f Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(lov_log.c:230:lov_llog_init()) error osc_llog_init idx 0 osc ''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000'' (rc=-2) Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_log.c:220:mds_llog_init()) lov_llog_init err -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:417:llog_cat_initialize()) rc: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:916:__mds_lov_synchronize()) besfs2-OST0000_UUID failed at update_mds: -2 Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:959:__mds_lov_synchronize()) besfs2-OST0000_UUID sync failed -2, deactivating Any ideas? ------------------ Lu Wang 2009-11-12 ------------------------------------------------------------- ????huangql ?????2009-11-12 09:25:44 ????Andreas Dilger; Lu Wang ???lustre-discuss ???Re: Re: [Lustre-discuss] osc lost on MDS server Hi, dearlist We have done the steps from 1 to 5, but we can still only see 2/7 osc devices on the MDS, but we can see 7 osc devices on a mounted client. Then we run e2fsck on the osts, However, we get the same result. And we had the problem, after we do the step 1 to 4 and umount the filesystem from ldiskfs, then run the step 5, we got the logs: According to the logs, we have no idea why there are 63 clients and where the MDS get the clients information with we removing CONFIGS/* and linking down. Nov 12 09:02:37 beshome01 kernel: Lustre: 2474:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service besfs2-MDT0000, 63 recoverable clients, last_transno 118077355 Nov 12 09:02:37 beshome01 kernel: Lustre: MDT besfs2-MDT0000 now serving dev (besfs2-MDT0000/5ceb6ad6-e810-9fae-4862-8ed0913bf7e7), but will be in recovery for at least 5:00, or until 63 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/besfs2-MDT0000/recovery_status. 2009-11-12 huangql ???? Andreas Dilger ????? 2009-11-12 06:19:53 ???? Lu Wang ??? lustre-discuss ??? Re: [Lustre-discuss] osc lost on MDS server On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list, > Our MDS losts 5/7 osc devices after a reconfiguration: > 1.umount all osts > 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1 > 3.mount -t ldiskfs mdtdevice mountpoint > 4.rm CONFIGS/* > 5.mount -t lustre mdtdevice mountpointWhere is it documented to delete all of the files in CONFIGS? This deletes the action of step #2 above, and isn''t a good idea. Presumably there was also a step 4b to unmount the filesystem from type lfdiskfs? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
We take the 2 servers back to the cluster. After 15 hours''s running, we
get this errors in /var/log/message:
Nov 13 10:37:04 beshome01 kernel: LustreError:
2359:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:37:04 beshome01 kernel: LustreError:
2359:0:(llog_obd.c:211:llog_add()) Skipped 2 previous similar messages
Nov 13 10:39:49 beshome01 kernel: LustreError:
2360:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:39:49 beshome01 kernel: LustreError:
2360:0:(llog_obd.c:211:llog_add()) Skipped 16 previous similar messages
Nov 13 10:52:43 beshome01 kernel: LustreError:
2332:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:52:43 beshome01 kernel: LustreError:
2332:0:(llog_obd.c:211:llog_add()) Skipped 265 previous similar messages
Nov 13 10:53:46 beshome01 kernel: LustreError:
2346:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:53:46 beshome01 kernel: LustreError:
2346:0:(llog_obd.c:211:llog_add()) Skipped 105 previous similar messages
Nov 13 10:54:38 beshome01 kernel: LustreError:
2335:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:54:38 beshome01 kernel: LustreError:
2335:0:(llog_obd.c:211:llog_add()) Skipped 4 previous similar messages
Nov 13 10:56:04 beshome01 kernel: LustreError:
2356:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:56:04 beshome01 kernel: LustreError:
2356:0:(llog_obd.c:211:llog_add()) Skipped 5 previous similar messages
Nov 13 10:59:26 beshome01 kernel: LustreError:
2357:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:59:26 beshome01 kernel: LustreError:
2357:0:(llog_obd.c:211:llog_add()) Skipped 3 previous similar messages
Since we still have 2 "UP" osc on MDS, and all the osc are
"UP" on lustre clients, users feel the system is back to normal.
However, new objects can only be created on 2 OSTs. If the write I/Os
increase, we will get:
Nov 12 18:50:28 beshome01 kernel: Lustre:
2599:0:(filter_io_26.c:714:filter_commitrw_write()) besfs2-OST0002: slow i_mutex
30s
------------------
Lu Wang
2009-11-13
-------------------------------------------------------------
????Lu Wang
?????2009-11-12 17:52:05
????lustre-discuss
???
???Re: [Lustre-discuss] osc lost on MDS server
Hi list,
We have tried again trying to recover the system to a consistant state with
following steps:
1. Fulled out the 10Gbit Ethernet links connecting to the computing clustre, and
connected the 2 server using a direct
ether net link. This step isolated the 2 servers from computing clustre to avoid
the interferes from running clients( may be umount unclearly).
2. umount all the osts
3. umount MDT.
4. mount MDT as ldiskfs, and rm all files under CONFIGS( this files are
confirmed not right ) and unmount.
5. running tunefs.lustre --erase-params --mgs --mdt --fsname=besfs2 --writeconf
/dev/sda1 to MDT device. This command returned an Fatal Error which said it
assumed that this is a upgrading operation from 1.4 to 1.6. The device
is trying to copy client file from /tmp/****/LOG/ but failed, so it made a log
file from "last_rcvd".
6. We ingored the error and mounted MDT successfully. lctl dl showed 5 device.
7. mount every OST as ldiskfs, and rm all files undre CONFIGS and umount
8. running tunefs.lustre --erase-params --ost --mgsnode=192.168.50.50
--index=old index --fsname=besfs2 --writeconf /dev/sd* to each OST. This command
also rreturned the assumption about 1.4 to 1.6 upgrade,and then it made a log
file from "last_recv". However, there was no Fatal error.
9. We mounted osts one by one.
10. This time we could see every osc for OST, however, only 2 osc are UP , the
other 5 are IN.
0 UP mgs MGS MGS 7
1 UP mgc MGC192.168.50.50 at tcp 26aae9d0-202e-abf3-3cb0-746eea59d7a4 5
2 UP mdt MDS MDS_uuid 3
3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4
4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 3
5 IN osc besfs2-OST0000-osc besfs2-mdtlov_UUID 5
6 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5
7 IN osc besfs2-OST0003-osc besfs2-mdtlov_UUID 5
8 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5
9 IN osc besfs2-OST0004-osc besfs2-mdtlov_UUID 5
10 IN osc besfs2-OST0005-osc besfs2-mdtlov_UUID 5
11 IN osc besfs2-OST0006-osc besfs2-mdtlov_UUID 5
12 UP ost OSS OSS_uuid 3
13 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 5
14 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 5
15 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 5
We find this error log in /var/log/message:
kernel: LustreError: 2407:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking
up logfile 0xbc28013:0xf77a298: rc -2
Nov 12 16:57:03 beshome01 kernel: LustreError:
2407:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id
0xbc28013:f77a298: rc -2
Nov 12 16:57:03 beshome01 kernel: LustreError:
2407:0:(llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0xbc28013
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb
failed: -2
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(osc_request.c:3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(osc_request.c:3675:osc_llog_init()) osc
''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000''
cnt 1 catid 00000104110f1ce8 rc=-2
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(osc_request.c:3677:osc_llog_init()) logid 0xbc28002:0x9a60e39f
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(lov_log.c:230:lov_llog_init()) error osc_llog_init idx 0 osc
''besfs2-OST0000-osc'' tgt ''besfs2-MDT0000''
(rc=-2)
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(mds_log.c:220:mds_llog_init()) lov_llog_init err -2
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(llog_obd.c:417:llog_cat_initialize()) rc: -2
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(mds_lov.c:916:__mds_lov_synchronize()) besfs2-OST0000_UUID failed at
update_mds: -2
Nov 12 16:57:03 beshome01 kernel: LustreError:
2398:0:(mds_lov.c:959:__mds_lov_synchronize()) besfs2-OST0000_UUID sync failed
-2, deactivating
Any ideas?
------------------
Lu Wang
2009-11-12
-------------------------------------------------------------
????huangql
?????2009-11-12 09:25:44
????Andreas Dilger; Lu Wang
???lustre-discuss
???Re: Re: [Lustre-discuss] osc lost on MDS server
Hi, dearlist
We have done the steps from 1 to 5, but we can still only see 2/7 osc devices on
the MDS, but we can see 7 osc devices on a mounted client. Then we run e2fsck on
the osts, However, we get the same result.
And we had the problem, after we do the step 1 to 4 and umount the filesystem
from ldiskfs, then run the step 5, we got the logs:
According to the logs, we have no idea why there are 63 clients and where the
MDS get the clients information with we removing CONFIGS/* and linking down.
Nov 12 09:02:37 beshome01 kernel: Lustre:
2474:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service besfs2-MDT0000,
63 recoverable clients, last_transno 118077355
Nov 12 09:02:37 beshome01 kernel: Lustre: MDT besfs2-MDT0000 now serving dev
(besfs2-MDT0000/5ceb6ad6-e810-9fae-4862-8ed0913bf7e7), but will be in recovery
for at least 5:00, or until 63 clients reconnect. During this time new clients
will not be allowed to connect. Recovery progress can be monitored by watching
/proc/fs/lustre/mds/besfs2-MDT0000/recovery_status.
2009-11-12
huangql
???? Andreas Dilger
????? 2009-11-12 06:19:53
???? Lu Wang
??? lustre-discuss
??? Re: [Lustre-discuss] osc lost on MDS server
On 2009-11-11, at 05:59, Lu Wang wrote:> Dear list,
> Our MDS losts 5/7 osc devices after a reconfiguration:
> 1.umount all osts
> 2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1
> 3.mount -t ldiskfs mdtdevice mountpoint
> 4.rm CONFIGS/*
> 5.mount -t lustre mdtdevice mountpoint
Where is it documented to delete all of the files in CONFIGS?
This deletes the action of step #2 above, and isn''t a good idea.
Presumably there was also a step 4b to unmount the filesystem
from type lfdiskfs?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss