Hello, I am trying to configure Lusture from three days but continuously getting stuck at an error "fsck failed please repair manually and reboot" snapshot of an error Loading required kernel modules Activating swap-devices in etc/fstab... Adding 803240k swap on /dev/sda3. Priority:-1 extents:1 across 803240k bootsplash: status on console 0 changed to off blogd: no message logging because /var filesystem is not accessible ehci-hcd ohci-hcd uhci-hcd usb-ohci usb-uhci: Loading console font lat9w-16.psfu -m trivial GO: Loadable Loading keymap i386/querty/uk.map.gz rm: cannot remove ''/var/run/numlock-on'': read-only filesystem Start Unicode mode fsck failed. Please repair manually and reboot. The root file system is currently mounted read-only. To remount it read write do: bash# mount -n -o remount,rw / Attention: Only CONTROL+D will reboot the system in this maintenance mode. shutdown or reboot will not work. Give root password for login: when I tried repairing it manually by e2fsck # e2fsck -c -f -y /dev/sdb e2fsck: symbol lookup error: e2fsck: undefined symbol: *ext2_attr_index_prefix* Thanking You -- Aman Agarwal 7th Semester Indian Institute of Information Technology,Allahabad http://profile.iiita.ac.in/RIT2007054 +91-9956125558 ?If you can dream it, you can do it?: Walt Disney -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100612/578271f2/attachment.html
On 2010-06-12, at 00:58, Aman (neshu) Agarwal wrote:> # e2fsck -c -f -y /dev/sdb > e2fsck: symbol lookup error: e2fsck: undefined symbol: ext2_attr_index_prefixe2fsck should print out the version when it runs. It looks like you have different versions of the e2fsck binary and the libext2fs libraries installed. This might happen if you installed 2 different versions of e2fsprogs at the same time. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Hello, quick question about the manual. Under the recovery section the manual states that a client needs to invalidate all locks, or flush it''s saved state in order to reconnect to a particular osc/mdc that has evicted it. We''ve found that one of our 1.8 clients will frequently get into a state where many of the oscs report a ''Resource temporarily unavailable'' state after an outage to a 1.6 LFS server. The LFS can be accessed again on the client by remounting the LFS, but it does not auto-recover. My question is, how one goes about manually flushing the saved client state? I couldn''t figure this one out from the manual, unless ''lctl set_param ldlm.namespaces.<OSC>.lru_size=clear'' does it. Thank you, Adam -- Adam Munro System Administrator | SHARCNET | http://www.sharcnet.ca Compute Canada | http://www.computecanada.org 519-888-4567 x36453
Yikes, sorry about the previous subject line! Adam wrote:> Hello, quick question about the manual. > > Under the recovery section the manual states that a client needs to > invalidate all locks, or flush it''s saved state in order to reconnect to > a particular osc/mdc that has evicted it. > > We''ve found that one of our 1.8 clients will frequently get into a state > where many of the oscs report a ''Resource temporarily unavailable'' state > after an outage to a 1.6 LFS server. The LFS can be accessed again on > the client by remounting the LFS, but it does not auto-recover. > > My question is, how one goes about manually flushing the saved client > state? I couldn''t figure this one out from the manual, unless ''lctl > set_param ldlm.namespaces.<OSC>.lru_size=clear'' does it. > > Thank you, > Adam > >-- Adam Munro System Administrator | SHARCNET | http://www.sharcnet.ca Compute Canada | http://www.computecanada.org 519-888-4567 x36453
On Jun 22, 2010, at 12:56 PM, Adam wrote:> Hello, quick question about the manual. > > Under the recovery section the manual states that a client needs to > invalidate all locks, or flush it''s saved state in order to > reconnect to > a particular osc/mdc that has evicted it. > > We''ve found that one of our 1.8 clients will frequently get into a > state > where many of the oscs report a ''Resource temporarily unavailable'' > state > after an outage to a 1.6 LFS server. The LFS can be accessed again on > the client by remounting the LFS, but it does not auto-recover.That sounds familiar. Are you using IB? There''s a problem with LNet peer health detection when used with 1.8 clients and 1.6 servers. See bug 23076. I haven''t tried the patch, but bug 23076 and my comments in bug 22920 describe the problem we saw at our site. Disabling peer health detection by setting ko2iblnd''s peer_timeout option to zero works around the problem. If you''re going to upgrade the servers to 1.8 at some point, it''s ok to leave it at the default of 180 on the servers and set it to zero on the clients until all of the 1.6 servers have been upgraded. Then, you can reboot your clients with the default value of peer_timeout at will, allowing you to take advantage of the feature without an outage on the servers. We tested that approach at our site. It worked for us, and that''s how we''ll be rolling it out over the next month. Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Ah excellent. I''m upgrading the servers tonight, so if successful the problem will vanish without any changes on the clients. Thanks Jason! Adam Jason Rappleye wrote:> > On Jun 22, 2010, at 12:56 PM, Adam wrote: > >> Hello, quick question about the manual. >> >> Under the recovery section the manual states that a client needs to >> invalidate all locks, or flush it''s saved state in order to reconnect to >> a particular osc/mdc that has evicted it. >> >> We''ve found that one of our 1.8 clients will frequently get into a state >> where many of the oscs report a ''Resource temporarily unavailable'' state >> after an outage to a 1.6 LFS server. The LFS can be accessed again on >> the client by remounting the LFS, but it does not auto-recover. > > That sounds familiar. Are you using IB? There''s a problem with LNet > peer health detection when used with 1.8 clients and 1.6 servers. See > bug 23076. I haven''t tried the patch, but bug 23076 and my comments in > bug 22920 describe the problem we saw at our site. > > Disabling peer health detection by setting ko2iblnd''s peer_timeout > option to zero works around the problem. If you''re going to upgrade > the servers to 1.8 at some point, it''s ok to leave it at the default > of 180 on the servers and set it to zero on the clients until all of > the 1.6 servers have been upgraded. Then, you can reboot your clients > with the default value of peer_timeout at will, allowing you to take > advantage of the feature without an outage on the servers. > > We tested that approach at our site. It worked for us, and that''s how > we''ll be rolling it out over the next month. > > Jason > > -- > Jason Rappleye > System Administrator > NASA Advanced Supercomputing Division > NASA Ames Research Center > Moffett Field, CA 94035 > > > > > > > >-- Adam Munro System Administrator | SHARCNET | http://www.sharcnet.ca Compute Canada | http://www.computecanada.org 519-888-4567 x36453
Confirmed -- if anyone else runs into this problem with a 1.8.2 client using a 1.6.6 server; upgrading the server to 1.8.3 will restore connections without requiring a umount, or any other changes to the client (at least in my case) Cheers, Adam Adam wrote:> Ah excellent. I''m upgrading the servers tonight, so if successful the > problem will vanish without any changes on the clients. > > Thanks Jason! > > Adam > > Jason Rappleye wrote: > >> On Jun 22, 2010, at 12:56 PM, Adam wrote: >> >> >>> Hello, quick question about the manual. >>> >>> Under the recovery section the manual states that a client needs to >>> invalidate all locks, or flush it''s saved state in order to reconnect to >>> a particular osc/mdc that has evicted it. >>> >>> We''ve found that one of our 1.8 clients will frequently get into a state >>> where many of the oscs report a ''Resource temporarily unavailable'' state >>> after an outage to a 1.6 LFS server. The LFS can be accessed again on >>> the client by remounting the LFS, but it does not auto-recover. >>> >> That sounds familiar. Are you using IB? There''s a problem with LNet >> peer health detection when used with 1.8 clients and 1.6 servers. See >> bug 23076. I haven''t tried the patch, but bug 23076 and my comments in >> bug 22920 describe the problem we saw at our site. >> >> Disabling peer health detection by setting ko2iblnd''s peer_timeout >> option to zero works around the problem. If you''re going to upgrade >> the servers to 1.8 at some point, it''s ok to leave it at the default >> of 180 on the servers and set it to zero on the clients until all of >> the 1.6 servers have been upgraded. Then, you can reboot your clients >> with the default value of peer_timeout at will, allowing you to take >> advantage of the feature without an outage on the servers. >> >> We tested that approach at our site. It worked for us, and that''s how >> we''ll be rolling it out over the next month. >> >> Jason >> >> -- >> Jason Rappleye >> System Administrator >> NASA Advanced Supercomputing Division >> NASA Ames Research Center >> Moffett Field, CA 94035 >> >> >> >> >> >> >> >> >> > > >-- Adam Munro System Administrator | SHARCNET | http://www.sharcnet.ca Compute Canada | http://www.computecanada.org 519-888-4567 x36453