thr3ads.net - Lustre discuss - [Lustre-discuss] "fsck failed : SLES11 X86

If this information is useful, please help other people find it:
Share via:

Aman (neshu) Agarwal

2010-Jun-12 06:58 UTC

[Lustre-discuss] "fsck failed : SLES11 X86_64

Hello,
I am trying to configure Lusture from three days but continuously
getting stuck at an error "fsck failed please repair manually and
reboot"

snapshot of an error

Loading required kernel modules
Activating swap-devices in etc/fstab...
Adding 803240k swap on /dev/sda3. Priority:-1 extents:1 across 803240k
bootsplash: status on console 0 changed to off
blogd: no message logging because /var filesystem is not accessible
ehci-hcd ohci-hcd uhci-hcd usb-ohci usb-uhci: Loading console font
lat9w-16.psfu -m trivial GO: Loadable
Loading keymap i386/querty/uk.map.gz
rm: cannot remove ''/var/run/numlock-on'': read-only filesystem
Start Unicode mode

fsck failed. Please repair manually and reboot. The root
file system is currently mounted read-only. To remount it
read write do:

bash# mount -n -o remount,rw /

Attention: Only CONTROL+D will reboot the system in this
maintenance mode. shutdown or reboot will not work.

Give root password for login:

when I tried repairing it manually by e2fsck

# e2fsck -c -f -y /dev/sdb
e2fsck: symbol lookup error: e2fsck: undefined symbol: *ext2_attr_index_prefix*

Thanking You
--
Aman Agarwal
7th Semester
Indian Institute of Information Technology,Allahabad
http://profile.iiita.ac.in/RIT2007054
+91-9956125558

?If you can dream it, you can do it?: Walt Disney
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100612/578271f2/attachment.html

Andreas Dilger

2010-Jun-14 20:39 UTC

head link

[Lustre-discuss] "fsck failed : SLES11 X86_64

On 2010-06-12, at 00:58, Aman (neshu) Agarwal wrote:> # e2fsck -c -f -y /dev/sdb
> e2fsck: symbol lookup error: e2fsck: undefined symbol:
ext2_attr_index_prefix
e2fsck should print out the version when it runs.
It looks like you have different versions of the e2fsck binary and the libext2fs
libraries installed.  This might happen if you installed 2 different versions of
e2fsprogs at the same time.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Adam

2010-Jun-22 19:56 UTC

head link

[Lustre-discuss] "fsck failed : SLES11 X86_64

Hello, quick question about the manual.

Under the recovery section the manual states that a client needs to 
invalidate all locks, or flush it''s saved state in order to reconnect
to
a particular osc/mdc that has evicted it.

We''ve found that one of our 1.8 clients will frequently get into a
state
where many of the oscs report a ''Resource temporarily
unavailable'' state
after an outage to a 1.6 LFS server. The LFS can be accessed again on 
the client by remounting the LFS, but it does not auto-recover.

My question is, how one goes about manually flushing the saved client 
state? I couldn''t figure this one out from the manual, unless
''lctl
set_param ldlm.namespaces.<OSC>.lru_size=clear'' does it.

Thank you,
Adam

-- 
Adam Munro
System Administrator  | SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.org
519-888-4567 x36453

Adam

2010-Jun-22 19:59 UTC

head link

[Lustre-discuss] flushing a client state

Yikes, sorry about the previous subject line!

Adam wrote:> Hello, quick question about the manual.
>
> Under the recovery section the manual states that a client needs to 
> invalidate all locks, or flush it''s saved state in order to
reconnect to
> a particular osc/mdc that has evicted it.
>
> We''ve found that one of our 1.8 clients will frequently get into a
state
> where many of the oscs report a ''Resource temporarily
unavailable'' state
> after an outage to a 1.6 LFS server. The LFS can be accessed again on 
> the client by remounting the LFS, but it does not auto-recover.
>
> My question is, how one goes about manually flushing the saved client 
> state? I couldn''t figure this one out from the manual, unless
''lctl
> set_param ldlm.namespaces.<OSC>.lru_size=clear'' does it.
>
> Thank you,
> Adam
>
>   

-- 
Adam Munro
System Administrator  | SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.org
519-888-4567 x36453

Jason Rappleye

2010-Jun-22 20:24 UTC

head link

[Lustre-discuss] "fsck failed : SLES11 X86_64

On Jun 22, 2010, at 12:56 PM, Adam wrote:
> Hello, quick question about the manual.
>
> Under the recovery section the manual states that a client needs to
> invalidate all locks, or flush it''s saved state in order to  
> reconnect to
> a particular osc/mdc that has evicted it.
>
> We''ve found that one of our 1.8 clients will frequently get into a
> state
> where many of the oscs report a ''Resource temporarily
unavailable''
> state
> after an outage to a 1.6 LFS server. The LFS can be accessed again on
> the client by remounting the LFS, but it does not auto-recover.
That sounds familiar. Are you using IB? There''s a problem with LNet  
peer health detection when used with 1.8 clients and 1.6 servers. See  
bug 23076. I haven''t tried the patch, but bug 23076 and my comments in
bug 22920 describe the problem we saw at our site.

Disabling peer health detection by setting ko2iblnd''s peer_timeout  
option to zero works around the problem. If you''re going to upgrade  
the servers to 1.8 at some point, it''s ok to leave it at the default  
of 180 on the servers and set it to zero on the clients until all of  
the 1.6 servers have been upgraded. Then, you can reboot your clients  
with the default value of peer_timeout at will, allowing you to take  
advantage of the feature without an outage on the servers.

We tested that approach at our site. It worked for us, and that''s how  
we''ll be rolling it out over the next month.

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035

Adam

2010-Jun-23 01:38 UTC

head link

[Lustre-discuss] "fsck failed : SLES11 X86_64

Ah excellent. I''m upgrading the servers tonight, so if successful the 
problem will vanish without any changes on the clients.

Thanks Jason!

Adam

Jason Rappleye wrote:>
> On Jun 22, 2010, at 12:56 PM, Adam wrote:
>
>> Hello, quick question about the manual.
>>
>> Under the recovery section the manual states that a client needs to
>> invalidate all locks, or flush it''s saved state in order to
reconnect to
>> a particular osc/mdc that has evicted it.
>>
>> We''ve found that one of our 1.8 clients will frequently get
into a state
>> where many of the oscs report a ''Resource temporarily
unavailable'' state
>> after an outage to a 1.6 LFS server. The LFS can be accessed again on
>> the client by remounting the LFS, but it does not auto-recover.
>
> That sounds familiar. Are you using IB? There''s a problem with
LNet
> peer health detection when used with 1.8 clients and 1.6 servers. See 
> bug 23076. I haven''t tried the patch, but bug 23076 and my
comments in
> bug 22920 describe the problem we saw at our site.
>
> Disabling peer health detection by setting ko2iblnd''s peer_timeout
> option to zero works around the problem. If you''re going to
upgrade
> the servers to 1.8 at some point, it''s ok to leave it at the
default
> of 180 on the servers and set it to zero on the clients until all of 
> the 1.6 servers have been upgraded. Then, you can reboot your clients 
> with the default value of peer_timeout at will, allowing you to take 
> advantage of the feature without an outage on the servers.
>
> We tested that approach at our site. It worked for us, and that''s
how
> we''ll be rolling it out over the next month.
>
> Jason
>
> -- 
> Jason Rappleye
> System Administrator
> NASA Advanced Supercomputing Division
> NASA Ames Research Center
> Moffett Field, CA 94035
>
>
>
>
>
>
>
>

-- 
Adam Munro
System Administrator  | SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.org
519-888-4567 x36453

Adam

2010-Jun-23 03:36 UTC

head link

[Lustre-discuss] Resource "always" unavailable

Confirmed -- if anyone else runs into this problem with a 1.8.2 client 
using a 1.6.6 server; upgrading the server to 1.8.3 will restore 
connections without requiring a umount, or any other changes to the 
client (at least in my case)

Cheers,
Adam

Adam wrote:> Ah excellent. I''m upgrading the servers tonight, so if successful
the
> problem will vanish without any changes on the clients.
>
> Thanks Jason!
>
> Adam
>
> Jason Rappleye wrote:
>   
>> On Jun 22, 2010, at 12:56 PM, Adam wrote:
>>
>>     
>>> Hello, quick question about the manual.
>>>
>>> Under the recovery section the manual states that a client needs to
>>> invalidate all locks, or flush it''s saved state in order
to reconnect to
>>> a particular osc/mdc that has evicted it.
>>>
>>> We''ve found that one of our 1.8 clients will frequently
get into a state
>>> where many of the oscs report a ''Resource temporarily
unavailable'' state
>>> after an outage to a 1.6 LFS server. The LFS can be accessed again
on
>>> the client by remounting the LFS, but it does not auto-recover.
>>>       
>> That sounds familiar. Are you using IB? There''s a problem with
LNet
>> peer health detection when used with 1.8 clients and 1.6 servers. See 
>> bug 23076. I haven''t tried the patch, but bug 23076 and my
comments in
>> bug 22920 describe the problem we saw at our site.
>>
>> Disabling peer health detection by setting ko2iblnd''s
peer_timeout
>> option to zero works around the problem. If you''re going to
upgrade
>> the servers to 1.8 at some point, it''s ok to leave it at the
default
>> of 180 on the servers and set it to zero on the clients until all of 
>> the 1.6 servers have been upgraded. Then, you can reboot your clients 
>> with the default value of peer_timeout at will, allowing you to take 
>> advantage of the feature without an outage on the servers.
>>
>> We tested that approach at our site. It worked for us, and
that''s how
>> we''ll be rolling it out over the next month.
>>
>> Jason
>>
>> -- 
>> Jason Rappleye
>> System Administrator
>> NASA Advanced Supercomputing Division
>> NASA Ames Research Center
>> Moffett Field, CA 94035
>>
>>
>>
>>
>>
>>
>>
>>
>>     
>
>
>   

-- 
Adam Munro
System Administrator  | SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.org
519-888-4567 x36453

Lustre discuss - Jun 2010 - "fsck failed : SLES11 X86_64

[Lustre-discuss] "fsck failed : SLES11 X86_64

[Lustre-discuss] "fsck failed : SLES11 X86_64

[Lustre-discuss] "fsck failed : SLES11 X86_64

[Lustre-discuss] flushing a client state

[Lustre-discuss] "fsck failed : SLES11 X86_64

[Lustre-discuss] "fsck failed : SLES11 X86_64

[Lustre-discuss] Resource "always" unavailable