thr3ads.net - Lustre discuss - [Lustre-discuss] OST error [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Bob Ball

2010-Dec-02 21:00 UTC

[Lustre-discuss] OST error

We were getting errors thrown by an OST.  /var/log/messages contained a 
lot of these:
2010-11-28T17:05:34-05:00 umfs06.aglt2.org kernel: [2102640.735927] 
LDISKFS-fs error (device sdk): ldiskfs_mb_check_ondisk_bitmap: on-disk 
bitmap for group 639corrupted: 440 blocks free in bitmap, 439 - in gd

So, I turned off (most) access to the disk via lctl (we have a LOT of 
client machines, some were missed) and got problems.  Had to use the 
alternate superblock to e2fsck the disk.  When back online, I still saw 
similar messages.  Updated to e2fsprogs 1.41.12 as suggested elsewhere.  
Repeated e2fsck.

Still seeing these.  Users report some files corrupted, coming up with 
bad md5sum....  Any other thoughts on what to do about this problem?

[2440763.879143] LDISKFS-fs error (device sdk): 
ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted: 
1318 blocks free in bitmap, 1317 - in gd
[2440763.879796]
[2440763.882724] LustreError: 
1651027:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record()) can''t 
read/create block: -28
[2440763.882736] LustreError: 
1651027:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log 
record: rc -28
[2440763.882789] LustreError: 
1651002:0:(mgc_request.c:1089:mgc_copy_llog()) Failed to copy remote log 
umt3-OST0019 (-28)

Rebooted to make system clean as a whole, and found the same kind of 
thing repeating.
[  285.834864] LDISKFS-fs (sdk): warning: mounting fs with errors, 
running e2fsck is recommended
[  285.852559] LDISKFS-fs (sdk): mounted filesystem with ordered data mode
[  286.079065] LDISKFS-fs (sdk): warning: mounting fs with errors, 
running e2fsck is recommended
[  286.096316] LDISKFS-fs (sdk): mounted filesystem with ordered data mode
[  286.940872] LDISKFS-fs error (device sdk): 
ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted: 
1318 blocks free in bitmap, 1317 - in gd
[  286.941693]
[  286.945224] LustreError: 
5790:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record()) can''t 
read/create block: -28
[  286.945233] LustreError: 
5790:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log 
record: rc -28
[  286.945448] LustreError: 5763:0:(mgc_request.c:1089:mgc_copy_llog()) 
Failed to copy remote log umt3-OST0019 (-28)

All help appreciated.

bob

Colin Faber

2010-Dec-02 21:05 UTC

head link

[Lustre-discuss] OST error

Hi Bob,

If you''re seeing the same errors on the same disk after e2fsck run, and
it''s not catching them, it''s possible that you''re
hitting an edge case
which isn''t handled within e2fsck properly, however if you''re 
experiencing different errors and e2fsck did catch them before, chances 
are you''re looking at some hardware failure some place.

If this is a single disk, and you have SMART monitoring enabled, check 
your error counters, if it''s a raid device, verify the error counters
on
that.

-cf


On 12/02/2010 02:00 PM, Bob Ball wrote:> We were getting errors thrown by an OST.  /var/log/messages contained a
> lot of these:
> 2010-11-28T17:05:34-05:00 umfs06.aglt2.org kernel: [2102640.735927]
> LDISKFS-fs error (device sdk): ldiskfs_mb_check_ondisk_bitmap: on-disk
> bitmap for group 639corrupted: 440 blocks free in bitmap, 439 - in gd
>
> So, I turned off (most) access to the disk via lctl (we have a LOT of
> client machines, some were missed) and got problems.  Had to use the
> alternate superblock to e2fsck the disk.  When back online, I still saw
> similar messages.  Updated to e2fsprogs 1.41.12 as suggested elsewhere.
> Repeated e2fsck.
>
> Still seeing these.  Users report some files corrupted, coming up with
> bad md5sum....  Any other thoughts on what to do about this problem?
>
> [2440763.879143] LDISKFS-fs error (device sdk):
> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted:
> 1318 blocks free in bitmap, 1317 - in gd
> [2440763.879796]
> [2440763.882724] LustreError:
> 1651027:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record())
can''t
> read/create block: -28
> [2440763.882736] LustreError:
> 1651027:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
> record: rc -28
> [2440763.882789] LustreError:
> 1651002:0:(mgc_request.c:1089:mgc_copy_llog()) Failed to copy remote log
> umt3-OST0019 (-28)
>
> Rebooted to make system clean as a whole, and found the same kind of
> thing repeating.
> [  285.834864] LDISKFS-fs (sdk): warning: mounting fs with errors,
> running e2fsck is recommended
> [  285.852559] LDISKFS-fs (sdk): mounted filesystem with ordered data mode
> [  286.079065] LDISKFS-fs (sdk): warning: mounting fs with errors,
> running e2fsck is recommended
> [  286.096316] LDISKFS-fs (sdk): mounted filesystem with ordered data mode
> [  286.940872] LDISKFS-fs error (device sdk):
> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted:
> 1318 blocks free in bitmap, 1317 - in gd
> [  286.941693]
> [  286.945224] LustreError:
> 5790:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record()) can''t
> read/create block: -28
> [  286.945233] LustreError:
> 5790:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
> record: rc -28
> [  286.945448] LustreError: 5763:0:(mgc_request.c:1089:mgc_copy_llog())
> Failed to copy remote log umt3-OST0019 (-28)
>
> All help appreciated.
>
> bob
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bob Ball

2010-Dec-02 21:35 UTC

head link

[Lustre-discuss] OST error

It is a Dell PERC6 RAID array.  OMSA monitoring is enabled and is not 
throwing errors.  Hmmmm, mptctl is old though, so maybe that is a 
contributing factor.  I guess I need to update that.  Shoot, 
megaraid_sas is also not up to date.  dkms....

OK, guess I need some driver updates.

Later.
bob

On 12/2/2010 4:05 PM, Colin Faber wrote:> Hi Bob,
>
> If you''re seeing the same errors on the same disk after e2fsck
run,
> and it''s not catching them, it''s possible that
you''re hitting an edge
> case which isn''t handled within e2fsck properly, however if
you''re
> experiencing different errors and e2fsck did catch them before, 
> chances are you''re looking at some hardware failure some place.
>
> If this is a single disk, and you have SMART monitoring enabled, check 
> your error counters, if it''s a raid device, verify the error
counters
> on that.
>
> -cf
>
>
> On 12/02/2010 02:00 PM, Bob Ball wrote:
>> We were getting errors thrown by an OST.  /var/log/messages contained a
>> lot of these:
>> 2010-11-28T17:05:34-05:00 umfs06.aglt2.org kernel: [2102640.735927]
>> LDISKFS-fs error (device sdk): ldiskfs_mb_check_ondisk_bitmap: on-disk
>> bitmap for group 639corrupted: 440 blocks free in bitmap, 439 - in gd
>>
>> So, I turned off (most) access to the disk via lctl (we have a LOT of
>> client machines, some were missed) and got problems.  Had to use the
>> alternate superblock to e2fsck the disk.  When back online, I still saw
>> similar messages.  Updated to e2fsprogs 1.41.12 as suggested elsewhere.
>> Repeated e2fsck.
>>
>> Still seeing these.  Users report some files corrupted, coming up with
>> bad md5sum....  Any other thoughts on what to do about this problem?
>>
>> [2440763.879143] LDISKFS-fs error (device sdk):
>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group
35406corrupted:
>> 1318 blocks free in bitmap, 1317 - in gd
>> [2440763.879796]
>> [2440763.882724] LustreError:
>> 1651027:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record())
can''t
>> read/create block: -28
>> [2440763.882736] LustreError:
>> 1651027:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
>> record: rc -28
>> [2440763.882789] LustreError:
>> 1651002:0:(mgc_request.c:1089:mgc_copy_llog()) Failed to copy remote
log
>> umt3-OST0019 (-28)
>>
>> Rebooted to make system clean as a whole, and found the same kind of
>> thing repeating.
>> [  285.834864] LDISKFS-fs (sdk): warning: mounting fs with errors,
>> running e2fsck is recommended
>> [  285.852559] LDISKFS-fs (sdk): mounted filesystem with ordered data 
>> mode
>> [  286.079065] LDISKFS-fs (sdk): warning: mounting fs with errors,
>> running e2fsck is recommended
>> [  286.096316] LDISKFS-fs (sdk): mounted filesystem with ordered data 
>> mode
>> [  286.940872] LDISKFS-fs error (device sdk):
>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group
35406corrupted:
>> 1318 blocks free in bitmap, 1317 - in gd
>> [  286.941693]
>> [  286.945224] LustreError:
>> 5790:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record())
can''t
>> read/create block: -28
>> [  286.945233] LustreError:
>> 5790:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
>> record: rc -28
>> [  286.945448] LustreError: 5763:0:(mgc_request.c:1089:mgc_copy_llog())
>> Failed to copy remote log umt3-OST0019 (-28)
>>
>> All help appreciated.
>>
>> bob
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Bob Ball

2010-Dec-03 21:41 UTC

head link

[Lustre-discuss] OST error

Just to cleanly end this thread, the mptctl was out of date.  We also 
updated megaraid_sas and perc6 firmware.  e2fsck found some Block bitmap 
differences (fixed) at this point, but the OST mounted cleanly and the 
errors stopped.

Unfortunately, there are now corrupted files in the system, that remain 
corrupted, and we''ll probably never be able to come up with a complete 
list of them.

bob


On 12/2/2010 4:35 PM, Bob Ball wrote:> It is a Dell PERC6 RAID array.  OMSA monitoring is enabled and is not
> throwing errors.  Hmmmm, mptctl is old though, so maybe that is a
> contributing factor.  I guess I need to update that.  Shoot,
> megaraid_sas is also not up to date.  dkms....
>
> OK, guess I need some driver updates.
>
> Later.
> bob
>
> On 12/2/2010 4:05 PM, Colin Faber wrote:
>> Hi Bob,
>>
>> If you''re seeing the same errors on the same disk after e2fsck
run,
>> and it''s not catching them, it''s possible that
you''re hitting an edge
>> case which isn''t handled within e2fsck properly, however if
you''re
>> experiencing different errors and e2fsck did catch them before,
>> chances are you''re looking at some hardware failure some
place.
>>
>> If this is a single disk, and you have SMART monitoring enabled, check
>> your error counters, if it''s a raid device, verify the error
counters
>> on that.
>>
>> -cf
>>
>>
>> On 12/02/2010 02:00 PM, Bob Ball wrote:
>>> We were getting errors thrown by an OST.  /var/log/messages
contained a
>>> lot of these:
>>> 2010-11-28T17:05:34-05:00 umfs06.aglt2.org kernel: [2102640.735927]
>>> LDISKFS-fs error (device sdk): ldiskfs_mb_check_ondisk_bitmap:
on-disk
>>> bitmap for group 639corrupted: 440 blocks free in bitmap, 439 - in
gd
>>>
>>> So, I turned off (most) access to the disk via lctl (we have a LOT
of
>>> client machines, some were missed) and got problems.  Had to use
the
>>> alternate superblock to e2fsck the disk.  When back online, I still
saw
>>> similar messages.  Updated to e2fsprogs 1.41.12 as suggested
elsewhere.
>>> Repeated e2fsck.
>>>
>>> Still seeing these.  Users report some files corrupted, coming up
with
>>> bad md5sum....  Any other thoughts on what to do about this
problem?
>>>
>>> [2440763.879143] LDISKFS-fs error (device sdk):
>>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group
35406corrupted:
>>> 1318 blocks free in bitmap, 1317 - in gd
>>> [2440763.879796]
>>> [2440763.882724] LustreError:
>>> 1651027:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record())
can''t
>>> read/create block: -28
>>> [2440763.882736] LustreError:
>>> 1651027:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing
log
>>> record: rc -28
>>> [2440763.882789] LustreError:
>>> 1651002:0:(mgc_request.c:1089:mgc_copy_llog()) Failed to copy
remote log
>>> umt3-OST0019 (-28)
>>>
>>> Rebooted to make system clean as a whole, and found the same kind
of
>>> thing repeating.
>>> [  285.834864] LDISKFS-fs (sdk): warning: mounting fs with errors,
>>> running e2fsck is recommended
>>> [  285.852559] LDISKFS-fs (sdk): mounted filesystem with ordered
data
>>> mode
>>> [  286.079065] LDISKFS-fs (sdk): warning: mounting fs with errors,
>>> running e2fsck is recommended
>>> [  286.096316] LDISKFS-fs (sdk): mounted filesystem with ordered
data
>>> mode
>>> [  286.940872] LDISKFS-fs error (device sdk):
>>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group
35406corrupted:
>>> 1318 blocks free in bitmap, 1317 - in gd
>>> [  286.941693]
>>> [  286.945224] LustreError:
>>> 5790:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record())
can''t
>>> read/create block: -28
>>> [  286.945233] LustreError:
>>> 5790:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
>>> record: rc -28
>>> [  286.945448] LustreError:
5763:0:(mgc_request.c:1089:mgc_copy_llog())
>>> Failed to copy remote log umt3-OST0019 (-28)
>>>
>>> All help appreciated.
>>>
>>> bob
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Colin Faber

2010-Dec-03 21:48 UTC

head link

[Lustre-discuss] OST error

Hi Bob,

Good to hear you''ve identified and resolved the issue. Sorry to hear 
you''ll have to restore from backup though.

-cf


On 12/03/2010 02:41 PM, Bob Ball wrote:> Just to cleanly end this thread, the mptctl was out of date.  We also
> updated megaraid_sas and perc6 firmware.  e2fsck found some Block bitmap
> differences (fixed) at this point, but the OST mounted cleanly and the
> errors stopped.
>
> Unfortunately, there are now corrupted files in the system, that remain
> corrupted, and we''ll probably never be able to come up with a
complete
> list of them.
>
> bob
>
>
> On 12/2/2010 4:35 PM, Bob Ball wrote:
>> It is a Dell PERC6 RAID array.  OMSA monitoring is enabled and is not
>> throwing errors.  Hmmmm, mptctl is old though, so maybe that is a
>> contributing factor.  I guess I need to update that.  Shoot,
>> megaraid_sas is also not up to date.  dkms....
>>
>> OK, guess I need some driver updates.
>>
>> Later.
>> bob
>>
>> On 12/2/2010 4:05 PM, Colin Faber wrote:
>>> Hi Bob,
>>>
>>> If you''re seeing the same errors on the same disk after
e2fsck run,
>>> and it''s not catching them, it''s possible that
you''re hitting an edge
>>> case which isn''t handled within e2fsck properly, however
if you''re
>>> experiencing different errors and e2fsck did catch them before,
>>> chances are you''re looking at some hardware failure some
place.
>>>
>>> If this is a single disk, and you have SMART monitoring enabled,
check
>>> your error counters, if it''s a raid device, verify the
error counters
>>> on that.
>>>
>>> -cf
>>>
>>>
>>> On 12/02/2010 02:00 PM, Bob Ball wrote:
>>>> We were getting errors thrown by an OST.  /var/log/messages
contained a
>>>> lot of these:
>>>> 2010-11-28T17:05:34-05:00 umfs06.aglt2.org kernel:
[2102640.735927]
>>>> LDISKFS-fs error (device sdk): ldiskfs_mb_check_ondisk_bitmap:
on-disk
>>>> bitmap for group 639corrupted: 440 blocks free in bitmap, 439 -
in gd
>>>>
>>>> So, I turned off (most) access to the disk via lctl (we have a
LOT of
>>>> client machines, some were missed) and got problems.  Had to
use the
>>>> alternate superblock to e2fsck the disk.  When back online, I
still saw
>>>> similar messages.  Updated to e2fsprogs 1.41.12 as suggested
elsewhere.
>>>> Repeated e2fsck.
>>>>
>>>> Still seeing these.  Users report some files corrupted, coming
up with
>>>> bad md5sum....  Any other thoughts on what to do about this
problem?
>>>>
>>>> [2440763.879143] LDISKFS-fs error (device sdk):
>>>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group
35406corrupted:
>>>> 1318 blocks free in bitmap, 1317 - in gd
>>>> [2440763.879796]
>>>> [2440763.882724] LustreError:
>>>> 1651027:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record())
can''t
>>>> read/create block: -28
>>>> [2440763.882736] LustreError:
>>>> 1651027:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error
writing log
>>>> record: rc -28
>>>> [2440763.882789] LustreError:
>>>> 1651002:0:(mgc_request.c:1089:mgc_copy_llog()) Failed to copy
remote log
>>>> umt3-OST0019 (-28)
>>>>
>>>> Rebooted to make system clean as a whole, and found the same
kind of
>>>> thing repeating.
>>>> [  285.834864] LDISKFS-fs (sdk): warning: mounting fs with
errors,
>>>> running e2fsck is recommended
>>>> [  285.852559] LDISKFS-fs (sdk): mounted filesystem with
ordered data
>>>> mode
>>>> [  286.079065] LDISKFS-fs (sdk): warning: mounting fs with
errors,
>>>> running e2fsck is recommended
>>>> [  286.096316] LDISKFS-fs (sdk): mounted filesystem with
ordered data
>>>> mode
>>>> [  286.940872] LDISKFS-fs error (device sdk):
>>>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group
35406corrupted:
>>>> 1318 blocks free in bitmap, 1317 - in gd
>>>> [  286.941693]
>>>> [  286.945224] LustreError:
>>>> 5790:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record())
can''t
>>>> read/create block: -28
>>>> [  286.945233] LustreError:
>>>> 5790:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing
log
>>>> record: rc -28
>>>> [  286.945448] LustreError:
5763:0:(mgc_request.c:1089:mgc_copy_llog())
>>>> Failed to copy remote log umt3-OST0019 (-28)
>>>>
>>>> All help appreciated.
>>>>
>>>> bob
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Dec 2010 - OST error

[Lustre-discuss] OST error

[Lustre-discuss] OST error

[Lustre-discuss] OST error

[Lustre-discuss] OST error

[Lustre-discuss] OST error