Christopher Walker
2010-Nov-04 02:22 UTC
[Lustre-discuss] Serious error: objid already exists; is this filesystem corrupt?
We recently had a hardware failure on one of our OSTs, which has caused some major problems for our 1.6.6-based array. We''re now getting the error: Serious error: objid 517386 already exists; is this filesystem corrupt? on one of our OSTs. If I mount this OST as ldiskfs and look in O/0/d*, the highest objid I see is 870397, considerably higher than 517386. We''ve taken this OST through a round of e2fsck and ll_recover_lost_found_objs, during which it restored a lot of lost files, and e2fsck on this OST and on the MDT don''t currently show any problems. Can I simply edit O/0/LAST_ID, set it to 870397, and expect files with objid between 517386 and 870397 to come back? Also, I could be wrong, but it looks like ll_recover_lost_found_objs.c only looks for lost files up to LAST_ID -- if I reset LAST_ID to 870397, should I rerun ll_recover_lost_found_objs? Many thanks in advance, Chris
Alexey Lyashkov
2010-Nov-04 07:01 UTC
[Lustre-discuss] Serious error: objid already exists; is this filesystem corrupt?
Hi Christopher, you need kill lov_objid file on MDS and set LAST_ID on OST to 870397. in that case MDS will reread last_id from OST''s and refill lov_objid file, to avoid possible file corruption. On Nov 4, 2010, at 04:22, Christopher Walker wrote:> > > We recently had a hardware failure on one of our OSTs, which has caused > some major problems for our 1.6.6-based array. > > We''re now getting the error: > > Serious error: objid 517386 already exists; is this filesystem corrupt? > > on one of our OSTs. If I mount this OST as ldiskfs and look in O/0/d*, > the highest objid I see is 870397, considerably higher than 517386. > We''ve taken this OST through a round of e2fsck > and ll_recover_lost_found_objs, during which it restored a lot of lost > files, and e2fsck on this OST and on the MDT don''t currently show any > problems. Can I simply edit O/0/LAST_ID, set it to 870397, and expect > files with objid between 517386 and 870397 to come back? > > Also, I could be wrong, but it looks like ll_recover_lost_found_objs.c > only looks for lost files up to LAST_ID -- if I reset LAST_ID to 870397, > should I rerun ll_recover_lost_found_objs? > > Many thanks in advance, > Chris > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Bernd Schubert
2010-Nov-04 12:35 UTC
[Lustre-discuss] Serious error: objid already exists; is this filesystem corrupt?
Hello Christopher, hello Alex, the alternative is to let e2fsck correct LAST_ID. Patches are here: https://bugzilla.lustre.org/show_bug.cgi?id=22734 and included in our e2fsprogs releases: http://eu.ddn.com:8080/lustre/lustre/RHEL5/tools/e2fsprogs/ Unfortunately, the patches are not yet in Oracle e2fsprogs version. In order to let e2fsck correct it, you will need to create an mdsdb file (the hdr is sufficient) and then e2fsck --mdsdb mdsdb.hdr --ostdb some_irrelevant_file /dev/device The procedure is similar to the lfsck preparations, although one usually runs that with "-n". To let e2fsck (pass6, the db-part) correct the LAST_ID, it must *not* run in read-only mode, though. Cheers, Bernd On Thursday, November 04, 2010, Alexey Lyashkov wrote:> Hi Christopher, > > you need kill lov_objid file on MDS and set LAST_ID on OST to 870397. > in that case MDS will reread last_id from OST''s and refill lov_objid file, > to avoid possible file corruption. > > On Nov 4, 2010, at 04:22, Christopher Walker wrote: > > We recently had a hardware failure on one of our OSTs, which has caused > > some major problems for our 1.6.6-based array. > > > > We''re now getting the error: > > > > Serious error: objid 517386 already exists; is this filesystem corrupt? > > > > on one of our OSTs. If I mount this OST as ldiskfs and look in O/0/d*, > > the highest objid I see is 870397, considerably higher than 517386. > > We''ve taken this OST through a round of e2fsck > > and ll_recover_lost_found_objs, during which it restored a lot of lost > > files, and e2fsck on this OST and on the MDT don''t currently show any > > problems. Can I simply edit O/0/LAST_ID, set it to 870397, and expect > > files with objid between 517386 and 870397 to come back? > > > > Also, I could be wrong, but it looks like ll_recover_lost_found_objs.c > > only looks for lost files up to LAST_ID -- if I reset LAST_ID to 870397, > > should I rerun ll_recover_lost_found_objs? > > > > Many thanks in advance, > > Chris > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Bernd Schubert DataDirect Networks
Chris Walker
2010-Nov-05 03:04 UTC
[Lustre-discuss] Serious error: objid already exists; is this filesystem corrupt?
Thanks very much to both of you -- the manual method worked perfectly. Best, Chris On 11/4/10 8:35 AM, Bernd Schubert wrote:> Hello Christopher, hello Alex, > > the alternative is to let e2fsck correct LAST_ID. Patches are here: > > https://bugzilla.lustre.org/show_bug.cgi?id=22734 > > and included in our e2fsprogs releases: > > http://eu.ddn.com:8080/lustre/lustre/RHEL5/tools/e2fsprogs/ > > Unfortunately, the patches are not yet in Oracle e2fsprogs version. > > In order to let e2fsck correct it, you will need to create an mdsdb file (the > hdr is sufficient) and then > e2fsck --mdsdb mdsdb.hdr --ostdb some_irrelevant_file /dev/device > > The procedure is similar to the lfsck preparations, although one usually runs > that with "-n". To let e2fsck (pass6, the db-part) correct the LAST_ID, it > must *not* run in read-only mode, though. > > > Cheers, > Bernd > > > On Thursday, November 04, 2010, Alexey Lyashkov wrote: >> Hi Christopher, >> >> you need kill lov_objid file on MDS and set LAST_ID on OST to 870397. >> in that case MDS will reread last_id from OST''s and refill lov_objid file, >> to avoid possible file corruption. >> >> On Nov 4, 2010, at 04:22, Christopher Walker wrote: >>> We recently had a hardware failure on one of our OSTs, which has caused >>> some major problems for our 1.6.6-based array. >>> >>> We''re now getting the error: >>> >>> Serious error: objid 517386 already exists; is this filesystem corrupt? >>> >>> on one of our OSTs. If I mount this OST as ldiskfs and look in O/0/d*, >>> the highest objid I see is 870397, considerably higher than 517386. >>> We''ve taken this OST through a round of e2fsck >>> and ll_recover_lost_found_objs, during which it restored a lot of lost >>> files, and e2fsck on this OST and on the MDT don''t currently show any >>> problems. Can I simply edit O/0/LAST_ID, set it to 870397, and expect >>> files with objid between 517386 and 870397 to come back? >>> >>> Also, I could be wrong, but it looks like ll_recover_lost_found_objs.c >>> only looks for lost files up to LAST_ID -- if I reset LAST_ID to 870397, >>> should I rerun ll_recover_lost_found_objs? >>> >>> Many thanks in advance, >>> Chris >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >