Hi, on one of our OSTs, I get the message [15065.001307] LustreError: 12549:0:(filter.c:2546:filter_precreate()) ost4-2: Serious error: objid 9413629 already exists; is this filesystem corrupt? [15065.016643] LustreError: 12549:0:(filter.c:2547:filter_precreate()) LBUG which brings Lustre client access and the OST to a halt. Any idea what can be done to remove the troubling objid? An e2fsck on both MDS and OST didn''t show any corruption. We use Kernel 2.6.15.7 with Lustre 1.4.6.1. Thanks, Roladn
On Dec 06, 2006 17:23 +0100, Roland Fehrenbacher wrote:> on one of our OSTs, I get the message > > [15065.001307] LustreError: 12549:0:(filter.c:2546:filter_precreate()) ost4-2: Serious error: objid 9413629 already exists; is this filesystem corrupt? > [15065.016643] LustreError: 12549:0:(filter.c:2547:filter_precreate()) LBUG > > which brings Lustre client access and the OST to a halt. Any idea what > can be done to remove the troubling objid? An e2fsck on both MDS and > OST didn''t show any corruption.You need to mount the OST directly (with Lustre stopped) to fix this problem. The problem is that there is an inconsistency between the file that tracks the last created object and the objects themselves. This shouldn''t be possible with ext3, unless you are using write cache and the system crashes in the middle of committing a transaction that is later rolled back. mount -t ldiskfs /dev/{ostdev} /mnt/ost od -td8 /mnt/ost/O/0/LAST_ID # should be highest object ID ls /mnt/ost/O/0/d* # find highest object ID xxd /mnt/ost/O/0/LAST_ID /tmp/li.asc {edit /tmp/li.asc to record highest-found object ID} xxd -r /tmp/li.asc /tmp/li od -td8 /tmp/li # verify this is correct cp /tmp/li /mnt/ost/O/0/LAST_ID umount /mnt/ost Then restart the OST and it should recover correctly. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
>On Dec 06, 2006 17:23 +0100, Roland Fehrenbacher wrote:>> on one of our OSTs, I get the message >> >> [15065.001307] LustreError: 12549:0:(filter.c:2546:filter_precreate()) ost4-2: Serious error: objid 9413629 already exists; is this filesystem corrupt? >> [15065.016643] LustreError: 12549:0:(filter.c:2547:filter_precreate()) LBUG >> >> which brings Lustre client access and the OST to a halt. Any idea what >> can be done to remove the troubling objid? An e2fsck on both MDS and >> OST didn''t show any corruption. > You need to mount the OST directly (with Lustre stopped) to fix this problem. > The problem is that there is an inconsistency between the file that tracks > the last created object and the objects themselves. This shouldn''t be > possible with ext3, unless you are using write cache and the system crashes > in the middle of committing a transaction that is later rolled back. > > mount -t ldiskfs /dev/{ostdev} /mnt/ost > od -td8 /mnt/ost/O/0/LAST_ID # should be highest object ID > ls /mnt/ost/O/0/d* # find highest object ID > xxd /mnt/ost/O/0/LAST_ID /tmp/li.asc > {edit /tmp/li.asc to record highest-found object ID} > xxd -r /tmp/li.asc /tmp/li > od -td8 /tmp/li # verify this is correct > cp /tmp/li /mnt/ost/O/0/LAST_ID > umount /mnt/ost > > Then restart the OST and it should recover correctly. Thanks for your fast reply. I did what you suggested, and the highest object ID found by the ls /mnt/ost/O/0/d* was 9413754. So I changed /mnt/ost/O/0/LAST_ID accordingly but obtain the following error when restarting the OST: [27965.722794] LustreError: 28694:0:(filter.c:2409:filter_should_precreate()) ASSERTION(diff >= 0)failed:ost4-2: 9413692 - 9413754 = -62 [27965.722803] LustreError: 28694:0:(filter.c:2409:filter_should_precreate()) LBUG What did I do wrong? Roland
On Dec 07, 2006 00:48 +0100, Roland Fehrenbacher wrote:> Thanks for your fast reply. I did what you suggested, and the highest > object ID found by the ls /mnt/ost/O/0/d* was 9413754. So I changed > /mnt/ost/O/0/LAST_ID accordingly but obtain the following error when > restarting the OST: > > [27965.722794] LustreError: 28694:0:(filter.c:2409:filter_should_precreate()) ASSERTION(diff >= 0)failed:ost4-2: 9413692 - 9413754 = -62 > [27965.722803] LustreError: 28694:0:(filter.c:2409:filter_should_precreate()) LBUGDelete all of the objects > 9413692. They should all be zero-length objects. Also update LAST_ID to be the same value. You didn''t do anything wrong, it''s just that this can happen one of two ways, and I picked the wrong way to fix it the first time. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
>>>>> "Andreas" == Andreas Dilger <adilger@clusterfs.com> writes:Andreas> On Dec 07, 2006 00:48 +0100, Roland Fehrenbacher wrote: >> Thanks for your fast reply. I did what you suggested, and the >> highest object ID found by the ls /mnt/ost/O/0/d* was >> 9413754. So I changed /mnt/ost/O/0/LAST_ID accordingly but >> obtain the following error when restarting the OST: >> >> [27965.722794] LustreError: 28694:0:(filter.c:2409:filter_should_precreate()) ASSERTION(diff >= 0)failed:ost4-2: 9413692 - 9413754 = -62 >> [27965.722803] LustreError: 28694:0:(filter.c:2409:filter_should_precreate()) LBUG Andreas> Delete all of the objects > 9413692. They should all be Andreas> zero-length objects. Also update LAST_ID to be the same Andreas> value. Interesting: After I rebooted the node (the OST wouldn''t umount anymore) I mounted the device again using $ mount -t ldiskfs /dev/{ostdev} /mnt/ost To my surprise, when checking /mnt/ost/O/0/LAST_ID, it had decreased automatically to 9413628 ( the objid -1 it was originally complaining about), and also using ls /mnt/ost/O/0/d* showed that all obj > 9413628 were gone. So Lustre seemed to have autocleaned up things correctly by itself. I then umounted and restarted Lustre, after which everything was running fine til now. The OST is sitting on a LVM LV with its VG on a Software RAID6 device. Could the original problem have had anything to do with bug 11313? Thanks a lot once more, Roland Andreas> You didn''t do anything wrong, it''s just that this can Andreas> happen one of two ways, and I picked the wrong way to fix Andreas> it the first time. Andreas> Cheers, Andreas -- Andreas Dilger Principal Software Andreas> Engineer Cluster File Systems, Inc.
On Dec 07, 2006 12:08 +0100, Roland Fehrenbacher wrote:> To my surprise, when checking /mnt/ost/O/0/LAST_ID, it had decreased > automatically to 9413628 ( the objid -1 it was originally complaining > about), and also using ls /mnt/ost/O/0/d* showed that all obj > > 9413628 were gone. So Lustre seemed to have autocleaned up things > correctly by itself. I then umounted and restarted Lustre, after which > everything was running fine til now.Very odd.> The OST is sitting on a LVM LV with its VG on a Software RAID6 device. > Could the original problem have had anything to do with bug 11313?RAID6 isn''t a well-supported configuration. There may be bad interactions between the RAID6 code and the Lustre IO. It does sound similar to 11313, but I''m not sure it is the same thing. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.