hi All, This morning we see on some client, it cannot connect to one of our node. I run fsck on the node, and remounted it. Fsck found a lot of errors. After this I see this on the logs again: Sep 16 18:16:08 node1 kernel: LustreError: 2538:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 80132: rc -2 Sep 16 18:16:08 node1 kernel: LustreError: 2538:0:(ldlm_resource.c:719:ldlm_resource_add()) Skipped 15 previous similar messages Sep 16 18:27:15 node1 kernel: LustreError: 2487:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 232490: rc -2 Sep 16 18:27:15 node1 kernel: LustreError: 2487:0:(ldlm_resource.c:719:ldlm_resource_add()) Skipped 16 previous similar messages Sep 16 18:30:08 node1 kernel: LDISKFS-fs error (device sdb1): ldiskfs_ext_find_extent: bad header in inode #58262056: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) Sep 16 18:30:08 node1 kernel: Remounting filesystem read-only Sep 16 18:30:16 node1 kernel: Lustre: Skipped 1 previous similar message Sep 16 18:30:16 node1 kernel: LustreError: 3986:0:(fsfilt-ldiskfs.c:1318:fsfilt_ldiskfs_write_record()) can''t start transaction for 37 blocks (128 bytes) Sep 16 18:30:16 node1 kernel: LustreError: 3986:0:(fsfilt-ldiskfs.c:1318:fsfilt_ldiskfs_write_record()) Skipped 53 previous similar messages Sep 16 18:30:16 node1 kernel: LustreError: 3986:0:(filter.c:360:filter_client_free()) zeroing out client 2bee00d4-c421-4c8e-bf27-8dd131e0bc55 at idx 51 (14720) in last_rcvd rc -30 Sep 16 18:34:32 node1 kernel: LustreError: 2426:0:(fsfilt-ldiskfs.c:281:fsfilt_ldiskfs_start()) error starting handle for op 8 (71 credits): rc -30 Sep 16 18:34:32 node1 kernel: LustreError: 2426:0:(fsfilt-ldiskfs.c:281:fsfilt_ldiskfs_start()) Skipped 36 previous similar messages Sep 16 18:37:16 node1 kernel: LustreError: 2416:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 212576: rc -2 Sep 16 18:37:16 node1 kernel: LustreError: 2416:0:(ldlm_resource.c:719:ldlm_resource_add()) Skipped 15 previous similar messages Sep 16 18:43:20 node1 kernel: LustreError: 2341:0:(fsfilt-ldiskfs.c:281:fsfilt_ldiskfs_start()) error starting handle for op 8 (71 credits): rc -30 Sep 16 18:43:20 node1 kernel: LustreError: 2341:0:(filter.c:273:filter_client_add()) unable to start transaction: rc -30 Sep 16 18:43:20 node1 kernel: LustreError: 2341:0:(filter.c:273:filter_client_add()) Skipped 36 previous similar messages Sep 16 18:43:20 node1 kernel: LustreError: 2341:0:(filter.c:294:filter_client_add()) error writing last_rcvd client idx 34: rc -30 Sep 16 18:43:20 node1 kernel: LustreError: 2341:0:(filter.c:294:filter_client_add()) Skipped 36 previous similar messages Sep 16 18:43:20 node1 kernel: LustreError: 2341:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-30) req at ffff81000276de00 x8/t0 o8-><?>@<?>:-1 lens 240/144 ref 0 fl Interpret:/0/0 rc -30/0 Sep 16 18:43:21 node1 kernel: LustreError: 2341:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 5 previous similar messages Sep 16 18:44:10 node1 kernel: LustreError: 2399:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-30) req at ffff81005f5d0400 x38/t0 o8-><?>@<?>:-1 lens 240/144 ref 0 fl Interpret:/0/0 rc -30/0 Sep 16 18:44:10 node1 kernel: LustreError: 2399:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 1 previous similar message Sep 16 18:44:35 node1 kernel: LustreError: 2341:0:(filter.c:273:filter_client_add()) unable to start transaction: rc -30 Sep 16 18:44:35 node1 kernel: LustreError: 2341:0:(filter.c:273:filter_client_add()) Skipped 2 previous similar messages Sep 16 18:44:35 node1 kernel: LustreError: 2341:0:(filter.c:294:filter_client_add()) error writing last_rcvd client idx 51: rc -30 Sep 16 18:44:35 node1 kernel: LustreError: 2341:0:(filter.c:294:filter_client_add()) Skipped 2 previous similar messages errno -2 was right after fsck, it''s OK. But why does -30 is here? I hoped, it will disappear after fsck, but I see again. What could cause this problem? How can I solve it? Thank you, tamas
On Tuesday 16 September 2008 18:47:43 Papp Tam?s wrote:> hi All, > > This morning we see on some client, it cannot connect to one of our node. > > I run fsck on the node, and remounted it. Fsck found a lot of errors. > > > After this I see this on the logs again:[...]> > Sep 16 18:30:08 node1 kernel: LDISKFS-fs error (device sdb1): > ldiskfs_ext_find_extent: bad header in inode #58262056: invalid magic - > magic 0, entries 0, max 0(0), depth 0(0)Did you use the latest e2fsprogrogs from Sun? -- Bernd Schubert Q-Leap Networks GmbH
Bernd Schubert wrote:> On Tuesday 16 September 2008 18:47:43 Papp Tam?s wrote: > >> hi All, >> >> This morning we see on some client, it cannot connect to one of our node. >> >> I run fsck on the node, and remounted it. Fsck found a lot of errors. >> >> >> After this I see this on the logs again: >> > > [...] > > >> Sep 16 18:30:08 node1 kernel: LDISKFS-fs error (device sdb1): >> ldiskfs_ext_find_extent: bad header in inode #58262056: invalid magic - >> magic 0, entries 0, max 0(0), depth 0(0) >> > > > Did you use the latest e2fsprogrogs from Sun? > >No, I use: 1.40.4.cfs1, but anyway it''s a good idea. However this was the current version, when 1.6.4.3 was the current. Thanks the advice, I''ll try it. tamas
On Sep 16, 2008 18:47 +0200, Papp Tam?s wrote:> I run fsck on the node, and remounted it. Fsck found a lot of errors. > > > After this I see this on the logs again: > Sep 16 18:30:08 node1 kernel: LDISKFS-fs error (device sdb1): > ldiskfs_ext_find_extent: bad header in inode #58262056: invalid magic - > magic 0, entries 0, max 0(0), depth 0(0) > Sep 16 18:30:08 node1 kernel: Remounting filesystem read-only > > But why does -30 is here? I hoped, it will disappear after fsck, but I > see again. What could cause this problem? How can I solve it?-30 = -EROFS, caused by the extent header error. This was fixed in very recent Lustre e2fsprogs, do you have the latest released version? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Sep 16, 2008 18:47 +0200, Papp Tam?s wrote: > >> I run fsck on the node, and remounted it. Fsck found a lot of errors. >> >> >> After this I see this on the logs again: >> Sep 16 18:30:08 node1 kernel: LDISKFS-fs error (device sdb1): >> ldiskfs_ext_find_extent: bad header in inode #58262056: invalid magic - >> magic 0, entries 0, max 0(0), depth 0(0) >> Sep 16 18:30:08 node1 kernel: Remounting filesystem read-only >> >> But why does -30 is here? I hoped, it will disappear after fsck, but I >> see again. What could cause this problem? How can I solve it? >> > > -30 = -EROFS, caused by the extent header error. This was fixed in > very recent Lustre e2fsprogs, do you have the latest released version? > >Well, the recent e2fsprogs from Sun did not help. So I tried to move away the files from the node, but it''s not so simple, I have some question. 1. $ lfs df|grep OST0002 cubefs-OST0002_UUID 1845110624 1512955404 332155220 81% /W[OST:2] $ lctl dl|grep OST0002 4 UP osc cubefs-OST0002-osc-ffff81002b2b5000 345f312a-51e9-b9de-b462-35a56ae76341 5 Which one should I use? Anyway: $ lfs find --obd cubefs-OST0002-osc-ffff81002b2b5000 -r . error: setup_obd_uuids: unknown obduuid: cubefs-OST0002-osc-ffff81002b2b5000 ./1 2 ./1 23 ./1 234 ./1 2345 $ lfs find --obd cubefs-OST0002_UUID -r . error: setup_obd_uuids: unknown obduuid: cubefs-OST0002_UUID ./1 2 ./1 23 ./1 234 ./1 2345 But: $ lfs getstripe . OBDS: . has no stripe info ./1 2 obdidx objid objid group 3 455101 0x6f1bd 0 ./1 23 obdidx objid objid group 3 455125 0x6f1d5 0 ./1 234 obdidx objid objid group 4 448480 0x6d7e0 0 ./1 2345 obdidx objid objid group 2 455201 0x6f221 0 2. # mount|grep lustre /dev/sdb1 on /mnt/cubefs/ost-1 type lustre (rw) # grep lustre /proc/mounts /dev/*sdb */mnt/cubefs/ost-1 lustre ro 0 0 Why don''t I see sdb1 in /proc? Also why do I see ro in /proc? 3. samba:~$ cat /proc/fs/lustre/lov/cubefs-clilov-ffff8100330c4800/target_obd samba:~$ I have an other cluster, on that it shows the right values (on the same machine at the same time too). Lustre 1.6.4.3 Thank you, tamas
On Wed, Sep 17, 2008 at 04:08:12PM +0200, Papp Tamas wrote:> > Well, the recent e2fsprogs from Sun did not help.Some more information. It''s strange, but there is one client (backupgwm it''s a gw client between two cluster, one is the main cluster and the other one is a backup system. This one works great without any error. There was another one, but I umounted it. After I remounted it was working like the others. Any idea, what''s happening? Thanks, tamas
Papp Tamas wrote:> Andreas Dilger wrote: > >> On Sep 16, 2008 18:47 +0200, Papp Tam?s wrote: >> >> >>> I run fsck on the node, and remounted it. Fsck found a lot of errors. >>> >>> >>> After this I see this on the logs again: >>> Sep 16 18:30:08 node1 kernel: LDISKFS-fs error (device sdb1): >>> ldiskfs_ext_find_extent: bad header in inode #58262056: invalid magic - >>> magic 0, entries 0, max 0(0), depth 0(0) >>> Sep 16 18:30:08 node1 kernel: Remounting filesystem read-only >>> >>> But why does -30 is here? I hoped, it will disappear after fsck, but I >>> see again. What could cause this problem? How can I solve it? >>> >>> >> -30 = -EROFS, caused by the extent header error. This was fixed in >> very recent Lustre e2fsprogs, do you have the latest released version? >> >> >> > > Well, the recent e2fsprogs from Sun did not help. > > So I tried to move away the files from the node, but it''s not so simple, > I have some question. > > 1. > $ lfs df|grep OST0002 > cubefs-OST0002_UUID 1845110624 1512955404 332155220 81% /W[OST:2] > $ lctl dl|grep OST0002 > 4 UP osc cubefs-OST0002-osc-ffff81002b2b5000 > 345f312a-51e9-b9de-b462-35a56ae76341 5 > > Which one should I use? > > Anyway: > > $ lfs find --obd cubefs-OST0002-osc-ffff81002b2b5000 -r . > error: setup_obd_uuids: unknown obduuid: cubefs-OST0002-osc-ffff81002b2b5000 > ./1 2 > ./1 23 > ./1 234 > ./1 2345 > > $ lfs find --obd cubefs-OST0002_UUID -r . > error: setup_obd_uuids: unknown obduuid: cubefs-OST0002_UUID > ./1 2 > ./1 23 > ./1 234 > ./1 2345 > > But: > > $ lfs getstripe . > OBDS: > . has no stripe info > ./1 2 > obdidx objid objid group > 3 455101 0x6f1bd 0 > > ./1 23 > obdidx objid objid group > 3 455125 0x6f1d5 0 > > ./1 234 > obdidx objid objid group > 4 448480 0x6d7e0 0 > > ./1 2345 > obdidx objid objid group > 2 455201 0x6f221 0 > > > 2. > # mount|grep lustre > /dev/sdb1 on /mnt/cubefs/ost-1 type lustre (rw) > > # grep lustre /proc/mounts > /dev/*sdb */mnt/cubefs/ost-1 lustre ro 0 0 > > Why don''t I see sdb1 in /proc? > Also why do I see ro in /proc? > > 3. > samba:~$ cat /proc/fs/lustre/lov/cubefs-clilov-ffff8100330c4800/target_obd > samba:~$ > > I have an other cluster, on that it shows the right values (on the same > machine at the same time too). > >There is no answer about any of these issues? Did I do something wrong? tamas
On Thu, 2008-09-18 at 19:39 +0200, Papp Tam?s wrote:> > There is no answer about any of these issues? > > Did I do something wrong?No, but you have to understand that we all have tasks that we must complete in order to meet our goals and that responding to questions here is done on an as-time-permits basis. Some days there is time and some there is not. You will just have to bear with us and be patient. I really don''t want this to come off like a sales pitch, but just so you know, if your operations require that you get timely responses to problems and questions Sun does offer support contracts that guarantee response times. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080918/c769a52c/attachment.bin
Brian J. Murrell wrote:> No, but you have to understand that we all have tasks that we must > complete in order to meet our goals and that responding to questions > here is done on an as-time-permits basis. Some days there is time and > some there is not. You will just have to bear with us and be patient. > >I''m really sorry if it looks like I want to hurry anybody for the answer. I just get used to get the relatively quick response. I searched the web and the archives and I saw a mail with a failure, like this without any answer. So I thought the problem is with the questions or with the cluster how it was made.> I really don''t want this to come off like a sales pitch, but just so you > know, if your operations require that you get timely responses to > problems and questions Sun does offer support contracts that guarantee > response times. >Actually I tried without feedback:) Thank you any help, tamas
> > # grep lustre /proc/mounts > > /dev/*sdb */mnt/cubefs/ost-1 lustre ro 0 0 > > > > Why don''t I see sdb1 in /proc? > > Also why do I see ro in /proc? > > > > 3. > > samba:~$ cat /proc/fs/lustre/lov/cubefs-clilov-ffff8100330c4800/target_obd > > samba:~$ > > > > I have an other cluster, on that it shows the right values (on the same > > machine at the same time too). > > > > > > There is no answer about any of these issues? > > Did I do something wrong?Someone needs to be motivated to answer your questions.. Either because someone is paying them, or because they get some benefit out of spending the time. What''s great about open source is anyone has access to the code, and you can exchange something other than money to get help with software. But there still needs to be something the person answering your question gets in return for spending the time and effort on it.
On Thu, 2008-09-18 at 13:23 -0500, Troy Benjegerdes wrote:> > Someone needs to be motivated to answer your questions.. Either because > someone is paying them, or because they get some benefit out of spending > the time.Indeed. The benefit for must of us here is purely the satisfaction one gets from helping somebody else. Unfortunately that satisfaction doesn''t go very far when one has to explain why one has not met one''s objectives. :-/ Now, where in this pile did that list of objectives get to... b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080918/2d72c60c/attachment.bin
On Thu, 2008-09-18 at 20:20 +0200, Papp Tam?s wrote:> > Actually I tried without feedback:)Ahhhh. That''s not good. Did you try http://www.sun.com/software/products/lustre/support.xml? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080918/c3227051/attachment.bin
Brian J. Murrell wrote:> On Thu, 2008-09-18 at 20:20 +0200, Papp Tam?s wrote: > >> >> Actually I tried without feedback:) >> > > Ahhhh. That''s not good. Did you try > http://www.sun.com/software/products/lustre/support.xml? >If I remember well, yes. I think, the problem is that here in Hungary there is no support for Lustre at Sun, probably nobody can do it? I don''t know. It was about one or half year ago, right after Sun bought Clusterfs. Maybe I should try it again:) tamas ps.: Again, I didn''t want to arrogate for anything.
On Thu, 2008-09-18 at 21:14 +0200, Papp Tam?s wrote:> It was about one or half year ago, right after Sun bought Clusterfs. > > Maybe I should try it again:)Indeed, please do. A lot has happened in terms of integration into Sun in the last year.> ps.: Again, I didn''t want to arrogate for anything.Understood. No worries. I just wanted to explain why you might not be seeing any responses and that it was nothing you did wrong. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080918/c048878a/attachment.bin
WRT paid Lustre support, I cannot speak with regards to the country of Hungary, but even within the U.S. Lustre support from Sun is difficult to get. I, on behalf of my company, contacted Sun for Lustre support (one person was Mr. Mike McClain). I received a price quote. I responded that my company would agree to pay that U.S. dollar amount and would it be possible to see what the dollar amount actually covered in terms of service support. That was two months ago. I have never heard anything further from Sun system. ....but the List does tend to answer.... ;-) Megan Larko Center for Research on Environment & Water (CREW) Beltsville, MD U.S.A.