John White
2010-Apr-15 20:10 UTC
[Lustre-discuss] odd kernel crash after a heartbeat failover
Hello Folks, We just had a very odd crash after a heartbeat failover that may or may not be related to each other. I''m not specifically sure if this was an IO error on the disk (I see no actual EIO, just the journal commit crash). Any ideas? The FS went through recovery just fine and doesn''t appear to have any corruption: [...] Apr 15 11:49:32 n0007.lustre heartbeat: [10940]: info: mach_down takeover complete. Apr 15 12:09:55 n0007.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: Apr 15 12:09:55 n0007.lustre kernel: [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP Apr 15 12:09:55 n0007.lustre kernel: last sysfs file: /block/dm-3/dev Apr 15 12:09:55 n0007.lustre kernel: RIP [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4890 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4807 Apr 15 12:13:26 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358806.3725 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3714 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4796 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3740 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4991 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.3727 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5109 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5072 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.4812 Apr 15 12:13:40 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358820.3720 Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.3732 Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.5047 Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.4862 Apr 15 12:24:09 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271359449.3725 logs available upon request. ---------------- John White High Performance Computing Services (HPCS) (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720
John White
2010-Apr-15 21:53 UTC
[Lustre-discuss] odd kernel crash after a heartbeat failover
This is actually happening repeatedly, any idea if this is a lustre-side error? Apr 15 14:37:57 n0008.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: Apr 15 14:37:57 n0008.lustre kernel: <2>LDISKFS-fs error (device dm-7) in ldiskfs_reserve_inode_write: Journal has aborted Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP Apr 15 14:37:57 n0008.lustre kernel: last sysfs file: /block/dm-7/dev Apr 15 14:37:57 n0008.lustre kernel: RIP [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 ---------------- John White High Performance Computing Services (HPCS) (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 On Apr 15, 2010, at 1:10 PM, John White wrote:> Hello Folks, > We just had a very odd crash after a heartbeat failover that may or may not be related to each other. I''m not specifically sure if this was an IO error on the disk (I see no actual EIO, just the journal commit crash). Any ideas? The FS went through recovery just fine and doesn''t appear to have any corruption: > [...] > Apr 15 11:49:32 n0007.lustre heartbeat: [10940]: info: mach_down takeover complete. > Apr 15 12:09:55 n0007.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > Apr 15 12:09:55 n0007.lustre kernel: [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e > Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP > Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP > Apr 15 12:09:55 n0007.lustre kernel: last sysfs file: /block/dm-3/dev > Apr 15 12:09:55 n0007.lustre kernel: RIP [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e > Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 > Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 > Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4890 > Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 > Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 > Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4807 > Apr 15 12:13:26 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358806.3725 > Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3714 > Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4796 > Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3740 > Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4991 > Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.3727 > Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5109 > Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5072 > Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.4812 > Apr 15 12:13:40 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358820.3720 > Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.3732 > Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.5047 > Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.4862 > Apr 15 12:24:09 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271359449.3725 > > > logs available upon request. > ---------------- > John White > High Performance Computing Services (HPCS) > (510) 486-7307 > One Cyclotron Rd, MS: 50B-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > > > > > > >
Cliff White
2010-Apr-16 04:45 UTC
[Lustre-discuss] odd kernel crash after a heartbeat failover
John White wrote:> This is actually happening repeatedly, any idea if this is a lustre-side error? > Apr 15 14:37:57 n0008.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > Apr 15 14:37:57 n0008.lustre kernel: <2>LDISKFS-fs error (device dm-7) in ldiskfs_reserve_inode_write: Journal has aborted > Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP > Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP > Apr 15 14:37:57 n0008.lustre kernel: last sysfs file: /block/dm-7/dev > Apr 15 14:37:57 n0008.lustre kernel: RIP [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e > Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 > Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 >Looks more like a messed-up-disk problem. I would sugest checking the health of your journal device. cliffw> ---------------- > John White > High Performance Computing Services (HPCS) > (510) 486-7307 > One Cyclotron Rd, MS: 50B-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > > > > > > > > On Apr 15, 2010, at 1:10 PM, John White wrote: > >> Hello Folks, >> We just had a very odd crash after a heartbeat failover that may or may not be related to each other. I''m not specifically sure if this was an IO error on the disk (I see no actual EIO, just the journal commit crash). Any ideas? The FS went through recovery just fine and doesn''t appear to have any corruption: >> [...] >> Apr 15 11:49:32 n0007.lustre heartbeat: [10940]: info: mach_down takeover complete. >> Apr 15 12:09:55 n0007.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: >> Apr 15 12:09:55 n0007.lustre kernel: [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e >> Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP >> Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP >> Apr 15 12:09:55 n0007.lustre kernel: last sysfs file: /block/dm-3/dev >> Apr 15 12:09:55 n0007.lustre kernel: RIP [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e >> Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 >> Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 >> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4890 >> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 >> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 >> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4807 >> Apr 15 12:13:26 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358806.3725 >> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3714 >> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4796 >> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3740 >> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4991 >> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.3727 >> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5109 >> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5072 >> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.4812 >> Apr 15 12:13:40 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358820.3720 >> Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.3732 >> Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.5047 >> Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.4862 >> Apr 15 12:24:09 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271359449.3725 >> >> >> logs available upon request. >> ---------------- >> John White >> High Performance Computing Services (HPCS) >> (510) 486-7307 >> One Cyclotron Rd, MS: 50B-3209C >> Lawrence Berkeley National Lab >> Berkeley, CA 94720 >> >> >> >> >> >> >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
John White
2010-Apr-16 18:29 UTC
[Lustre-discuss] odd kernel crash after a heartbeat failover
Just to follow-up, after enabling netconsole to get some meaningful logging out of these OSSs, it is clear that there''s a problem with the backend storage communication and that this certainly isn''t a lustre issue. Thanks folks. ---------------- John White High Performance Computing Services (HPCS) (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 On Apr 15, 2010, at 9:45 PM, Cliff White wrote:> John White wrote: >> This is actually happening repeatedly, any idea if this is a lustre-side error? >> Apr 15 14:37:57 n0008.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: Apr 15 14:37:57 n0008.lustre kernel: <2>LDISKFS-fs error (device dm-7) in ldiskfs_reserve_inode_write: Journal has aborted Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP Apr 15 14:37:57 n0008.lustre kernel: last sysfs file: /block/dm-7/dev Apr 15 14:37:57 n0008.lustre kernel: RIP [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 > > Looks more like a messed-up-disk problem. I would sugest checking the health of your journal device. > cliffw > >> ---------------- >> John White >> High Performance Computing Services (HPCS) >> (510) 486-7307 >> One Cyclotron Rd, MS: 50B-3209C >> Lawrence Berkeley National Lab >> Berkeley, CA 94720 >> On Apr 15, 2010, at 1:10 PM, John White wrote: >>> Hello Folks, >>> We just had a very odd crash after a heartbeat failover that may or may not be related to each other. I''m not specifically sure if this was an IO error on the disk (I see no actual EIO, just the journal commit crash). Any ideas? The FS went through recovery just fine and doesn''t appear to have any corruption: >>> [...] >>> Apr 15 11:49:32 n0007.lustre heartbeat: [10940]: info: mach_down takeover complete. Apr 15 12:09:55 n0007.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: Apr 15 12:09:55 n0007.lustre kernel: [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP Apr 15 12:09:55 n0007.lustre kernel: last sysfs file: /block/dm-3/dev Apr 15 12:09:55 n0007.lustre kernel: RIP [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4890 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4807 Apr 15 12:13:26 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358806.3725 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3714 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4796 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3740 Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4991 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.3727 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5109 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5072 Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.4812 Apr 15 12:13:40 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358820.3720 Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.3732 Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.5047 Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.4862 Apr 15 12:24:09 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271359449.3725 >>> >>> >>> logs available upon request. >>> ---------------- >>> John White >>> High Performance Computing Services (HPCS) >>> (510) 486-7307 >>> One Cyclotron Rd, MS: 50B-3209C >>> Lawrence Berkeley National Lab >>> Berkeley, CA 94720 >>> >>> >>> >>> >>> >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Andreas Dilger
2010-Apr-16 23:44 UTC
[Lustre-discuss] odd kernel crash after a heartbeat failover
On 2010-04-16, at 11:29, John White wrote:> Just to follow-up, after enabling netconsole to get some meaningful > logging out of these OSSs, it is clear that there''s a problem with > the backend storage communication and that this certainly isn''t a > lustre issue. Thanks folks. > > On Apr 15, 2010, at 9:45 PM, Cliff White wrote: >> John White wrote: >>> This is actually happening repeatedly, any idea if this is a >>> lustre-side error? >>> kernel: Unable to handle NULL pointer dereference at >>> 0000000000000000 >>> kernel: LDISKFS-fs error (device dm-7) in >>> ldiskfs_reserve_inode_write: Journal has aborted >>> kernel: Oops: 0002 [1] SMP >>> kernel: RIP jbd:journal_commit_transaction+0xc33/0x132eCould you please decode the line for journal_commit_transaction+0xc33 to see what line it is. This Oops shouldn''t be happening, even if the journal has aborted. Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.
Kit Westneat
2010-Apr-17 06:30 UTC
[Lustre-discuss] odd kernel crash after a heartbeat failover
This looks like bug 18235/19025... since it only occurs with flaky hardware, the fix was only landed in 2.0. - Kit On 4/16/2010 7:44 PM, Andreas Dilger wrote:> On 2010-04-16, at 11:29, John White wrote: > >> Just to follow-up, after enabling netconsole to get some meaningful >> logging out of these OSSs, it is clear that there''s a problem with >> the backend storage communication and that this certainly isn''t a >> lustre issue. Thanks folks. >> >> On Apr 15, 2010, at 9:45 PM, Cliff White wrote: >> >>> John White wrote: >>> >>>> This is actually happening repeatedly, any idea if this is a >>>> lustre-side error? >>>> kernel: Unable to handle NULL pointer dereference at >>>> 0000000000000000 >>>> kernel: LDISKFS-fs error (device dm-7) in >>>> ldiskfs_reserve_inode_write: Journal has aborted >>>> kernel: Oops: 0002 [1] SMP >>>> kernel: RIP jbd:journal_commit_transaction+0xc33/0x132e >>>> > Could you please decode the line for journal_commit_transaction+0xc33 > to see what line it is. This Oops shouldn''t be happening, even if the > journal has aborted. > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer, Lustre Group > Oracle Corporation Canada Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- --- Kit Westneat kwestneat at datadirectnet.com 812-484-8485