I have encountered on my system a bug when OCFS2 tries to do a journal_wipe. At the time that it does the call it gets back an error of -22. The problem is that it seems to leave stuff in an inconsistent state when it exits out of the functions that have called it. So later on bad things happen :( This diff simulates the error that I received. I am trying to figure out what is the stuff that has been partially initialized when this gets called but I am having a bit of difficulty and tracking it all down :( John Index: journal.c ==================================================================--- journal.c (revision 766) +++ journal.c (working copy) @@ -1261,8 +1261,11 @@ if (!journal) BUG(); - status = journal_wipe(journal->k_journal, full); +// FIXME: Simulate BUG +// status = journal_wipe(journal->k_journal, full); + status = -22; + LOG_EXIT_STATUS(status); return(status); }
At what point did you first see the error? Was it a 1st mount of a fresh file system or just a normal mount? I assume the file system failed to mount because of this error... Can you be more specific as to what Bad Things (TM) were happening? :) Did it crash or what? --Mark On Tue, Mar 09, 2004 at 10:39:52AM -0800, John L. Villalovos wrote:> I have encountered on my system a bug when OCFS2 tries to do a journal_wipe. > > At the time that it does the call it gets back an error of -22. > > The problem is that it seems to leave stuff in an inconsistent state when > it exits out of the functions that have called it. So later on bad things > happen :( > > This diff simulates the error that I received. I am trying to figure out > what is the stuff that has been partially initialized when this gets called > but I am having a bit of difficulty and tracking it all down :( > > John > > > Index: journal.c > ==================================================================> --- journal.c (revision 766) > +++ journal.c (working copy) > @@ -1261,8 +1261,11 @@ > if (!journal) > BUG(); > > - status = journal_wipe(journal->k_journal, full); > +// FIXME: Simulate BUG > +// status = journal_wipe(journal->k_journal, full); > + status = -22; > > + > LOG_EXIT_STATUS(status); > return(status); > } > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel-- Mark Fasheh Software Developer, Oracle Corp mark.fasheh@oracle.com
> At what point did you first see the error? Was it a 1st mount > of a fresh > file system or just a normal mount? I assume the file system > failed to mount > because of this error... Can you be more specific as to what > Bad Things (TM) > were happening? :) Did it crash or what?It was NOT a 1st mount. It was a disk that had been previously used. It appears that the mount fails but then some globals are probably in a partially set state. Here is what I saw after it happened: # mount -t ocfs2 /dev/sda1 /ocfs2 JBD: no valid journal superblock found (1856) ERROR: status = -22, /root/ocfs/ocfs2/src/osb.c, 424 (1856) ERROR: status = -22, /root/ocfs/ocfs2/src/super.c, 1047 mount: wrong fs type, bad option, bad superblock on /dev/sda1, or too many mounted file systems [root@linuxjohn2 load_ocfs]# Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: d10bb369 *pde = 0e9b5067 Oops: 0000 [#1] CPU: 0 EIP: 0060:[<d10bb369>] Tainted: GF EFLAGS: 00010286 EIP is at ocfs_bh_sem_lookup+0x29/0x650 [ocfs2] eax: 00000000 ebx: cbc57984 ecx: 000000f9 edx: 000007f9 esi: cbc57984 edi: cbc57984 ebp: 00000800 esp: cc1d5eac ds: 007b es: 007b ss: 0068 Process ocfs2nm-0 (pid: 1857, threadinfo=cc1d4000 task=ccfbd660) Stack: cbc6a374 cc1d5ed0 0000001f cba16e90 00000010 00000010 ce654a00 cfa0a200 [6~Stack: cbc6a374 cc1d5ed0 0000001f cba16e90 00000010 00000010 ce654a00 cfa0a200 cfa0a200 00000000 00000000 00000000 00000000 ccfbea40 00000000 cbc57984 00000000 cc1d5f3c c035cd80 ccdc87b4 00000010 ce6549f0 00011c00 cfa0a200 Call Trace: [<d10bb9a1>] ocfs_bh_sem_lock+0x11/0x60 [ocfs2] [<d10c6267>] ocfs_read_bhs+0x227/0x930 [ocfs2] [<d10bbd6a>] ocfs_bh_sem_hash_prune+0x19a/0x390 [ocfs2] [<d10d5d6e>] ocfs_volume_thread+0x29e/0x930 [ocfs2] [<d10d5ad0>] ocfs_volume_thread+0x0/0x930 [ocfs2] [<c0109295>] kernel_thread_helper+0x5/0x10 Code: 8b 00 89 c3 d3 e3 8d 4d f6 d3 e0 31 c3 88 d1 89 5c 24 34 8b After this point I couldn't unload OCFS2 anymore. John
> Ok, could you update from latest SVN and let me know if that fixed it? > I wasn't getting the NULL pointer error in ocfs_bh_sem_lookup > like you, but > I was definitely seeing one in ocfs_inode_hash_prune_all where we were > assuming that an inode existed on the inum when in fact it > didn't :) The fix > of course, we to check for it's existence before acting on it! > > Alternatively, if you don't want to update from SVN, you can > apply this > patch.I will try to give that a try. Though I reformatted my partition so I may not be able to reproduce. Just a note. I am doing this on a 2.6.3 kernel. Where I was having it crash was on: ocfs_bh_sem * ocfs_bh_sem_lookup(struct buffer_head *bh) { int depth, bucket; struct list_head *head, *iter = NULL; ocfs_bh_sem *sem = NULL, *newsem = NULL; bucket = ocfs_bh_sem_hash_fn(bh); <<<<<<<<<---- #define ocfs_bh_sem_hash_fn(_b) \ (_hashfn((unsigned int)BH_GET_DEVICE((_b)), (_b)->b_blocknr) & ocfs_bh_hash_shift) This macro is where the NULL reference occurs: #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0) #define BH_GET_DEVICE(bh) ((bh->b_bdev)->bd_dev) <<<<------------- #else #define BH_GET_DEVICE(bh) (bh->b_dev) #endif