thr3ads.net - Ocfs2 devel - [Ocfs2-devel] Bug in error handling [Mar 2004]

If this information is useful, please help other people find it:
Share via:

John L. Villalovos

2004-Mar-09 12:40 UTC

[Ocfs2-devel] Bug in error handling

I have encountered on my system a bug when OCFS2 tries to do a journal_wipe.

At the time that it does the call it gets back an error of -22.

The problem is that it seems to leave stuff in an inconsistent state when it
exits out of the functions that have called it.  So later on bad things happen
:(

This diff simulates the error that I received.  I am trying to figure out what
is the stuff that has been partially initialized when this gets called but I am
having a bit of difficulty and tracking it all down :(

John


Index: journal.c
==================================================================--- journal.c
(revision 766)
+++ journal.c	(working copy)
@@ -1261,8 +1261,11 @@
  	if (!journal)
  		BUG();

-	status = journal_wipe(journal->k_journal, full);
+// FIXME: Simulate BUG
+//	status = journal_wipe(journal->k_journal, full);
+	status = -22;

+
  	LOG_EXIT_STATUS(status);
  	return(status);
  }

Mark Fasheh

2004-Mar-09 15:13 UTC

head link

[Ocfs2-devel] Bug in error handling

At what point did you first see the error? Was it a 1st mount of a fresh
file system or just a normal mount? I assume the file system failed to mount
because of this error... Can you be more specific as to what Bad Things (TM)
were happening?  :) Did it crash or what?
	--Mark

On Tue, Mar 09, 2004 at 10:39:52AM -0800, John L. Villalovos
wrote:> I have encountered on my system a bug when OCFS2 tries to do a
journal_wipe.
> 
> At the time that it does the call it gets back an error of -22.
> 
> The problem is that it seems to leave stuff in an inconsistent state when 
> it exits out of the functions that have called it.  So later on bad things 
> happen :(
> 
> This diff simulates the error that I received.  I am trying to figure out 
> what is the stuff that has been partially initialized when this gets called
> but I am having a bit of difficulty and tracking it all down :(
> 
> John
> 
> 
> Index: journal.c
> ==================================================================> ---
journal.c	(revision 766)
> +++ journal.c	(working copy)
> @@ -1261,8 +1261,11 @@
>  	if (!journal)
>  		BUG();
> 
> -	status = journal_wipe(journal->k_journal, full);
> +// FIXME: Simulate BUG
> +//	status = journal_wipe(journal->k_journal, full);
> +	status = -22;
> 
> +
>  	LOG_EXIT_STATUS(status);
>  	return(status);
>  }
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel--
Mark Fasheh
Software Developer, Oracle Corp
mark.fasheh@oracle.com

Villalovos, John L

2004-Mar-09 15:18 UTC

head link

[Ocfs2-devel] Bug in error handling

> At what point did you first see the error? Was it a 1st mount 
> of a fresh
> file system or just a normal mount? I assume the file system 
> failed to mount
> because of this error... Can you be more specific as to what 
> Bad Things (TM)
> were happening?  :) Did it crash or what?
It was NOT a 1st mount.  It was a disk that had been previously used.

It appears that the mount fails but then some globals are probably in a
partially set state.

Here is what I saw after it happened:

# mount -t ocfs2 /dev/sda1 /ocfs2
JBD: no valid journal superblock found
(1856) ERROR: status = -22, /root/ocfs/ocfs2/src/osb.c, 424
(1856) ERROR: status = -22, /root/ocfs/ocfs2/src/super.c, 1047
mount: wrong fs type, bad option, bad superblock on /dev/sda1,
       or too many mounted file systems
[root@linuxjohn2 load_ocfs]# Unable to handle kernel NULL pointer
dereference at virtual address 00000000
 printing eip:
d10bb369
*pde = 0e9b5067
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[<d10bb369>]    Tainted: GF
EFLAGS: 00010286
EIP is at ocfs_bh_sem_lookup+0x29/0x650 [ocfs2]
eax: 00000000   ebx: cbc57984   ecx: 000000f9   edx: 000007f9
esi: cbc57984   edi: cbc57984   ebp: 00000800   esp: cc1d5eac
ds: 007b   es: 007b   ss: 0068
Process ocfs2nm-0 (pid: 1857, threadinfo=cc1d4000 task=ccfbd660)
Stack: cbc6a374 cc1d5ed0 0000001f cba16e90 00000010 00000010 ce654a00
cfa0a200
[6~Stack: cbc6a374 cc1d5ed0 0000001f cba16e90 00000010 00000010
ce654a00 cfa0a200
       cfa0a200 00000000 00000000 00000000 00000000 ccfbea40 00000000
cbc57984
       00000000 cc1d5f3c c035cd80 ccdc87b4 00000010 ce6549f0 00011c00
cfa0a200
Call Trace:
 [<d10bb9a1>] ocfs_bh_sem_lock+0x11/0x60 [ocfs2]
 [<d10c6267>] ocfs_read_bhs+0x227/0x930 [ocfs2]
 [<d10bbd6a>] ocfs_bh_sem_hash_prune+0x19a/0x390 [ocfs2]
 [<d10d5d6e>] ocfs_volume_thread+0x29e/0x930 [ocfs2]
 [<d10d5ad0>] ocfs_volume_thread+0x0/0x930 [ocfs2]
 [<c0109295>] kernel_thread_helper+0x5/0x10

Code: 8b 00 89 c3 d3 e3 8d 4d f6 d3 e0 31 c3 88 d1 89 5c 24 34 8b



After this point I couldn't unload OCFS2 anymore.

John

Villalovos, John L

2004-Mar-09 19:14 UTC

head link

[Ocfs2-devel] Bug in error handling

> Ok, could you update from latest SVN and let me know if that fixed it?
> I wasn't getting the NULL pointer error in ocfs_bh_sem_lookup 
> like you, but
> I was definitely seeing one in ocfs_inode_hash_prune_all where we were
> assuming that an inode existed on the inum when in fact it 
> didn't :) The fix
> of course, we to check for it's existence before acting on it!
> 
> Alternatively, if you don't want to update from SVN, you can 
> apply this
> patch.
I will try to give that a try.  Though I reformatted my partition so I
may not be able to reproduce.

Just a note.  I am doing this on a 2.6.3 kernel.

Where I was having it crash was on:

ocfs_bh_sem * ocfs_bh_sem_lookup(struct buffer_head *bh)
{
        int depth, bucket;
        struct list_head *head, *iter = NULL;
        ocfs_bh_sem *sem = NULL, *newsem = NULL;

        bucket = ocfs_bh_sem_hash_fn(bh); 
<<<<<<<<<----



#define ocfs_bh_sem_hash_fn(_b)   \
        (_hashfn((unsigned int)BH_GET_DEVICE((_b)), (_b)->b_blocknr) &
ocfs_bh_hash_shift)


This macro is where the NULL reference occurs:

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)
#define BH_GET_DEVICE(bh) ((bh->b_bdev)->bd_dev) 
<<<<-------------
#else
#define BH_GET_DEVICE(bh) (bh->b_dev)
#endif

Ocfs2 devel - Mar 2004 - Bug in error handling

[Ocfs2-devel] Bug in error handling

[Ocfs2-devel] Bug in error handling

[Ocfs2-devel] Bug in error handling

[Ocfs2-devel] Bug in error handling