thr3ads.net - Ocfs2 users - [Ocfs2-users] No space left on device [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Todd Freeman

2010-Sep-07 21:06 UTC

[Ocfs2-users] No space left on device

From reading the archives I can see this issue has been hit before but 
I haven't found a resolution.

I have a 50gb partition...  I have formatted it at 10gb. I have it set 
for 4 cluster members and am using 3 of those slots.

I fill the partition to 66% and voila... no space left on device.  I 
have tried it with big files and lots of small files and both ways I hit 
this error at 66% usage.

I am using ubuntu-server with ocfs2-tools 1.4.2-1


If anyone has ideas/solutions I would be most grateful...  this FS is 
awesome :P

-- 
Todd Freeman  Ext 6103                   .^.    Don't fear the penguins!
Programming Department                   /V\
Andrews University                      // \\    http://www.linux.org/
http://www.andrews.edu/~freeman/       /(   )\   http://www.debian.org/
                                         ^^ ^^

Sunil Mushran

2010-Sep-08 02:12 UTC

head link

[Ocfs2-users] No space left on device

Which kernel are you using?

We have fixed this issue in mainline. We will soon have the same
fix for production kernels.

On 09/07/2010 02:06 PM, Todd Freeman wrote:>    From reading the archives I can see this issue has been hit before but
> I haven't found a resolution.
>
> I have a 50gb partition...  I have formatted it at 10gb. I have it set
> for 4 cluster members and am using 3 of those slots.
>
> I fill the partition to 66% and voila... no space left on device.  I
> have tried it with big files and lots of small files and both ways I hit
> this error at 66% usage.
>
> I am using ubuntu-server with ocfs2-tools 1.4.2-1
>
>
> If anyone has ideas/solutions I would be most grateful...  this FS is
> awesome :P
>
>

Tao Ma

2010-Sep-29 09:25 UTC

head link

[Ocfs2-users] No space left on device

On 09/29/2010 05:13 PM, Alexander Barton wrote:> Hello again!
>
> Am 21.09.2010 um 11:04 schrieb Tao Ma:
>
>> On 09/21/2010 04:52 PM, Alexander Barton wrote:
>>
>>> So kernel 2.6.35.4 would be ok?
>>
>> It should work.
>>
>>> And OCFS2 tools from the GIT master branch? Or a special tag? There
is no archive or release, right?
>>
>> I have already committed the patches to ocfs2-tools.
>> So you can get from
>> git clone git://oss.oracle.com/git/ocfs2-tools.git
>
> We upgraded both of our cluster nodes last friday to
>
>    - Debian Linux Kernel ?2.6.35-trunk-amd64?
>      (linux-image-2.6.35-trunk-amd64_2.6.35-1~experimental.3_amd64.deb)
>      which is 2.6.35.4 plus Debian patches
>
>    - OCFS2 tools 1.6.3 from GIT
>
> Since then, our cluster is VERY unstable, ge get lots of ?general
protection faults? and hard lockups. ?Lots? as in ?often more than 2 times a
day?.
sorry for the trouble.>
> Our scenario is OCFS2 on top of DRBD. It looks like the ?crash pattern? is
the following:
>
> On Node 2:
>
> cl1-n2 kernel: [ 4006.829327] general protection fault: 0000 [#21] SMP
> cl1-n2 kernel: [ 4006.829487] last sysfs file:
/sys/devices/platform/coretemp.7/temp1_label
> cl1-n2 kernel: [ 4006.829558] CPU 1
> cl1-n2 kernel: [ 4006.829611] Modules linked in: ocfs2 jbd2 quota_tree tun
xt_tcpudp iptable_filter hmac sha1_generic ocfs2_dlmfs ocfs2_stack_o2cb
ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iptable_nat nf_nat configfs
nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables nfsd exportfs
nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 coretemp drbd lru_cache cn
loop hed tpm_tis snd_pcm snd_timer snd soundcore psmouse snd_page_alloc
processor tpm pcspkr evdev joydev tpm_bios dcdbas serio_raw i5k_amb button
rng_core shpchp pci_hotplug i5000_edac edac_core ext3 jbd mbcache dm_mirror
dm_region_hash dm_log dm_snapshot dm_mod usbhid hid sg sr_mod cdrom ata_generic
sd_mod ses crc_t10dif enclosure ata_piix ehci_hcd uhci_hcd usbcore bnx2 libata
nls_base megaraid_sas scsi_mod e1000e thermal fan thermal_sys [last unloaded:
scsi_wait_scan]
> cl1-n2 kernel: [ 4006.833215]
> cl1-n2 kernel: [ 4006.833215] Pid: 7699, comm: apache2 Tainted: G      D   
2.6.35-trunk-amd64 #1 0H603H/PowerEdge 2950
> cl1-n2 kernel: [ 4006.833215] RIP: 0010:[<ffffffff810e1886>] 
[<ffffffff810e1886>] __kmalloc+0xd3/0x136
> cl1-n2 kernel: [ 4006.833215] RSP: 0018:ffff88012e277cd8  EFLAGS: 00010006
> cl1-n2 kernel: [ 4006.833215] RAX: 0000000000000000 RBX: 6f635f6465727265
RCX: ffffffffa0686032
> cl1-n2 kernel: [ 4006.833215] RDX: 0000000000000000 RSI: ffff88012e277da8
RDI: 0000000000000004
> cl1-n2 kernel: [ 4006.833215] RBP: ffffffff81625520 R08: ffff880001a52510
R09: 0000000000000003
> cl1-n2 kernel: [ 4006.833215] R10: ffff88009a561b40 R11: ffff88022d62f400
R12: 000000000000000b
> cl1-n2 kernel: [ 4006.833215] R13: 0000000000008050 R14: 0000000000008050
R15: 0000000000000246
> cl1-n2 kernel: [ 4006.833215] FS:  00007f9199715740(0000)
GS:ffff880001a40000(0000) knlGS:0000000000000000
> cl1-n2 kernel: [ 4006.833215] CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
> cl1-n2 kernel: [ 4006.833215] CR2: 00000000402de9d0 CR3: 00000001372a1000
CR4: 00000000000406e0
> cl1-n2 kernel: [ 4006.833215] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
> cl1-n2 kernel: [ 4006.833215] DR3: 0000000000000000 DR6: 00000000ffff0ff0
DR7: 0000000000000400
> cl1-n2 kernel: [ 4006.833215] Process apache2 (pid: 7699, threadinfo
ffff88012e276000, task ffff88009a561b40)
> cl1-n2 kernel: [ 4006.833215] Stack:
> cl1-n2 kernel: [ 4006.833215]  ffff8801b1af9c20 ffffffffa0686032
ffff8801b1a2da20 ffff88018f5f30c0
> cl1-n2 kernel: [ 4006.833215]<0>  ffff88012e277e88 000000000000000a
ffff88018d105300 ffff88009a561b40
> cl1-n2 kernel: [ 4006.833215]<0>  ffff88012e277da8 ffffffffa0686032
ffff88012e277e88 ffff88012e277da8
> cl1-n2 kernel: [ 4006.833215] Call Trace:
> cl1-n2 kernel: [ 4006.833215]  [<ffffffffa0686032>] ?
ocfs2_fast_follow_link+0x166/0x284 [ocfs2]
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f29fa>] ?
do_follow_link+0xdb/0x24c
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f2d55>] ?
link_path_walk+0x1ea/0x482
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f311f>] ?
path_walk+0x63/0xd6
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f27ba>] ?
path_init+0x46/0x1ab
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f3288>] ?
do_path_lookup+0x20/0x85
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f3cd9>] ?
user_path_at+0x46/0x78
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff81038bac>] ?
pick_next_task_fair+0xe6/0xf6
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff81305101>] ?
schedule+0x4d4/0x530
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff81060526>] ?
prepare_creds+0x87/0x9c
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810e8649>] ?
sys_faccessat+0x96/0x15b
> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810089c2>] ?
system_call_fastpath+0x16/0x1b
> cl1-n2 kernel: [ 4006.833215] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44
00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48
63 45 18<48>  8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
> cl1-n2 kernel: [ 4006.833215] RIP  [<ffffffff810e1886>]
__kmalloc+0xd3/0x136
> cl1-n2 kernel: [ 4006.833215]  RSP<ffff88012e277cd8>
> cl1-n2 kernel: [ 4006.833215] ---[ end trace b1eead7c8752b710 ]?Interesting, this seems to unrelated to discontig bg. Please file a bug 
for it.>
> This fault then repeats over and over again ?
> Sometime the node seems to lock up completely; then Node 1 logs:
>
> cl1-n1 kernel: [ 7314.523473] block drbd0: peer( Primary ->  Unknown )
conn( Connected ->  NetworkFailure ) pdsk( UpToDate ->  DUnknown )
> cl1-n1 kernel: [ 7314.523485] block drbd0: asender terminated
> cl1-n1 kernel: [ 7314.523488] block drbd0: Terminating drbd0_asender
> cl1-n1 kernel: [ 7314.523495] block drbd0: short read receiving data: read
2960 expected 4096
> cl1-n1 kernel: [ 7314.523502] block drbd0: Creating new current UUID
> cl1-n1 kernel: [ 7314.523509] block drbd0: error receiving Data, l: 4120!
> cl1-n1 kernel: [ 7314.523705] block drbd0: Connection closed
> cl1-n1 kernel: [ 7314.523710] block drbd0: conn( NetworkFailure -> 
Unconnected )
> cl1-n1 kernel: [ 7314.523718] block drbd0: receiver terminated
> cl1-n1 kernel: [ 7314.523721] block drbd0: Restarting drbd0_receiver
> cl1-n1 kernel: [ 7314.523723] block drbd0: receiver (re)started
> cl1-n1 kernel: [ 7314.523730] block drbd0: conn( Unconnected -> 
WFConnection )
>
> ? which looks ?fine? because of the 1st node not responding any more.
> But then we get:
>
> cl1-n1 kernel: [ 7319.136065] o2net: connection to node cl1-n2 (num 1) at
10.0.1.2:6999 has been idle for 10.0 seconds, shutting it down.
> cl1-n1 kernel: [ 7319.136079] (swapper,0,0):o2net_idle_timer1O:ee  n ee  n
ee  n ee  n ee  n ee  n ee  n ee    ee    ee    ee    ee    ee    ee    ee    ee
ee    ee    ee   ee sending message 506 (key 0xf5dfae8c) to node 1
>
> And endless messages like the following ?
>
> cl1-n1 kernel: [ 7319.196135]
(TunCtl,3278,6):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
> cl1-n1 kernel: [ 7319.196192]
(IPaddr2,3416,4):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
> cl1-n1 kernel: [ 7319.196256]
(IPaddr2,3282,1):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
> cl1-n1 kernel: [ 7319.196373]
(linux,3271,3):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending
message 506 (key 0xf5dfae8c) to node 1
> cl1-n1 kernel: [ 7319.196622]
(apache2,5474,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
> ?
> ?
>
> ? until node 1 hangs forever, too.
>
> After rebooting both of the nodes, the cluster runs for some time ? and
errors out like above again. And again. And ? :-/
>
> Two other error messages observed while trying to mount the OCFS2
filesystem on one node (but we can?t reproduce any more):
>
> cl1-n2 kernel: [  361.012172] INFO: task mount.ocfs2:4969 blocked for more
than 120 seconds.
> cl1-n2 kernel: [  361.012252] "echo 0> 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
> cl1-n2 kernel: [  361.012343] mount.ocfs2   D 00000000ffff3314     0  4969 
1 0x00000000
> cl1-n2 kernel: [  361.012347]  ffff88022c8c1b40 0000000000000082
ffff880200000000 ffff88022c133e90
> cl1-n2 kernel: [  361.012351]  ffff88022ec751c0 00000000000154c0
00000000000154c0 00000000000154c0
> cl1-n2 kernel: [  361.012354]  ffff8802184b3fd8 00000000000154c0
ffff88022c8c1b40 ffff8802184b3fd8
> cl1-n2 kernel: [  361.012356] Call Trace:
> cl1-n2 kernel: [  361.012365]  [<ffffffff813054df>] ?
schedule_timeout+0x2d/0xd7
> cl1-n2 kernel: [  361.012369]  [<ffffffff81305354>] ?
wait_for_common+0xd1/0x14e
> cl1-n2 kernel: [  361.012374]  [<ffffffff8103f630>] ?
default_wake_function+0x0/0xf
> cl1-n2 kernel: [  361.012394]  [<ffffffffa05bc21f>] ?
__ocfs2_cluster_lock+0x6e0/0x890 [ocfs2]
> cl1-n2 kernel: [  361.012403]  [<ffffffffa0551bb4>] ?
dlm_register_domain+0x9e4/0xaf0 [ocfs2_dlm]
> cl1-n2 kernel: [  361.012408]  [<ffffffff81190a3e>] ?
hweight_long+0x5/0x6
> cl1-n2 kernel: [  361.012420]  [<ffffffffa05bd242>] ? T.775+0x18/0x1d
[ocfs2]
> cl1-n2 kernel: [  361.012432]  [<ffffffffa05bd2f3>] ?
ocfs2_super_lock+0xac/0x2bd [ocfs2]
> cl1-n2 kernel: [  361.012443]  [<ffffffffa05b4d3b>] ?
ocfs2_is_hard_readonly+0x10/0x23 [ocfs2]
> cl1-n2 kernel: [  361.012455]  [<ffffffffa05bd2f3>] ?
ocfs2_super_lock+0xac/0x2bd [ocfs2]
> cl1-n2 kernel: [  361.012470]  [<ffffffffa05f5d66>] ?
ocfs2_fill_super+0x1227/0x2101 [ocfs2]
> cl1-n2 kernel: [  361.012475]  [<ffffffff8118e2f3>] ?
snprintf+0x36/0x3b
> cl1-n2 kernel: [  361.012478]  [<ffffffff810eb86e>] ?
get_sb_bdev+0x137/0x19a
> cl1-n2 kernel: [  361.012492]  [<ffffffffa05f4b3f>] ?
ocfs2_fill_super+0x0/0x2101 [ocfs2]
> cl1-n2 kernel: [  361.012495]  [<ffffffff810eaf45>] ?
vfs_kern_mount+0xa6/0x196
> cl1-n2 kernel: [  361.012498]  [<ffffffff810eb094>] ?
do_kern_mount+0x49/0xe7
> cl1-n2 kernel: [  361.012502]  [<ffffffff810ff38b>] ?
do_mount+0x75c/0x7d6
> cl1-n2 kernel: [  361.012506]  [<ffffffff810d90ba>] ?
alloc_pages_current+0x9f/0xc2
> cl1-n2 kernel: [  361.012508]  [<ffffffff810ff48d>] ?
sys_mount+0x88/0xc3
> cl1-n2 kernel: [  361.012513]  [<ffffffff810089c2>] ?
system_call_fastpath+0x16/0x1b
>
> cl1-n2 kernel: [  361.012518] INFO: task mount.ocfs2:5481 blocked for more
than 120 seconds.
> cl1-n2 kernel: [  361.012588] "echo 0> 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
> cl1-n2 kernel: [  361.012674] mount.ocfs2   D 00000000ffff63dd     0  5481 
1 0x00000000
> cl1-n2 kernel: [  361.012677]  ffff88022c7a2210 0000000000000082
ffff88022c5ae2c8 ffffffff00000000
> cl1-n2 kernel: [  361.012680]  ffff88022ec76d00 00000000000154c0
00000000000154c0 00000000000154c0
> cl1-n2 kernel: [  361.012683]  ffff880218485fd8 00000000000154c0
ffff88022c7a2210 ffff880218485fd8
> cl1-n2 kernel: [  361.012685] Call Trace:
> cl1-n2 kernel: [  361.012691]  [<ffffffff813063b5>] ?
rwsem_down_failed_common+0x97/0xcb
> cl1-n2 kernel: [  361.012694]  [<ffffffff810ea58f>] ?
test_bdev_super+0x0/0xd
> cl1-n2 kernel: [  361.012697]  [<ffffffff81306405>] ?
rwsem_down_write_failed+0x1c/0x25
> cl1-n2 kernel: [  361.012700]  [<ffffffff8118f2c3>] ?
call_rwsem_down_write_failed+0x13/0x20
> cl1-n2 kernel: [  361.012703]  [<ffffffff81305d2e>] ?
down_write+0x25/0x27
> cl1-n2 kernel: [  361.012705]  [<ffffffff810eb361>] ? sget+0x99/0x34d
> cl1-n2 kernel: [  361.012708]  [<ffffffff810ea565>] ?
set_bdev_super+0x0/0x2a
> cl1-n2 kernel: [  361.012710]  [<ffffffff810eb7d5>] ?
get_sb_bdev+0x9e/0x19a
> cl1-n2 kernel: [  361.012724]  [<ffffffffa05f4b3f>] ?
ocfs2_fill_super+0x0/0x2101 [ocfs2]
> cl1-n2 kernel: [  361.012727]  [<ffffffff810eaf45>] ?
vfs_kern_mount+0xa6/0x196
> cl1-n2 kernel: [  361.012730]  [<ffffffff810eb094>] ?
do_kern_mount+0x49/0xe7
> cl1-n2 kernel: [  361.012733]  [<ffffffff810ff38b>] ?
do_mount+0x75c/0x7d6
> cl1-n2 kernel: [  361.012735]  [<ffffffff810d90ba>] ?
alloc_pages_current+0x9f/0xc2
> cl1-n2 kernel: [  361.012738]  [<ffffffff810ff48d>] ?
sys_mount+0x88/0xc3
>
> Ok, so this Kernel and/or OCFS2-tools upgrade is a very unpleasant one for
us :-(
> But what can we do?
>
> Is it possible for us to downgrade the kernel and ocfs2-tools after
?tunefs.ocfs2 --fs-features=discontig-bg??just use tunefs.ocfs2 --fs-features=nodiscontig-bg to disable it and you 
can go back to the old kernel and use it.

Regards,
Tao

Alexander Barton

2010-Sep-29 09:40 UTC

head link

[Ocfs2-users] No space left on device

Am 29.09.2010 um 11:25 schrieb Tao Ma:
> On 09/29/2010 05:13 PM, Alexander Barton wrote:
> 
>> cl1-n2 kernel: [ 4006.829327] general protection fault: 0000 [#21] SMP
>> cl1-n2 kernel: [ 4006.829487] last sysfs file:
/sys/devices/platform/coretemp.7/temp1_label
>> cl1-n2 kernel: [ 4006.829558] CPU 1
>> cl1-n2 kernel: [ 4006.829611] Modules linked in: ocfs2 jbd2 quota_tree
tun xt_tcpudp iptable_filter hmac sha1_generic ocfs2_dlmfs ocfs2_stack_o2cb
ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iptable_nat nf_nat configfs
nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables nfsd exportfs
nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 coretemp drbd lru_cache cn
loop hed tpm_tis snd_pcm snd_timer snd soundcore psmouse snd_page_alloc
processor tpm pcspkr evdev joydev tpm_bios dcdbas serio_raw i5k_amb button
rng_core shpchp pci_hotplug i5000_edac edac_core ext3 jbd mbcache dm_mirror
dm_region_hash dm_log dm_snapshot dm_mod usbhid hid sg sr_mod cdrom ata_generic
sd_mod ses crc_t10dif enclosure ata_piix ehci_hcd uhci_hcd usbcore bnx2 libata
nls_base megaraid_sas scsi_mod e1000e thermal fan thermal_sys [last unloaded:
scsi_wait_scan]
>> cl1-n2 kernel: [ 4006.833215]
>> cl1-n2 kernel: [ 4006.833215] Pid: 7699, comm: apache2 Tainted: G     
D     2.6.35-trunk-amd64 #1 0H603H/PowerEdge 2950
>> cl1-n2 kernel: [ 4006.833215] RIP: 0010:[<ffffffff810e1886>] 
[<ffffffff810e1886>] __kmalloc+0xd3/0x136
>> cl1-n2 kernel: [ 4006.833215] RSP: 0018:ffff88012e277cd8  EFLAGS:
00010006
>> cl1-n2 kernel: [ 4006.833215] RAX: 0000000000000000 RBX:
6f635f6465727265 RCX: ffffffffa0686032
>> cl1-n2 kernel: [ 4006.833215] RDX: 0000000000000000 RSI:
ffff88012e277da8 RDI: 0000000000000004
>> cl1-n2 kernel: [ 4006.833215] RBP: ffffffff81625520 R08:
ffff880001a52510 R09: 0000000000000003
>> cl1-n2 kernel: [ 4006.833215] R10: ffff88009a561b40 R11:
ffff88022d62f400 R12: 000000000000000b
>> cl1-n2 kernel: [ 4006.833215] R13: 0000000000008050 R14:
0000000000008050 R15: 0000000000000246
>> cl1-n2 kernel: [ 4006.833215] FS:  00007f9199715740(0000)
GS:ffff880001a40000(0000) knlGS:0000000000000000
>> cl1-n2 kernel: [ 4006.833215] CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
>> cl1-n2 kernel: [ 4006.833215] CR2: 00000000402de9d0 CR3:
00000001372a1000 CR4: 00000000000406e0
>> cl1-n2 kernel: [ 4006.833215] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
>> cl1-n2 kernel: [ 4006.833215] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
>> cl1-n2 kernel: [ 4006.833215] Process apache2 (pid: 7699, threadinfo
ffff88012e276000, task ffff88009a561b40)
>> cl1-n2 kernel: [ 4006.833215] Stack:
>> cl1-n2 kernel: [ 4006.833215]  ffff8801b1af9c20 ffffffffa0686032
ffff8801b1a2da20 ffff88018f5f30c0
>> cl1-n2 kernel: [ 4006.833215]<0>  ffff88012e277e88
000000000000000a ffff88018d105300 ffff88009a561b40
>> cl1-n2 kernel: [ 4006.833215]<0>  ffff88012e277da8
ffffffffa0686032 ffff88012e277e88 ffff88012e277da8
>> cl1-n2 kernel: [ 4006.833215] Call Trace:
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffffa0686032>] ?
ocfs2_fast_follow_link+0x166/0x284 [ocfs2]
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f29fa>] ?
do_follow_link+0xdb/0x24c
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f2d55>] ?
link_path_walk+0x1ea/0x482
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f311f>] ?
path_walk+0x63/0xd6
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f27ba>] ?
path_init+0x46/0x1ab
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f3288>] ?
do_path_lookup+0x20/0x85
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810f3cd9>] ?
user_path_at+0x46/0x78
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff81038bac>] ?
pick_next_task_fair+0xe6/0xf6
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff81305101>] ?
schedule+0x4d4/0x530
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff81060526>] ?
prepare_creds+0x87/0x9c
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810e8649>] ?
sys_faccessat+0x96/0x15b
>> cl1-n2 kernel: [ 4006.833215]  [<ffffffff810089c2>] ?
system_call_fastpath+0x16/0x1b
>> cl1-n2 kernel: [ 4006.833215] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f
44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d
48 63 45 18<48>  8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1
f1
>> cl1-n2 kernel: [ 4006.833215] RIP  [<ffffffff810e1886>]
__kmalloc+0xd3/0x136
>> cl1-n2 kernel: [ 4006.833215]  RSP<ffff88012e277cd8>
>> cl1-n2 kernel: [ 4006.833215] ---[ end trace b1eead7c8752b710 ]?
> 
> Interesting, this seems to unrelated to discontig bg. Please file a bug for
it.
Done. Bug #1292: <http://oss.oracle.com/bugzilla/show_bug.cgi?id=1292>
>> This fault then repeats over and over again ?
>> Sometime the node seems to lock up completely; then Node 1 logs:
>> 
>> cl1-n1 kernel: [ 7314.523473] block drbd0: peer( Primary ->  Unknown
) conn( Connected ->  NetworkFailure ) pdsk( UpToDate ->  DUnknown )
>> cl1-n1 kernel: [ 7314.523485] block drbd0: asender terminated
>> cl1-n1 kernel: [ 7314.523488] block drbd0: Terminating drbd0_asender
>> cl1-n1 kernel: [ 7314.523495] block drbd0: short read receiving data:
read 2960 expected 4096
>> cl1-n1 kernel: [ 7314.523502] block drbd0: Creating new current UUID
>> cl1-n1 kernel: [ 7314.523509] block drbd0: error receiving Data, l:
4120!
>> cl1-n1 kernel: [ 7314.523705] block drbd0: Connection closed
>> cl1-n1 kernel: [ 7314.523710] block drbd0: conn( NetworkFailure -> 
Unconnected )
>> cl1-n1 kernel: [ 7314.523718] block drbd0: receiver terminated
>> cl1-n1 kernel: [ 7314.523721] block drbd0: Restarting drbd0_receiver
>> cl1-n1 kernel: [ 7314.523723] block drbd0: receiver (re)started
>> cl1-n1 kernel: [ 7314.523730] block drbd0: conn( Unconnected -> 
WFConnection )
>> 
>> ? which looks ?fine? because of the 1st node not responding any more.
>> But then we get:
>> 
>> cl1-n1 kernel: [ 7319.136065] o2net: connection to node cl1-n2 (num 1)
at 10.0.1.2:6999 has been idle for 10.0 seconds, shutting it down.
>> cl1-n1 kernel: [ 7319.136079] (swapper,0,0):o2net_idle_timer1O:ee  n ee
n ee  n ee  n ee  n ee  n ee  n ee    ee    ee    ee    ee    ee    ee    ee   
ee    ee    ee    ee   ee sending message 506 (key 0xf5dfae8c) to node 1
>> 
>> And endless messages like the following ?
>> 
>> cl1-n1 kernel: [ 7319.196135]
(TunCtl,3278,6):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
>> cl1-n1 kernel: [ 7319.196192]
(IPaddr2,3416,4):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
>> cl1-n1 kernel: [ 7319.196256]
(IPaddr2,3282,1):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
>> cl1-n1 kernel: [ 7319.196373]
(linux,3271,3):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending
message 506 (key 0xf5dfae8c) to node 1
>> cl1-n1 kernel: [ 7319.196622]
(apache2,5474,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 when
sending message 506 (key 0xf5dfae8c) to node 1
>> ?
>> ?
>> 
>> ? until node 1 hangs forever, too.
>> 
>> After rebooting both of the nodes, the cluster runs for some time ? and
errors out like above again. And again. And ? :-/
>> 
>> Two other error messages observed while trying to mount the OCFS2
filesystem on one node (but we can?t reproduce any more):
>> 
>> cl1-n2 kernel: [  361.012172] INFO: task mount.ocfs2:4969 blocked for
more than 120 seconds.
>> cl1-n2 kernel: [  361.012252] "echo 0> 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> cl1-n2 kernel: [  361.012343] mount.ocfs2   D 00000000ffff3314     0 
4969      1 0x00000000
>> cl1-n2 kernel: [  361.012347]  ffff88022c8c1b40 0000000000000082
ffff880200000000 ffff88022c133e90
>> cl1-n2 kernel: [  361.012351]  ffff88022ec751c0 00000000000154c0
00000000000154c0 00000000000154c0
>> cl1-n2 kernel: [  361.012354]  ffff8802184b3fd8 00000000000154c0
ffff88022c8c1b40 ffff8802184b3fd8
>> cl1-n2 kernel: [  361.012356] Call Trace:
>> cl1-n2 kernel: [  361.012365]  [<ffffffff813054df>] ?
schedule_timeout+0x2d/0xd7
>> cl1-n2 kernel: [  361.012369]  [<ffffffff81305354>] ?
wait_for_common+0xd1/0x14e
>> cl1-n2 kernel: [  361.012374]  [<ffffffff8103f630>] ?
default_wake_function+0x0/0xf
>> cl1-n2 kernel: [  361.012394]  [<ffffffffa05bc21f>] ?
__ocfs2_cluster_lock+0x6e0/0x890 [ocfs2]
>> cl1-n2 kernel: [  361.012403]  [<ffffffffa0551bb4>] ?
dlm_register_domain+0x9e4/0xaf0 [ocfs2_dlm]
>> cl1-n2 kernel: [  361.012408]  [<ffffffff81190a3e>] ?
hweight_long+0x5/0x6
>> cl1-n2 kernel: [  361.012420]  [<ffffffffa05bd242>] ?
T.775+0x18/0x1d [ocfs2]
>> cl1-n2 kernel: [  361.012432]  [<ffffffffa05bd2f3>] ?
ocfs2_super_lock+0xac/0x2bd [ocfs2]
>> cl1-n2 kernel: [  361.012443]  [<ffffffffa05b4d3b>] ?
ocfs2_is_hard_readonly+0x10/0x23 [ocfs2]
>> cl1-n2 kernel: [  361.012455]  [<ffffffffa05bd2f3>] ?
ocfs2_super_lock+0xac/0x2bd [ocfs2]
>> cl1-n2 kernel: [  361.012470]  [<ffffffffa05f5d66>] ?
ocfs2_fill_super+0x1227/0x2101 [ocfs2]
>> cl1-n2 kernel: [  361.012475]  [<ffffffff8118e2f3>] ?
snprintf+0x36/0x3b
>> cl1-n2 kernel: [  361.012478]  [<ffffffff810eb86e>] ?
get_sb_bdev+0x137/0x19a
>> cl1-n2 kernel: [  361.012492]  [<ffffffffa05f4b3f>] ?
ocfs2_fill_super+0x0/0x2101 [ocfs2]
>> cl1-n2 kernel: [  361.012495]  [<ffffffff810eaf45>] ?
vfs_kern_mount+0xa6/0x196
>> cl1-n2 kernel: [  361.012498]  [<ffffffff810eb094>] ?
do_kern_mount+0x49/0xe7
>> cl1-n2 kernel: [  361.012502]  [<ffffffff810ff38b>] ?
do_mount+0x75c/0x7d6
>> cl1-n2 kernel: [  361.012506]  [<ffffffff810d90ba>] ?
alloc_pages_current+0x9f/0xc2
>> cl1-n2 kernel: [  361.012508]  [<ffffffff810ff48d>] ?
sys_mount+0x88/0xc3
>> cl1-n2 kernel: [  361.012513]  [<ffffffff810089c2>] ?
system_call_fastpath+0x16/0x1b
>> 
>> cl1-n2 kernel: [  361.012518] INFO: task mount.ocfs2:5481 blocked for
more than 120 seconds.
>> cl1-n2 kernel: [  361.012588] "echo 0> 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> cl1-n2 kernel: [  361.012674] mount.ocfs2   D 00000000ffff63dd     0 
5481      1 0x00000000
>> cl1-n2 kernel: [  361.012677]  ffff88022c7a2210 0000000000000082
ffff88022c5ae2c8 ffffffff00000000
>> cl1-n2 kernel: [  361.012680]  ffff88022ec76d00 00000000000154c0
00000000000154c0 00000000000154c0
>> cl1-n2 kernel: [  361.012683]  ffff880218485fd8 00000000000154c0
ffff88022c7a2210 ffff880218485fd8
>> cl1-n2 kernel: [  361.012685] Call Trace:
>> cl1-n2 kernel: [  361.012691]  [<ffffffff813063b5>] ?
rwsem_down_failed_common+0x97/0xcb
>> cl1-n2 kernel: [  361.012694]  [<ffffffff810ea58f>] ?
test_bdev_super+0x0/0xd
>> cl1-n2 kernel: [  361.012697]  [<ffffffff81306405>] ?
rwsem_down_write_failed+0x1c/0x25
>> cl1-n2 kernel: [  361.012700]  [<ffffffff8118f2c3>] ?
call_rwsem_down_write_failed+0x13/0x20
>> cl1-n2 kernel: [  361.012703]  [<ffffffff81305d2e>] ?
down_write+0x25/0x27
>> cl1-n2 kernel: [  361.012705]  [<ffffffff810eb361>] ?
sget+0x99/0x34d
>> cl1-n2 kernel: [  361.012708]  [<ffffffff810ea565>] ?
set_bdev_super+0x0/0x2a
>> cl1-n2 kernel: [  361.012710]  [<ffffffff810eb7d5>] ?
get_sb_bdev+0x9e/0x19a
>> cl1-n2 kernel: [  361.012724]  [<ffffffffa05f4b3f>] ?
ocfs2_fill_super+0x0/0x2101 [ocfs2]
>> cl1-n2 kernel: [  361.012727]  [<ffffffff810eaf45>] ?
vfs_kern_mount+0xa6/0x196
>> cl1-n2 kernel: [  361.012730]  [<ffffffff810eb094>] ?
do_kern_mount+0x49/0xe7
>> cl1-n2 kernel: [  361.012733]  [<ffffffff810ff38b>] ?
do_mount+0x75c/0x7d6
>> cl1-n2 kernel: [  361.012735]  [<ffffffff810d90ba>] ?
alloc_pages_current+0x9f/0xc2
>> cl1-n2 kernel: [  361.012738]  [<ffffffff810ff48d>] ?
sys_mount+0x88/0xc3
>> 
Do you have any comments to these errors and traces? Or anybody else?
>> Is it possible for us to downgrade the kernel and ocfs2-tools after
?tunefs.ocfs2 --fs-features=discontig-bg??
> 
> just use tunefs.ocfs2 --fs-features=nodiscontig-bg to disable it and you
can go back to the old kernel and use it.
Ok, will try this most probably to get a stable system again.
But then will will encounter the ?no space left? problem again. Hmpf :-(

Thanks for your support!

Alex

Ocfs2 users - Sep 2010 - No space left on device

[Ocfs2-users] No space left on device

[Ocfs2-users] No space left on device

[Ocfs2-users] No space left on device

[Ocfs2-users] No space left on device