Smith, Darren (Academic)
2013-Dec-19 21:25 UTC
[Ocfs2-users] Deadlocks on OCFS2 file I/O with Ruby Application
Hello: We are running a Ruby/Passenger app called Canvas (by Instructure). It runs from 3 application servers that use shared storage (iSCSI, EqualLogic Volume, LVM2, OCFS2) for user uploaded files. We have been experiencing strange behavior where a ruby and jbd2 process will go into a 'D' (I/O wait) state randomly with some I/O operations on the shared storage (a file upload most commonly). I turned on some kernel debugging options for one of the application servers and captured the following 'possible circular locking dependency' error during one of these times. Would anyone be able to provide any insight into possible ways to troubleshoot this further? Our distro is Gentoo Linux using hardened kernel sources v3.11.2 on this particular app server. We do not have GrSecurity or PaX enabled. The other app servers are running kernel 3.2.30. Was testing newer kernel to see if it helped - which it doesn't appear to have. Thanks for any help. [ 858.570825] =====================================================[ 858.570878] [ INFO: possible circular locking dependency detected ] [ 858.570932] 3.11.2-hardenedcanvas-2013-12-16 #4 Not tainted [ 858.570980] ------------------------------------------------------- [ 858.571033] ruby/3710 is trying to acquire lock: [ 858.571073] (&sc->sc_send_lock){+.+...}, at: [<ffffffff81355567>] o2net_send_message_vec+0x543/0x8dd [ 858.571165] but task is already holding lock: [ 858.571213] (jbd2_handle){+.+...}, at: [<ffffffff8123986e>] start_this_handle+0x4e8/0x57d [ 858.571294] which lock already depends on the new lock. [ 858.571361] the existing dependency chain (in reverse order) is: [ 858.571423] -> #4 (jbd2_handle){+.+...}: [ 858.571464] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe [ 858.571539] [<ffffffff812398e0>] start_this_handle+0x55a/0x57d [ 858.571615] [<ffffffff81239b0c>] jbd2__journal_start+0x9f/0x15c [ 858.571691] [<ffffffff8121feab>] __ext4_journal_start_sb+0x89/0xa6 [ 858.571771] [<ffffffff812005c1>] ext4_dirty_inode+0x25/0x54 [ 858.571844] [<ffffffff8117a338>] __mark_inode_dirty+0x54/0x210 [ 858.571920] [<ffffffff8116d0c7>] update_time+0x9c/0xa7 [ 858.571989] [<ffffffff8116d16e>] file_update_time+0x9c/0xbd [ 858.572060] [<ffffffff8120077f>] ext4_page_mkwrite+0x50/0x2a1 [ 858.572133] [<ffffffff81130997>] __do_fault+0x125/0x2f5 [ 858.572203] [<ffffffff81132d9a>] handle_pte_fault+0x1d4/0x4f2 [ 858.572277] [<ffffffff81133459>] handle_mm_fault+0x1af/0x1c9 [ 858.572349] [<ffffffff81029a3c>] __do_page_fault+0x317/0x35e [ 858.572423] [<ffffffff81029aab>] do_page_fault+0x9/0xb [ 858.572491] [<ffffffff816e4e45>] page_fault+0x25/0x30 [ 858.572560] -> #3 (sb_pagefaults){.+.+..}: [ 858.572643] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe [ 858.572712] [<ffffffff81159900>] __sb_start_write+0x111/0x161 [ 858.572786] [<ffffffff81200773>] ext4_page_mkwrite+0x44/0x2a1 [ 858.572860] [<ffffffff81130997>] __do_fault+0x125/0x2f5 [ 858.572928] [<ffffffff81132d9a>] handle_pte_fault+0x1d4/0x4f2 [ 858.573002] [<ffffffff81133459>] handle_mm_fault+0x1af/0x1c9 [ 858.573075] [<ffffffff81029a3c>] __do_page_fault+0x317/0x35e [ 858.573148] [<ffffffff81029aab>] do_page_fault+0x9/0xb [ 858.573216] [<ffffffff816e4e45>] page_fault+0x25/0x30 [ 858.573284] -> #2 (&mm->mmap_sem){++++++}: [ 858.573361] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe [ 858.573430] [<ffffffff81130bc9>] might_fault+0x62/0x85 [ 858.573498] [<ffffffff8163e41b>] skb_do_copy_data_nocache+0x155/0x168 [ 858.573578] [<ffffffff8163f72a>] tcp_sendmsg+0x398/0x7e1 [ 858.573648] [<ffffffff8165ee8b>] inet_sendmsg+0x5e/0x67 [ 858.573718] [<ffffffff815f76d3>] sock_aio_write+0x101/0x111 [ 858.573790] [<ffffffff81157544>] do_sync_write+0x5d/0x89 [ 858.573860] [<ffffffff81158035>] vfs_write+0x105/0x14c [ 858.573929] [<ffffffff8115832d>] SyS_write+0x57/0x83 [ 858.573996] [<ffffffff816e5424>] system_call_fastpath+0x1c/0x21 [ 858.574072] -> #1 (sk_lock-AF_INET){+.+.+.}: [ 858.574150] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe [ 858.574219] [<ffffffff815fc22e>] lock_sock_nested+0x69/0x79 [ 858.574291] [<ffffffff8163d8bc>] lock_sock+0xb/0xd [ 858.574356] [<ffffffff8163efe4>] tcp_sendpage+0x55/0x3da [ 858.574426] [<ffffffff8165ef7a>] inet_sendpage+0x82/0x97 [ 858.574496] [<ffffffff81354b02>] o2net_sendpage+0xae/0x21b [ 858.574568] [<ffffffff81356d71>] o2net_accept_many+0x383/0x40a [ 858.574642] [<ffffffff81099dc6>] process_one_work+0x1e9/0x358 [ 858.574716] [<ffffffff8109a31f>] worker_thread+0x159/0x1ee [ 858.574787] [<ffffffff810a02b1>] kthread+0xac/0xb4 [ 858.574853] [<ffffffff816e5372>] ret_from_fork+0x72/0xa0 [ 858.574924] -> #0 (&sc->sc_send_lock){+.+...}: [ 858.575003] [<ffffffff810c6e3c>] __lock_acquire+0xb65/0xe65 [ 858.575077] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe [ 858.575146] [<ffffffff816e1732>] mutex_lock_nested+0x3f/0x353 [ 858.575219] [<ffffffff81355567>] o2net_send_message_vec+0x543/0x8dd [ 858.575296] [<ffffffff81355925>] o2net_send_message+0x24/0x26 [ 858.575369] [<ffffffff81369390>] dlm_do_master_request+0xca/0x630 [ 858.575446] [<ffffffff8136c727>] dlm_get_lock_resource+0xa58/0xe1d [ 858.575523] [<ffffffff813747d7>] dlmlock+0x865/0x11d3 [ 858.575593] [<ffffffff81349ad8>] o2cb_dlm_lock+0x59/0x69 [ 858.575665] [<ffffffff81349029>] ocfs2_dlm_lock+0x28/0x2a [ 858.575736] [<ffffffff812ecef9>] ocfs2_lock_create+0xfa/0x270 [ 858.575811] [<ffffffff812ed68d>] ocfs2_create_new_lock+0x7a/0x89 [ 858.575887] [<ffffffff812ed81d>] ocfs2_create_new_inode_locks+0x181/0x2ae [ 858.575988] [<ffffffff813068fc>] __ocfs2_mknod_locked.isra.21+0x3b1/0x4d4 [ 858.576088] [<ffffffff81306b39>] ocfs2_mknod_locked.isra.22+0x11a/0x12d [ 858.576186] [<ffffffff813072d2>] ocfs2_mknod+0x786/0xe09 [ 858.576257] [<ffffffff813079d2>] ocfs2_mkdir+0x7d/0x11f [ 858.576326] [<ffffffff81162ba7>] vfs_mkdir+0x5d/0x9e [ 858.576393] [<ffffffff811650f9>] SyS_mkdirat+0x8f/0xe2 [ 858.576462] [<ffffffff81165160>] SyS_mkdir+0x14/0x16 [ 858.576528] [<ffffffff816e5424>] system_call_fastpath+0x1c/0x21 [ 858.576604] other info that might help us debug this: [ 858.576721] Chain exists of: &sc->sc_send_lock --> sb_pagefaults --> jbd2_handle [ 858.576847] Possible unsafe locking scenario: [ 858.576931] CPU0 CPU1 [ 858.576986] ---- ---- [ 858.577042] lock(jbd2_handle); [ 858.577090] lock(sb_pagefaults); [ 858.577161] lock(jbd2_handle); [ 858.577229] lock(&sc->sc_send_lock); [ 858.577282] *** DEADLOCK *** [ 858.577383] 11 locks held by ruby/3710: [ 858.577432] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff811714f9>] mnt_want_write+0x1f/0x46 [ 858.577551] #1: (&type->i_mutex_dir_key#3/1){+.+.+.}, at: [<ffffffff8116279e>] kern_path_create+0x7f/0x14a [ 858.577681] #2: (Meta){++++++}, at: [<ffffffff81306c68>] ocfs2_mknod+0x11c/0xe09 [ 858.577789] #3: (&ocfs2_sysfile_lock_key[args->fi_sysfile_type]#3){+.+.+.}, at: [<ffffffff8131e853>] ocfs2_reserve_suballoc_bits+0xc3/0x12ce [ 858.577942] #4: (inode_alloc:%04d){+.+.+.}, at: [<ffffffff8131e869>] ocfs2_reserve_suballoc_bits+0xd9/0x12ce [ 858.578071] #5: (&ocfs2_sysfile_lock_key[args->fi_sysfile_type]#4){+.+.+.}, at: [<ffffffff8131e853>] ocfs2_reserve_suballoc_bits+0xc3/0x12ce [ 858.578225] #6: (extent_alloc:%04d){+.+.+.}, at: [<ffffffff8131e869>] ocfs2_reserve_suballoc_bits+0xd9/0x12ce [ 858.578354] #7: (&ocfs2_sysfile_lock_key[args->fi_sysfile_type]#5){+.+...}, at: [<ffffffff8130162c>] ocfs2_reserve_local_alloc_bits+0xad/0xc93 [ 858.578509] #8: (sb_internal#2){.+.+..}, at: [<ffffffff812fcb30>] ocfs2_start_trans+0xa7/0x1be [ 858.578632] #9: (&journal->j_trans_barrier){.+.+..}, at: [<ffffffff812fcb40>] ocfs2_start_trans+0xb7/0x1be [ 858.578760] #10: (jbd2_handle){+.+...}, at: [<ffffffff8123986e>] start_this_handle+0x4e8/0x57d [ 858.578880] stack backtrace: [ 858.578952] CPU: 3 PID: 3710 Comm: ruby Not tainted 3.11.2-hardenedcanvas-2013-12-16 #4 [ 858.579054] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/09/2012 [ 858.579178] ffffffff85305fa0 ffff880516e27348 ffffffff816dd690 0000000000000006 [ 858.579283] ffffffff8531ace0 ffff880516e27398 ffffffff816d9328 0000000000000002 [ 858.579387] ffffffff856d9850 ffff8804ffa8a190 ffff8804ffa89600 ffff8804ffa8a190 [ 858.579492] Call Trace: [ 858.580924] [<ffffffff816dd690>] dump_stack+0x46/0x58 [ 858.580994] [<ffffffff816d9328>] print_circular_bug+0x1f8/0x209 [ 858.581071] [<ffffffff810c6e3c>] __lock_acquire+0xb65/0xe65 [ 858.581145] [<ffffffff810c60e4>] ? mark_lock+0x2d/0x220 [ 858.581215] [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd [ 858.581296] [<ffffffff810c761c>] lock_acquire+0xbf/0xfe [ 858.581366] [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd [ 858.581447] [<ffffffff810c7c21>] ? debug_check_no_locks_freed+0x132/0x147 [ 858.581531] [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd [ 858.581612] [<ffffffff816e1732>] mutex_lock_nested+0x3f/0x353 [ 858.581687] [<ffffffff81355567>] ? o2net_send_message_vec+0x543/0x8dd [ 858.581770] [<ffffffff810c45bd>] ? lockdep_init_map+0xc2/0x147 [ 858.581848] [<ffffffff81009715>] ? paravirt_read_tsc+0x9/0xd [ 858.581922] [<ffffffff810097bc>] ? read_tsc+0x9/0x19 [ 858.581991] [<ffffffff810bac88>] ? timekeeping_get_ns.constprop.8+0x13/0x3a [ 858.582077] [<ffffffff81355567>] o2net_send_message_vec+0x543/0x8dd [ 858.582163] [<ffffffff810c60e4>] ? mark_lock+0x2d/0x220 [ 858.582233] [<ffffffff810c792d>] ? mark_held_locks+0x6d/0x95 [ 858.582309] [<ffffffff810f2470>] ? rcu_irq_exit+0x7c/0x90 [ 858.582381] [<ffffffff81355925>] o2net_send_message+0x24/0x26 [ 858.582457] [<ffffffff81369390>] dlm_do_master_request+0xca/0x630 [ 858.582536] [<ffffffff81009a93>] ? sched_clock+0x9/0xd [ 858.582606] [<ffffffff8136c727>] dlm_get_lock_resource+0xa58/0xe1d [ 858.582684] [<ffffffff810c7539>] ? lock_release+0x1ad/0x1d1 [ 858.582758] [<ffffffff813d0718>] ? do_raw_spin_unlock+0x8f/0x98 [ 858.582835] [<ffffffff81373ee0>] ? dlm_new_lock+0xbc/0x14e [ 858.582908] [<ffffffff813495fc>] ? ocfs2_cluster_connect_agnostic+0x3b/0x3b [ 858.582993] [<ffffffff813747d7>] dlmlock+0x865/0x11d3 [ 858.583062] [<ffffffff8134960d>] ? o2dlm_lock_ast_wrapper+0x11/0x11 [ 858.583142] [<ffffffff810c60e4>] ? mark_lock+0x2d/0x220 [ 858.583211] [<ffffffff81349ad8>] o2cb_dlm_lock+0x59/0x69 [ 858.583282] [<ffffffff813495fc>] ? ocfs2_cluster_connect_agnostic+0x3b/0x3b [ 858.583369] [<ffffffff8134960d>] ? o2dlm_lock_ast_wrapper+0x11/0x11 [ 858.583448] [<ffffffff810c7aa9>] ? trace_hardirqs_on_caller+0x154/0x18b [ 858.583531] [<ffffffff81349029>] ocfs2_dlm_lock+0x28/0x2a [ 858.583603] [<ffffffff812ecef9>] ocfs2_lock_create+0xfa/0x270 [ 858.583679] [<ffffffff816e47be>] ? _raw_spin_unlock_irqrestore+0x40/0x4d [ 858.583763] [<ffffffff810c7aa9>] ? trace_hardirqs_on_caller+0x154/0x18b [ 858.583846] [<ffffffff812ed68d>] ocfs2_create_new_lock+0x7a/0x89 [ 858.583924] [<ffffffff812ed81d>] ocfs2_create_new_inode_locks+0x181/0x2ae [ 858.584010] [<ffffffff813068fc>] __ocfs2_mknod_locked.isra.21+0x3b1/0x4d4 [ 858.584095] [<ffffffff81306b39>] ocfs2_mknod_locked.isra.22+0x11a/0x12d [ 858.584178] [<ffffffff813072d2>] ocfs2_mknod+0x786/0xe09 [ 858.584249] [<ffffffff813079d2>] ocfs2_mkdir+0x7d/0x11f [ 858.584319] [<ffffffff81162ba7>] vfs_mkdir+0x5d/0x9e [ 858.584386] [<ffffffff811650f9>] SyS_mkdirat+0x8f/0xe2 [ 858.584456] [<ffffffff81165160>] SyS_mkdir+0x14/0x16 [ 858.584524] [<ffffffff816e5424>] system_call_fastpath+0x1c/0x21 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20131219/e31bc176/attachment-0001.html