Hi all, and particularly Andrew and Stephen, I recently "upgraded' one of my NFS fileservers from (patched)2.4.18 to 2.4.23-pre7 (in order to resolve a HIMEM related memory pressure problem). Unfortunately I have experienced what appears to be a deadlock. The one I will describe was experienced while running 2.4.23-pre7, though I had a very similar problem in 2.4.22 (but didn't collect enough data to be able to give a useful report at the time). An ext3 filesystem is mounted with data=journal, with an external journal, and accessed mostly via NFS. v1 quotas are enabled. The stacktrace lines below are from a lightly editted "echo t > /proc/sysrq-trigger" output. A substantial number of nfsd threads a waiting to "start a handle": nfsd Call Trace: [sleep_on+75/124] [start_this_handle+205/368] [journal_start+149/196] [ext3_dirty_inode+116/268] [__mark_inode_dirty+50/168] [update_atime+75/80] [do_generic_file_read+1158/1172] [generic_file_read+147/400] [file_read_actor+0/224] [nfsd_read+477/636] [nfsd3_proc_read+295/388] [nfsd_dispatch+207/406] [svc_process+655/1264] [nfsd+566/944] [arch_kernel_thread+40/56] This sleep_on is the one at line 136 of transcation.c(start_this_handle), where it is waiting for t_state to stop being T_LOCKED. The related kjournald is at: kjournald Call Trace: [sleep_on+75/124] [journal_commit_transaction+357/4044] [do_IRQ+221/236] [.text.lock.sched+131/471] [kjournald+326/540] [commit_timeout+0/12] [arch_kernel_thread+40/56] This sleep_on is at line 87 in commit.c (journal_commit_transaction) where it is waiting for t_updates to be 0. At this point, t_state is T_LOCKED, so presumably those nfsd threads above are waiting on kjournald. But what is kjournald really waiting for? My first though was the two nfsd threads in: nfsd Call Trace: [sleep_on+75/124] [log_wait_commit+74/136] [journal_stop+408/432] [journal_force_commit+78/128] [ext3_force_commit+66/112] [ext3_sync_file+128/144] [nfsd_sync_dir+49/72] [nfsd_unlink+455/480] [nfsd_proc_remove+122/140] [nfsd_dispatch+207/406] [svc_process+655/1264] [nfsd+566/944] [arch_kernel_thread+40/56] that are waiting on j_wait_done_commit. However they are doing that from journal_stop *after* journal_stop has decremented t_updates, so it doesn't seem likely that kjournald is waiting on that. Outside of nfsd, there is an rquotad program (locally written, not the standard one) that is : rquotad Call Trace: [sleep_on+75/124] [start_this_handle+205/368] [journal_start+149/196] [ext3_dirty_inode+116/268] [__mark_inode_dirty+50/168] [update_atime+75/80] [do_generic_file_read+1158/1172] [generic_file_read+147/400] [file_read_actor+0/224] [do_get_write_access+1382/1420] [v1_read_dqblk+121/196] [read_dqblk+76/128] [dqget+344/484] [vfs_get_dqblk+21/64] [v1_get_dqblk+39/172] [link_path_walk+2680/2956] [do_compat_quotactl+417/688] [resolve_dev+89/108] [sys_quotactl+166/275] [system_call+51/56] So it is trying to start a transaction to update the atime on the quota file, and has a lock on some quota structures thanks to "read_dqblk". At the same time, "sync" is running: sync Call Trace: [__down+109/208] [__down_failed+8/12] [.text.lock.dquot+73/286] [ext3_sync_dquot+337/462] [vfs_quota_sync+102/372] [sync_dquots_dev+194/260] [fsync_dev+66/128] [sys_sync+7/16] [system_call+51/56] and has started an ext3 transaction (in ext3_sync_dquot) and is trying to get the lock that rquotad has. Presumably the transaction that sync has started is keeping t_updates greater than 0, thus preventing kjournald from progressing, and this preventing anyone else, including rquotad, from starting a new transaction. Hence a deadlock. The incomplete trace that I have from the first time this problem happened also has sync blocks in the same place, and an nfsd thread, rather than rquotad, holding a lock in read_dqblk. I note that ext3_sync_dquot didn't exist in 2.4.18 and does now, so that correlated well with the observation that this problem has been introduced since 2.4.18. If my analysis is correct, I can largely avoid the problem by not running "sync" as much (I currently run it every 5 minutes which isn't needed any longer (it's a long story:-)) but that doesn't fix the bug. My guess is that ext3_sync_dquot doesn't need ext3_journal_start at all but that isn't a well-informed guess. Help appreciated. NeilBrown
Christopher Li
2003-Oct-27 06:20 UTC
Re: EXT3 deadlock in 2.4.22 and 2.4.23-pre7 - quota related?
On Mon, Oct 27, 2003 at 05:06:16PM +1100, Neil Brown wrote:> > The related kjournald is at: > kjournald Call Trace: [sleep_on+75/124] > [journal_commit_transaction+357/4044] [do_IRQ+221/236] > [.text.lock.sched+131/471] [kjournald+326/540] > [commit_timeout+0/12] [arch_kernel_thread+40/56] > > This sleep_on is at line 87 in commit.c (journal_commit_transaction) > where it is waiting for t_updates to be 0. At this point, > t_state is T_LOCKED, so presumably those nfsd threads above are > waiting on kjournald. But what is kjournald really waiting for?kjournald is wait for the current pending transaction to stop.> My first though was the two nfsd threads in: > nfsd Call Trace: [sleep_on+75/124] > [log_wait_commit+74/136] [journal_stop+408/432] > [journal_force_commit+78/128] [ext3_force_commit+66/112] > [ext3_sync_file+128/144] [nfsd_sync_dir+49/72] > [nfsd_unlink+455/480] [nfsd_proc_remove+122/140] > [nfsd_dispatch+207/406] [svc_process+655/1264] > [nfsd+566/944] [arch_kernel_thread+40/56] > > that are waiting on j_wait_done_commit. However they are doing that > from journal_stop *after* journal_stop has decremented t_updates, so > it doesn't seem likely that kjournald is waiting on that.That is right.> > Outside of nfsd, there is an rquotad program (locally written, not the > standard one) that is : > > rquotad Call Trace: [sleep_on+75/124] > [start_this_handle+205/368] [journal_start+149/196] > [ext3_dirty_inode+116/268] [__mark_inode_dirty+50/168] > [update_atime+75/80] [do_generic_file_read+1158/1172] > [generic_file_read+147/400] [file_read_actor+0/224] > [do_get_write_access+1382/1420] [v1_read_dqblk+121/196] > [read_dqblk+76/128] [dqget+344/484] [vfs_get_dqblk+21/64] > [v1_get_dqblk+39/172] [link_path_walk+2680/2956] > [do_compat_quotactl+417/688] [resolve_dev+89/108] > [sys_quotactl+166/275] [system_call+51/56] > > So it is trying to start a transaction to update the atime on the > quota file, and has a lock on some quota structures thanks to > "read_dqblk".This guy is waiting the journal commit to be finished, seems harmless to me.> > At the same time, "sync" is running: > > sync Call Trace: [__down+109/208] [__down_failed+8/12] > [.text.lock.dquot+73/286] [ext3_sync_dquot+337/462] > [vfs_quota_sync+102/372] [sync_dquots_dev+194/260] > [fsync_dev+66/128] [sys_sync+7/16] [system_call+51/56] > > and has started an ext3 transaction (in ext3_sync_dquot) and is trying > to get the lock that rquotad has.That seems wrong to me. It should get the lock before it start the transasction. For the same reason that you can't lock_page inside journal transasction, it is a ranking error. BTW, it seems that current bk tree, truncate still do lock_page inside journal transasction.> > Presumably the transaction that sync has started is keeping t_updates > greater than 0, thus preventing kjournald from progressing, and this > preventing anyone else, including rquotad, from starting a new > transaction. Hence a deadlock.That is right.> > My guess is that ext3_sync_dquot doesn't need ext3_journal_start at > all but that isn't a well-informed guess.I think you want to put ext3_sync_dquot to be atomic on power failure. The journal handle can get from ext3_current_journal_handle, which used by writepage etc. Chris
Andrew Morton
2003-Oct-27 10:01 UTC
Re: EXT3 deadlock in 2.4.22 and 2.4.23-pre7 - quota related?
Neil Brown <neilb@cse.unsw.edu.au> wrote:> > > Hi all, and particularly Andrew and Stephen, > > I recently "upgraded' one of my NFS fileservers from (patched)2.4.18 > to 2.4.23-pre7 (in order to resolve a HIMEM related memory pressure > problem). > > Unfortunately I have experienced what appears to be a deadlock. >I don't recall a time when quotas on ext3 were not deadlocky :(> ... > > rquotad Call Trace: [sleep_on+75/124] > [start_this_handle+205/368] [journal_start+149/196] > [ext3_dirty_inode+116/268] [__mark_inode_dirty+50/168] > [update_atime+75/80] [do_generic_file_read+1158/1172] > [generic_file_read+147/400] [file_read_actor+0/224] > [do_get_write_access+1382/1420] [v1_read_dqblk+121/196] > [read_dqblk+76/128] [dqget+344/484] [vfs_get_dqblk+21/64] > [v1_get_dqblk+39/172] [link_path_walk+2680/2956] > [do_compat_quotactl+417/688] [resolve_dev+89/108] > [sys_quotactl+166/275] [system_call+51/56]read_dqblk() took dqio_sem, then ext3_dirty_inode() did journal_start().> At the same time, "sync" is running: > > sync Call Trace: [__down+109/208] [__down_failed+8/12] > [.text.lock.dquot+73/286] [ext3_sync_dquot+337/462] > [vfs_quota_sync+102/372] [sync_dquots_dev+194/260] > [fsync_dev+66/128] [sys_sync+7/16] [system_call+51/56]ext3_sync_dquot() did journal_start(), then someone (commit_dqblk?) tried to take dqio_sem. Probably it is not too hard to flip the ordering in the latter case, but all other code paths need to be checked, and perhaps even documented. Jan, didn't we fix this very problem in 2.6 recently?