Zhang, Sonic
2004-Jul-20 23:23 UTC
[Ocfs2-devel] A patch to improve the metadata reading throughput (against svn1267)
Hi Mark, Our objective to send notification messages in routine ocfs_release_lock() for some lock types listed bellow is to keep the data consistency in the other nodes. As you know, after we change to cacheable metadata and turn off checkpoint in journal transaction, the metadata in cache may not consistent among different nodes in some cases. The metadata inconsistency in cache is only allowed when the access to the metadata is protected by exclusive locks. So, after metadata is changed on one node and the exclusive lock is released, the metadata in other nodes should invalidate their cached copy. This is done through sending a RELEASE notification to all other nodes when the lock of type FLAG_FILE_CREATE, FLAG_FILE_DELETE, FLAG_FILE_TRUNCATE and FLAG_FILE_EXTEND is released. We think the RELEASE notification and cached buffer reading should work together to improve the performance while keep the consistency. Regards -----Original Message----- From: Mark Fasheh [mailto:mark.fasheh@oracle.com] Sent: Saturday, July 17, 2004 1:24 AM To: Zhang, Sonic Cc: Ocfs2-Devel; Fu, Michael; Yang, Elton Subject: Re: [Ocfs2-devel] A patch to improve the metadata reading throughput (a gainst svn1267) Hi, On Fri, Jul 16, 2004 at 05:59:55PM +0800, Zhang, Sonic wrote:> Hi, > > This is a new patch to increase the metadata reading throughput > for OCFS2.Would you mind regenerating the patch against the latest svn release? We've made some changes which you've already gone ahead and rolled into that one, and I noticed you're missing at least one bug fix which you want to pick up in ocfs_populate_inode.> Our approach is to make the metadata reading operation via > routine ocfs_read_bh() cacheable and keep consistency through routine > invalidate_bdev() when receiving lock release notifications, such as > FLAG_FILE_CREATE, FLAG_FILE_DELETE, FLAG_FILE_TRUNCATE, > FLAG_FILE_EXTEND, etc. These notification is sent in routine > ocfs_release_lock(), which is invoked in the journal commit thread.Yeah, for a while now we've been meaning to go through all calls to ocfs_read_bh[s] and basically make as many of them cached as possible. At this point in our development, things should be cached unless you're *absolutely* certain you want to sync off disk. Another thing that's on the list which you might be interested in looking at is not sending all lock release messages. Some of them do basically nothing on the other end in process_vote, so there's really no reason to send them to the nodes at all. This should help alot when you've batched up a ton of locks to release in commit_cache. I already started some of this by not sendning the final delete message in ocfs_delete_inode, though that one's special as the lockres accounting information is about to be thrown out so we don't have to update it. Others require more work as you at least have to decrement the lock_holders variable when you're done with it.> We also did some benchmark. > The directory creating rate is about 4000/s with this patch and > 1400/s without. > The directory removing rate is about 740/s with this patch and > 10/s without. > > Please refer to the attachment. It is against svn 1267. > > This patch is not stable yet. Known issues: > 1. We still need to check all code which read the metadata. Figure out > if that metadata is suitable to be cached. > 3. Current OCFS lock mechanism may cause dead lock in some cases after > we change to cache metadata and release lock in journalasynchronously.> We plan to fix it in a new patch.So are you planning to turn off immediate checkpointing for all the other journal transactions? This is also on the list :) The only one that *may* be troublesome I believe is truncate. Otherwise, the ones that are left are: link, symlink, and rename.> 4. readdir() may get old data after the data is written back to diskin> journal asynchronously. It is not a bug. But which way is better, sync > the new data to disk when other nodes notify READONLY message or just > let them get old data?No, we consider it a bug :) The other nodes should be getting up to date directory contents.> > We are working to improve this patch. Any suggestion?Sure :) I'll paste your patch below: Index: src/journal.c ==================================================================--- src/journal.c (revision 1267) +++ src/journal.c (working copy) @@ -148,6 +148,8 @@ } spin_unlock(&journal->cmt_lock); + if (osb->needs_flush) + ocfs_sync_blockdev(osb->sb); Is this necessary? It seems awfully heavy, and since we journal *all* metadata (so it should be synced up to disk via the journal_flush just a couple lines above that), I don't see the point... I was actually meaning to take the other call to sync_blockdev out as it's never used :) Index: src/dlm.c ==================================================================--- src/dlm.c (revision 1267) +++ src/dlm.c (working copy) @@ -1038,9 +1038,7 @@ ocfs2_dinode *fe = NULL; struct buffer_head *tmpbh = NULL, **b = NULL; __s16 curr_master; - int lockflags - (OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE) ? - 0 : OCFS_BH_CACHED; + int lockflags = OCFS_BH_CACHED; int clear_tmp = 0; ocfs_lock_res *lockres = GET_INODE_LOCKRES(inode); this is good... <snip> Index: src/inode.c ==================================================================--- src/inode.c (revision 1267) +++ src/inode.c (working copy) @@ -447,8 +447,6 @@ break; case S_IFDIR: if (inode->i_nlink < 2) { - LOG_ERROR_ARGS("inlink=%d for %llu\n", inode->i_nlink, - fe->i_blkno); inode->i_nlink = 2; } This is wrong. The latest SVN doesn't include this block as nlink on a directory can very well be 0 if it's been moved to the orphan dir. <snip> Index: src/nm.c ==================================================================--- src/nm.c (revision 1267) +++ src/nm.c (working copy) <snip> @@ -813,8 +814,7 @@ } } - lockflags = (OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE) - ? 0 : OCFS_BH_CACHED; + lockflags = OCFS_BH_CACHED; This is good... <snip> @@ -910,7 +928,7 @@ inode->i_nlink); /* verify_update_inode does a dirty read which * might set i_nlink to an old value. */ - if (inode->i_nlink) { + if ( !S_ISDIR(inode->i_mode) && inode->i_nlink) { LOG_ERROR_ARGS("Orphaned inode %lu has link " "count = %d!\n", inode->i_ino, inode->i_nlink); and if you include the fix to ocfs_populate_inode, this is no longer necessary :) <snip> Index: src/buffer_head_io.c ==================================================================--- src/buffer_head_io.c (revision 1267) +++ src/buffer_head_io.c (working copy) @@ -217,13 +217,6 @@ !(OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE)) flags |= OCFS_BH_CACHED; - if (inode && (flags & OCFS_BH_CACHED) && - (OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE)) { - LOG_TRACE_STR("hey bozo you are trying to read " - "a system thingy cached!"); - flags &= ~OCFS_BH_CACHED; - } - sb = osb->sb; blocknum = off >> sb->s_blocksize_bits; excellent. This and the other places where you did something like: - int lockflags - (OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE) ? - 0 : OCFS_BH_CACHED; + int lockflags = OCFS_BH_CACHED; are exactly what we've needed to do for a while now. Thanks! --Mark -- Mark Fasheh Software Developer, Oracle Corp mark.fasheh@oracle.com