Jeremy Schneider
2004-Mar-12 10:11 UTC
[Ocfs-users] Node hangs when trying to create/delete file
Here's a basic overview of the bug and a workaround for any DBA's or SysAdmin's reading this list. I'm sure that there will be an official fix soon, this is just an FYI if you run into the problem I had in the meantime. As soon as you install the updated ocfs-*.rpm the problem will go away. (You won't even need to fsck or anything... aren't they such nice guys?) SYMPTOM: When you try to create or delete a file in a directory with more than 254 files, the process hangs indefinitely. When you try to kill the process (via CTRL-C or /bin/kill) it seems to hang in a 'D' Disk Wait state. SHORT-TERM WORKAROUND: You need to know what directory you were trying to create the file in. One of the other nodes has that directory locked. It's real easy in a 2-node cluster; just go to that directory on the other node and delete any file. This will release the lock on that directory. You might need to create a file first so you can delete it. :) LONG-TERM WORKAROUND: If this happens once, it will likely happen again. You can fix the directory permanently so that the bug won't happen anymore but this requires downtime if there are any database files in that directory. The fix is to basically move all the files to a new directory then delete the old and rename the new. Step 2 sounds kinda weird, but it's actually the crucial step that will prevent the bug. Step 2 changes "file_lock" (in step 3) from OCFS_DLM_ENABLE_CACHE_LOCK to OCFS_DLM_NO_LOCK. 1. create a new directory. 2. create a file in the new directory and /bin/cat the file from a different node than the one where you created the directory. delete the file. 3. debugocfs -D /relative/path/to/newdir/from/mountpoint/ /dev/device -- confirm that "file_lock = OCFS_DLM_NO_LOCK" 4. /bin/mv all the datafiles to the new directory. 5. /bin/rmdir the old directory 6. rename the new directory to the same name as the old. Happy hacking, everyone... ;) /js Jeremy Schneider Lansing, MI>>> Sunil Mushran <Sunil.Mushran@oracle.com> 03/11/2004 9:57:03 PM >>>Wow... I am impressed. I still need to test it... but it looks good otherwise. BTW, it's mainly the oracle developers who are responding on this list. :-) <<<<...>>>>