Derek Suzuki
2004-Mar-07 00:13 UTC
[Ocfs-users] A couple more minor questions about OCFS and RHEL3
Oracle appears to have Wim chained in the basement, forced to answer mailing list questions at all hours. I do appreciate it. Our cluster has been stable since we installed RAC, but a few minor issues have me concerned. First, our storage array seems to maintain continuous low-level activity even when the database is shut down. The CPUs spend a modest amount of time in iowait state while this is going on. I figure this might be related to the I/O fencing and inter-node communication features of OCFS, but I want to verify that this is expected. Next, I saw a Metalink thread which suggests that async I/O is not supported on OCFS with RHAS 2.1. It doesn't say anything about RHEL3. We've been using async in our testing with no problems so far, and plan to use it in production unless Oracle feels the combination is not yet trustworthy. The last issue is that sometimes we see messages such as the following from dmesg: (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1220 (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1285 (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1586 (11637) ERROR: status = -16, Common/ocfsgencreate.c, 1027 (11637) ERROR: status = -16, Common/ocfsgencreate.c, 1770 (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1220 (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1285 (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1586 (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1027 (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1770 (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1220 (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1285 (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1586 (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1027 (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1770 I think these mostly come up around boot time, so maybe they're related to mounting cluster filesystems when the other node is down. The messages do not come continuously, and the systems behave properly, so I'm just trying to make sure that this isn't the sign of some subtle error. Derek
Wim Coekaerts
2004-Mar-07 00:35 UTC
[Ocfs-users] A couple more minor questions about OCFS and RHEL3
heh...> Our cluster has been stable since we installed RAC, but a few minor issues > have me concerned. First, our storage array seems to maintain continuous > low-level activity even when the database is shut down. The CPUs spend a > modest amount of time in iowait state while this is going on. I figure this > might be related to the I/O fencing and inter-node communication features of > OCFS, but I want to verify that this is expected.ocfs does about 1k wite and 32kb read / second per mounted volume. if nothing goes on, it write a heartbeat and reads everryone elses (32 sectors worth)> Next, I saw a Metalink thread which suggests that async I/O is not > supported on OCFS with RHAS 2.1. It doesn't say anything about RHEL3. > We've been using async in our testing with no problems so far, and plan to > use it in production unless Oracle feels the combination is not yet > trustworthy.well - tough one, it works, but the big issue is that you rredologfile need to be contiguous on disk, otherwise you might have failures, exact same goes for rhel3 as rhas21. you can see that by running debugocfs eg : /ocfs/log/foo1.dbf -> debugocfs -f /log/foo1.dbf /dev/sdXXX that will show how many offsets (should only have one) in the extents if its more than 1, dd it over with a very large blocksize and see if that ends up being 1 contig file. if you do that, everything should work, however, there just hasn't been enough real testing with aio, need to ggather more evidence. the reason the logfiles are annoying is because he way aio is implemented and how we call it, it cannto handle short io's or non contig aio submits.> The last issue is that sometimes we see messages such as the following > from dmesg: > > (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1220 > (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1285 > (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1586 > (11637) ERROR: status = -16, Common/ocfsgencreate.c, 1027 > (11637) ERROR: status = -16, Common/ocfsgencreate.c, 1770 > (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1220 > (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1285 > (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1586 > (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1027 > (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1770 > (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1220 > (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1285 > (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1586 > (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1027 > (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1770 > > I think these mostly come up around boot time, so maybe they're related to > mounting cluster filesystems when the other node is down. The messages do > not come continuously, and the systems behave properly, so I'm just trying > to make sure that this isn't the sign of some subtle error.hmm have to look at the code for this , get ebusy, sounds like dlm and trying to get access to a file thats in use you know when things are serious yoy really ought to call support, don't rely on this maillist for production problems ;) mileage may vary ;)