Daniel McDonald
2010-Dec-09 01:07 UTC
[Ocfs2-users] Extremely poor write performance, but read appears to be okay
Hello, I'm writing from the otherside of the world from where my systems are, so details are coming in slow. We have a 6TB OCFS2 volume across 20 or so nodes all running OEL5.4 running ocfs2-1.4.4. The system has worked fairly well for the last 6-8 months. Something has happened over the last few weeks which has driven write performance nearly to a halt. I'm not sure how to proceed, and very poor internet is hindering my abilities further. I've verified that the disk array is in good health. I'm seeing a few awkward kernel log messages, an example of one follows. I have not been able to verify all nodes due to limited time and slow internet in my present location. Any assistance would be greatly appreciated. I should be able to provide log files in about 12 hours. At this moment, loadavgs on each node are 0.00 to 0.09. Here is a test write and associated iostat -xm 5 output. Previously I was obtaining > 90MB/s: $ dd if=/dev/zero of=/home/testdump count=1000 bs=1024k ...and associated iostat output: avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.00 0.43 12.25 0.00 87.22 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 1.80 0.00 8.40 0.00 0.04 9.71 0.01 0.64 0.05 0.04 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda3 0.00 1.80 0.00 8.40 0.00 0.04 9.71 0.01 0.64 0.05 0.04 sdc 0.00 0.00 115.80 0.60 0.46 0.00 8.04 0.99 8.48 8.47 98.54 sdc1 0.00 0.00 115.80 0.60 0.46 0.00 8.04 0.99 8.48 8.47 98.54 avg-cpu: %user %nice %system %iowait %steal %idle 0.07 0.00 0.55 12.25 0.00 87.13 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.40 0.00 0.80 0.00 0.00 12.00 0.00 2.00 1.25 0.10 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda3 0.00 0.40 0.00 0.80 0.00 0.00 12.00 0.00 2.00 1.25 0.10 sdc 0.00 0.00 112.80 0.40 0.44 0.00 8.03 0.98 8.68 8.69 98.38 sdc1 0.00 0.00 112.80 0.40 0.44 0.00 8.03 0.98 8.68 8.69 98.38 Here is a test read and associated iostat output. I'm intentionally reading from a different test file as to avoid caching effects: $ dd if=/home/someothertestdump of=/dev/null bs=1024k avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.00 3.60 10.85 0.00 85.45 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3.79 0.00 1.40 0.00 0.02 29.71 0.00 1.29 0.43 0.06 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda3 0.00 3.79 0.00 1.40 0.00 0.02 29.71 0.00 1.29 0.43 0.06 sdc 7.98 0.20 813.17 1.00 102.50 0.00 257.84 1.92 2.34 1.19 96.71 sdc1 7.98 0.20 813.17 1.00 102.50 0.00 257.84 1.92 2.34 1.19 96.67 avg-cpu: %user %nice %system %iowait %steal %idle 0.07 0.00 3.67 10.22 0.00 86.03 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.20 0.00 0.40 0.00 0.00 12.00 0.00 0.50 0.50 0.02 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda3 0.00 0.20 0.00 0.40 0.00 0.00 12.00 0.00 0.50 0.50 0.02 sdc 6.60 0.20 829.00 1.00 104.28 0.00 257.32 1.90 2.31 1.17 97.28 sdc1 6.60 0.20 829.00 1.00 104.28 0.00 257.32 1.90 2.31 1.17 97.28 I'm seeing a few weird kernel messages, such as: Dec 7 14:07:50 growler kernel: (dlm_wq,4793,4):dlm_deref_lockres_worker:2344 ERROR: 84B7C6421A6C4280AB87F569035C5368:O0000000000000016296ce900000000: node 14 trying to drop ref but it is already dropped! Dec 7 14:07:50 growler kernel: lockres: O0000000000000016296ce900000000, owner=0, state=0 Dec 7 14:07:50 growler kernel: last used: 0, refcnt: 6, on purge list: no Dec 7 14:07:50 growler kernel: on dirty list: no, on reco list: no, migrating pending: no Dec 7 14:07:50 growler kernel: inflight locks: 0, asts reserved: 0 Dec 7 14:07:50 growler kernel: refmap nodes: [ 21 ], inflight=0 Dec 7 14:07:50 growler kernel: granted queue: Dec 7 14:07:50 growler kernel: type=3, conv=-1, node=21, cookie=21:213370, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), pending=(conv=n,lock=n,cancel=n,unlock=n) Dec 7 14:07:50 growler kernel: converting queue: Dec 7 14:07:50 growler kernel: blocked queue: Here is df output: root at growler:~$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 245695888 29469416 203544360 13% / /dev/sda1 101086 15133 80734 16% /boot tmpfs 33005580 0 33005580 0% /dev/shm /dev/sdc1 5857428444 5234400436 623028008 90% /home Thanks -Daniel
Sunil Mushran
2010-Dec-09 01:49 UTC
[Ocfs2-users] Extremely poor write performance, but read appears to be okay
http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commitdiff;h=1f667766cb67ed05b4d706aa82e8ad0b12eaae8b That specific error has been addressed in the upcoming 1.4.8. Attach the logs and all other info to a bugzilla. On 12/08/2010 05:07 PM, Daniel McDonald wrote:> Hello, > > I'm writing from the otherside of the world from where my systems are, > so details are coming in slow. We have a 6TB OCFS2 volume across 20 or > so nodes all running OEL5.4 running ocfs2-1.4.4. The system has worked > fairly well for the last 6-8 months. Something has happened over the > last few weeks which has driven write performance nearly to a halt. > I'm not sure how to proceed, and very poor internet is hindering my > abilities further. I've verified that the disk array is in good > health. I'm seeing a few awkward kernel log messages, an example of > one follows. I have not been able to verify all nodes due to limited > time and slow internet in my present location. Any assistance would be > greatly appreciated. I should be able to provide log files in about 12 > hours. At this moment, loadavgs on each node are 0.00 to 0.09. > > Here is a test write and associated iostat -xm 5 output. Previously I > was obtaining> 90MB/s: > > $ dd if=/dev/zero of=/home/testdump count=1000 bs=1024k > > ...and associated iostat output: > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.10 0.00 0.43 12.25 0.00 87.22 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > sda 0.00 1.80 0.00 8.40 0.00 0.04 9.71 > 0.01 0.64 0.05 0.04 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda3 0.00 1.80 0.00 8.40 0.00 0.04 9.71 > 0.01 0.64 0.05 0.04 > sdc 0.00 0.00 115.80 0.60 0.46 0.00 > 8.04 0.99 8.48 8.47 98.54 > sdc1 0.00 0.00 115.80 0.60 0.46 0.00 > 8.04 0.99 8.48 8.47 98.54 > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.07 0.00 0.55 12.25 0.00 87.13 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > sda 0.00 0.40 0.00 0.80 0.00 0.00 12.00 > 0.00 2.00 1.25 0.10 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda3 0.00 0.40 0.00 0.80 0.00 0.00 12.00 > 0.00 2.00 1.25 0.10 > sdc 0.00 0.00 112.80 0.40 0.44 0.00 > 8.03 0.98 8.68 8.69 98.38 > sdc1 0.00 0.00 112.80 0.40 0.44 0.00 > 8.03 0.98 8.68 8.69 98.38 > > Here is a test read and associated iostat output. I'm intentionally > reading from a different test file as to avoid caching effects: > > $ dd if=/home/someothertestdump of=/dev/null bs=1024k > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.10 0.00 3.60 10.85 0.00 85.45 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > sda 0.00 3.79 0.00 1.40 0.00 0.02 29.71 > 0.00 1.29 0.43 0.06 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda3 0.00 3.79 0.00 1.40 0.00 0.02 29.71 > 0.00 1.29 0.43 0.06 > sdc 7.98 0.20 813.17 1.00 102.50 0.00 > 257.84 1.92 2.34 1.19 96.71 > sdc1 7.98 0.20 813.17 1.00 102.50 0.00 > 257.84 1.92 2.34 1.19 96.67 > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.07 0.00 3.67 10.22 0.00 86.03 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > sda 0.00 0.20 0.00 0.40 0.00 0.00 12.00 > 0.00 0.50 0.50 0.02 > sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 > sda3 0.00 0.20 0.00 0.40 0.00 0.00 12.00 > 0.00 0.50 0.50 0.02 > sdc 6.60 0.20 829.00 1.00 104.28 0.00 > 257.32 1.90 2.31 1.17 97.28 > sdc1 6.60 0.20 829.00 1.00 104.28 0.00 > 257.32 1.90 2.31 1.17 97.28 > > I'm seeing a few weird kernel messages, such as: > > Dec 7 14:07:50 growler kernel: > (dlm_wq,4793,4):dlm_deref_lockres_worker:2344 ERROR: > 84B7C6421A6C4280AB87F569035C5368:O0000000000000016296ce900000000: node > 14 trying to drop ref but it is already dropped! > Dec 7 14:07:50 growler kernel: lockres: > O0000000000000016296ce900000000, owner=0, state=0 > Dec 7 14:07:50 growler kernel: last used: 0, refcnt: 6, on purge list: no > Dec 7 14:07:50 growler kernel: on dirty list: no, on reco list: no, > migrating pending: no > Dec 7 14:07:50 growler kernel: inflight locks: 0, asts reserved: 0 > Dec 7 14:07:50 growler kernel: refmap nodes: [ 21 ], inflight=0 > Dec 7 14:07:50 growler kernel: granted queue: > Dec 7 14:07:50 growler kernel: type=3, conv=-1, node=21, > cookie=21:213370, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), > pending=(conv=n,lock=n,cancel=n,unlock=n) > Dec 7 14:07:50 growler kernel: converting queue: > Dec 7 14:07:50 growler kernel: blocked queue: > > > Here is df output: > > root at growler:~$ df > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda3 245695888 29469416 203544360 13% / > /dev/sda1 101086 15133 80734 16% /boot > tmpfs 33005580 0 33005580 0% /dev/shm > /dev/sdc1 5857428444 5234400436 623028008 90% /home > > Thanks > -Daniel > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Possibly Parallel Threads
- 10 Node OCFS2 Cluster - Performance
- Slow RAID Check/high %iowait during check after updgrade from CentOS 6.5 -> CentOS 7.2
- Slow RAID Check/high %iowait during check after updgrade from CentOS 6.5 -> CentOS 7.2
- Xen and I/O Intensive Loads
- Slow RAID Check/high %iowait during check after updgrade from CentOS 6.5 -> CentOS 7.2