Hi, We have a storage server (HP DL360G5 + MSA20 (12 disks in RAID 6) on a SmartArray6400). 10 directories are exported through nfs to 10 clients (rsize=32768,wsize=32768,soft,intr,nosuid,proto=udp,vers=3). The server is apparently not doing much but... we have very high waiting IOs. dstat show very little activity, but high 'wai'... # dstat ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 0 0 88 12 0 0| 413k 98k| 0 0 | 0 0 | 188 132 0 1 46 53 0 0| 716k 48k| 19k 420k| 0 0 |1345 476 0 1 49 50 0 1| 492k 32k| 12k 181k| 0 0 |1269 482 0 1 63 37 0 0| 316k 159k| 58k 278k| 0 0 |1789 1562 0 0 74 26 0 0| 84k 512k|1937B 6680B| 0 0 |1200 106 0 1 44 55 0 1| 612k 80k| 14k 221k| 0 0 |1378 538 1 1 52 47 0 0| 628k 0 | 17k 318k| 0 0 |1327 520 0 1 50 49 0 0| 484k 60k| 14k 178k| 0 0 |1303 494 0 0 87 13 0 0| 124k 0 |7745B 116k| 0 0 |1083 139 0 1 59 41 0 0| 316k 60k|4828B 67k| 0 0 |1179 346 top shows that one nfsd is usualy in state 'D' (waiting). # top -i (sorted by cpu usage) top - 18:11:28 up 207 days, 7:13, 2 users, load average: 0.99, 1.07, 1.00 Tasks: 124 total, 1 running, 123 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 54.3%id, 45.3%wa, 0.2%hi, 0.0%si, 0.0%st Mem: 3089252k total, 3068112k used, 21140k free, 928468k buffers Swap: 2008116k total, 164k used, 2007952k free, 293716k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16571 root 15 0 12708 1076 788 R 1 0.0 0:00.02 top 2580 root 15 0 0 0 0 D 0 0.0 2:36.70 nfsd # cat /proc/net/rpc/nfsd rc 8872 34768207 38630969 fh 142 0 0 0 0 io 2432226534 884662242 th 32 394 4851.311 2437.416 370.949 238.432 542.241 4.942 2.239 1.000 0.427 0.541 ra 64 3876274 5025 3724 2551 2030 2036 1506 1607 1219 1154 1136249 net 73410453 73261524 0 0 rpc 73408119 0 0 0 0 proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 proc3 22 33 9503937 1315066 11670859 7139862 0 5033349 28129122 3729031 0 0 0 487614 0 1116215 0 0 2054329 21225 66 0 2351744 proc4 2 0 0 proc4ops 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Do you think nfs is the problem here? If so, is there something wrong with our config? Is it too much to have 10 dir x 10 clients, even if there is almost no traffic? Thx, JD
John Doe wrote:> Hi, > > We have a storage server (HP DL360G5 + MSA20 (12 disks in RAID 6) on a > SmartArray6400). > 10 directories are exported through nfs to 10 clients > (rsize=32768,wsize=32768,soft,intr,nosuid,proto=udp,vers=3). > The server is apparently not doing much but... we have very high waiting > IOs.How about running iostat -x ? Sounds like the system is doing a lot more than you think it is.. nate
> Do you think nfs is the problem here? > If so, is there something wrong with our config? > Is it too much to have 10 dir x 10 clients, even if there is almost no traffic?This has very little to do with the number of clients or directories, except that in general terms more users = more work. If the workload is highly random and not easily cached, you could be seeing low throughput but also IOPS starvation. Does 12 disks in Raid6 mean 10 data disks? That's not a lot at all, especially if it is an array that doesn't do much abstraction; any given block will only exist in on a single spindle, creating potential for hot spots if we are talking large sized disks. Can you do any analysis of the array, specifically looking for hot disks, busy disks, high latencies, etc? Do you know what your baseline should be for this array versus your 10 client workload?
John Doe wrote:> Hi, > > We have a storage server (HP DL360G5 + MSA20 (12 disks in RAID 6) on a SmartArray6400). > 10 directories are exported through nfs to 10 clients (rsize=32768,wsize=32768,soft,intr,nosuid,proto=udp,vers=3). > The server is apparently not doing much but... we have very high waiting IOs. > > dstat show very little activity, but high 'wai'... >as others have said, iostat -x N for an N like 5 (5 seconds). ignore the first sample as its the average since boot, instead, look at the ongoing 5 second interval samples, and look for high await, svctm, and %util, as well as the rrqm/s and wrqm/s numbers, rather than sec/s, as sequential access is likely the least of your problems. raid6 does pretty poorly on heavy random write workloads Also, IBM has a neat freeware system analysis tool called NMON (originally for AIX, ported to Linux) http://www.ibm.com/developerworks/aix/library/au-analyze_aix/ works sorta like a souped up 'top' but has per file system IO stats and stuff too. it can also accumulate stats over a long period into a CSV file, and they have an excel spreadsheet that loads said CSV file and cranks out a lot of fairly useful graphs. dunno if the excel spreadsheet works in OOcalc or not.
John Doe wrote:> Hi, > > We have a storage server (HP DL360G5 + MSA20 (12 disks in RAID 6) on a SmartArray6400). > 10 directories are exported through nfs to 10 clients (rsize=32768,wsize=32768,soft,intr,nosuid,proto=udp,vers=3). > The server is apparently not doing much but... we have very high waiting IOs.Probably not connected, but personally I would use 'hard' and 'proto=tcp' instead of 'soft' and 'proto=udp' on the clients James Pearson