Sergey Koposov
2014-Jul-21 20:35 UTC
[Gluster-users] glusterfs, striped volume x8, poor sequential read performance, good write performance
Hi, I have a HPC installation with 8 nodes. Each node has a software RAID1 using two NLSAS disks. And the disks from 8 nodes are combined into large shared striped 20Tb glusterfs partition which seems to show abnormally slow sequential read performance, with good write performance. Basically I see is that the write performance is very decent ~ 500Mb/sec (tested using dd): [root at XXXX bigstor]# dd if=/dev/zero of=test2 bs=1M count=100000 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 186.393 s, 563 MB/s And all this is is not just seating in the cache of each node, as I see the data being flushed to disks with approximately right speed. In the same time the read performance is (tested using dd with dropping of the caches beforehand) is really bad: [root at XXXX bigstor]# dd if=/data/bigstor/test of=/dev/null bs=1M count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 309.821 s, 33.8 MB/s When doing this glusterfs processes only take ~ 10-15% of the CPU max. So it isn't CPU starving. The underlying devices do not seem to be loaded at all: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 73.00 0.00 9344.00 0.00 256.00 0.11 1.48 1.47 10.70 To check that the disks are not the problem I did a separate test of the read-speed of the raided disks on all machines and they have read speads of ~ 180Mb/s (uncached). So they aren't the problem. I also tried to increase the readahead on the raid disks echo 2048 > /sys/block/md126/queue/read_ahead_kb but that doesn't seem to help at all. Does anyone have any advice what to do here ? What knobs to adjust ? To me it looks like a bug, being honest, but I would be happy if there is magic switch I forgot to turn on ) Here is more details about my system OS: Centos 6.5 glusterfs : 3.4.4 Kernel 2.6.32-431.20.3.el6.x86_64 mount options and df output: [root at XXXX bigstor]# cat /etc/mtab /dev/md126p4 /data/glvol/brick1 xfs rw 0 0 node1:/glvol /data/bigstor fuse.glusterfs rw,default_permissions,allow_other,max_read=131072 0 0 [root at XXXX bigstor]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/md126p4 2516284988 2356820844 159464144 94% /data/glvol/brick1 node1:/glvol 20130279808 18824658688 1305621120 94% /data/bigstor brick info: xfs_info /data/glvol/brick1 meta-data=/dev/md126p4 isize=512 agcount=4, agsize=157344640 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=629378560, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=307313, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Here is the gluster info: [root at XXXXXX bigstor]# gluster gluster> volume info glvol Volume Name: glvol Type: Stripe Volume ID: 53b2f6ad-46a6-4359-acad-dc5b6687d535 Status: Started Number of Bricks: 1 x 8 = 8 Transport-type: tcp Bricks: Brick1: node1:/data/glvol/brick1/brick Brick2: node2:/data/glvol/brick1/brick Brick3: node3:/data/glvol/brick1/brick Brick4: node4:/data/glvol/brick1/brick Brick5: node5:/data/glvol/brick1/brick Brick6: node6:/data/glvol/brick1/brick Brick7: node7:/data/glvol/brick1/brick Brick8: node8:/data/glvol/brick1/brick The network I use is the ip over infiniband with very high throughput. I also saw the discussion here on similar issue: http://supercolony.gluster.org/pipermail/gluster-users/2013-February/035560.html but it was blamed on ext4. Thanks in advance, Sergey PS I also looked at the contents of the /var/lib/glusterd/vols/glvol/glvol-fuse.vol and saw this, I don't know whether that's relevant or not volume glvol-client-0 type protocol/client option transport-type tcp option remote-subvolume /data/glvol/brick1/brick option remote-host node1 end-volume ...... volume glvol-stripe-0 type cluster/stripe subvolumes glvol-client-0 glvol-client-1 glvol-client-2 glvol-client-3 glvol-client-4 glvol-client-5 glvol-client-6 glvol-client-7 end-volume volume glvol-dht type cluster/distribute subvolumes glvol-stripe-0 end-volume volume glvol-write-behind type performance/write-behind subvolumes glvol-dht end-volume volume glvol-read-ahead type performance/read-ahead subvolumes glvol-write-behind end-volume volume glvol-io-cache type performance/io-cache subvolumes glvol-read-ahead end-volume volume glvol-quick-read type performance/quick-read subvolumes glvol-io-cache end-volume volume glvol-open-behind type performance/open-behind subvolumes glvol-quick-read end-volume volume glvol-md-cache type performance/md-cache subvolumes glvol-open-behind end-volume volume glvol type debug/io-stats option count-fop-hits off option latency-measurement off subvolumes glvol-md-cache end-volume ***************************************************** Sergey E. Koposov, PhD, Senior Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/