Hi all, although I''m running all this in a Sol10u5 X4500, I hope I may ask this question here. If not, please let me know where to head to. We are running several X4500 with only 3 raidz2 zpools since we want quite a bit of storage space[*], but the performance we get when using zfs send is sometimes really lousy. Of course this depends what''s in the file system, but when doing a few backups today I have seen the following: receiving full stream of atlashome/XXX at 2008-10-13T115649 into atlashome/BACKUP/XXX at 2008-10-13T115649 in @ 11.1 MB/s, out @ 11.1 MB/s, 14.9 GB total, buffer 0% full summary: 14.9 GByte in 45 min 42.8 sec - average of 5708 kB/s So, a mere 15 GB were transferred in 45 minutes, another user''s home which is quite large (7TB) took more than 42 hours to be transferred. Since all this is going a 10 Gb/s network and the CPUs are all idle I would really like to know why * zfs send is so slow and * how can I improve the speed? Thanks a lot for any hint Cheers Carsten [*] we have some quite a few tests with more zpools but were not able to improve the speeds substantially. For this particular bad file system I still need to histogram the file sizes. -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31
Carsten Aulbert wrote:> Hi all, > > although I''m running all this in a Sol10u5 X4500, I hope I may ask this > question here. If not, please let me know where to head to. > > We are running several X4500 with only 3 raidz2 zpools since we want > quite a bit of storage space[*], but the performance we get when using > zfs send is sometimes really lousy. Of course this depends what''s in the > file system, but when doing a few backups today I have seen the following: > > receiving full stream of atlashome/XXX at 2008-10-13T115649 into > atlashome/BACKUP/XXX at 2008-10-13T115649 > in @ 11.1 MB/s, out @ 11.1 MB/s, 14.9 GB total, buffer 0% full > summary: 14.9 GByte in 45 min 42.8 sec - average of 5708 kB/s > > So, a mere 15 GB were transferred in 45 minutes, another user''s home > which is quite large (7TB) took more than 42 hours to be transferred. > Since all this is going a 10 Gb/s network and the CPUs are all idle I > would really like to know whyWhat are you using to transfer the data over the network ? -- Darren J Moffat
Carsten Aulbert schrieb:> Hi all, > > although I''m running all this in a Sol10u5 X4500, I hope I may ask this > question here. If not, please let me know where to head to. > > We are running several X4500 with only 3 raidz2 zpools since we want > quite a bit of storage space[*], but the performance we get when using > zfs send is sometimes really lousy. Of course this depends what''s in the > file system, but when doing a few backups today I have seen the following: > > receiving full stream of atlashome/XXX at 2008-10-13T115649 into > atlashome/BACKUP/XXX at 2008-10-13T115649 > in @ 11.1 MB/s, out @ 11.1 MB/s, 14.9 GB total, buffer 0% full > summary: 14.9 GByte in 45 min 42.8 sec - average of 5708 kB/s > > So, a mere 15 GB were transferred in 45 minutes, another user''s home > which is quite large (7TB) took more than 42 hours to be transferred. > Since all this is going a 10 Gb/s network and the CPUs are all idle I > would really like to know why > > * zfs send is so slow and > * how can I improve the speed? > > Thanks a lot for any hint > > Cheers > > Carsten > > [*] we have some quite a few tests with more zpools but were not able to > improve the speeds substantially. For this particular bad file system I > still need to histogram the file sizes. >Carsten, the summary looks like you are using mbuffer. Can you elaborate on what options you are passing to mbuffer? Maybe changing the blocksize to be consistent with the recordsize of the zpool could improve performance. Is the buffer running full or is it empty most of the time? Are you sure that the network connection is 10Gb/s all the way through from machine to machine? - Thomas
Hi Darren J Moffat wrote:> > What are you using to transfer the data over the network ? >Initially just plain ssh which was way to slow, now we use mbuffer on both ends and socket transfer the data over via socat - I know that mbuffer already allows this, but in a few tests socat seemed to be faster. Sorry for not writing this into the first email. Cheers Carsten
Hi Thomas, Thomas Maier-Komor wrote:> > Carsten, > > the summary looks like you are using mbuffer. Can you elaborate on what > options you are passing to mbuffer? Maybe changing the blocksize to be > consistent with the recordsize of the zpool could improve performance. > Is the buffer running full or is it empty most of the time? Are you sure > that the network connection is 10Gb/s all the way through from machine > to machine?Well spotted :) right now plain mbuffer with plenty of buffer (-m 2048M) on both ends and I have not seen any buffer exceeding the 10% watermark level. The network connection are via Neterion XFrame II Sun Fire NICs then via CX4 cables to our core switch where both boxes are directly connected (WovenSystmes EFX1000). netperf tells me that the TCP performance is close to 7.5 GBit/s duplex and if I use cat /dev/zero | mbuffer | socat ---> socat | mbuffer > /dev/null I easily see speeds of about 350-400 MB/s so I think the network is fine. Cheers Carsten
Carsten Aulbert schrieb:> Hi Thomas, > > Thomas Maier-Komor wrote: > >> Carsten, >> >> the summary looks like you are using mbuffer. Can you elaborate on what >> options you are passing to mbuffer? Maybe changing the blocksize to be >> consistent with the recordsize of the zpool could improve performance. >> Is the buffer running full or is it empty most of the time? Are you sure >> that the network connection is 10Gb/s all the way through from machine >> to machine? > > Well spotted :) > > right now plain mbuffer with plenty of buffer (-m 2048M) on both ends > and I have not seen any buffer exceeding the 10% watermark level. The > network connection are via Neterion XFrame II Sun Fire NICs then via CX4 > cables to our core switch where both boxes are directly connected > (WovenSystmes EFX1000). netperf tells me that the TCP performance is > close to 7.5 GBit/s duplex and if I use > > cat /dev/zero | mbuffer | socat ---> socat | mbuffer > /dev/null > > I easily see speeds of about 350-400 MB/s so I think the network is fine. > > Cheers > > Carsten > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussI don''t know socat or what benefit it gives you, but have you tried using mbuffer to send and receive directly (options -I and -O)? Additionally, try to set the block size of mbuffer to the recordsize of zfs (usually 128k): receiver$ mbuffer -I sender:10000 -s 128k -m 2048M | zfs receive sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:10000 As transmitting from /dev/zero to /dev/null is at a rate of 350MB/s, I guess, you are really hitting the maximum speed of your zpool. From my understanding, I''d guess sending is always slower than receiving, because reads are random and writes are sequential. So it should be quite normal that mbuffer''s buffer doesn''t really see a lot of usage. Cheers, Thomas
Hi again, Thomas Maier-Komor wrote:> Carsten Aulbert schrieb: >> Hi Thomas, > I don''t know socat or what benefit it gives you, but have you tried > using mbuffer to send and receive directly (options -I and -O)?I thought we tried that in the past and with socat it seemed faster, but I just made a brief test and I got (/dev/zero -> remote /dev/null) 330 MB/s with mbuffer+socat and 430MB/s with mbuffer alone.> Additionally, try to set the block size of mbuffer to the recordsize of > zfs (usually 128k): > receiver$ mbuffer -I sender:10000 -s 128k -m 2048M | zfs receive > sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:10000We are using 32k since many of our user use tiny files (and then I need to reduce the buffer size because of this ''funny'' error): mbuffer: fatal: Cannot address so much memory (32768*65536=2147483648>1544040742911). Does this qualify for a bug report? Thanks for the hint of looking into this again! Cheers Carsten
Carsten Aulbert schrieb:> Hi again, > > Thomas Maier-Komor wrote: >> Carsten Aulbert schrieb: >>> Hi Thomas, >> I don''t know socat or what benefit it gives you, but have you tried >> using mbuffer to send and receive directly (options -I and -O)? > > I thought we tried that in the past and with socat it seemed faster, but > I just made a brief test and I got (/dev/zero -> remote /dev/null) 330 > MB/s with mbuffer+socat and 430MB/s with mbuffer alone. > >> Additionally, try to set the block size of mbuffer to the recordsize of >> zfs (usually 128k): >> receiver$ mbuffer -I sender:10000 -s 128k -m 2048M | zfs receive >> sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:10000 > > We are using 32k since many of our user use tiny files (and then I need > to reduce the buffer size because of this ''funny'' error): > > mbuffer: fatal: Cannot address so much memory > (32768*65536=2147483648>1544040742911). > > Does this qualify for a bug report? > > Thanks for the hint of looking into this again! > > Cheers > > Carsten > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussYes this qualifies for a bug report. As a workaround for now, you can compile in 64 bit mode. I.e.: $ ./configure CFLAGS="-g -O -m64" $ make && make install This works for Sun Studio 12 and gcc. For older version of Sun Studio, you need to pass -xarch=v9 instead of -m64. I am planning to release an updated version mbuffer this week. I''ll include a patch for this issue. Cheers, Thomas
Hi, I''m just doing my first proper send/receive over the network and I''m getting just 9.4MB/s over a gigabit link. Would you be able to provide an example of how to use mbuffer / socat with ZFS for a Solaris beginner? thanks, Ross -- This message posted from opensolaris.org
Ross schrieb:> Hi, > > I''m just doing my first proper send/receive over the network and I''m getting just 9.4MB/s over a gigabit link. Would you be able to provide an example of how to use mbuffer / socat with ZFS for a Solaris beginner? > > thanks, > > Ross > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussreceiver> mbuffer -I sender:10000 -s 128k -m 512M | zfs receive sender> zfs send mypool/myfilesystem at mysnapshot | mbuffer -s 128k -m 512M -O receiver:10000 BTW: I release a new version of mbuffer today. HTH, Thomas
Thanks, that got it working. I''m still only getting 10MB/s, so it''s not solved my problem - I''ve still got a bottleneck somewhere, but mbuffer is a huge improvement over standard zfs send / receive. It makes such a difference when you can actually see what''s going on. ----------------------------------------> Date: Wed, 15 Oct 2008 12:08:14 +0200 > From: thomas at maier-komor.de > To: myxiplx at hotmail.com; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Improving zfs send performance > > Ross schrieb: >> Hi, >> >> I''m just doing my first proper send/receive over the network and I''m getting just 9.4MB/s over a gigabit link. Would you be able to provide an example of how to use mbuffer / socat with ZFS for a Solaris beginner? >> >> thanks, >> >> Ross >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > receiver> mbuffer -I sender:10000 -s 128k -m 512M | zfs receive > > sender> zfs send mypool/myfilesystem at mysnapshot | mbuffer -s 128k -m > 512M -O receiver:10000 > > BTW: I release a new version of mbuffer today. > > HTH, > Thomas_________________________________________________________________ Make a mini you and download it into Windows Live Messenger http://clk.atdmt.com/UKM/go/111354029/direct/01/
Hi Ross Ross Smith wrote:> Thanks, that got it working. I''m still only getting 10MB/s, so it''s not solved my problem - I''ve still got a bottleneck somewhere, but mbuffer is a huge improvement over standard zfs send / receive. It makes such a difference when you can actually see what''s going on.I''m currently trying to investigate this a bit. One of our user''s home directories is extremely slow to ''zfs send''. It started yesterday afternoon at about 1600+0200 and is still running and has only copied less than 50% of the whole tree: On the receiving side zfs get tells me: atlashome/BACKUP/XXX used 193G - atlashome/BACKUP/XXX available 17.2T - atlashome/BACKUP/XXX referenced 193G - atlashome/BACKUP/XXX compressratio 1.81x - So close 350 GB are transferred and about 500 GB to go. More later. Carsten
Thomas Maier-Komor schrieb:> BTW: I release a new version of mbuffer today.WARNING!!! Sorry people!!! The latest version of mbuffer has a regression that can CORRUPT output if stdout is used. Please fall back to the last version. A fix is on the way... - Thomas
I''m using 2008-05-07 (latest stable), am I right in assuming that one is ok? ----------------------------------------> Date: Wed, 15 Oct 2008 13:52:42 +0200 > From: thomas at maier-komor.de > To: myxiplx at hotmail.com; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Improving zfs send performance > > Thomas Maier-Komor schrieb: >> BTW: I release a new version of mbuffer today. > > WARNING!!! > > Sorry people!!! > > The latest version of mbuffer has a regression that can CORRUPT output > if stdout is used. Please fall back to the last version. A fix is on the > way... > > - Thomas_________________________________________________________________ Discover Bird''s Eye View now with Multimap from Live Search http://clk.atdmt.com/UKM/go/111354026/direct/01/
Hi all, Carsten Aulbert wrote:> More later.OK, I''m completely puzzled right now (and sorry for this lengthy email). My first (and currently only idea) was that the size of the files is related to this effect, but that does not seem to be the case: (1) A 185 GB zfs file system was transferred yesterday with a speed of about 60 MB/s to two different servers. The histogram of files looks like: 2822 files were investigated, total size is: 185.82 Gbyte Summary of file sizes [bytes]: zero: 2 1 -> 2 0 2 -> 4 1 4 -> 8 3 8 -> 16 26 16 -> 32 8 32 -> 64 6 64 -> 128 29 128 -> 256 11 256 -> 512 13 512 -> 1024 17 1024 -> 2k 33 2k -> 4k 45 4k -> 8k 9044 ************ 8k -> 16k 60 16k -> 32k 41 32k -> 64k 19 64k -> 128k 22 128k -> 256k 12 256k -> 512k 5 512k -> 1024k 1218 ** 1024k -> 2M 16004 ********************* 2M -> 4M 46202 ************************************************************ 4M -> 8M 0 8M -> 16M 0 16M -> 32M 0 32M -> 64M 0 64M -> 128M 0 128M -> 256M 0 256M -> 512M 0 512M -> 1024M 0 1024M -> 2G 0 2G -> 4G 0 4G -> 8G 0 8G -> 16G 1 (2) Currently a much larger file system is being transferred, the same script (even the same incarnation, i.e. process) is now running close to 22 hours: 28549 files were investigated, total size is: 646.67 Gbyte Summary of file sizes [bytes]: zero: 4954 ************************** 1 -> 2 0 2 -> 4 0 4 -> 8 1 8 -> 16 1 16 -> 32 0 32 -> 64 0 64 -> 128 1 128 -> 256 0 256 -> 512 9 512 -> 1024 71 1024 -> 2k 1 2k -> 4k 1095 ****** 4k -> 8k 8449 ********************************************* 8k -> 16k 2217 ************ 16k -> 32k 503 *** 32k -> 64k 1 64k -> 128k 1 128k -> 256k 1 256k -> 512k 0 512k -> 1024k 0 1024k -> 2M 0 2M -> 4M 0 4M -> 8M 16 8M -> 16M 0 16M -> 32M 0 32M -> 64M 11218 ************************************************************ 64M -> 128M 0 128M -> 256M 0 256M -> 512M 0 512M -> 1024M 0 1024M -> 2G 0 2G -> 4G 5 4G -> 8G 1 8G -> 16G 3 16G -> 32G 1 When watching zpool iostat I get this (30 second average, NOT the first output): capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- atlashome 3.54T 17.3T 137 0 4.28M 0 raidz2 833G 6.00T 1 0 30.8K 0 c0t0d0 - - 1 0 2.38K 0 c1t0d0 - - 1 0 2.18K 0 c4t0d0 - - 0 0 1.91K 0 c6t0d0 - - 0 0 1.76K 0 c7t0d0 - - 0 0 1.77K 0 c0t1d0 - - 0 0 1.79K 0 c1t1d0 - - 0 0 1.86K 0 c4t1d0 - - 0 0 1.97K 0 c5t1d0 - - 0 0 2.04K 0 c6t1d0 - - 1 0 2.25K 0 c7t1d0 - - 1 0 2.31K 0 c0t2d0 - - 1 0 2.21K 0 c1t2d0 - - 0 0 1.99K 0 c4t2d0 - - 0 0 1.99K 0 c5t2d0 - - 1 0 2.38K 0 raidz2 1.29T 5.52T 67 0 2.09M 0 c6t2d0 - - 58 0 143K 0 c7t2d0 - - 58 0 141K 0 c0t3d0 - - 53 0 131K 0 c1t3d0 - - 53 0 130K 0 c4t3d0 - - 58 0 143K 0 c5t3d0 - - 58 0 145K 0 c6t3d0 - - 59 0 147K 0 c7t3d0 - - 59 0 146K 0 c0t4d0 - - 59 0 145K 0 c1t4d0 - - 58 0 145K 0 c4t4d0 - - 58 0 145K 0 c6t4d0 - - 58 0 143K 0 c7t4d0 - - 58 0 143K 0 c0t5d0 - - 58 0 145K 0 c1t5d0 - - 58 0 144K 0 raidz2 1.43T 5.82T 69 0 2.16M 0 c4t5d0 - - 62 0 141K 0 c5t5d0 - - 60 0 138K 0 c6t5d0 - - 59 0 135K 0 c7t5d0 - - 60 0 138K 0 c0t6d0 - - 62 0 142K 0 c1t6d0 - - 61 0 138K 0 c4t6d0 - - 59 0 135K 0 c5t6d0 - - 60 0 138K 0 c6t6d0 - - 62 0 142K 0 c7t6d0 - - 61 0 138K 0 c0t7d0 - - 58 0 134K 0 c1t7d0 - - 60 0 137K 0 c4t7d0 - - 62 0 142K 0 c5t7d0 - - 61 0 139K 0 c6t7d0 - - 58 0 134K 0 c7t7d0 - - 60 0 138K 0 ---------- ----- ----- ----- ----- ----- ----- Odd things: (1) The zpool is not equally striped across the raidz2-pools (2) The disks should be able to perform much much faster than they currently output data at, I believe it;s 2008 and not 1995. (3) The four cores of the X4500 are dying of boredom, i.e. idle >95% all the time. Has anyone a good idea, where the bottleneck could be? I''m running out of ideas. Cheers Carsten
Ross Smith schrieb:> I''m using 2008-05-07 (latest stable), am I right in assuming that one is ok? > > ---------------------------------------- >> Date: Wed, 15 Oct 2008 13:52:42 +0200 >> From: thomas at maier-komor.de >> To: myxiplx at hotmail.com; zfs-discuss at opensolaris.org >> Subject: Re: [zfs-discuss] Improving zfs send performance >> >> Thomas Maier-Komor schrieb: >>> BTW: I release a new version of mbuffer today. >> WARNING!!! >> >> Sorry people!!! >> >> The latest version of mbuffer has a regression that can CORRUPT output >> if stdout is used. Please fall back to the last version. A fix is on the >> way... >> >> - Thomas > > _________________________________________________________________ > Discover Bird''s Eye View now with Multimap from Live Search > http://clk.atdmt.com/UKM/go/111354026/direct/01/Yes this one is OK. The regression appeared in 20081014. - Thomas
Hello all, I think in SS 11 should be -xarch=amd64. Leal. -- This message posted from opensolaris.org
comments below... Carsten Aulbert wrote:> Hi all, > > Carsten Aulbert wrote: > >> More later. >> > > OK, I''m completely puzzled right now (and sorry for this lengthy email). > My first (and currently only idea) was that the size of the files is > related to this effect, but that does not seem to be the case: > > (1) A 185 GB zfs file system was transferred yesterday with a speed of > about 60 MB/s to two different servers. The histogram of files looks like: > > 2822 files were investigated, total size is: 185.82 Gbyte > > Summary of file sizes [bytes]: > zero: 2 > 1 -> 2 0 > 2 -> 4 1 > 4 -> 8 3 > 8 -> 16 26 > 16 -> 32 8 > 32 -> 64 6 > 64 -> 128 29 > 128 -> 256 11 > 256 -> 512 13 > 512 -> 1024 17 > 1024 -> 2k 33 > 2k -> 4k 45 > 4k -> 8k 9044 ************ > 8k -> 16k 60 > 16k -> 32k 41 > 32k -> 64k 19 > 64k -> 128k 22 > 128k -> 256k 12 > 256k -> 512k 5 > 512k -> 1024k 1218 ** > 1024k -> 2M 16004 ********************* > 2M -> 4M 46202 > ************************************************************ > 4M -> 8M 0 > 8M -> 16M 0 > 16M -> 32M 0 > 32M -> 64M 0 > 64M -> 128M 0 > 128M -> 256M 0 > 256M -> 512M 0 > 512M -> 1024M 0 > 1024M -> 2G 0 > 2G -> 4G 0 > 4G -> 8G 0 > 8G -> 16G 1 > > (2) Currently a much larger file system is being transferred, the same > script (even the same incarnation, i.e. process) is now running close to > 22 hours: > > 28549 files were investigated, total size is: 646.67 Gbyte > > Summary of file sizes [bytes]: > zero: 4954 ************************** > 1 -> 2 0 > 2 -> 4 0 > 4 -> 8 1 > 8 -> 16 1 > 16 -> 32 0 > 32 -> 64 0 > 64 -> 128 1 > 128 -> 256 0 > 256 -> 512 9 > 512 -> 1024 71 > 1024 -> 2k 1 > 2k -> 4k 1095 ****** > 4k -> 8k 8449 ********************************************* > 8k -> 16k 2217 ************ > 16k -> 32k 503 *** > 32k -> 64k 1 > 64k -> 128k 1 > 128k -> 256k 1 > 256k -> 512k 0 > 512k -> 1024k 0 > 1024k -> 2M 0 > 2M -> 4M 0 > 4M -> 8M 16 > 8M -> 16M 0 > 16M -> 32M 0 > 32M -> 64M 11218 > ************************************************************ > 64M -> 128M 0 > 128M -> 256M 0 > 256M -> 512M 0 > 512M -> 1024M 0 > 1024M -> 2G 0 > 2G -> 4G 5 > 4G -> 8G 1 > 8G -> 16G 3 > 16G -> 32G 1 > > > When watching zpool iostat I get this (30 second average, NOT the first > output): > > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > atlashome 3.54T 17.3T 137 0 4.28M 0 > raidz2 833G 6.00T 1 0 30.8K 0 > c0t0d0 - - 1 0 2.38K 0 > c1t0d0 - - 1 0 2.18K 0 > c4t0d0 - - 0 0 1.91K 0 > c6t0d0 - - 0 0 1.76K 0 > c7t0d0 - - 0 0 1.77K 0 > c0t1d0 - - 0 0 1.79K 0 > c1t1d0 - - 0 0 1.86K 0 > c4t1d0 - - 0 0 1.97K 0 > c5t1d0 - - 0 0 2.04K 0 > c6t1d0 - - 1 0 2.25K 0 > c7t1d0 - - 1 0 2.31K 0 > c0t2d0 - - 1 0 2.21K 0 > c1t2d0 - - 0 0 1.99K 0 > c4t2d0 - - 0 0 1.99K 0 > c5t2d0 - - 1 0 2.38K 0 > raidz2 1.29T 5.52T 67 0 2.09M 0 > c6t2d0 - - 58 0 143K 0 > c7t2d0 - - 58 0 141K 0 > c0t3d0 - - 53 0 131K 0 > c1t3d0 - - 53 0 130K 0 > c4t3d0 - - 58 0 143K 0 > c5t3d0 - - 58 0 145K 0 > c6t3d0 - - 59 0 147K 0 > c7t3d0 - - 59 0 146K 0 > c0t4d0 - - 59 0 145K 0 > c1t4d0 - - 58 0 145K 0 > c4t4d0 - - 58 0 145K 0 > c6t4d0 - - 58 0 143K 0 > c7t4d0 - - 58 0 143K 0 > c0t5d0 - - 58 0 145K 0 > c1t5d0 - - 58 0 144K 0 > raidz2 1.43T 5.82T 69 0 2.16M 0 > c4t5d0 - - 62 0 141K 0 > c5t5d0 - - 60 0 138K 0 > c6t5d0 - - 59 0 135K 0 > c7t5d0 - - 60 0 138K 0 > c0t6d0 - - 62 0 142K 0 > c1t6d0 - - 61 0 138K 0 > c4t6d0 - - 59 0 135K 0 > c5t6d0 - - 60 0 138K 0 > c6t6d0 - - 62 0 142K 0 > c7t6d0 - - 61 0 138K 0 > c0t7d0 - - 58 0 134K 0 > c1t7d0 - - 60 0 137K 0 > c4t7d0 - - 62 0 142K 0 > c5t7d0 - - 61 0 139K 0 > c6t7d0 - - 58 0 134K 0 > c7t7d0 - - 60 0 138K 0 > ---------- ----- ----- ----- ----- ----- ----- > > Odd things: > > (1) The zpool is not equally striped across the raidz2-pools >Since you are reading, it depends on where the data was written. Remember, ZFS dynamic striping != RAID-0. I would expect something like this if the pool was expanded at some point in time.> (2) The disks should be able to perform much much faster than they > currently output data at, I believe it;s 2008 and not 1995. >X4500? Those disks are good for about 75-80 random iops, which seems to be about what they are delivering. The dtrace tool, iopattern, will show the random/sequential nature of the workload.> (3) The four cores of the X4500 are dying of boredom, i.e. idle >95% all > the time. > > Has anyone a good idea, where the bottleneck could be? I''m running out > of ideas. >I would suspect the disks. 30 second samples are not very useful to try and debug such things -- even 1 second samples can be too coarse. But you should take a look at 1 second samples to see if there is a consistent I/O workload. -- richard> Cheers > > Carsten > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hi Richard, Richard Elling wrote:> Since you are reading, it depends on where the data was written. > Remember, ZFS dynamic striping != RAID-0. > I would expect something like this if the pool was expanded at some > point in time.No, the RAID was set-up in one go right after jumpstarting the box.>> (2) The disks should be able to perform much much faster than they >> currently output data at, I believe it;s 2008 and not 1995. >> > > X4500? Those disks are good for about 75-80 random iops, > which seems to be about what they are delivering. The dtrace > tool, iopattern, will show the random/sequential nature of the > workload. >I need to read about his a bit and will try to analyze it.>> (3) The four cores of the X4500 are dying of boredom, i.e. idle >95% all >> the time. >> >> Has anyone a good idea, where the bottleneck could be? I''m running out >> of ideas. >> > > I would suspect the disks. 30 second samples are not very useful > to try and debug such things -- even 1 second samples can be > too coarse. But you should take a look at 1 second samples > to see if there is a consistent I/O workload. > -- richard >Without doing too much statistics (yet, if needed I can easily do that) it looks like these: capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- atlashome 3.54T 17.3T 256 0 7.97M 0 raidz2 833G 6.00T 0 0 0 0 c0t0d0 - - 0 0 0 0 c1t0d0 - - 0 0 0 0 c4t0d0 - - 0 0 0 0 c6t0d0 - - 0 0 0 0 c7t0d0 - - 0 0 0 0 c0t1d0 - - 0 0 0 0 c1t1d0 - - 0 0 0 0 c4t1d0 - - 0 0 0 0 c5t1d0 - - 0 0 0 0 c6t1d0 - - 0 0 0 0 c7t1d0 - - 0 0 0 0 c0t2d0 - - 0 0 0 0 c1t2d0 - - 0 0 0 0 c4t2d0 - - 0 0 0 0 c5t2d0 - - 0 0 0 0 raidz2 1.29T 5.52T 133 0 4.14M 0 c6t2d0 - - 117 0 285K 0 c7t2d0 - - 114 0 279K 0 c0t3d0 - - 106 0 261K 0 c1t3d0 - - 114 0 282K 0 c4t3d0 - - 118 0 294K 0 c5t3d0 - - 125 0 308K 0 c6t3d0 - - 126 0 311K 0 c7t3d0 - - 118 0 293K 0 c0t4d0 - - 119 0 295K 0 c1t4d0 - - 120 0 298K 0 c4t4d0 - - 120 0 291K 0 c6t4d0 - - 106 0 257K 0 c7t4d0 - - 96 0 236K 0 c0t5d0 - - 109 0 267K 0 c1t5d0 - - 114 0 282K 0 raidz2 1.43T 5.82T 123 0 3.83M 0 c4t5d0 - - 108 0 242K 0 c5t5d0 - - 104 0 236K 0 c6t5d0 - - 104 0 239K 0 c7t5d0 - - 107 0 245K 0 c0t6d0 - - 108 0 248K 0 c1t6d0 - - 106 0 245K 0 c4t6d0 - - 108 0 250K 0 c5t6d0 - - 112 0 258K 0 c6t6d0 - - 114 0 261K 0 c7t6d0 - - 110 0 253K 0 c0t7d0 - - 109 0 248K 0 c1t7d0 - - 109 0 246K 0 c4t7d0 - - 108 0 243K 0 c5t7d0 - - 108 0 244K 0 c6t7d0 - - 106 0 240K 0 c7t7d0 - - 109 0 244K 0 ---------- ----- ----- ----- ----- ----- ----- the iops vary between about 70 - 140, interesting bit is that the first raidz2 does not get any hits at all :( Cheers Carsten
Hi All, Just want to note that I had the same issue with zfs send + vdevs that had 11 drives in them on a X4500. Reducing the count of drives per zvol cleared this up. One vdev is IOPS limited to the speed of one drive in that vdev, according to this post <http://opensolaris.org/jive/thread.jspa?threadID=74033> (see comment from ptribble.) On Wed, Oct 15, 2008 at 3:07 PM, Carsten Aulbert <carsten.aulbert at aei.mpg.de> wrote:> Hi Richard, > > Richard Elling wrote: > > Since you are reading, it depends on where the data was written. > > Remember, ZFS dynamic striping != RAID-0. > > I would expect something like this if the pool was expanded at some > > point in time. > > No, the RAID was set-up in one go right after jumpstarting the box. > > >> (2) The disks should be able to perform much much faster than they > >> currently output data at, I believe it;s 2008 and not 1995. > >> > > > > X4500? Those disks are good for about 75-80 random iops, > > which seems to be about what they are delivering. The dtrace > > tool, iopattern, will show the random/sequential nature of the > > workload. > > > > I need to read about his a bit and will try to analyze it. > > >> (3) The four cores of the X4500 are dying of boredom, i.e. idle >95% all > >> the time. > >> > >> Has anyone a good idea, where the bottleneck could be? I''m running out > >> of ideas. > >> > > > > I would suspect the disks. 30 second samples are not very useful > > to try and debug such things -- even 1 second samples can be > > too coarse. But you should take a look at 1 second samples > > to see if there is a consistent I/O workload. > > -- richard > > > > Without doing too much statistics (yet, if needed I can easily do that) > it looks like these: > > > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > atlashome 3.54T 17.3T 256 0 7.97M 0 > raidz2 833G 6.00T 0 0 0 0 > c0t0d0 - - 0 0 0 0 > c1t0d0 - - 0 0 0 0 > c4t0d0 - - 0 0 0 0 > c6t0d0 - - 0 0 0 0 > c7t0d0 - - 0 0 0 0 > c0t1d0 - - 0 0 0 0 > c1t1d0 - - 0 0 0 0 > c4t1d0 - - 0 0 0 0 > c5t1d0 - - 0 0 0 0 > c6t1d0 - - 0 0 0 0 > c7t1d0 - - 0 0 0 0 > c0t2d0 - - 0 0 0 0 > c1t2d0 - - 0 0 0 0 > c4t2d0 - - 0 0 0 0 > c5t2d0 - - 0 0 0 0 > raidz2 1.29T 5.52T 133 0 4.14M 0 > c6t2d0 - - 117 0 285K 0 > c7t2d0 - - 114 0 279K 0 > c0t3d0 - - 106 0 261K 0 > c1t3d0 - - 114 0 282K 0 > c4t3d0 - - 118 0 294K 0 > c5t3d0 - - 125 0 308K 0 > c6t3d0 - - 126 0 311K 0 > c7t3d0 - - 118 0 293K 0 > c0t4d0 - - 119 0 295K 0 > c1t4d0 - - 120 0 298K 0 > c4t4d0 - - 120 0 291K 0 > c6t4d0 - - 106 0 257K 0 > c7t4d0 - - 96 0 236K 0 > c0t5d0 - - 109 0 267K 0 > c1t5d0 - - 114 0 282K 0 > raidz2 1.43T 5.82T 123 0 3.83M 0 > c4t5d0 - - 108 0 242K 0 > c5t5d0 - - 104 0 236K 0 > c6t5d0 - - 104 0 239K 0 > c7t5d0 - - 107 0 245K 0 > c0t6d0 - - 108 0 248K 0 > c1t6d0 - - 106 0 245K 0 > c4t6d0 - - 108 0 250K 0 > c5t6d0 - - 112 0 258K 0 > c6t6d0 - - 114 0 261K 0 > c7t6d0 - - 110 0 253K 0 > c0t7d0 - - 109 0 248K 0 > c1t7d0 - - 109 0 246K 0 > c4t7d0 - - 108 0 243K 0 > c5t7d0 - - 108 0 244K 0 > c6t7d0 - - 106 0 240K 0 > c7t7d0 - - 109 0 244K 0 > ---------- ----- ----- ----- ----- ----- ----- > > the iops vary between about 70 - 140, interesting bit is that the first > raidz2 does not get any hits at all :( > > Cheers > > Carsten > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/6c4aa822/attachment.html>
On Wed, Oct 15, 2008 at 2:17 PM, Scott Williamson <scott.williamson at gmail.com> wrote:> Hi All, > > Just want to note that I had the same issue with zfs send + vdevs that had > 11 drives in them on a X4500. Reducing the count of drives per zvol cleared > this up. > > One vdev is IOPS limited to the speed of one drive in that vdev, according > to this post (see comment from ptribble.) >Scott, Can you tell us the configuration that you''re using that is working for you? Were you using RaidZ, or RaidZ2? I''m wondering what the "sweetspot" is to get a good compromise in vdevs and usable space/performance Thanks! -- Brent Jones brent at servuhome.net
Hi again Brent Jones wrote:> > Scott, > > Can you tell us the configuration that you''re using that is working for you? > Were you using RaidZ, or RaidZ2? I''m wondering what the "sweetspot" is > to get a good compromise in vdevs and usable space/performance >Some time ago I made some tests to find this: (1) create a new zpool (2) Copy user''s home to it (always the same ~ 25 GB IIRC) (3) zfs send to /dev/null (4) evaluate && continue loop I did this for fully mirrored setups, raidz as well as raidz2, the results were mixed: https://n0.aei.uni-hannover.de/cgi-bin/twiki/view/ATLAS/ZFSBenchmarkTest#ZFS_send_performance_relevant_fo The culprit here might be that in retrospect this seemed like a "good" home filesystem, i.e. one which was quite fast. If you don''t want to bother with the table: Mirrored setup never exceeded 58 MB/s and was getting faster the more small mirrors you used. RaidZ had its sweetspot with a configuration of ''6 6 6 6 6 6 5 5'', i.e. 6 or 5 disks per RaidZ and 8 vdevs RaidZ2 finally was best at ''10 9 9 9 9'', i.e. 5 vdevs but not much worse with only 3, i.e. what we are currently using to get more storage space (gains us about 2 TB/box). Cheers Carsten
On Wed, Oct 15, 2008 at 9:37 PM, Brent Jones <brent at servuhome.net> wrote:> Scott, > > Can you tell us the configuration that you''re using that is working for > you? > Were you using RaidZ, or RaidZ2? I''m wondering what the "sweetspot" is > to get a good compromise in vdevs and usable space/performanceI used RaidZ with 4x5 disk and 4x6 disk vdevs in one pool with two hot spares. This is very similar to how the pre-installed OS shipped from sun. Also note that I am using ssh as the transfer method. I have not tried mbuffer with this configuration as in testing with initial home directories of ~14GB in size it was not needed. This configuration seems to be similar to Carsten Aulbert''s evaluation, without mbuffer in the pipe. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/540b836b/attachment.html>
Hi Carsten, You seem to be using dd for write testing. In my testing I noted that there was a large difference in write speed between using dd to write from /dev/zero and using other files. Writing from /dev/zero always seemed to be fast, reaching the maximum of ~200MB/s and using cp which would perform poorler the fewer the vdevs. This also impacted the zfs send speed, as with fewer vdevs in RaidZ2 the disks seemed to spend most of their time seeking during the send. On Thu, Oct 16, 2008 at 1:27 AM, Carsten Aulbert <carsten.aulbert at aei.mpg.de> wrote:> Some time ago I made some tests to find this: > > (1) create a new zpool > (2) Copy user''s home to it (always the same ~ 25 GB IIRC) > (3) zfs send to /dev/null > (4) evaluate && continue loop > > I did this for fully mirrored setups, raidz as well as raidz2, the > results were mixed: > > > https://n0.aei.uni-hannover.de/cgi-bin/twiki/view/ATLAS/ZFSBenchmarkTest#ZFS_send_performance_relevant_fo > > The culprit here might be that in retrospect this seemed like a "good" > home filesystem, i.e. one which was quite fast. > > If you don''t want to bother with the table: > > Mirrored setup never exceeded 58 MB/s and was getting faster the more > small mirrors you used. > > RaidZ had its sweetspot with a configuration of ''6 6 6 6 6 6 5 5'', i.e. > 6 or 5 disks per RaidZ and 8 vdevs > > RaidZ2 finally was best at ''10 9 9 9 9'', i.e. 5 vdevs but not much worse > with only 3, i.e. what we are currently using to get more storage space > (gains us about 2 TB/box). > > Cheers > > Carsten > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/bb143f21/attachment.html>
Ok, I''m not entirely sure this is the same problem, but it does sound fairly similar. Apologies for hijacking the thread if this does turn out to be something else. After following the advice here to get mbuffer working with zfs send / receive, I found I was only getting around 10MB/s throughput. Thinking it was a network problem I started the below thread in the OpenSolaris help forum: http://www.opensolaris.org/jive/thread.jspa?messageID=294846 Now though I don''t think it''s network at all. The end result from that thread is that we can''t see any errors in the network setup, and using nicstat and NFS I can show that the server is capable of 50-60MB/s over the gigabit link. Nicstat also shows clearly that both zfs send / receive and mbuffer are only sending 1/5 of that amount of data over the network. I''ve completely run out of ideas of my own (but I do half expect there''s a simple explanation I haven''t thought of). Can anybody think of a reason why both zfs send / receive and mbuffer would be so slow? -- This message posted from opensolaris.org
Hi Scott, Scott Williamson wrote:> You seem to be using dd for write testing. In my testing I noted that > there was a large difference in write speed between using dd to write > from /dev/zero and using other files. Writing from /dev/zero always > seemed to be fast, reaching the maximum of ~200MB/s and using cp which > would perform poorler the fewer the vdevs.You are right, the write benchmarks were done with dd just to have some "bulk" bulk figures since usually zeros can be generated fast enough.> > This also impacted the zfs send speed, as with fewer vdevs in RaidZ2 the > disks seemed to spend most of their time seeking during the send. >That seems a bit too simplistic to me. If you compare raidz with raidz2 it seems that raidz2 is not too bad with fewer vdevs. I wish there was a way for zfs send to avoid so many seeks. The << 1 TB file system is still being zfs send, now close to 48 hours. Cheers Carsten PS: We still have a spare thumper sitting around, maybe I give it a try with 5 vdevs
Hi Ross Ross wrote:> Now though I don''t think it''s network at all. The end result from that thread is that we can''t see any errors in the network setup, and using nicstat and NFS I can show that the server is capable of 50-60MB/s over the gigabit link. Nicstat also shows clearly that both zfs send / receive and mbuffer are only sending 1/5 of that amount of data over the network. > > I''ve completely run out of ideas of my own (but I do half expect there''s a simple explanation I haven''t thought of). Can anybody think of a reason why both zfs send / receive and mbuffer would be so slow?Try to separate the two things: (1) Try /dev/zero -> mbuffer --- network ---> mbuffer > /dev/null That should give you wirespeed (2) Try zfs send | mbuffer > /dev/null That should give you an idea how fast zfs send really is locally. Carsten
> Try to separate the two things:> > (1) Try /dev/zero -> mbuffer --- network ---> mbuffer > /dev/null > That should give you wirespeedI tried that already. It still gets just 10-11MB/s from this server. I can get zfs send / receive and mbuffer working at 30MB/s though from a couple of test servers (with much lower specs).> (2) Try zfs send | mbuffer > /dev/null> That should give you an idea how fast zfs send really is locally.Hmm, that''s better than 10MB/s, but the average is still only around 20MB/s: summary: 942 MByte in 47.4 sec - average of 19.9 MB/s I think that points to another problem though as the send mbuffer is 100% full. Certainly the pool itself doesn''t appear under any strain at all while this is going on: capacity operations bandwidthpool used avail read write read write---------- ----- ----- ----- ----- ----- -----rc-pool 732G 1.55T 171 85 21.3M 1.01M mirror 144G 320G 38 0 4.78M 0 c1t1d0 - - 6 0 779K 0 c1t2d0 - - 17 0 2.17M 0 c2t1d0 - - 14 0 1.85M 0 mirror 146G 318G 39 0 4.89M 0 c1t3d0 - - 20 0 2.50M 0 c2t2d0 - - 13 0 1.63M 0 c2t0d0 - - 6 0 779K 0 mirror 146G 318G 34 0 4.35M 0 c2t3d0 - - 19 0 2.39M 0 c1t5d0 - - 7 0 1002K 0 c1t4d0 - - 7 0 1002K 0 mirror 148G 316G 23 0 2.93M 0 c2t4d0 - - 8 0 1.09M 0 c2t5d0 - - 6 0 890K 0 c1t6d0 - - 7 0 1002K 0 mirror 148G 316G 35 0 4.35M 0 c1t7d0 - - 6 0 779K 0 c2t6d0 - - 12 0 1.52M 0 c2t7d0 - - 17 0 2.07M 0 c3d1p0 12K 504M 0 85 0 1.01M---------- ----- ----- ----- ----- ----- ----- Especially when compared to the zfs send stats on my backup server which managed 30MB/s via mbuffer (Being received on a single virtual SATA disk): capacity operations bandwidthpool used avail read write read write---------- ----- ----- ----- ----- ----- -----rpool 5.12G 42.6G 0 5 0 27.1K c4t0d0s0 5.12G 42.6G 0 5 0 27.1K---------- ----- ----- ----- ----- ----- -----zfspool 431G 4.11T 261 0 31.4M 0 raidz2 431G 4.11T 261 0 31.4M 0 c4t1d0 - - 155 0 6.28M 0 c4t2d0 - - 155 0 6.27M 0 c4t3d0 - - 155 0 6.27M 0 c4t4d0 - - 155 0 6.27M 0 c4t5d0 - - 155 0 6.27M 0---------- ----- ----- ----- ----- ----- ----- The really ironic thing is that the 30MB/s send / receive was sending to a virtual SATA disk which is stored (via sync NFS) on the server I''m having problems with... Ross _________________________________________________________________ Win New York holidays with Kellogg?s & Live Search http://clk.atdmt.com/UKM/go/111354033/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/b177c365/attachment.html>
Oh dear god. Sorry folks, it looks like the new hotmail really doesn''t play well with the list. Trying again in plain text:> Try to separate the two things: > > (1) Try /dev/zero -> mbuffer --- network ---> mbuffer> /dev/null > That should give you wirespeedI tried that already. It still gets just 10-11MB/s from this server. I can get zfs send / receive and mbuffer working at 30MB/s though from a couple of test servers (with much lower specs).> (2) Try zfs send | mbuffer> /dev/null > That should give you an idea how fast zfs send really is locally.Hmm, that''s better than 10MB/s, but the average is still only around 20MB/s: summary: 942 MByte in 47.4 sec - average of 19.9 MB/s I think that points to another problem though as the send mbuffer is 100% full. Certainly the pool itself doesn''t appear under any strain at all while this is going on: capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- rc-pool 732G 1.55T 171 85 21.3M 1.01M mirror 144G 320G 38 0 4.78M 0 c1t1d0 - - 6 0 779K 0 c1t2d0 - - 17 0 2.17M 0 c2t1d0 - - 14 0 1.85M 0 mirror 146G 318G 39 0 4.89M 0 c1t3d0 - - 20 0 2.50M 0 c2t2d0 - - 13 0 1.63M 0 c2t0d0 - - 6 0 779K 0 mirror 146G 318G 34 0 4.35M 0 c2t3d0 - - 19 0 2.39M 0 c1t5d0 - - 7 0 1002K 0 c1t4d0 - - 7 0 1002K 0 mirror 148G 316G 23 0 2.93M 0 c2t4d0 - - 8 0 1.09M 0 c2t5d0 - - 6 0 890K 0 c1t6d0 - - 7 0 1002K 0 mirror 148G 316G 35 0 4.35M 0 c1t7d0 - - 6 0 779K 0 c2t6d0 - - 12 0 1.52M 0 c2t7d0 - - 17 0 2.07M 0 c3d1p0 12K 504M 0 85 0 1.01M ---------- ----- ----- ----- ----- ----- ----- Especially when compared to the zfs send stats on my backup server which managed 30MB/s via mbuffer (Being received on a single virtual SATA disk): capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- rpool 5.12G 42.6G 0 5 0 27.1K c4t0d0s0 5.12G 42.6G 0 5 0 27.1K ---------- ----- ----- ----- ----- ----- ----- zfspool 431G 4.11T 261 0 31.4M 0 raidz2 431G 4.11T 261 0 31.4M 0 c4t1d0 - - 155 0 6.28M 0 c4t2d0 - - 155 0 6.27M 0 c4t3d0 - - 155 0 6.27M 0 c4t4d0 - - 155 0 6.27M 0 c4t5d0 - - 155 0 6.27M 0 ---------- ----- ----- ----- ----- ----- ----- The really ironic thing is that the 30MB/s send / receive was sending to a virtual SATA disk which is stored (via sync NFS) on the server I''m having problems with... Ross> Date: Thu, 16 Oct 2008 14:27:49 +0200 > From: carsten.aulbert at aei.mpg.de > To: myxiplx at hotmail.com > CC: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Improving zfs send performance > > Hi Ross > > Ross wrote: >> Now though I don''t think it''s network at all. The end result from that thread is that we can''t see any errors in the network setup, and using nicstat and NFS I can show that the server is capable of 50-60MB/s over the gigabit link. Nicstat also shows clearly that both zfs send / receive and mbuffer are only sending 1/5 of that amount of data over the network. >> >> I''ve completely run out of ideas of my own (but I do half expect there''s a simple explanation I haven''t thought of). Can anybody think of a reason why both zfs send / receive and mbuffer would be so slow? > > Try to separate the two things: > > (1) Try /dev/zero -> mbuffer --- network ---> mbuffer> /dev/null > > That should give you wirespeed > > (2) Try zfs send | mbuffer> /dev/null > > That should give you an idea how fast zfs send really is locally. > > Carsten_________________________________________________________________ Get all your favourite content with the slick new MSN Toolbar - FREE http://clk.atdmt.com/UKM/go/111354027/direct/01/
So I am zfs sending ~450 datasets between thumpers running SOL10U5 via ssh, most are empty except maybe 10 that have a few GB of files. I see the following output on one that contained ~1GB of files in my send report: Output from zfs receive -v "received 1.07Gb stream in 30 seconds (36.4Mb/sec)" I have a few problems with this: 1. Should it not read 1.07GB for Bytes? 2. Should it not read that this was done at a rate of 36.4MB/s? The output seems to be incorrect, but makes sense if you uppercase the b. This is an underwhelming ~292Mb/s! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/88a8d24d/attachment.html>
Ok, just did some more testing on this machine to try to find where my bottlenecks are. Something very odd is going on here. As best I can tell there are two separate problems now: - something is throttling network output to 10MB/s - something is throttling zfs send to around 20MB/s The network throughput I''ve verified with mbuffer: 1. A quick mbuffer test from /dev/zero to /dev/null gave me 565MB/s. 2. On a test server, mbuffer sending from /dev/zero on one machine to /dev/null on another gave me 37MB/s 3. On the live server, mbuffer sending from /dev/zero to the same receiving machine gave me just under 10MB/s. This looks very much like mbuffer is throttled on this machine, but I know NFS can give me 60-80MB/s. Can anybody give me a clue as to what could be causing this? And the disk performance is just as confusing. Again I used a test server to provide a comparison, and this time used a zfs scrub with iostat to check the performance possible on the disks. Live server: 5 sets of 3 way mirrors Test server: 5 disk raid-z2 1. On the Live server, zfs send to /dev/null via mbuffer reports a speed of 21MB/s # zfs send pool at snapshot | mbuffer -s 128k -m 512M > /dev/null 2. On the Test server, zfs send to /dev/null via mbuffer reports a speed of 35MB/s 3. On the Live server, zpool scrub and iostat report a peak of 3k iops, and 283MB/s throughput. 4. On the Test server, zpool scrub and iostat report a peak of 472 iops, and 53MB/s throughput. Surely the send and scrub operations should give similar results? Why is zpool scrub running 10-15x faster than zfs send on the live server? The iostat figures on the live server are particularly telling. During a scrub (30s intervals): capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- rc-pool 734G 1.55T 2.94K 41 189M 788K mirror 144G 320G 578 6 39.2M 166K c1t1d0 - - 379 5 39.9M 166K c1t2d0 - - 379 5 39.9M 166K c2t1d0 - - 385 5 40.1M 166K mirror 147G 317G 633 2 37.8M 170K c1t3d0 - - 389 2 38.7M 171K c2t2d0 - - 393 2 38.9M 171K c2t0d0 - - 384 2 38.9M 171K mirror 147G 317G 619 6 37.3M 57.5K c2t3d0 - - 377 2 38.3M 57.9K c1t5d0 - - 377 2 38.3M 57.9K c1t4d0 - - 373 3 38.2M 57.9K mirror 148G 316G 638 10 37.6M 64.0K c2t4d0 - - 375 4 38.5M 64.4K c2t5d0 - - 386 6 38.2M 64.4K c1t6d0 - - 384 6 38.2M 64.4K mirror 149G 315G 540 6 37.4M 164K c1t7d0 - - 356 4 38.1M 164K c2t6d0 - - 362 5 38.2M 164K c2t7d0 - - 361 5 38.2M 164K c3d1p0 12K 504M 0 8 0 166K ---------- ----- ----- ----- ----- ----- ----- During a send (30s intervals): capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- rc-pool 734G 1.55T 148 55 18.6M 1.71M mirror 144G 320G 25 6 3.15M 235K c1t1d0 - - 8 3 1.02M 235K c1t2d0 - - 7 3 954K 235K c2t1d0 - - 9 3 1.19M 235K mirror 147G 317G 27 3 3.40M 203K c1t3d0 - - 8 2 1.03M 203K c2t2d0 - - 9 3 1.25M 203K c2t0d0 - - 8 2 1.11M 203K mirror 147G 317G 32 2 4.12M 205K c2t3d0 - - 11 1 1.45M 205K c1t5d0 - - 10 1 1.34M 205K c1t4d0 - - 10 1 1.34M 205K mirror 148G 316G 32 2 4.02M 201K c2t4d0 - - 10 1 1.37M 201K c2t5d0 - - 9 1 1.23M 201K c1t6d0 - - 11 1 1.43M 201K mirror 149G 315G 31 6 3.89M 180K c1t7d0 - - 11 2 1.45M 180K c2t6d0 - - 8 2 1.10M 180K c2t7d0 - - 10 2 1.35M 180K c3d1p0 12K 504M 0 34 0 727K ---------- ----- ----- ----- ----- ----- ----- Can anybody explain why zfs send could be so slow on one server? Is anybody else able to compare their iostat results for a zfs send and zpool scrub to see if they also have such a huge difference between the figures? thanks, Ross -- This message posted from opensolaris.org
Hi Ross, On Fri, Oct 17, 2008 at 1:35 PM, Ross <myxiplx at googlemail.com> wrote:> Ok, just did some more testing on this machine to try to find where my bottlenecks are. Something very odd is going on here. As best I can tell there are two separate problems now: > > - something is throttling network output to 10MB/sI''ll try to help you with this problem.> The network throughput I''ve verified with mbuffer: > > 1. A quick mbuffer test from /dev/zero to /dev/null gave me 565MB/s. > 2. On a test server, mbuffer sending from /dev/zero on one machine to /dev/null on another gave me 37MB/s > 3. On the live server, mbuffer sending from /dev/zero to the same receiving machine gave me just under 10MB/s. > > This looks very much like mbuffer is throttled on this machine, but I know NFS can give me 60-80MB/s. Can anybody give me a clue as to what could be causing this? >Does your NFS mount go over a separate network? If not, just ignore this advice. :) When first testing out ZFS over NFS performance, I ran into a similar problem. I had very nice graphs, all plateauing at 10MB/s, and was getting frustrated at performance being so slow. It turned out that one of my links was 100Mbit. I took a moment to breathe, learn from my mistake (check the network links BEFORE running performance tests), and ran my tests again. Check your network links, make sure that it''s Gigabit all the way through, and that you''re negotiating full-duplex. A 100Mbit link will give you just about 10MB/s throughput on network transfers. - Dimitri
Yup, that''s one of the first things I checked when it came out with figures so close to 10MB/s. All three servers are running full duplex gigabit though, as reported by both Solaris and the switch. And both the NFS at 60+MB/s, and the zfs send / receive are all going over the same network link, in some cases to the same servers. -- This message posted from opensolaris.org
Hi All, I have opened a ticket with sun support #66104157 regarding zfs send / receive and will let you know what I find out. Keep in mind that this is for Solaris 10 not opensolaris. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081017/423dc8ab/attachment.html>
>>>>> "r" == Ross <myxiplx at googlemail.com> writes:r> figures so close to 10MB/s. All three servers are running r> full duplex gigabit though there is one tricky way 100Mbit/s could still bite you, but it''s probably not happening to you. It mostly affects home users with unmanaged switches: http://www.smallnetbuilder.com/content/view/30212/54/ http://virtualthreads.blogspot.com/2006/02/beware-ethernet-flow-control.html because the big switch vendors all use pause frames safely: http://www.networkworld.com/netresources/0913flow2.html -- pause frames as interpreted by netgear are harmful -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081017/526f78f3/attachment.bin>
Scott Williamson wrote:> Hi All, > > I have opened a ticket with sun support #66104157 regarding zfs send / > receive and will let you know what I find out.Thanks.> > Keep in mind that this is for Solaris 10 not opensolaris.Keep in mind that any changes required for Solaris 10 will first be available in OpenSolaris, including any changes which may have already been implemented. -- richard
On Fri, Oct 17, 2008 at 2:48 PM, Richard Elling <Richard.Elling at sun.com>wrote:> Keep in mind that any changes required for Solaris 10 will first > be available in OpenSolaris, including any changes which may > have already been implemented. >For me (who uses SOL10) it is the only way I can get information about what bugs and changes have been identified and helps me get stuff from opensolaris into sol10. The last support ticket resulted in a solaris iSCSI target to windows initiator patch to solaris 10 that made iSCSI targets on ZFS actually work for us. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081017/1d0af393/attachment.html>
Hi Miles Nordin wrote:>>>>>> "r" == Ross <myxiplx at googlemail.com> writes: > > r> figures so close to 10MB/s. All three servers are running > r> full duplex gigabit though > > there is one tricky way 100Mbit/s could still bite you, but it''s > probably not happening to you. It mostly affects home users with > unmanaged switches: > > http://www.smallnetbuilder.com/content/view/30212/54/ > http://virtualthreads.blogspot.com/2006/02/beware-ethernet-flow-control.html > > because the big switch vendors all use pause frames safely: > > http://www.networkworld.com/netresources/0913flow2.html -- pause frames as interpreted by netgear are harmfulThat rings a bell, Ross, are you using NFS via UDP or TCP? May it be that your network has different performance levels for different transport types? For our network we have disabled pause frames completey and rely only on TCP internal mechanisms to prevent flooding/blocking. Carsten PS: the job where 25k files sizing up to 800 GB is now done - zfs send took only 52 hrs and the speed was ~ 4.5 MB/s :(
Richard Elling ?????:>> Keep in mind that this is for Solaris 10 not opensolaris. > > Keep in mind that any changes required for Solaris 10 will first > be available in OpenSolaris, including any changes which may > have already been implemented.Indeed. For example, less than a week ago fix for the following two CRs (along with some others) was put back into Solaris Nevada: 6333409 traversal code should be able to issue multiple reads in parallel 6418042 want traversal in depth-first pre-order for quicker ''zfs send'' This should have positive impact on ''zfs send'' performance. Wbr, victor
On Mon, Oct 20, 2008 at 1:52 AM, Victor Latushkin <Victor.Latushkin at sun.com>wrote> Indeed. For example, less than a week ago fix for the following two CRs > (along with some others) was put back into Solaris Nevada: > > 6333409 traversal code should be able to issue multiple reads in parallel > 6418042 want traversal in depth-first pre-order for quicker ''zfs send'' >That is helpful Victor. Does anyone have a full list of CRs that I can provide to sun support? I have tried searching the bugs database, but I didn''t even find those two on my own. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081020/1ca93f50/attachment.html>
Thomas, for long latency fat links, it should be quite beneficial to set the socket buffer on the receive side (instead of having users tune tcp_recv_hiwat). throughput of a tcp connnection is gated by "receive socket buffer / round trip time". Could that be Ross'' problem ? -r Ross Smith writes: > > Thanks, that got it working. I''m still only getting 10MB/s, so it''s not solved my problem - I''ve still got a bottleneck somewhere, but mbuffer is a huge improvement over standard zfs send / receive. It makes such a difference when you can actually see what''s going on. > > > ---------------------------------------- > > Date: Wed, 15 Oct 2008 12:08:14 +0200 > > From: thomas at maier-komor.de > > To: myxiplx at hotmail.com; zfs-discuss at opensolaris.org > > Subject: Re: [zfs-discuss] Improving zfs send performance > > > > Ross schrieb: > >> Hi, > >> > >> I''m just doing my first proper send/receive over the network and I''m getting just 9.4MB/s over a gigabit link. Would you be able to provide an example of how to use mbuffer / socat with ZFS for a Solaris beginner? > >> > >> thanks, > >> > >> Ross > >> -- > >> This message posted from opensolaris.org > >> _______________________________________________ > >> zfs-discuss mailing list > >> zfs-discuss at opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > receiver> mbuffer -I sender:10000 -s 128k -m 512M | zfs receive > > > > sender> zfs send mypool/myfilesystem at mysnapshot | mbuffer -s 128k -m > > 512M -O receiver:10000 > > > > BTW: I release a new version of mbuffer today. > > > > HTH, > > Thomas > > _________________________________________________________________ > Make a mini you and download it into Windows Live Messenger > http://clk.atdmt.com/UKM/go/111354029/direct/01/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Roch schrieb:> Thomas, for long latency fat links, it should be quite > beneficial to set the socket buffer on the receive side > (instead of having users tune tcp_recv_hiwat). > > throughput of a tcp connnection is gated by > "receive socket buffer / round trip time". > > Could that be Ross'' problem ? > > -r > >Hmm, I''m not a TCP expert, but that sounds absolutely possible, if Solaris 10 isn''t tuning the TCP buffer automatically. The default receive buffer seems to be 48k (at least on a V240 running 118833-33). So if the block size is something like 128k it would absolutely make sense to tune the receive buffer to lower the rund trip time... Ross: Would you like a patch to test if this is the case? Which version of mbuffer are you currently using? - Thomas