Hi all, The more readings i do about ZFS, and experiments the more i like this stack of technologies. Since we all like to see real figures in real environments , i might as well share some of my numbers .. The replication has been achieved with the zfs send / zfs receive but piped with mbuffer (http://www.maier-komor.de/mbuffer.html), during business hours , so it''s a live environment and *not *a controlled test environment. storageA opensolaris snv_133 2 quad-core amd 28 gb ram Seagate Barracuda SATA drives 1.5TB 7.200 rpm (ST31500341AS) - *non-enterprise class disks* 1 RAIDZ2 pool with 6 vdevs with 3 disks each connected to a lsi non-raid controller storageB opensolaris snv_134 2 Intel Xeon 2.0ghz 8 gb ram Seagate Barracuda SATA drives 1TB 7.200 rpm (ST31000640SS) - *enterprise class disks* 1 RAIDZ2 pool with 4 vdevs with 5 disks each connected to a Adaptec RAID controller(52445, 512 mb cache) with read and write cache enabled. The adaptec hba has 20 volumes , where one volume = one drive..something similar to a jbod Both systems are connected to a gigabit switch without vlans (switch is a 3com), and jumbo-frames are disabled. And now the results : Dataset : around 26.5 gb in files bigger than 256 KB and smaller than 1MB summary: 26.6 GByte in 6 min 20.6 sec - average of *71.7 MB/s* Dataset : around 160gb of data with files small (less than 20 kb) and large (bigger than 10MB) summary: 164 GByte in 34 min 41.9 sec - average of *80.6 MB/s* I don''t know what about you...but for me it does seems very , very , very good performance :), specially if i consider that in overall these two systems cost less than 12.000EUR . Does anyone else has numbers to share? Bruno -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100325/ea3d26ed/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100325/ea3d26ed/attachment.bin>
BTW, if you download the solaris drivers for the 52445 from adaptec, you can use jbod instead of simple volumes. -- This message posted from opensolaris.org
Thanks for the tip..btw is there any advantage with jbod vs simple volumes? Bruno On 25-3-2010 21:08, Richard Jahnel wrote:> BTW, if you download the solaris drivers for the 52445 from adaptec, you can use jbod instead of simple volumes. >-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100325/2c596f06/attachment.bin>
On 03/26/10 08:47 AM, Bruno Sousa wrote:> Hi all, > > The more readings i do about ZFS, and experiments the more i like this > stack of technologies. > Since we all like to see real figures in real environments , i might > as well share some of my numbers .. > The replication has been achieved with the zfs send / zfs receive but > piped with mbuffer (http://www.maier-komor.de/mbuffer.html), during > business hours , so it''s a live environment and *not *a controlled > test environment. > > storageA > > opensolaris snv_133 > 2 quad-core amd > 28 gb ram > > Seagate Barracuda SATA drives 1.5TB 7.200 rpm (ST31500341AS) - > *non-enterprise class disks* > 1 RAIDZ2 pool with 6 vdevs with 3 disks each connected to a lsi > non-raid controller >As others have already said, raidz2 with 3 drives is Not A Good Idea!> storageB > > opensolaris snv_134 > 2 Intel Xeon 2.0ghz > 8 gb ram > > > Seagate Barracuda SATA drives 1TB 7.200 rpm (ST31000640SS) - > *enterprise class disks* > 1 RAIDZ2 pool with 4 vdevs with 5 disks each connected to a Adaptec > RAID controller(52445, 512 mb cache) with read and write cache > enabled. The adaptec hba has 20 volumes , where one volume = one > drive..something similar to a jbod > > Both systems are connected to a gigabit switch without vlans (switch > is a 3com), and jumbo-frames are disabled. > > And now the results : > > Dataset : around 26.5 gb in files bigger than 256 KB and smaller than 1MB > > summary: 26.6 GByte in 6 min 20.6 sec - average of *71.7 MB/s* > > Dataset : around 160gb of data with files small (less than 20 kb) and > large (bigger than 10MB) > > summary: 164 GByte in 34 min 41.9 sec - average of *80.6 MB/s* >Those numbers look right for a 1 Gig link. Try a tool such as bonnie++ to see what the block read and write numbers are for your pools and if the they are significantly better than these, try an aggregated link between the systems. -- Ian. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100326/7b4defa9/attachment.html>
Hi, Indeed the 3 disks per vdev (raidz2) seems a bad idea...but it''s the system i have now. Regarding the performance...let''s assume that a bonnie++ benchmark could go to 200 mg/s in. The possibility of getting the same values (or near) in a zfs send / zfs receive is just a matter of putting , let''s say a 10gbE card between both systems? I have the impression that benchmarks are always synthetic, therefore live/production environments behave quite differently. Again, it might be just me, but with 1gb link being able to replicate 2 servers with a average speed above 60 mb/s does seems quite good. However, like i said i would like to know other results from other guys... Thanks for the time. Bruno On 25-3-2010 21:52, Ian Collins wrote:> On 03/26/10 08:47 AM, Bruno Sousa wrote: >> Hi all, >> >> The more readings i do about ZFS, and experiments the more i like >> this stack of technologies. >> Since we all like to see real figures in real environments , i might >> as well share some of my numbers .. >> The replication has been achieved with the zfs send / zfs receive but >> piped with mbuffer (http://www.maier-komor.de/mbuffer.html), during >> business hours , so it''s a live environment and *not *a controlled >> test environment. >> >> storageA >> >> opensolaris snv_133 >> 2 quad-core amd >> 28 gb ram >> >> Seagate Barracuda SATA drives 1.5TB 7.200 rpm (ST31500341AS) - >> *non-enterprise class disks* >> 1 RAIDZ2 pool with 6 vdevs with 3 disks each connected to a lsi >> non-raid controller >> > As others have already said, raidz2 with 3 drives is Not A Good Idea! > >> storageB >> >> opensolaris snv_134 >> 2 Intel Xeon 2.0ghz >> 8 gb ram >> >> >> Seagate Barracuda SATA drives 1TB 7.200 rpm (ST31000640SS) - >> *enterprise class disks* >> 1 RAIDZ2 pool with 4 vdevs with 5 disks each connected to a Adaptec >> RAID controller(52445, 512 mb cache) with read and write cache >> enabled. The adaptec hba has 20 volumes , where one volume = one >> drive..something similar to a jbod >> >> Both systems are connected to a gigabit switch without vlans (switch >> is a 3com), and jumbo-frames are disabled. >> >> And now the results : >> >> Dataset : around 26.5 gb in files bigger than 256 KB and smaller than 1MB >> >> summary: 26.6 GByte in 6 min 20.6 sec - average of *71.7 MB/s* >> >> Dataset : around 160gb of data with files small (less than 20 kb) and >> large (bigger than 10MB) >> >> summary: 164 GByte in 34 min 41.9 sec - average of *80.6 MB/s* >> > > Those numbers look right for a 1 Gig link. Try a tool such as > bonnie++ to see what the block read and write numbers are for your > pools and if the they are significantly better than these, try an > aggregated link between the systems. > -- > Ian. > > > -- > This message has been scanned for viruses and > dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is > believed to be clean.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100325/c10c4565/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100325/c10c4565/attachment.bin>
On 03/26/10 10:00 AM, Bruno Sousa wrote: [Boy top-posting sure mucks up threads!]> Hi, > > Indeed the 3 disks per vdev (raidz2) seems a bad idea...but it''s the > system i have now. > Regarding the performance...let''s assume that a bonnie++ benchmark > could go to 200 mg/s in. The possibility of getting the same values > (or near) in a zfs send / zfs receive is just a matter of putting , > let''s say a 10gbE card between both systems?Maybe, or a 2x1G LAG would me more cost effective (and easier to check!). The only way to know for sure is to measure. I managed to get slightly better transfers by enabling jumbo frames.> I have the impression that benchmarks are always synthetic, therefore > live/production environments behave quite differently.Very true, especially in the black arts of storage management!> Again, it might be just me, but with 1gb link being able to replicate > 2 servers with a average speed above 60 mb/s does seems quite good. > However, like i said i would like to know other results from other guys... >As I said, the results are typical for a 1G link. Don''t forget you are measuring full copies, incremental replications may well be significantly slower. -- Ian. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100326/c818fa4b/attachment.html>
On 25 mars 2010, at 22:00, Bruno Sousa <bsousa at epinfante.com> wrote:> Hi, > > Indeed the 3 disks per vdev (raidz2) seems a bad idea...but it''s the > system i have now. > Regarding the performance...let''s assume that a bonnie++ benchmark > could go to 200 mg/s in. The possibility of getting the same values > (or near) in a zfs send / zfs receive is just a matter of putting , > let''s say a 10gbE card between both systems? > I have the impression that benchmarks are always synthetic, > therefore live/production environments behave quite differently. > Again, it might be just me, but with 1gb link being able to > replicate 2 servers with a average speed above 60 mb/s does seems > quite good. However, like i said i would like to know other results > from other guys...Don''t forget to factor in your transport mechanism. If you''re using ssh to pipe the send/recv data your overall speed may end up being CPU bound since I think that ssh will be single threaded so even on a multicore system, you''ll only be able to consume one core and here raw clock speed will make difference. Cheers, Erik
Hi, I think that in this case the cpu is not the bottleneck, since i''m not using ssh. However my 1gb network link probably is the bottleneck. Bruno On 26-3-2010 9:25, Erik Ableson wrote:> > On 25 mars 2010, at 22:00, Bruno Sousa <bsousa at epinfante.com> wrote: > >> Hi, >> >> Indeed the 3 disks per vdev (raidz2) seems a bad idea...but it''s the >> system i have now. >> Regarding the performance...let''s assume that a bonnie++ benchmark >> could go to 200 mg/s in. The possibility of getting the same values >> (or near) in a zfs send / zfs receive is just a matter of putting , >> let''s say a 10gbE card between both systems? >> I have the impression that benchmarks are always synthetic, therefore >> live/production environments behave quite differently. >> Again, it might be just me, but with 1gb link being able to replicate >> 2 servers with a average speed above 60 mb/s does seems quite good. >> However, like i said i would like to know other results from other >> guys... > > Don''t forget to factor in your transport mechanism. If you''re using > ssh to pipe the send/recv data your overall speed may end up being CPU > bound since I think that ssh will be single threaded so even on a > multicore system, you''ll only be able to consume one core and here raw > clock speed will make difference. > > Cheers, > > Erik >-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100326/e8212814/attachment.bin>
Hi, The jumbo-frames in my case give me a boost of around 2 mb/s, so it''s not that much. Now i will play with link aggregation and see how it goes, and of course i''m counting that incremental replication will be slower...but since the amount of data would be much less probably it will still deliver a good performance. And what a relief to know that i''m not alone when i say that storage management is part science, part arts and part "voodoo magic" ;) Cheers, Bruno On 25-3-2010 23:22, Ian Collins wrote:> On 03/26/10 10:00 AM, Bruno Sousa wrote: > > [Boy top-posting sure mucks up threads!] > >> Hi, >> >> Indeed the 3 disks per vdev (raidz2) seems a bad idea...but it''s the >> system i have now. >> Regarding the performance...let''s assume that a bonnie++ benchmark >> could go to 200 mg/s in. The possibility of getting the same values >> (or near) in a zfs send / zfs receive is just a matter of putting , >> let''s say a 10gbE card between both systems? > > Maybe, or a 2x1G LAG would me more cost effective (and easier to > check!). The only way to know for sure is to measure. I managed to > get slightly better transfers by enabling jumbo frames. > >> I have the impression that benchmarks are always synthetic, therefore >> live/production environments behave quite differently. > > Very true, especially in the black arts of storage management! > >> Again, it might be just me, but with 1gb link being able to replicate >> 2 servers with a average speed above 60 mb/s does seems quite good. >> However, like i said i would like to know other results from other >> guys... >> > As I said, the results are typical for a 1G link. Don''t forget you > are measuring full copies, incremental replications may well be > significantly slower. > > -- > Ian. > > > -- > This message has been scanned for viruses and > dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is > believed to be clean.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100326/b3714614/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3656 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100326/b3714614/attachment.bin>
On Mar 26, 2010, at 2:34 AM, Bruno Sousa wrote:> Hi, > > The jumbo-frames in my case give me a boost of around 2 mb/s, so it''s not that much.That is about right. IIRC, the theoretical max is about 4% improvement, for MTU of 8KB.> Now i will play with link aggregation and see how it goes, and of course i''m counting that incremental replication will be slower...but since the amount of data would be much less probably it will still deliver a good performance.Probably won''t help at all because of the brain dead way link aggregation has to work. See "Ordering of frames" at http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol#Link_Aggregation_Control_Protocol If you see the workload on the wire go through regular patterns of fast/slow response then there are some additional tricks that can be applied to increase the overall throughput and smooth the jaggies. But that is fodder for another post... You can measure this with iostat using samples < 15 seconds or with tcpstat. tcpstat is a handy DTrace script often located as /opt/DTT/Bin/tcpstat.d -- richard> And what a relief to know that i''m not alone when i say that storage management is part science, part arts and part "voodoo magic" ;) > > Cheers, > Bruno > > On 25-3-2010 23:22, Ian Collins wrote: >> On 03/26/10 10:00 AM, Bruno Sousa wrote: >> >> [Boy top-posting sure mucks up threads!] >> >>> Hi, >>> >>> Indeed the 3 disks per vdev (raidz2) seems a bad idea...but it''s the system i have now. >>> Regarding the performance...let''s assume that a bonnie++ benchmark could go to 200 mg/s in. The possibility of getting the same values (or near) in a zfs send / zfs receive is just a matter of putting , let''s say a 10gbE card between both systems? >> >> Maybe, or a 2x1G LAG would me more cost effective (and easier to check!). The only way to know for sure is to measure. I managed to get slightly better transfers by enabling jumbo frames. >> >>> I have the impression that benchmarks are always synthetic, therefore live/production environments behave quite differently. >> >> Very true, especially in the black arts of storage management! >> >>> Again, it might be just me, but with 1gb link being able to replicate 2 servers with a average speed above 60 mb/s does seems quite good. However, like i said i would like to know other results from other guys... >>> >> As I said, the results are typical for a 1G link. Don''t forget you are measuring full copies, incremental replications may well be significantly slower. >> >> -- >> Ian. >> >> >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Ian Collins
2010-Mar-26 22:55 UTC
[zfs-discuss] *SPAM* Re: zfs send/receive - actual performance
On 03/27/10 09:39 AM, Richard Elling wrote:> On Mar 26, 2010, at 2:34 AM, Bruno Sousa wrote: > >> Hi, >> >> The jumbo-frames in my case give me a boost of around 2 mb/s, so it''s not that much. >> > That is about right. IIRC, the theoretical max is about 4% improvement, for MTU of 8KB. > > >> Now i will play with link aggregation and see how it goes, and of course i''m counting that incremental replication will be slower...but since the amount of data would be much less probably it will still deliver a good performance. >> > Probably won''t help at all because of the brain dead way link aggregation has to > work. See "Ordering of frames" at > http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol#Link_Aggregation_Control_Protocol > >Arse, thanks for reminding me Richard! A single stream will only use one path in a LAG. -- Ian.
Svein Skogen
2010-Mar-27 07:14 UTC
[zfs-discuss] *SPAM* Re: zfs send/receive - actual performance
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 26.03.2010 23:55, Ian Collins wrote:> On 03/27/10 09:39 AM, Richard Elling wrote: >> On Mar 26, 2010, at 2:34 AM, Bruno Sousa wrote: >> >>> Hi, >>> >>> The jumbo-frames in my case give me a boost of around 2 mb/s, so it''s >>> not that much. >>> >> That is about right. IIRC, the theoretical max is about 4% >> improvement, for MTU of 8KB. >> >> >>> Now i will play with link aggregation and see how it goes, and of >>> course i''m counting that incremental replication will be slower...but >>> since the amount of data would be much less probably it will still >>> deliver a good performance. >>> >> Probably won''t help at all because of the brain dead way link >> aggregation has to >> work. See "Ordering of frames" at >> http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol#Link_Aggregation_Control_Protocol >> >> >> > Arse, thanks for reminding me Richard! A single stream will only use one > path in a LAG.Doesn''t (Open)Solaris have the option of setting the aggregate up as a FEC or in roundrobin mode? //Svein - -- - --------+-------------------+------------------------------- /"\ |Svein Skogen | svein at d80.iso100.no \ / |Solberg ?stli 9 | PGP Key: 0xE5E76831 X |2020 Skedsmokorset | svein at jernhuset.no / \ |Norway | PGP Key: 0xCE96CE13 | | svein at stillbilde.net ascii | | PGP Key: 0x58CD33B6 ribbon |System Admin | svein-listmail at stillbilde.net Campaign|stillbilde.net | PGP Key: 0x22D494A4 +-------------------+------------------------------- |msn messenger: | Mobile Phone: +47 907 03 575 |svein at jernhuset.no | RIPE handle: SS16503-RIPE - --------+-------------------+------------------------------- If you really are in a hurry, mail me at svein-mobile at stillbilde.net This mailbox goes directly to my cellphone and is checked even when I''m not in front of my computer. - ------------------------------------------------------------ Picture Gallery: https://gallery.stillbilde.net/v/svein/ - ------------------------------------------------------------ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkutsEoACgkQSBMQn1jNM7bOfwCfd2BkP+bOyIR5FoMraDvlMgS1 wAIAoKZQoF0cA5Ea7qclUzBFh+0IbesE =HAAx -----END PGP SIGNATURE-----
Ian Collins
2010-Mar-27 07:42 UTC
[zfs-discuss] *SPAM* Re: zfs send/receive - actual performance
On 03/27/10 08:14 PM, Svein Skogen wrote:> On 26.03.2010 23:55, Ian Collins wrote: > >> On 03/27/10 09:39 AM, Richard Elling wrote: >> >>> On Mar 26, 2010, at 2:34 AM, Bruno Sousa wrote: >>> >>> >>>> Hi, >>>> >>>> The jumbo-frames in my case give me a boost of around 2 mb/s, so it''s >>>> not that much. >>>> >>>> >>> That is about right. IIRC, the theoretical max is about 4% >>> improvement, for MTU of 8KB. >>> >>> >>> >>>> Now i will play with link aggregation and see how it goes, and of >>>> course i''m counting that incremental replication will be slower...but >>>> since the amount of data would be much less probably it will still >>>> deliver a good performance. >>>> >>>> >>> Probably won''t help at all because of the brain dead way link >>> aggregation has to >>> work. See "Ordering of frames" at >>> http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol#Link_Aggregation_Control_Protocol >>> >>> >>> >>> >> Arse, thanks for reminding me Richard! A single stream will only use one >> path in a LAG. >> > Doesn''t (Open)Solaris have the option of setting the aggregate up as a > FEC or in roundrobin mode? > >Not if a switch is used, data between a pair of endpoints (IP addresses or address/port combination) uses one physical link. The same most likely applies to direct connections given the IP stack doesn''t know how the ports are connected. I''d be interested to know how they do this if I''m wrong!> //Svein > > - -- >Please fix your delimiter, thanks! -- Ian.
Kyle McDonald
2010-Mar-31 19:34 UTC
[zfs-discuss] *SPAM* Re: zfs send/receive - actual performance
On 3/27/2010 3:14 AM, Svein Skogen wrote:> On 26.03.2010 23:55, Ian Collins wrote: > > On 03/27/10 09:39 AM, Richard Elling wrote: > >> On Mar 26, 2010, at 2:34 AM, Bruno Sousa wrote: > >> > >>> Hi, > >>> > >>> The jumbo-frames in my case give me a boost of around 2 mb/s, so it''s > >>> not that much. > >>> > >> That is about right. IIRC, the theoretical max is about 4% > >> improvement, for MTU of 8KB. > >> > >> > >>> Now i will play with link aggregation and see how it goes, and of > >>> course i''m counting that incremental replication will be slower...but > >>> since the amount of data would be much less probably it will still > >>> deliver a good performance. > >>> > >> Probably won''t help at all because of the brain dead way link > >> aggregation has to > >> work. See "Ordering of frames" at > >> > http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol#Link_Aggregation_Control_Protocol > >> > >> > >> > > Arse, thanks for reminding me Richard! A single stream will only use one > > path in a LAG. > > Doesn''t (Open)Solaris have the option of setting the aggregate up as a > FEC or in roundrobin mode?Solaris does offer what the Wiki describes as "L4" or port number based hashing. I''m not sure what FEC is, but when I asked, round-robin isn''t available as preserving packet ordering wouldn''t be easy (possible?) that way. -Kyle> > //Svein >_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
"If you see the workload on the wire go through regular patterns of fast/slow response then there are some additional tricks that can be applied to increase the overall throughput and smooth the jaggies. But that is fodder for another post..." Can you pls. elaborate on what can be done here as I am seeing this. -- This message posted from opensolaris.org
On Apr 1, 2010, at 12:43 AM, tomwaters wrote:> "If you see the workload on the wire go through regular patterns of fast/slow response > then there are some additional tricks that can be applied to increase the overall > throughput and smooth the jaggies. But that is fodder for another post..." > > Can you pls. elaborate on what can be done here as I am seeing this.There are many things that can be done, but no silver bullet. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com