Marc Bevand
2008-Mar-13 08:50 UTC
[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
I figured the following ZFS ''success story'' may interest some readers here. I was interested to see how much sequential read/write performance it would be possible to obtain from ZFS running on commodity hardware with modern features such as PCI-E busses, SATA disks, well-designed SATA controllers (AHCI, SiI3132/SiI3124). So I made this experiment of building a fileserver by picking each component to be as cheap as possible while not sacrificing performance too much. I ended up spending $270 on the server itself and $1050 on seven 750GB SATA disks. After installing snv_82, a 7-disk raidz pool on this $1320 box is capable of: - 220-250 MByte/s sequential write throughput (dd if=/dev/zero of=file bs=1024k) - 430-440 MByte/s sequential read throughput (dd if=file of=/dev/null bs=1024k) I did a quick test with a 7-disk striped pool too: - 330-390 MByte/s seq. writes - 560-570 MByte/s seq. reads (what''s really interesting here is that the bottleneck is the platter speed of one of the disks at 81 MB/s: 81*7=567, ZFS truly "runs at platter speed", as advertised, wow) I used disks with 250GB-platter (Samsung HD753LJ; they have even higher density 640GB and 1TB models with 334GB/platter but they are respectively impossible to find or too expensive). I put 4 disks on the motherboard''s integrated AHCI controller (SB600 chipset), 2 disks on a 2-port $20 PCI-E 1x SiI3132 controller, and the 7th disk on a $65 4-port PCI-X SiI3124 controller that I scavenged from another server (it''s in a PCI slot, what a waste for a PCI-X card...). The rest is also dirty cheap: $65 Asus M2A-VM motherboard, $60 dual-core Athlon 64 X2 4000+, with 1GB of DDR2 800, and a 400W PSU. When testing the read throughput of individual disks with dd (roughly 81 to 97 MB/s at the beginning of the platter -- I don''t know why it varies so much between different units of the same model, additional seeks caused by reallocated sectors perhaps) I found out that an important factor influencing the max bandwidth of a PCI Express device such as the SiI3132 is the Max_Payload_Size setting, which can be set from 128 to 4096 bytes by writing to bits 7:5 of the Device Control Register (offset 08h) in the PCI Express Capability Structure (starting at offset 70h on the SiI3132): $ /usr/X11/bin/pcitweak -r 2:0 -h 0x78 # read the register 0x2007 Bits 7:5 of 0x2007 are 000, which indicates a 128 bytes max payload size (000=128B, 001=256B, ..., 101=4096B, 110=reserved, 111=reserved). All OSes and drivers seem to keep it to this default value of 128 bytes. However in my tests, this payload size only allowed a practical unidirectional bandwidth of about 147 MB/s (59% of the 250 MB/s peak theoretical of PCI-E 1x). I changed it to 256 bytes: $ /usr/X11/bin/pcitweak -w 2:0 -h 0x78 0x2027 This increased the bandwidth to 175 MB/s. Better. At 512 bytes or above something strange happens: the bandwidth ridiculously drops below 5 or 50 MB/s depending on the PCI-E slot I use... Weird, I have no idea why. Anyway 175 MB/s or even 145 MB/s is good enough for this 2-port SATA controller because the I/O bandwidth consumed by ZFS in my case is never above 62-63 MB/s per disk. I wanted to share this Max_Payload_Size tidbit here because I didn''t find any mention of anybody manually tuning this parameter on the Net. So in case some of you wonder why PCI-E devices seem limited to 60% of their peak theoretical bandwidth, now you know why. Speaking of another bottleneck, my SiI3124 has a bottleneck of 87 MB/s per SATA port. Back on the main topic, here are some system stats during 430-440 MB/s sequential reads from the ZFS raidz pool with dd (c0 is the AHCI controller, c1 = SiI3124, c2 = SiI3132). "zpool iostat -v 2" capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- tank 2.54T 2.17T 3.38K 0 433M 0 raidz1 2.54T 2.17T 3.38K 0 433M 0 c0t0d0s7 - - 1.02K 0 61.9M 0 c0t1d0s7 - - 1.02K 0 61.9M 0 c0t2d0s7 - - 1.02K 0 62.0M 0 c0t3d0s7 - - 1.02K 0 62.0M 0 c1t0d0s7 - - 1.01K 0 61.9M 0 c2t0d0s7 - - 1.02K 0 62.0M 0 c2t1d0s7 - - 1.02K 0 61.9M 0 ------------ ----- ----- ----- ----- ----- ----- "iostat -Mnx 2" extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 1044.8 0.5 61.7 0.0 0.1 14.3 0.1 13.6 4 81 c0t0d0 1043.3 0.0 61.7 0.0 0.1 15.4 0.1 14.7 5 84 c0t1d0 1043.3 0.0 61.7 0.0 0.1 14.7 0.1 14.1 5 82 c0t2d0 1044.8 0.0 61.8 0.0 0.1 13.0 0.1 12.5 4 76 c0t3d0 1042.3 0.0 61.7 0.0 13.9 0.8 13.3 0.8 83 83 c1t0d0 1041.8 0.0 61.7 0.0 11.5 0.7 11.1 0.7 73 73 c2t0d0 1041.8 0.0 61.7 0.0 12.8 0.8 12.3 0.8 79 79 c2t1d0 (actv is less than 1 on c1 & c2 due to si3124 driver not supporting NCQ. I wonder what''s preventing ZFS from keeping the disks busy more than ~80% of the time. Maybe the CPU, see below.) "mpstat 2" CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 7275 7075 691 99 146 354 0 321 0 89 0 10 1 0 0 0 221 2 844 43 150 244 0 241 0 88 0 11 (The dual-core CPU is almost(?) a bottleneck at 90% utilization on both cores. For this reason I doubt such a read throughput could have been reached pre-snv_79 because checksum verification/calculation only became multithreaded in snv_79 and after.) -marc
Anton B. Rang
2008-Mar-14 05:54 UTC
[zfs-discuss] Max_Payload_Size (was Re: 7-disk raidz achieves 430 MB/s reads and...)
Be careful of changing the Max_Payload_Size parameter. It needs to match, and be supported, between all PCI-E components which might communicate with each other. You can tell what values are supported by reading the Device Capabilities Register and checking the Max_Payload_Size Supported bits. If you set a size which is too large, you might see PCI-E errors, data corruption, or hangs. (You might also start to see latency effects in audio/video applications.) The operating system is supposed to set this register properly for you. A quick glance at OpenSolaris code suggests that, while PCIE_DEVCAP_MAX_PAYLOAD_MASK is defined in pcie.h, it''s not actually referenced yet, and in fact PCIE_DEVCAP seems to only be used for debugging. File a bug? This message posted from opensolaris.org
Marc Bevand
2008-Mar-14 09:58 UTC
[zfs-discuss] Max_Payload_Size (was Re: 7-disk raidz achieves 430 MB/s reads and...)
Anton B. Rang <rang <at> acm.org> writes:> > Be careful of changing the Max_Payload_Size parameter. It needs to match, > and be supported, between all PCI-E components which might communicate with > each other. You can tell what values are supported by reading the Device > Capabilities Register and checking the Max_Payload_Size Supported bits.Yes, the DevCap register of the SiI3132 indicates the maximum supported payload size is 1024 bytes. This is confirmed by its datasheet. However I compiled lspci for Solaris and running it with -vv shows only 2 PCI-E devices (other than the SiI3132 and an Ethernet controller), which represent the AMD690G chipset''s root PCI-E ports for my 2 PCI-E slots (I think): 00:06.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express Port 2) (prog-if 00 [Normal decode]) [...] Capabilities: [58] Express (v1) Root Port (Slot-), MSI 00 00:07.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express Port 3) (prog-if 00 [Normal decode]) [...] Capabilities: [58] Express (v1) Root Port (Slot-), MSI 00 But each shows a Max_Payload_Size of 128 bytes in both the DevCap and DevCtl registers. Clearly they are accepting 256-byte payloads, else I wouldn''t notice the big perf improvement when reading data from the disks. Could it be possible that (1) an errata in the AMD690G makes its DevCap register incorrectly report Max_Payload_Size=128 even though it supports larger ones, and that (2) the AMD690G implements PCI-E leniently and always accepts large payloads even when it is not supposed to when DevCtl defines Max_Payload_Size=128 ?> If you set a size which is too large, you might see PCI-E errors, data > corruption, or hangs.Ouch!> The operating system is supposed to set this register properly for you. > A quick glance at OpenSolaris code suggests that, while > PCIE_DEVCAP_MAX_PAYLOAD_MASK is defined in pcie.h, it''s not actually > referenced yet, and in fact PCIE_DEVCAP seems to only be used for debugging.I came to the same conclusion as you after grepping through the code. -marc
> But each shows a Max_Payload_Size of 128 bytes in both the DevCap and > DevCtl registers. Clearly they are accepting 256-byte payloads, else I > wouldn''t notice the big perf improvement when reading data from the > disks.Right -- you''d see errors instead.> Could it be possible that (1) an errata in the AMD690G makes its > DevCap register incorrectly report Max_Payload_Size=128 even though it > supports larger ones, and that (2) the AMD690G implements PCI-E > leniently and always accepts large payloads even when it is not > supposed to when DevCtl defines Max_Payload_Size=128?Looking at the AMD 690 series manual (well, the family register guide), the max payload size value is deliberately set to 0 to indicate that the chip only supports 128-byte transfers. There is a bit in another register which can be set to ignore max-payload errors. Perhaps that''s being set? I can guess as to why it might accept a 256-byte transfer even when it''s not truly supported. Perhaps there is a 512-byte buffer on the chip, for instance, between the PCI-E interface and the rest of the chip. If there is a flag to indicate when at least 128 bytes are available, then the chip would allow an incoming transfer whenever there''s room for that 128 bytes. If the memory subsystem is not under contention, data would quickly flow out and make room for more. If the system were under heavy load, the internal buffer might wind up with, say, 384 bytes of data waiting to be stored into memory -- and if the chip allowed a new transfer to start (since there were 128 bytes free in the buffer) and received 256 bytes of data, it would drop some, or hang.> I came to the same conclusion as you after grepping through the code.Someone should add proper PCI-E payload size support. ;-) This message posted from opensolaris.org
Anton B. Rang <rang <at> acm.org> writes:> Looking at the AMD 690 series manual (well, the family > register guide), the max payload size value is deliberately > set to 0 to indicate that the chip only supports 128-byte > transfers. There is a bit in another register which can be > set to ignore max-payload errors. Perhaps that''s being set?Perhaps. I briefly tried looking for AMD 690 series manual or datasheet, but they don''t seem to be available to the public. I think I''ll go back to the 128-byte setting. I wouldn''t want to see errors happening under heavy usage even though my stress tests were all successful (aggregate data rate of 610 MB/s generated by reading the disks for 24+ hours, 6 million head seeks performed by each disk, etc). Thanks for your much appreciated comments. -- Marc Bevand
Brandon High
2008-Mar-17 21:02 UTC
[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
On Thu, Mar 13, 2008 at 1:50 AM, Marc Bevand <m.bevand at gmail.com> wrote:> integrated AHCI controller (SB600 chipset), 2 disks on a 2-port $20 PCI-E 1x > SiI3132 controller, and the 7th disk on a $65 4-port PCI-X SiI3124 controllerDo you have access to a Sil3726 port multiplier? I''d like to see how the Sil3132 performs with multiple drive attached via one, since if/when I build a ZFS based NAS, I would like to use one. It''s also easier to use an external disk box like the CFI 8-drive eSATA tower than find a reasonable server case that can hold that many drives.
Tim
2008-Mar-17 21:09 UTC
[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
On 3/17/08, Brandon High <bhigh at freaks.com> wrote:> > On Thu, Mar 13, 2008 at 1:50 AM, Marc Bevand <m.bevand at gmail.com> wrote: > > integrated AHCI controller (SB600 chipset), 2 disks on a 2-port $20 > PCI-E 1x > > SiI3132 controller, and the 7th disk on a $65 4-port PCI-X SiI3124 > controller > > > Do you have access to a Sil3726 port multiplier? I''d like to see how > the Sil3132 performs with multiple drive attached via one, since > if/when I build a ZFS based NAS, I would like to use one. It''s also > easier to use an external disk box like the CFI 8-drive eSATA tower > than find a reasonable server case that can hold that many drives. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Woah, why would you spend 1600$ on that tower? For a LOT less money, you can have more drives, and all hot swappable. http://www.supermicro.com/products/chassis/3U/933/SC933T-R760.cfm I paired one of those with a pair of the supermicro 8-port sata cards. Works like a charm. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080317/dfbc7c5c/attachment.html>
Brandon High
2008-Mar-17 21:58 UTC
[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
On Mon, Mar 17, 2008 at 2:09 PM, Tim <tim at tcsac.net> wrote:> On 3/17/08, Brandon High <bhigh at freaks.com> wrote: > > easier to use an external disk box like the CFI 8-drive eSATA tower > > than find a reasonable server case that can hold that many drives. > > Woah, why would you spend 1600$ on that tower? For a LOT less money, you > can have more drives, and all hot swappable.I''m not sure where you got that price. Newegg has the CFI-B8283 for about $350. The case you mentioned is about $750. I also don''t have a rack at home, so a pedestal or tower case is required. -B
Marc Bevand
2008-Mar-17 22:10 UTC
[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
Brandon High <bhigh <at> freaks.com> writes:> Do you have access to a Sil3726 port multiplier?Nope. But AFAIK OpenSolaris doesn''t support port multipliers yet. Maybe FreeBSD does. Keep in mind that three modern drives (334GB/platter) are all it takes to saturate a SATA 3.0Gbps link.> It''s also easier to use an external disk box like the CFI 8-drive eSATA tower > than find a reasonable server case that can hold that many drives.If you are willing to go cheap you can get something that holds 8 drives for $70: buy a standard tower case with five internal 3.5" bays ($50), and one of these enclosures that fit in two 5.25" bays but give you three 3.5" bays ($20). -marc
Peter Schuller
2008-Mar-19 23:01 UTC
[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
> If you are willing to go cheap you can get something that holds 8 drives > for $70: buy a standard tower case with five internal 3.5" bays ($50), and > one of these enclosures that fit in two 5.25" bays but give you three 3.5" > bays ($20).I have one of these: http://www.gtek.se/index.php?mode=item&id=2454 That 2798 SEK price is about $450, and you can fit up to 30 3.5" drives in the exteme case. This is true even when using the Supermicro SATA hotswap bays that fit 5 drives in 3x5.25". You can fit 6 of these total, meaning 30 drives. Of course the cost of these bays are added (~ $60-$70 I believe for the 5-bay supermicro; the Lian Li stuff is cheaper, but not hotswap and such). -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: This is a digitally signed message part. URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080320/c025d16b/attachment.bin>
On Sat, Mar 15, 2008 at 2:06 PM, Marc Bevand <m.bevand at gmail.com> wrote:> I think I''ll go back to the 128-byte setting. I wouldn''t want to > see errors happening under heavy usage even though my stress > tests were all successful (aggregate data rate of 610 MB/s > generated by reading the disks for 24+ hours, 6 million head > seeks performed by each disk, etc).A co-worker is putting together a system to host a ZFS NAS, and is using the same motherboard that you are. He had the board''s manual with him, and I noticed that there are BIOS settings for the pcie max payload size. The default value is 4096 bytes. This doesn''t help explain why the throughput dropped when increasing max_payload_size over 512 causes a drop in throughput, but at least you can safely run the card with a payload greater than 128. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
On Fri, Apr 4, 2008 at 10:53 PM, Marc Bevand <m.bevand at gmail.com> wrote:> > with him, and I noticed that there are BIOS settings for the pcie max > > payload size. The default value is 4096 bytes. > > I noticed. But it looks like this setting has no effect on anything whatsoever.My guess is that the hardware supports the large payload, but that the BIOS isn''t representing it properly. I was looking at some other PCIe chipset specs and came across some documents commenting that many early devices didn''t support a payload over 256 bytes. So the problem could easily be that the sil3132 chip is causing the hiccup on a larger payload, not the RS690 PCIe controller. Of course, without more detailed spec on either component this is pure conjecture but it seems to match the behavior you observed. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
mike
2008-Sep-15 03:36 UTC
[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
As of 4/14, I had written this after talking to some folks who maintain drivers for various OSes: http://michaelshadle.com/2008/04/14/the-current-state-of-esata-port-multipliers/ FreeBSD wasn''t any further than OSOL probably at that time. As of right now, I am not sure. I haven''t paid attention. I''d prefer to run ZFS in native Solaris, I''m too lazy to have to deal with any possible issues running on a ZFS port (even if it is really good :)) -- This message posted from opensolaris.org