Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do''s and dont''s, and any other information that could be of help while building and configuring this? What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. Let''s talk moon landings. Regards, Arve -- This message posted from opensolaris.org
On 6/15/2010 4:42 AM, Arve Paalsrud wrote:> Hi, > > We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. > Our hardware specifications are as follows: > > Quad AMD G34 12-core 2.3 GHz (~110 GHz) > 10 Crucial RealSSD (6Gb/s) > 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders > LSI2008SAS (two 4x ports) > Mellanox InfiniBand 40 Gbit NICs > 128 GB RAM > > This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. > > Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. > > DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). > > Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do''s and dont''s, and any other information that could be of help while building and configuring this? > > What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. > > Let''s talk moon landings. > > Regards, > Arve >Given that for ZIL, random write IOPS is paramount, the RealSSD isn''t a good choice. SLC SSDs still spank any MLC device, and random IOPS for something like an Intel X25-E or OCZ Vertex EX are over twice that of the RealSSD. I don''t know where they manage to get 40k+ IOPS number for the RealSSD (I know it''s in the specs, but how did they get that?), but that''s not what others are reporting: http://benchmarkreviews.com/index.php?option=com_content&task=view&id=454&Itemid=60&limit=1&limitstart=7 Sadly, none of the current crop of SSDs support a capacitor or battery to back up their local (on-SSD) cache, so they''re all subject to data loss on a power interruption. Likewise, random Read dominates L2ARC usage. Here, the most cost-effective solutions tend to be MLC-based SSDs with more moderate IOPS performance - the Intel X25-M and OCZ Vertex series are likely much more cost-effective than a RealSSD, especially considering price/performance. Also, given the limitations of a x4 port connection to the rest of the system, I''d consider using a couple more SAS controllers, and fewer Expanders. The SSDs together are likely to be able to overwhelm a x4 PCI-E connection, so I''d want at least one dedicated x4 SAS HBA just for them. For the 42 disks, it depends more on what your workload looks like. If it is mostly small or random I/O to the disks, you can get away with fewer HBAs. Large, sequential I/O to the disks is going to require more HBAs. Remember, a modern 7200RPM SATA drive can pump out well over 100MB/s sequential, but well under 10MB/s random. Do the math to see how fast it will overwhelm the x4 PCI-E 2.0 connection which maxes out at about 2GB/s. I''d go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn''t really going to buy you much here (so far as I can tell). 6Gbit/s SAS is wasted on HDs, so don''t bother paying for it if you can avoid doing so. Really, I''d suspect that paying for 6Gb/s SAS isn''t worth it at all, as really only the read performance of the L2ARC SSDs might possibly exceed 3Gb/s SAS. I''m going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, if I''m reading this correctly, even if you switch to the 160GB Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by the DDT. The rest is file cache. You''ll need lots of RAM if you plan on storing lots of small files in the L2ARC (that is, if your workload is lots of small files). 200bytes/record needed in RAM for an L2ARC entry. I.e. if you have 1k average record size, for 600GB of L2ARC, you''ll need 600GB / 1kb * 200B = 120GB RAM. if you have a more manageable 8k record size, then, 600GB / 8kB * 200B = 15GB -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On 15/06/2010 14:09, Erik Trimble wrote:> I''m going to say something sacrilegious here: 128GB of RAM may be > overkill. You have the SSDs for L2ARC - much of which will be the DDT,The point of L2ARC is that you start adding L2ARC when you can no longer physically put in (or afford) to add any more DRAM, so if OP can afford to put in 128GB of RAM then they should. -- Darren J Moffat
On 6/15/2010 6:17 AM, Darren J Moffat wrote:> On 15/06/2010 14:09, Erik Trimble wrote: >> I''m going to say something sacrilegious here: 128GB of RAM may be >> overkill. You have the SSDs for L2ARC - much of which will be the DDT, > > The point of L2ARC is that you start adding L2ARC when you can no > longer physically put in (or afford) to add any more DRAM, so if OP > can afford to put in 128GB of RAM then they should. >True. I was speaking price/performance. Those 8GB DIMMs are still pretty darned pricey... -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
> I''m going to say something sacrilegious here: 128GB of RAM may be > overkill. You have the SSDs for L2ARC - much of which will be the DDT, > but, if I''m reading this correctly, even if you switch to the 160GB > Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only > half is in-use by the DDT. The rest is file cache. You''ll need lots of > RAM if you plan on storing lots of small files in the L2ARC (that is, > if your workload is lots of small files). 200bytes/record needed in > RAM for an L2ARC entry. > > I.e. > > if you have 1k average record size, for 600GB of L2ARC, you''ll need > 600GB / 1kb * 200B = 120GB RAM. > > if you have a more manageable 8k record size, then, 600GB / 8kB * 200B > = 15GBNow I''m confused. First thing I heard, was about 160 bytes was needed per DDT entry. Later, someone else told med 270. Then you, at 200. Also, there should be a good way to list out a total of blocks (zdb just crashed with a full memory on my 10TB test box). I tried browsing the source to see the size of the ddt struct, but I got lost. Can someone with an osol development environment please just check sizeof that struct? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On 6/15/2010 6:40 AM, Roy Sigurd Karlsbakk wrote:>> I''m going to say something sacrilegious here: 128GB of RAM may be >> overkill. You have the SSDs for L2ARC - much of which will be the DDT, >> but, if I''m reading this correctly, even if you switch to the 160GB >> Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only >> half is in-use by the DDT. The rest is file cache. You''ll need lots of >> RAM if you plan on storing lots of small files in the L2ARC (that is, >> if your workload is lots of small files). 200bytes/record needed in >> RAM for an L2ARC entry. >> >> I.e. >> >> if you have 1k average record size, for 600GB of L2ARC, you''ll need >> 600GB / 1kb * 200B = 120GB RAM. >> >> if you have a more manageable 8k record size, then, 600GB / 8kB * 200B >> = 15GB >> > Now I''m confused. First thing I heard, was about 160 bytes was needed per DDT entry. Later, someone else told med 270. Then you, at 200. Also, there should be a good way to list out a total of blocks (zdb just crashed with a full memory on my 10TB test box). I tried browsing the source to see the size of the ddt struct, but I got lost. Can someone with an osol development environment please just check sizeof that struct? > > Vennlige hilsener / Best regards > > roy > -- >A DDT entry takes up about 250 bytes, regardless of where it is stored. For every "normal" (i.e. block, metadata, etc - NOT DDT ) L2ARC entry, about 200 bytes has to be stored in main memory (ARC). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Jun 15, 2010, at 6:40 AM, Roy Sigurd Karlsbakk wrote:>> I''m going to say something sacrilegious here: 128GB of RAM may be >> overkill. You have the SSDs for L2ARC - much of which will be the DDT, >> but, if I''m reading this correctly, even if you switch to the 160GB >> Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only >> half is in-use by the DDT. The rest is file cache. You''ll need lots of >> RAM if you plan on storing lots of small files in the L2ARC (that is, >> if your workload is lots of small files). 200bytes/record needed in >> RAM for an L2ARC entry. >> >> I.e. >> >> if you have 1k average record size, for 600GB of L2ARC, you''ll need >> 600GB / 1kb * 200B = 120GB RAM. >> >> if you have a more manageable 8k record size, then, 600GB / 8kB * 200B >> = 15GB > > Now I''m confused. First thing I heard, was about 160 bytes was needed per DDT entry. Later, someone else told med 270. Then you, at 200. Also, there should be a good way to list out a total of blocks (zdb just crashed with a full memory on my 10TB test box). I tried browsing the source to see the size of the ddt struct, but I got lost. Can someone with an osol development environment please just check sizeof that struct?Why read source when you can read the output of "zdb -D"? :-) -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On Jun 15, 2010, at 4:42 AM, Arve Paalsrud wrote:> Hi, > > We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. > Our hardware specifications are as follows: > > Quad AMD G34 12-core 2.3 GHz (~110 GHz) > 10 Crucial RealSSD (6Gb/s) > 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders > LSI2008SAS (two 4x ports) > Mellanox InfiniBand 40 Gbit NICs > 128 GB RAM > > This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. > > Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. > > DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). > > Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do''s and dont''s, and any other information that could be of help while building and configuring this? > > What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS.In general, both dedup and compression gain space by trading off performance. You should take a closer look at snapshots + clones because they gain performance by trading off systems management. You can''t size by ESX server, because ESX works (mostly) as a pass-through of the client VM workload. In your sizing calculations, think of ESX as a fancy network switch. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote:> Hi, > > We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. > Our hardware specifications are as follows: > > Quad AMD G34 12-core 2.3 GHz (~110 GHz) > 10 Crucial RealSSD (6Gb/s) > 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders > LSI2008SAS (two 4x ports) > Mellanox InfiniBand 40 Gbit NICsJust recognize that those "NICs" are IB only. Solaris currently does not support 10GbE using Mellanox products, even though other operating systems do. (There are folks working on resolving this, but I think we''re still a couple months from seeing the results of that effort.)> 128 GB RAM > > This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. > > Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety.I expect that you need more space for L2ARC and a lot less for Zil. Furthmore, you''d be better served by an even lower latency/higher IOPs ZIL. If you''re going to spend this kind of cash, I think I''d recommend at least one or two DDR Drive X1 units or something similar. While not very big, you don''t need much to get a huge benefit from the ZIL, and I think the vastly superior IOPS of these units will pay off in the end.> > DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity).Dedup is not always a win, I think. I''d look hard at your data and usage to determine whether to use it. -- Garrett
On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble <erik.trimble at oracle.com>wrote:> On 6/15/2010 4:42 AM, Arve Paalsrud wrote: > >> Hi, >> >> We are currently building a storage box based on OpenSolaris/Nexenta using >> ZFS. >> Our hardware specifications are as follows: >> >> Quad AMD G34 12-core 2.3 GHz (~110 GHz) >> 10 Crucial RealSSD (6Gb/s) >> 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders >> LSI2008SAS (two 4x ports) >> Mellanox InfiniBand 40 Gbit NICs >> 128 GB RAM >> >> This setup gives us about 40TB storage after mirror (two disks in spare), >> 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. >> >> Both L2ARC and Zil shares the same disks (striped) due to bandwidth >> requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k >> read/write scenario with 70/30 distribution. Now, I know that you should >> have mirrored Zil for safety, but the entire box are synchronized with an >> active standby on a different site location (18km distance - round trip of >> 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or >> the motherboard/disk group/motherboard dies - we still have safety. >> >> DDT requirements for dedupe on 16k blocks should be about 640GB when main >> pool are full (capacity). >> >> Without going into details about chipsets and such, do any of you on this >> list have any experience with a similar setup and can share with us your >> thoughts, do''s and dont''s, and any other information that could be of help >> while building and configuring this? >> >> What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters >> (also InfiniBand-based), with both dedupe and compression enabled in ZFS. >> >> Let''s talk moon landings. >> >> Regards, >> Arve >> >> > > > Given that for ZIL, random write IOPS is paramount, the RealSSD isn''t a > good choice. SLC SSDs still spank any MLC device, and random IOPS for > something like an Intel X25-E or OCZ Vertex EX are over twice that of the > RealSSD. I don''t know where they manage to get 40k+ IOPS number for the > RealSSD (I know it''s in the specs, but how did they get that?), but that''s > not what others are reporting: > > > http://benchmarkreviews.com/index.php?option=com_content&task=view&id=454&Itemid=60&limit=1&limitstart=7See http://www.anandtech.com/show/2944/3 and http://www.crucial.com/pdf/Datasheets-letter_C300_RealSSD_v2-5-10_online.pdf But I agree that we should look into using the Vertex instead. Sadly, none of the current crop of SSDs support a capacitor or battery to> back up their local (on-SSD) cache, so they''re all subject to data loss on a > power interruption. >Noted> Likewise, random Read dominates L2ARC usage. Here, the most cost-effective > solutions tend to be MLC-based SSDs with more moderate IOPS performance - > the Intel X25-M and OCZ Vertex series are likely much more cost-effective > than a RealSSD, especially considering price/performance. >Our other option are to use two Fusion-IO ioDrive Duo SLC/MLC or the SMLC when available (as well as drivers for Solaris) - so the price we''re currently talking about is not an issue.> Also, given the limitations of a x4 port connection to the rest of the > system, I''d consider using a couple more SAS controllers, and fewer > Expanders. The SSDs together are likely to be able to overwhelm a x4 PCI-E > connection, so I''d want at least one dedicated x4 SAS HBA just for them. > For the 42 disks, it depends more on what your workload looks like. If it > is mostly small or random I/O to the disks, you can get away with fewer > HBAs. Large, sequential I/O to the disks is going to require more HBAs. > Remember, a modern 7200RPM SATA drive can pump out well over 100MB/s > sequential, but well under 10MB/s random. Do the math to see how fast it > will overwhelm the x4 PCI-E 2.0 connection which maxes out at about 2GB/s. >We''re talking about 4X SAS 6Gb/s lanes - 4800MB/s per port. See http://www.lsi.com/DistributionSystem/AssetDocument/SCG_LSISAS2008_PB_043009.pdffor specifications of the LSI chip. In other words, it utilizes PCIe 2.0 8x. I''d go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn''t> really going to buy you much here (so far as I can tell). 6Gbit/s SAS is > wasted on HDs, so don''t bother paying for it if you can avoid doing so. > Really, I''d suspect that paying for 6Gb/s SAS isn''t worth it at all, as > really only the read performance of the L2ARC SSDs might possibly exceed > 3Gb/s SAS. >What about bandwidth in this scenario? Won''t the ZIL be limited to the throughput of only one X25-E? The SATA disks operates at 3Gb/s through the SAS expanders, so no 6Gb/s there.> I''m going to say something sacrilegious here: 128GB of RAM may be > overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, > if I''m reading this correctly, even if you switch to the 160GB Intel X25-M, > that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by > the DDT. The rest is file cache. You''ll need lots of RAM if you plan on > storing lots of small files in the L2ARC (that is, if your workload is lots > of small files). 200bytes/record needed in RAM for an L2ARC entry. > > I.e. > > if you have 1k average record size, for 600GB of L2ARC, you''ll need 600GB > / 1kb * 200B = 120GB RAM. > > if you have a more manageable 8k record size, then, 600GB / 8kB * 200B > 15GBThe box will store mostly large VM files, and DRAM will always be used when available, regardless of L2ARC size - it''s a benefit to have more of it.> -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA >Regards, Arve -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100615/606e4cf3/attachment.html>
On Tue, 2010-06-15 at 07:36 -0700, Richard Elling wrote:> > What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. > > In general, both dedup and compression gain space by trading off performance. > You should take a closer look at snapshots + clones because they gain > performance by trading off systems management.It depends on the usage. Note that for some uses, compression can be a performance *win*, because generally CPUs are fast enough that the cost of decompression beats the cost of the larger IOs required to transfer uncompressed data. Of course, that assumes you have CPU cycles to spare. -- Garrett
On Tue, Jun 15, 2010 at 3:33 PM, Erik Trimble <erik.trimble at oracle.com>wrote:> On 6/15/2010 6:17 AM, Darren J Moffat wrote: > >> On 15/06/2010 14:09, Erik Trimble wrote: >> >>> I''m going to say something sacrilegious here: 128GB of RAM may be >>> overkill. You have the SSDs for L2ARC - much of which will be the DDT, >>> >> >> The point of L2ARC is that you start adding L2ARC when you can no longer >> physically put in (or afford) to add any more DRAM, so if OP can afford to >> put in 128GB of RAM then they should. >> >> > True. > > I was speaking price/performance. Those 8GB DIMMs are still pretty darned > pricey... > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > >The motherboard has 32 DIMM slots - making use of 32 4GB modules to gain 128GB quite affordable :) -Arve -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100615/ba6cb570/attachment.html>
On Tue, Jun 15, 2010 at 4:20 PM, Erik Trimble <erik.trimble at oracle.com>wrote:> On 6/15/2010 6:57 AM, Arve Paalsrud wrote: > > On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble <erik.trimble at oracle.com>wrote: > > I''d go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn''t >> really going to buy you much here (so far as I can tell). 6Gbit/s SAS is >> wasted on HDs, so don''t bother paying for it if you can avoid doing so. >> Really, I''d suspect that paying for 6Gb/s SAS isn''t worth it at all, as >> really only the read performance of the L2ARC SSDs might possibly exceed >> 3Gb/s SAS. >> > > What about bandwidth in this scenario? Won''t the ZIL be limited to the > throughput of only one X25-E? The SATA disks operates at 3Gb/s through the > SAS expanders, so no 6Gb/s there. > > Yes - though I''m not sure how the slog devices work when there is more than > one. I *don''t* think they work like the L2ARC devices, which work > round-robin. You''d have to ask. If they''re doing a true stripe, then I > doubt you''ll get much more performance as weird as that sounds. Also, even > with a single X25-E, you can service a huge number of IOPS - likely more > small IOPS than can be pushed over even an Infiniband interface. The place > that the Infiniband would certainly outpace the X25-E''s capacity is for > large writes, where a single 100MB write would suck up all the X25-E''s > throughput capability. >But the Intel X25-E are limited to about 200 MB/s write, regardless of IOPS. So when throwing a lot of 16k IOPS (about 13 000) at it, it will still be limited to 200 MB/s - or about 6-7% of the throughput of an QDR InfiniBand links capacity. So I hereby officially ask: Can I have multiple slogs striped to handle higher bandwidth than a single device can - is that supported in ZFS?> -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > > - Arve-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100615/bceee787/attachment.html>
> -----Original Message----- > From: Garrett D''Amore [mailto:garrett at nexenta.com] > Sent: 15. juni 2010 17:43 > To: Arve Paalsrud > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] High-Performance ZFS (2000MB/s+) > > On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote: > > Hi, > > > > We are currently building a storage box based on OpenSolaris/Nexenta > using ZFS. > > Our hardware specifications are as follows: > > > > Quad AMD G34 12-core 2.3 GHz (~110 GHz) > > 10 Crucial RealSSD (6Gb/s) > > 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders > > LSI2008SAS (two 4x ports) > > Mellanox InfiniBand 40 Gbit NICs > > Just recognize that those "NICs" are IB only. Solaris currently does > not support 10GbE using Mellanox products, even though other operating > systems do. (There are folks working on resolving this, but I think > we''re still a couple months from seeing the results of that effort.) > > > 128 GB RAM > > > > This setup gives us about 40TB storage after mirror (two disks in > spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. > > > > Both L2ARC and Zil shares the same disks (striped) due to bandwidth > requirements. Each SSD has a theoretical performance of 40-50k IOPS on > 4k read/write scenario with 70/30 distribution. Now, I know that you > should have mirrored Zil for safety, but the entire box are > synchronized with an active standby on a different site location (18km > distance - round trip of 0.16ms + equipment latency). So in case the > Zil in Site A takes a fall, or the motherboard/disk group/motherboard > dies - we still have safety. > > I expect that you need more space for L2ARC and a lot less for Zil. > Furthmore, you''d be better served by an even lower latency/higher IOPs > ZIL. If you''re going to spend this kind of cash, I think I''d recommend > at least one or two DDR Drive X1 units or something similar. While not > very big, you don''t need much to get a huge benefit from the ZIL, and I > think the vastly superior IOPS of these units will pay off in the end. >What about the ZIL bandwidth in this case? I mean, could I stripe across multiple devices to be able to handle higher throughput? Otherwise I would still be limited to the performance of the unit itself (155 MB/s).> > > > DDT requirements for dedupe on 16k blocks should be about 640GB when > main pool are full (capacity). > > Dedup is not always a win, I think. I''d look hard at your data and > usage to determine whether to use it. > > -- Garrett-Arve
On Tue, 2010-06-15 at 18:33 +0200, Arve Paalsrud wrote:> > What about the ZIL bandwidth in this case? I mean, could I stripe across multiple devices to be able to handle higher throughput? Otherwise I would still be limited to the performance of the unit itself (155 MB/s). >I think so. Btw, I''ve gotten better performance than that with my driver (not sure about the production driver). I seem to recall about 220 MB/sec. (I was basically driving the PCIe x1 bus to its limit.) This was with large transfers (sized at 64k IIRC.) Shrinking the job size down, I could get up to 150K IOPS with 512 byte jobs. (This high IOP rate is unrealistic for ZFS -- for ZFS the bus bandwidth limitation comes into play long before you start hitting IOPS limitations.) One issue of course is that each of these units occupies a PCIe x1 slot. On another note, if you''re dataset and usage requirements don''t require strict I/O flush/sync guarantees, you could probably get away without any ZIL at all, and just use lots of RAM to get really good performance. (You''d then disable the zil on filesystems that didn''t have this need. This is a very new feature in OpenSolaris.) Of course, you don''t want to do this for data sets where loss of the data would be tragic. (But its ideal for situations such as filesystems used for compiling, etc. -- where the data being written can be easily regenerated in the event of a failure.) -- Garrett> > > > > > DDT requirements for dedupe on 16k blocks should be about 640GB when > > main pool are full (capacity). > > > > Dedup is not always a win, I think. I''d look hard at your data and > > usage to determine whether to use it. > > > > -- Garrett > > -Arve > >
On 15/06/2010 12:42, "Arve Paalsrud" <arve.paalsrud at gmail.com> wrote:> Hi, > > We are currently building a storage box based on OpenSolaris/Nexenta using > ZFS. > Our hardware specifications are as follows: > > Quad AMD G34 12-core 2.3 GHz (~110 GHz) > 10 Crucial RealSSD (6Gb/s) > 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders > LSI2008SAS (two 4x ports) > Mellanox InfiniBand 40 Gbit NICsI was told that IB support in Nexenta is scheduled to be released in 3.0.4 (beginning of July).> 128 GB RAM > > This setup gives us about 40TB storage after mirror (two disks in spare), > 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. > > Both L2ARC and Zil shares the same disks (striped) due to bandwidth > requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k > read/write scenario with 70/30 distribution. Now, I know that you should have > mirrored Zil for safety, but the entire box are synchronized with an active > standby on a different site location (18km distance - round trip of 0.16ms + > equipment latency). So in case the Zil in Site A takes a fall, or the > motherboard/disk group/motherboard dies - we still have safety. > > DDT requirements for dedupe on 16k blocks should be about 640GB when main pool > are full (capacity). > > Without going into details about chipsets and such, do any of you on this list > have any experience with a similar setup and can share with us your thoughts, > do''s and dont''s, and any other information that could be of help while > building and configuring this? > > What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also > InfiniBand-based), with both dedupe and compression enabled in ZFS.As VMware does not currently support NFS over RDMA, you will need to stick with IPoIB which will suffer from some performance implications inherent to traditional TCP/IP stack. You could also use iSER or SRP which are both supported.> > Let''s talk moon landings. > > Regards, > Arve-- Przem
> I mean, could I stripe across multiple devices to be able to handle higher > throughput?Absolutely. Stripping four DDRdrive X1s (16GB dedicated log) is extremely simple. Each X1 has it''s own dedicated IOPS controller, critical for approaching linear synchronous write scalability. The same principles and benefits of multi-core processing apply here with multiple controllers. The performance potential of NVRAM based SSDs dictates moving away from a single/separate HBA based controller. Best regards, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org