Richard Connamacher
2009-Sep-28 22:08 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
I''m looking at building a high bandwidth file server to store video for editing, as an alternative to buying a $30,000 hardware RAID and spending $2000 per seat on fibrechannel and specialized SAN drive software. Uncompressed HD runs around 1.2 to 4 gigabits per second, putting it in 10 gigabit Ethernet or FibreChannel territory. Any file server would have to be able to move that many bits in sustained read and sustained write, and doing both simultaneously would be a plus. If the drives were plentiful enough and fast enough, could a RAID-Z (on currently available off-the-shelf hardware) keep up with that? Thanks! -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Sep-28 22:50 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On Mon, 28 Sep 2009, Richard Connamacher wrote:> I''m looking at building a high bandwidth file server to store video > for editing, as an alternative to buying a $30,000 hardware RAID and > spending $2000 per seat on fibrechannel and specialized SAN drive > software. > > Uncompressed HD runs around 1.2 to 4 gigabits per second, putting it > in 10 gigabit Ethernet or FibreChannel territory. Any file server > would have to be able to move that many bits in sustained read and > sustained write, and doing both simultaneously would be a plus.Please see a white paper I wrote entitled "ZFS and Digital Intermediate" at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-and-di.pdf which expounds on this topic and makes it sound like zfs is the perfect answer for this. Unfortunately, I have since learned that zfs file prefetch ramps up too slowly or becomes disabled for certain workloads. I reported a bug. Many here are eagerly awaiting the next OpenSolaris development release which is supposed to have fixes for the prefetch problem I encountered. I am told that a Solaris 10 IDR (customer-specific patch) will be provided to me within the next few days to resolve the performance issue. There is another performance issue in which writes to the server cause reads to briefly stop periodically. This means that the server could not be used simultaneously for video playback while files are being updated. To date there is no proposed solution for this problem. Linux XFS seems like the top contender for video playback and editing.>From a description of XFS design and behavior, it would not surpriseme if it stuttered during playback when files are updated as well. Linux XFS also buffers written data and writes it out in large batches at a time. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Connamacher
2009-Sep-28 23:47 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Thanks for the detailed information. When you get the patch, I''d love to hear if it fixes the problems you''re having. From my understanding, a working prefetch would keep video playback from stuttering whenever the drive head moves ? is this right? The inability to read and write simultaneously (within reason) would be frustrating for a shared video editing server. I wonder if ZFS needs more parallelism? If any software RAID ends up having a similar problem, then we might have to go with the hardware RAID setups I''m trying to avoid. I wonder if there''s any way to work around that. Would a bigger write cache help? Or adding an SSD for the cache (ZFS Intent Log)? would Linux software RAID be any better? Assuming they fix the prefetch performance issues you talked about, do you think ZFS would be able to keep up with uncompressed 1080p HD or 2K? -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Sep-29 00:22 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On Mon, 28 Sep 2009, Richard Connamacher wrote:> Thanks for the detailed information. When you get the patch, I''d > love to hear if it fixes the problems you''re having. From my > understanding, a working prefetch would keep video playback from > stuttering whenever the drive head moves ? is this right?For me, agressive prefetch is most important in order to schedule reads from enough disks in advance to produce a high data rate. This is because I am using mirrors. When using raidz or raidz2 the situation should be a bit different because raidz is striped. The prefetch bug which is specifically fixed is when using thousands of files in the 5MB-8MB range which is typical for film postproduction. The bug is that prefetch becomes disabled if the file had been accessed before but its data is no longer in cache. When doing video playback, it is typical to be reading from several files at once in order to avoid the potential for read "stutter".> The inability to read and write simultaneously (within reason) would > be frustrating for a shared video editing server. I wonder if ZFS > needs more parallelism? If any software RAID ends up having aZFS has a lot of parallelism since it is optimized for large data servers.> similar problem, then we might have to go with the hardware RAID > setups I''m trying to avoid. I wonder if there''s any way to work > around that. Would a bigger write cache help? Or adding an SSD for > the cache (ZFS Intent Log)? would Linux software RAID be any better?The problem seems to be that ZFS uses a huge write cache by default and it delays flushing it (up to 30 seconds) so that when the write cache is flushed, it maximally engages the write channel for up to 5 seconds. Decreasing the size of the write cache diminishes the size of the problem.> Assuming they fix the prefetch performance issues you talked about, > do you think ZFS would be able to keep up with uncompressed 1080p HD > or 2K?That is not clear to me yet. With my setup, I can read up to 550MB/second from a large file. That is likely the hardware limit for me. But when reading one-at-a-time from individual 5 or 8MB files, the data rate is much less (around 130MB/second). I am using Solaris 10. OpenSolaris performance seems to be better than Solaris 10. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Connamacher
2009-Sep-29 01:25 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
> For me, agressive prefetch is most important in order to schedule > reads from enough disks in advance to produce a high data rate. This > is because I am using mirrors. When using raidz or raidz2 the > situation should be a bit different because raidz is striped. The > prefetch bug which is specifically fixed is when using thousands of > files in the 5MB-8MB range which is typical for film postproduction. > The bug is that prefetch becomes disabled if the file had been > accessed before but its data is no longer in cache.I''m planning on using RAIDZ2 if it can keep up with my bandwidth requirements. So maybe ZFS could be an option after all?> That is not clear to me yet. With my setup, I can read up to > 550MB/second from a large file. That is likely the hardware limit for > me. But when reading one-at-a-time from individual 5 or 8MB files, > the data rate is much less (around 130MB/second).By MB do you mean mega*byte*? If so, 550 MB is more than enough for uncompressed 1080p. If you mean mega*bit*, then that''s not enough. But as you said, you''re using a mirrored setup, and RAID-Z should be faster. This might work for Final Cut editing using QuickTime files. But FX and color grading using TIFF frames at 130 MB/s would slow your setup to a crawl. Do you think RAID-Z would help here? Thanks! -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Sep-29 02:07 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On Mon, 28 Sep 2009, Richard Connamacher wrote:> > I''m planning on using RAIDZ2 if it can keep up with my bandwidth > requirements. So maybe ZFS could be an option after all?ZFS certainly can be an option. If you are willing to buy Sun hardware, they have a "try and buy" program which would allow you to set up a system to evaluate if it will work for you. Otherwise you can use a high-grade Brand-X server and decent-grade Brand-X JBOD array to test on. Sun Sun Storage 7000 series has OpenSolaris and ZFS inside but is configured and sold as a closed-box NAS. The X4550 server is fitted with 48 disk drives and is verified to be able to deliver 2.0GB/second to a network.> By MB do you mean mega*byte*? If so, 550 MB is more than enough for > uncompressed 1080p. If you mean mega*bit*, then that''s not enough. > But as you said, you''re using a mirrored setup, and RAID-Z should be > faster.Yes. I mean megabyte. This is a 12-drive StorageTek 2540 with two 4gbit FC links. I am getting a peak of more than one FC link (550MB/second with a huge file). A JBOD SAS array would be a much better choice now but these products had not yet come to market when I ordered my hardware.> This might work for Final Cut editing using QuickTime files. But FX > and color grading using TIFF frames at 130 MB/s would slow your > setup to a crawl. Do you think RAID-Z would help here?There is no reason why RAID-Z is necessarily faster at sequential reads than mirrors and in fact mirrors can be faster due to fewer disk seeks. With mirrors, it is theoretically possible to schedule reads from all 12 of my disks at once. It is just a matter of the tunings/options that the ZFS implementors decide to provide. Here are some iozone measurements (taken June 28th) with different record sizes running up to a 64GB file size: KB reclen write rewrite read reread 8388608 64 482097 595557 1851378 1879145 8388608 128 429126 621319 1937128 1944177 8388608 256 428197 646922 1954065 1965570 8388608 512 489692 585971 1593610 1584573 16777216 64 439880 41304 822968 841246 16777216 128 443119 435886 815705 844789 16777216 256 446006 475347 814529 687915 16777216 512 436627 462599 787369 803182 33554432 64 401110 41096 547065 553262 33554432 128 404420 394838 549944 552664 33554432 256 406367 400859 544950 553516 33554432 512 401254 410153 554100 558650 67108864 64 378158 40794 552623 555655 67108864 128 379809 385453 549364 553948 67108864 256 380286 377397 551060 550414 67108864 512 378225 385588 550131 557150 It seems like every time I run the benchmark, the numbers have improved. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Connamacher
2009-Sep-29 03:11 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
I was thinking of custom building a server, which I think I can do for around $10,000 of hardware (using 45 SATA drives and a custom enclosure), and putting OpenSolaris on it. It''s a bit of a risk compared to buying a $30,000 server, but would be a fun experiment. -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Sep-29 03:32 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On Mon, 28 Sep 2009, Richard Connamacher wrote:> I was thinking of custom building a server, which I think I can do > for around $10,000 of hardware (using 45 SATA drives and a custom > enclosure), and putting OpenSolaris on it. It''s a bit of a risk > compared to buying a $30,000 server, but would be a fun experiment.Others have done similar experiments with considerable success. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Erik Trimble
2009-Sep-29 05:15 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Bob Friesenhahn wrote:> On Mon, 28 Sep 2009, Richard Connamacher wrote: > >> I was thinking of custom building a server, which I think I can do >> for around $10,000 of hardware (using 45 SATA drives and a custom >> enclosure), and putting OpenSolaris on it. It''s a bit of a risk >> compared to buying a $30,000 server, but would be a fun experiment. > > Others have done similar experiments with considerable success. > > Bob > --Yes, but be careful of your workload on SATA disks. SATA can be very good for sequential read and write, but only under lower loads, even with a serious SSD cache. I''d want to benchmark things with your particular workload before using SATA instead of SAS. To mention things: Sun''s 7110 lists for $11k in the 2TB (with SAS) disk configuration. If you have a longer-term storage needs, look at a X4540 Thor (the replacement for the X4500 Thumpers). They''re significantly more reliable and manageable than a custom-built solution. And reasonably cost-competitive. ( >> $1/GB after discount). Both the Thor and 7110 are available for Try-and-Buy. Get them and test them against your workload - it''s the only way to be sure (to paraphrase Ripley). Not just for Sun kit, but I''d be very wary of using any no-service-contract hardware for something that is business critical, which I can''t imagine your digital editing system isn''t. Don''t be penny-wise and pound-foolish. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Marc Bevand
2009-Sep-29 07:34 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Richard Connamacher <rich <at> indieimage.com> writes:> > I was thinking of custom building a server, which I think I can do for > around $10,000 of hardware (using 45 SATA drives and a custom enclosure), > and putting OpenSolaris on it. It''s a bit of a risk compared to buying a > $30,000 server, but would be a fun experiment.Do you have a $2k budget to perform a cheap experiment? Because for this amount of money you can build the following server that has 10TB of usable storage capacity, and that would be roughly able to sustain sequential reads between 500MByte/s and 1000MByte/s over NFS over a Myricom 10GbE NIC. This is my estimation. I am less sure about sequential writes: I think this server would be capable of at least 250-500 MByte/s. $150 - Mobo with onboard 4-port AHCI SATA controller (eg. any AMD 700 chipset), and at least two x8 electrical PCI-E slots $200 - Quad-core Phenom II X4 CPU + 4GB RAM $150 - LSISAS1068E 8-port SAS/SATA HBA, PCI-E x8 $500 - Myri-10G NIC (10G-PCIE-8B-C), PCI-E x8 $1000 - 12 x 1TB SATA drives (4 on onboard AHCI, 8 on LSISAS1068E) - It is important to choose an AMD platform because the PCI-E lanes will always come from the northbridge chipset which is connected to the CPU via an HT 3.0 link. On Intel platforms, the DMI link between the ICH and MCH will be a bottleneck if the mobo gives you PCI-E lanes from the MCH (in my experience, this is the case of most desktop mobos). - Make sure you enable AHCI in the BIOS. - Configure the 12 drives as striped raidz vdevs: zpool create mytank raidz d0 d1 d2 d3 d4 d5 raidz d6 d7 d8 d9 d10 d11 - Buy drives able to sustain 120-130 MByte/s of sequential reads at the beginning of the platter (my recommendation: Seagate 7200.12) this way your 4Gbit/s requirement will be met even in the worst case when reading from the end of the platters. Thank me for saving you $28k :-) The above experiment would be a way to validate some of your ideas before building a 45-drive server... -mrb
Or just "try and buy" the machines from Sun for ZERO DOLLARS!!! Like Erik said...... "Both the Thor and 7110 are available for Try-and-Buy. Get them and test them against your workload - it''s the only way to be sure (to paraphrase Ripley)." Marc Bevand wrote: Richard Connamacher indieimage.com> writes: I was thinking of custom building a server, which I think I can do for around $10,000 of hardware (using 45 SATA drives and a custom enclosure), and putting OpenSolaris on it. It''s a bit of a risk compared to buying a $30,000 server, but would be a fun experiment. Do you have a $2k budget to perform a cheap experiment? Because for this amount of money you can build the following server that has 10TB of usable storage capacity, and that would be roughly able to sustain sequential reads between 500MByte/s and 1000MByte/s over NFS over a Myricom 10GbE NIC. This is my estimation. I am less sure about sequential writes: I think this server would be capable of at least 250-500 MByte/s. $150 - Mobo with onboard 4-port AHCI SATA controller (eg. any AMD 700 chipset), and at least two x8 electrical PCI-E slots $200 - Quad-core Phenom II X4 CPU + 4GB RAM $150 - LSISAS1068E 8-port SAS/SATA HBA, PCI-E x8 $500 - Myri-10G NIC (10G-PCIE-8B-C), PCI-E x8 $1000 - 12 x 1TB SATA drives (4 on onboard AHCI, 8 on LSISAS1068E) - It is important to choose an AMD platform because the PCI-E lanes will always come from the northbridge chipset which is connected to the CPU via an HT 3.0 link. On Intel platforms, the DMI link between the ICH and MCH will be a bottleneck if the mobo gives you PCI-E lanes from the MCH (in my experience, this is the case of most desktop mobos). - Make sure you enable AHCI in the BIOS. - Configure the 12 drives as striped raidz vdevs: zpool create mytank raidz d0 d1 d2 d3 d4 d5 raidz d6 d7 d8 d9 d10 d11 - Buy drives able to sustain 120-130 MByte/s of sequential reads at the beginning of the platter (my recommendation: Seagate 7200.12) this way your 4Gbit/s requirement will be met even in the worst case when reading from the end of the platters. Thank me for saving you $28k :-) The above experiment would be a way to validate some of your ideas before building a 45-drive server... -mrb _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Connamacher
2009-Sep-29 20:31 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Bob, thanks for the tips. Before building a custom solution I want to do my due diligence and make sure that, for every part that can go bad, I''ve got a backup ready to be swapped in at a moment''s notice. But I am seriously considering the alternative as well, paying more to get something with a good support contract. -- This message posted from opensolaris.org
Richard Connamacher
2009-Sep-29 20:38 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Marc, Thanks for the tips! I was looking at building a smaller scale version of it first with maybe 8 1.5 TB drives, but I like your idea better. I''d probably use 1.5 TB drives since the cost per gigabyte is about the same now. -- This message posted from opensolaris.org
Richard Connamacher
2009-Sep-29 22:00 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Also, one of those drives will need to be the boot drive. (Even if it''s possible I don''t want to boot from the data dive, need to keep it focused on video storage.) So it''ll end up being 11 drives in the raid-z. -- This message posted from opensolaris.org
Marc Bevand
2009-Sep-30 02:23 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Richard Connamacher <rich <at> indieimage.com> writes:> > Also, one of those drives will need to be the boot drive. > (Even if it''s possible I don''t want to boot from the > data dive, need to keep it focused on video storage.)But why? By allocating 11 drives instead of 12 to your data pool, you will reduce the max sequential I/O throughput by approximately 10% which is significant... If I were you I would format every 1.5TB drive like this: * 6GB slice for the root fs * 1494GB slice for the data fs And create an N-way mirror for the root fs with N in [2..12]. I would rather loose 6/1500 = 0.4% of storage capacity than loose 10% of I/O throughput. I/O activity on the root fs will be insignificant and will have zero perf impact on your data pool. -mrb
Frank Middleton
2009-Sep-30 13:18 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On 09/29/09 10:23 PM, Marc Bevand wrote:> If I were you I would format every 1.5TB drive like this: > * 6GB slice for the root fsAs noted in another thread, 6GB is way too small. Based on actual experience, an upgradable rpool must be more than 20GB. I would suggest at least 32GB; out of 1.5TB that''s still negligible. Recent release notes for image-update say that at least 8GB free is required for an update. snv111b as upgraded from a CD installed image takes > 11GB without any user applications like Firefox. Note also that a nominal 1.5TB drive really only has 1.36TB of actual space as reported by zfs. Can''t speak to the 12-way mirror idea, but if you go this route you might keep some slices for rpool backups. I have found having a disk with such a backup invaluable... How do you plan to do backups in general? Cheers -- Frank
paul at paularcher.org
2009-Sep-30 13:22 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
> Also, one of those drives will need to be the boot drive. (Even if it''s > possible I don''t want to boot from the data dive, need to keep it focused > on video storage.) So it''ll end up being 11 drives in the raid-z. > -- > This message posted from opensolaris.org > >FWIW, most enclosures like the ones we have been discussing lately have an internal bay for a boot/OS drive--so you''ll probably have all 12 hot-swap bays available for data drives. Paul
Orvar Korvar
2009-Sep-30 15:58 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Many sysadmins recommends raidz2. The reason is, if a drive breaks and you have to rebuild your array, it will take a long time with a large drive. With a 4TB drive or larger, it could take a week to rebuild your array! During that week, there will be heavy load on the rest of the drives, which may break another drive - and all your data is lost. Are you willing to risc that? With 2TB drives, might take 24h or more. I will soon be migrating to raidz2 because I expect to swap all my drives to larger and larger ones. First 2TB drives. Then 4TB drives. And still keep the same nr of drives in my array. In the future, I expect to have 4TB drives. Then I will be glad I have opted for raidz2. -- This message posted from opensolaris.org
Marc Bevand
2009-Sep-30 16:59 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
Frank Middleton <f.middleton <at> apogeect.com> writes:> > As noted in another thread, 6GB is way too small. Based on > actual experience, an upgradable rpool must be more than > 20GB.It depends on how minimal your install is. The OpenSolaris install instructions recommend 8GB minimum, I have one OpenSolaris 2009.06 server using about 4GB, so I thought 6GB would be sufficient. That said I have never upgraded the rpool of this server, but based on your commends I would recommend an rpool of 15GB to the original poster. -mrb
Frank Middleton
2009-Sep-30 18:42 UTC
[zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On 09/30/09 12:59 PM, Marc Bevand wrote:> It depends on how minimal your install is.Absolutely minimalist install from live CD subsequently updated via pkg to snv111b. This machine is an old 32 bit PC used now as an X-terminal, so doesn''t need any additional software. It now has a bigger slice of a larger pair of disks :-). snv122 also takes around 11GB after emptying /var/pkg/download. # uname -a SunOS host8 5.11 snv_111b i86pc i386 i86pc Solaris # df -h Filesystem Size Used Avail Use% Mounted on rpool/ROOT/opensolaris-2 34G 13G 22G 37% / .... There''s around 765GB in /var/pkg/download that could be deleted, and 1GB''s worth of snapshots left by previous image-updates, bringing it down to around 11GB. consistent with a minimalist SPARC snv122 install with /var/pkg/download emptied and all but the current BE and all snapshots deleted.> The OpenSolaris install instructions recommend 8GB minimum, I haveIt actually says 8GB free space required. This is on top of the space used by the base installation. This 8GB makes perfect sense when you consider that the baseline has to be snapshotted, and new code has to be downloaded and installed in a way that can be rolled back. I can''t explain why the snv111b baseline is 11GB vs. the 6GB of the initial install, but this was a default install followed by default image-updates.> one OpenSolaris 2009.06 server using about 4GB, so I thought 6GB > would be sufficient. That said I have never upgraded the rpool of > this server, but based on your commends I would recommend an rpool > of 15GB to the original poster.The absolute minimum for an upgradable rpool is 20GB, for both SPARC and X86. This assumes you religiously purge all unnecessary files (such as /var/pkg/download) and keep swap, /var/dump, /var/crash and /opt on another disk. You *really* don''t want to run out of space doing an image-update. The result is likely to require a restore from backup of the rpool, or at best, loss of some space that seems to vanish down a black hole. Technically, the rpool was recovered from a baseline snapshot several times onto a 20GB disk until I figured out empirically that 8GB of free space was required for the image-update. I really doubt your mileage will vary. Prudence says that 32GB is much safer... Cheers -- Frank