I think this discussion probably came up here already but I couldn't find much on the archives. Would you able to comment or correct whatever might look wrong. What options people think is more adequate to use with Gluster in terms of RAID underneath and a good balance between cost, usable space and performance. I have thought about two main options with its Pros and Cons No RAID (individual hot swappable disks): Each disk is a brick individually (server:/disk1, server:/disk2, etc) so no RAID controller is required. As the data is replicated if one fail the data must exist in another disk on another node. Pros: Cheaper to build as there is no cost for a expensive RAID controller. Improved performance as writes have to be done only on a single disk not in the entire RAID5/6 Array. Make better usage of the Raw space as there is no disk for parity on a RAID 5/6 Cons: If a failed disk gets replaced the data need to be replicated over the network (not a big deal if using Infiniband or 1Gbps+ Network) The biggest file size is the size of one disk if using a volume type Distributed. In this case does anyone know if when replacing a failed disk does it need to be manually formatted and mounted ? RAID Controller: Using a RAID controller with battery backup can improve the performance specially caching the writes on the controller's memory but at the end one single array means the equivalent performance of one disk for each brick. Also RAID requires have either 1 or 2 disks for parity. If using very cheap disks probably better use RAID 6, if using better quality ones should be fine RAID 5 as, again, the data the data is replicated to another RAID 5 on another node. Pros: Can create larger array as a single brick in order to fit bigger files for when using Distributed volume type. Disk rebuild should be quicker (and more automated?) Cons: Extra cost of the RAID controller. Performance of the array is equivalent a single disk + RAID controller caching features. RAID doesn't scale well beyond ~16 disks Attaching a JBOD to a node and creating multiple RAID Arrays(or a single server with more disk slots) instead of adding a new node can save power(no need CPU, Memory, Motherboard), but having multiple bricks on the same node might happen the data is replicated inside the same node making the downtime of a node something critical, or does Gluster is smart to replicate data to a brick in a different node ? Regards, Fernando -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120614/8bf8992e/attachment.html>
Hi, Some corrections... Cons:> Extra cost of the RAID controller. > Performance of the array is equivalent a single disk + RAID controller > caching features. > RAID doesn?t scale well beyond ~16 disksPerformance of the array is not equivalent of a single disk and doesn't depend only on cache size or spec. features - it depends on the total IOPS, block sizes, access type etc. RAID scales well beyond 16 disks, ex. for Adaptec. Yes, it will scale, but is it software or hardware, for both array reconfiguration and grow is the same kind of problem - data needs to be reallocated. Maximum Number of Arrays that can be created on the same set of drives: 4 Maximum Logical Drive Size: 512TB Maximum Number of Drives in Striped Array (such as RAID 0): 128 Maximum Number of Drives in RAID 5 Array: 32 Maximum Number of Drives in RAID 50 Array: 32 Maximum Number of Drives in RAID 6 Array: 32 Maximum Number of Drives in RAID 60 Array: 32 Available Stripe Sizes for Arrays are 16, 32, 64, 128, 256, 512, or 1024 KB. Striped RAID configurations have a default stripe size of 256 KB. Note: A RAID 10, RAID 50, or RAID 60 array cannot have more than 32 legs when created using the Build method. Maximum disk drive count is only limited by RAID level. For instance: a RAID 10 array built with 32 RAID 1 legs (64 disk drives) is supported a RAID 50 array built with 32 RAID 5 legs (number of drives will vary) is also supported Best regards, George Machitidze On Thu, Jun 14, 2012 at 3:06 PM, Fernando Frediani (Qube) < fernando.frediani at qubenet.net> wrote:> Cons: > > Extra cost of the RAID controller. > > Performance of the array is equivalent a single disk + RAID controller > caching features. > > RAID doesn?t scale well beyond ~16 disks-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120614/86d9e12a/attachment.html>
On 14 Jun 2012, at 15:22, "Fernando Frediani (Qube)" <fernando.frediani at qubenet.net> wrote:> Well, as far as I know the amount of IOPS you can get from a RAID 5/6 is the same that you get from a single disk. The write can not be acknowledged until it is written to all the data and parity disks.It can exceed that with battery back-up on the controller. With battery back-up, writes are often faster than reads (in all of IOPS, latency and throughput), at least until you hit the cache size limit. Sustained writes will not get such good performance because of the limit you mention, but random writes can still do pretty well, YMMV. If you want to scale writes properly, you need some variant of RAID-10. I've got one server with RAID-10 across 6 SSDs, works well. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info at hand CRM solutions marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/
On Thu, Jun 14, 2012 at 11:06:32AM +0000, Fernando Frediani (Qube) wrote:> No RAID (individual hot swappable disks): > > Each disk is a brick individually (server:/disk1, server:/disk2, etc) > so no RAID controller is required. As the data is replicated if one > fail the data must exist in another disk on another node. > > Pros: > > Cheaper to build as there is no cost for a expensive RAID controller.Except that software (md) RAID is free and works with a HBA.> Improved performance as writes have to be done only on a single disk > not in the entire RAID5/6 Array. > > Make better usage of the Raw space as there is no disk for parity on a > RAID 5/6 > > > Cons: > > If a failed disk gets replaced the data need to be replicated over the > network (not a big deal if using Infiniband or 1Gbps+ Network) > > The biggest file size is the size of one disk if using a volume type > Distributed.Additional Cons: * You will probably need to write your own tools to monitor and notify you when a disk fails in the array (wherease there are easily-available existing tools for md RAID, including E-mail notifications and SNMP integration) * The process of swapping a disk is not a simple hot-swap: you need to replace the failed drive, mkfs a new filesystem, and re-introduce it into the gluster volume. This is something you will need to document procedures for and test carefully, whereas RAID swaps are relatively no-brainer. * For a large configuration with hundreds of drives, it can become ungainly to have a gluster volume with hundreds of bricks.> RAID doesn?t scale well beyond ~16 disksBut you can groups your disks into multiple RAID volumes.> Attaching a JBOD to a node and creating multiple RAID Arrays(or a > single server with more disk slots) instead of adding a new node can > save power(no need CPU, Memory, Motherboard), but having multiple > bricks on the same node might happen the data is replicated inside the > same node making the downtime of a node something critical, or does > Gluster is smart to replicate data to a brick in a different node ?It's not automatic, you configure it explicitly. If your replica count is 2 then you give it pairs of bricks, and data will be replicated onto each brick in the pair. It's your responsibility to ensure that those two bricks are on different servers, if high availability is your concern. Another alternative to consider: RAID10 on each node. Eliminates the performance penalty of RAID5/6, indeed will give you improved read performance compared to single disks, but halves your available storage capacity. You can of course mix-and-match. e.g. RAID5 for backup volumes; RAID10 for highly active read/write volumes; some gluster volumes are replicated and some are not, etc. This can become a management headache if it gets too complex though.
On 06/14/2012 07:06 AM, Fernando Frediani (Qube) wrote:> I think this discussion probably came up here already but I couldn?t > find much on the archives. Would you able to comment or correct whatever > might look wrong. > > What options people think is more adequate to use with Gluster in terms > of RAID underneath and a good balance between cost, usable space and > performance. I have thought about two main options with its Pros and Cons > > *No RAID (individual hot swappable disks):* > > Each disk is a brick individually (server:/disk1, server:/disk2, etc) so > no RAID controller is required. As the data is replicated if one fail > the data must exist in another disk on another node.For this to work well, you need the ability to mark a disk as failed and as ready for removal, or to migrate all data on a disk over to a new disk. Gluster only has the last capability, and doesn't have the rest. You still need additional support in the OS and tool sets. The tools we've developed for DeltaV and siFlash help in this regard, though I wouldn't suggest using Gluster in this mode.> > _Pros_: > > Cheaper to build as there is no cost for a expensive RAID controller.If a $500USD RAID adapter saves you $1000USD of time/expense over its lifetime due to failed disk alerts, hot swap autoconfiguration, etc. is it "really" expensive? Of course, if you are at a university where you have infinite amounts of cheap labor, sure, its expensive. Cheaper to manage by throwing grad/undergrad students at it than it is to manage with an HBA. That is, the word "expensive" has different meanings in different contexts ... and in storage, the $500USD adapter may easily help reduce costs elsewhere in the system (usually in the disk lifecycle management, as RAID's major purpose in life is to give you the administrator a fighting chance to replace a failed device before you lose your data).> > Improved performance as writes have to be done only on a single disk not > in the entire RAID5/6 Array.Good for tiny writes. Bad for larger writes (>64kB)> > Make better usage of the Raw space as there is no disk for parity on a > RAID 5/6 > > _Cons_: > > If a failed disk gets replaced the data need to be replicated over the > network (not a big deal if using Infiniband or 1Gbps+ Network)For a 100 MB/s pipe (streaming disk read, which you don't normally get when you copy random files to/from disk), 1 GB = 10 seconds. 1 TB = 10,000 seconds. This is the best case scenario. In reality, you will get some fractional portion of that disk read/write speed. So expect 10,000 seconds as the most optimistic (and unrealistic) estimate ... a lower bound on time.> > The biggest file size is the size of one disk if using a volume type > Distributed.For some users this is not a problem, though several years ago, we had users wanting to read write *single* TB sized files.> > In this case does anyone know if when replacing a failed disk does it > need to be manually formatted and mounted ?In this model, yes. This is why the RAID adapter saves time unless you have written/purchased "expensive" tools to do similar things.> > *RAID Controller:* > > Using a RAID controller with battery backup can improve the performance > specially caching the writes on the controller?s memory but at the end > one single array means the equivalent performance of one disk for each > brick. Also RAID requires have either 1 or 2 disks for parity. If usingFor large reads/writes, you typically get N* (N disks reduced by number of parity disks and hot spares) disk performance. For small reads/writes you get 1 disk (or less) performance. Basically optimal read/write will be in multiples of the stripe width. Optimizing stripe width and chunk sizes for various applications is something of a black art, in that overoptimization for one size/app will negatively impact another.> very cheap disks probably better use RAID 6, if using better quality > ones should be fine RAID 5 as, again, the data the data is replicated to > another RAID 5 on another node.If you have more than 6TB of data, use RAID6 or RAID10. RAID5 shouldn't be used for TB class storage for units with UCE rates more than 10^-17 (you would hit a UCE on rebuild for a failed drive, which would take out all your data ... not nice).> > _Pros_: > > Can create larger array as a single brick in order to fit bigger files > for when using Distributed volume type. > > Disk rebuild should be quicker (and more automated?)More generally, management is nearly automatic, modulo physically replacing a drive.> > _Cons_: > > Extra cost of the RAID controller.Its a cost-benefit analysis, and for lower end storage units, the CBE almost always is in favor of a reasonable RAID design.> > Performance of the array is equivalent a single disk + RAID controller > caching features.No ... see above.> > RAID doesn?t scale well beyond ~16 disks16 disks is the absolute maximum we would ever tie to a single RAID (or HBA). Most RAID processor chips can't handle the calculations for 16 disks (compare the performance of RAID6 at 16 drives to that at 12 drives for similar sized chunks, and "optimal" IO ... in most cases, the performance delta isn't 16/12, 14/10, 13/9 or similar. Its typically a bit lower. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
Apparently Analagous Threads
- Can't run KVM Virtual Machines on a Gluster volume
- Performance optimization tips Gluster 3.3? (small files / directory listings)
- Problem with too many small files
- "mismatching layouts" flooding in the logs
- Very slow directory listing and high CPU usage on replicated volume