thr3ads.net - Gluster users - [Gluster-users] RAID options for Gluster [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Fernando Frediani (Qube)

2012-Jun-14 11:06 UTC

[Gluster-users] RAID options for Gluster

I think this discussion probably came up here already but I couldn't find
much on the archives. Would you able to comment or correct whatever might look
wrong.

What options people think is more adequate to use with Gluster in terms of RAID
underneath and a good balance between cost, usable space and performance. I have
thought about two main options with its Pros and Cons

No RAID (individual hot swappable disks):
Each disk is a brick individually (server:/disk1, server:/disk2, etc) so no RAID
controller is required. As the data is replicated if one fail the data must
exist in another disk on another node.
Pros:
Cheaper to build as there is no cost for a expensive RAID controller.
Improved performance as writes have to be done only on a single disk not in the
entire RAID5/6 Array.
Make better usage of the Raw space as there is no disk for parity on a RAID 5/6

Cons:
If a failed disk gets replaced the data need to be replicated over the network
(not a big deal if using Infiniband or 1Gbps+ Network)
The biggest file size is the size of one disk if using a volume type
Distributed.

In this case does anyone know if when replacing a failed disk does it need to be
manually formatted and mounted ?

RAID Controller:
Using a RAID controller with battery backup can improve the performance
specially caching the writes on the controller's memory but at the end one
single array means the equivalent performance of one disk for each brick. Also
RAID requires have either 1 or 2 disks for parity. If using very cheap disks
probably better use RAID 6, if using better quality ones should be fine RAID 5
as, again, the data the data is replicated to another RAID 5 on another node.
Pros:
Can create larger array as a single brick in order to fit bigger files for when
using Distributed volume type.
Disk rebuild should be quicker (and more automated?)
Cons:
Extra cost of the RAID controller.
Performance of the array is equivalent a single disk + RAID controller caching
features.
RAID doesn't scale well beyond ~16 disks

Attaching a JBOD to a node and creating multiple RAID Arrays(or a single server
with more disk slots) instead of adding a new node can save power(no need CPU,
Memory, Motherboard), but having multiple bricks on the same node might happen
the data is replicated inside the same node making the downtime of a node
something critical, or does Gluster is smart to replicate data to a brick in a
different node ?

Regards,

Fernando
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120614/8bf8992e/attachment.html>

George Machitidze

2012-Jun-14 13:01 UTC

head link

[Gluster-users] RAID options for Gluster

Hi,

Some corrections...

Cons:> Extra cost of the RAID controller.
> Performance of the array is equivalent a single disk + RAID controller
> caching features.
> RAID doesn?t scale well beyond ~16 disks

Performance of the array is not equivalent of a single disk and doesn't
depend only on cache size or spec. features - it depends on the total IOPS,
block sizes, access type etc.

RAID scales well beyond 16 disks, ex. for Adaptec. Yes, it will scale, but
is it software or hardware, for both array reconfiguration and grow is the
same kind of problem - data needs to be reallocated.

Maximum Number of Arrays that can be created on the same set of drives: 4
Maximum Logical Drive Size: 512TB
Maximum Number of Drives in Striped Array (such as RAID 0): 128
Maximum Number of Drives in RAID 5 Array: 32
Maximum Number of Drives in RAID 50 Array: 32
Maximum Number of Drives in RAID 6 Array: 32
Maximum Number of Drives in RAID 60 Array: 32
Available Stripe Sizes for Arrays are 16, 32, 64, 128, 256, 512, or 1024
KB. Striped RAID configurations have a default stripe size of 256 KB.
Note: A RAID 10, RAID 50, or RAID 60 array cannot have more than 32 legs
when created using the Build method. Maximum disk drive count is only
limited by RAID level. For instance:
a RAID 10 array built with 32 RAID 1 legs (64 disk drives) is supported
a RAID 50 array built with 32 RAID 5 legs (number of drives will vary) is
also supported

Best regards,
George Machitidze

On Thu, Jun 14, 2012 at 3:06 PM, Fernando Frediani (Qube) <
fernando.frediani at qubenet.net> wrote:> Cons:
>
> Extra cost of the RAID controller.
>
> Performance of the array is equivalent a single disk + RAID controller
> caching features.
>
> RAID doesn?t scale well beyond ~16 disks-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120614/86d9e12a/attachment.html>

Marcus Bointon

2012-Jun-14 13:33 UTC

head link

[Gluster-users] RAID options for Gluster

On 14 Jun 2012, at 15:22, "Fernando Frediani (Qube)"
<fernando.frediani at qubenet.net> wrote:
> Well, as far as I know the amount of IOPS you can get from a RAID 5/6 is
the same that you get from a single disk. The write can not be acknowledged
until it is written to all the data and parity disks.
It can exceed that with battery back-up on the controller. With battery back-up,
writes are often faster than reads (in all of IOPS, latency and throughput), at
least until you hit the cache size limit. Sustained writes will not get such
good performance because of the limit you mention, but random writes can still
do pretty well, YMMV.

If you want to scale writes properly, you need some variant of RAID-10. I've
got one server with RAID-10 across 6 SSDs, works well.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info at hand CRM solutions
marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/

Brian Candler

2012-Jun-14 13:54 UTC

head link

[Gluster-users] RAID options for Gluster

On Thu, Jun 14, 2012 at 11:06:32AM +0000, Fernando Frediani (Qube)
wrote:>    No RAID (individual hot swappable disks):
> 
>    Each disk is a brick individually (server:/disk1, server:/disk2, etc)
>    so no RAID controller is required. As the data is replicated if one
>    fail the data must exist in another disk on another node.
> 
>    Pros:
> 
>    Cheaper to build as there is no cost for a expensive RAID controller.
Except that software (md) RAID is free and works with a HBA.
>    Improved performance as writes have to be done only on a single disk
>    not in the entire RAID5/6 Array.
> 
>    Make better usage of the Raw space as there is no disk for parity on a
>    RAID 5/6
> 
> 
>    Cons:
> 
>    If a failed disk gets replaced the data need to be replicated over the
>    network (not a big deal if using Infiniband or 1Gbps+ Network)
> 
>    The biggest file size is the size of one disk if using a volume type
>    Distributed.
Additional Cons:

* You will probably need to write your own tools to monitor and notify you
when a disk fails in the array (wherease there are easily-available existing
tools for md RAID, including E-mail notifications and SNMP integration)

* The process of swapping a disk is not a simple hot-swap: you need to
replace the failed drive, mkfs a new filesystem, and re-introduce it into
the gluster volume.  This is something you will need to document procedures
for and test carefully, whereas RAID swaps are relatively no-brainer.

* For a large configuration with hundreds of drives, it can become ungainly
to have a gluster volume with hundreds of bricks.
>    RAID doesn?t scale well beyond ~16 disks
But you can groups your disks into multiple RAID volumes.
>    Attaching a JBOD to a node and creating multiple RAID Arrays(or a
>    single server with more disk slots) instead of adding a new node can
>    save power(no need CPU, Memory, Motherboard), but having multiple
>    bricks on the same node might happen the data is replicated inside the
>    same node making the downtime of a node something critical, or does
>    Gluster is smart to replicate data to a brick in a different node ?
It's not automatic, you configure it explicitly. If your replica count is 2
then you give it pairs of bricks, and data will be replicated onto each
brick in the pair. It's your responsibility to ensure that those two bricks
are on different servers, if high availability is your concern.

Another alternative to consider: RAID10 on each node. Eliminates the
performance penalty of RAID5/6, indeed will give you improved read
performance compared to single disks, but halves your available storage
capacity.

You can of course mix-and-match. e.g. RAID5 for backup volumes; RAID10 for
highly active read/write volumes; some gluster volumes are replicated and
some are not, etc.  This can become a management headache if it gets too
complex though.

Joe Landman

2012-Jun-14 14:09 UTC

head link

[Gluster-users] RAID options for Gluster

On 06/14/2012 07:06 AM, Fernando Frediani (Qube) wrote:> I think this discussion probably came up here already but I couldn?t
> find much on the archives. Would you able to comment or correct whatever
> might look wrong.
>
> What options people think is more adequate to use with Gluster in terms
> of RAID underneath and a good balance between cost, usable space and
> performance. I have thought about two main options with its Pros and Cons
>
> *No RAID (individual hot swappable disks):*
>
> Each disk is a brick individually (server:/disk1, server:/disk2, etc) so
> no RAID controller is required. As the data is replicated if one fail
> the data must exist in another disk on another node.
For this to work well, you need the ability to mark a disk as failed and 
as ready for removal, or to migrate all data on a disk over to a new 
disk.  Gluster only has the last capability, and doesn't have the rest. 
  You still need additional support in the OS and tool sets.

The tools we've developed for DeltaV and siFlash help in this regard, 
though I wouldn't suggest using Gluster in this mode.

>
> _Pros_:
>
> Cheaper to build as there is no cost for a expensive RAID controller.
If a $500USD RAID adapter saves you $1000USD of time/expense over its 
lifetime due to failed disk alerts, hot swap autoconfiguration, etc. is 
it "really" expensive?  Of course, if you are at a university where
you
have infinite amounts of cheap labor, sure, its expensive.  Cheaper to 
manage by throwing grad/undergrad students at it than it is to manage 
with an HBA.

That is, the word "expensive" has different meanings in different 
contexts ... and in storage, the $500USD adapter may easily help reduce 
costs elsewhere in the system (usually in the disk lifecycle management, 
as RAID's major purpose in life is to give you the administrator a 
fighting chance to replace a failed device before you lose your data).
>
> Improved performance as writes have to be done only on a single disk not
> in the entire RAID5/6 Array.
Good for tiny writes.  Bad for larger writes (>64kB)
>
> Make better usage of the Raw space as there is no disk for parity on a
> RAID 5/6
>
> _Cons_:
>
> If a failed disk gets replaced the data need to be replicated over the
> network (not a big deal if using Infiniband or 1Gbps+ Network)
For a 100 MB/s pipe (streaming disk read, which you don't normally get 
when you copy random files to/from disk), 1 GB = 10 seconds.  1 TB = 
10,000 seconds.  This is the best case scenario.  In reality, you will 
get some fractional portion of that disk read/write speed.  So expect 
10,000 seconds as the most optimistic (and unrealistic) estimate ... a 
lower bound on time.
>
> The biggest file size is the size of one disk if using a volume type
> Distributed.
For some users this is not a problem, though several years ago, we had 
users wanting to read write *single* TB sized files.
>
> In this case does anyone know if when replacing a failed disk does it
> need to be manually formatted and mounted ?
In this model, yes.  This is why the RAID adapter saves time unless you 
have written/purchased "expensive" tools to do similar things.
>
> *RAID Controller:*
>
> Using a RAID controller with battery backup can improve the performance
> specially caching the writes on the controller?s memory but at the end
> one single array means the equivalent performance of one disk for each
> brick. Also RAID requires have either 1 or 2 disks for parity. If using
For large reads/writes, you typically get N* (N disks reduced by number 
of parity disks and hot spares) disk performance.  For small 
reads/writes you get 1 disk (or less) performance.  Basically optimal 
read/write will be in multiples of the stripe width.  Optimizing stripe 
width and chunk sizes for various applications is something of a black 
art, in that overoptimization for one size/app will negatively impact 
another.

> very cheap disks probably better use RAID 6, if using better quality
> ones should be fine RAID 5 as, again, the data the data is replicated to
> another RAID 5 on another node.
If you have more than 6TB of data, use RAID6 or RAID10.  RAID5 shouldn't 
be used for TB class storage for units with UCE rates more than 10^-17 
(you would hit a UCE on rebuild for a failed drive, which would take out 
all your data ... not nice).
>
> _Pros_:
>
> Can create larger array as a single brick in order to fit bigger files
> for when using Distributed volume type.
>
> Disk rebuild should be quicker (and more automated?)
More generally, management is nearly automatic, modulo physically 
replacing a drive.
>
> _Cons_:
>
> Extra cost of the RAID controller.
Its a cost-benefit analysis, and for lower end storage units, the CBE 
almost always is in favor of a reasonable RAID design.
>
> Performance of the array is equivalent a single disk + RAID controller
> caching features.
No ... see above.
>
> RAID doesn?t scale well beyond ~16 disks
16 disks is the absolute maximum we would ever tie to a single RAID (or 
HBA).  Most RAID processor chips can't handle the calculations for 16 
disks (compare the performance of RAID6 at 16 drives to that at 12 
drives for similar sized chunks, and "optimal" IO ... in most cases,
the
performance delta isn't 16/12, 14/10, 13/9 or similar.  Its typically a 
bit lower.



-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

Seemingly Similar Threads

Search for more possibly parallel threads

Gluster users - Jun 2012 - RAID options for Gluster

[Gluster-users] RAID options for Gluster

[Gluster-users] RAID options for Gluster

[Gluster-users] RAID options for Gluster

[Gluster-users] RAID options for Gluster

[Gluster-users] RAID options for Gluster

Seemingly Similar Threads