On 10/26/2016 02:12 PM, Gandalf Corvotempesta wrote:> 2016-10-26 23:07 GMT+02:00 Joe Julian <joe at julianfamily.org>: >> And yes, they can fail, but 20TB is small enough to heal pretty quickly. > 20TB small enough to build quickly? On which network? Gluster doesn't > have a dedicated cluster network, if the cluster is being hevily > accessed, the healing will slow down everything else (or everything > else will slow down the healing)Quickly = MTTR is within tolerances to continue to meet SLA. It's just math. As for a dedicated heal network, split-horizon dns handles that just fine. Clients resolve a server's hostname to the "eth1" (for example) address and the servers themselves resolve the same hostname to the "eth0" address. We played with bonding but decided against the complexity.> > Anyway, you can heal quickly, but I still prefere to have data safe on > each node. If you start with 3 server at once, probably each disk is > coming from the same batch, thus a massive disks failure is easy to > get.There's preference and there's engineering to meet requirements. If your SLA is 5 nines and you engineer 6 nines, you may realize that the difference between a 99.99993% uptime and a 99.99997% uptime isn't worth the added expense of doing replication /and/ raid-1.> If you loose only 2 disks, one for each server, from the same replica > group, you are game over. With RAID6, you have to loose 5 disks from > the same replica group.I never loose my drives. They're always firmly attached. :P With 300 drives, 60 bricks, replica 3 (across 3 racks), I have a six nines availability for any one replica subvolume. If you really want to fudge the numbers, the reliability for any given file is not worth calculating in that volume. The odds of all three bricks failing for any 1 file among 20 distribute subvolumes is statistically infinitesimal.> > In my environment, I can create 4 RAID-0 on each server (3 disks on > each RAID0), or 2 RAID-6 with 6 disks each, or 1 RAID-6 with 12 disks > or 1 RAID-7 with 12 disks (RAID-7 with less than 12 disks is > non-sense) > I don't know which one is better.Just do the reliability calculations and engineer a storage system to meet (exceed) your obligations within the available budget. http://www.eventhelix.com/realtimemantra/faulthandling/system_reliability_availability.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161026/e9f91fc2/attachment.html>
Maybe a controversial question (and hopefully not trolling), but any particularly reason you choose gluster over ceph for these larger setups Joe? For myself, gluster is much easier to manage and provides better performance on my small non-enterprise setup, plus it plays nice with zfs. But I thought ceph had the edge on large, many node, many disk setups. It would seem it handles adding/removing disks better that the juggling you have to do with gluster to keep replication triads even. To complex/fragile maybe? Genuinely curious. -- Lindsay Mathieson
2016-10-26 23:38 GMT+02:00 Joe Julian <joe at julianfamily.org>:> Quickly = MTTR is within tolerances to continue to meet SLA. It's just math.Obviously yes. But in the real world, you can have the best SLAs in the world, but if you loose data, you loose customers.> As for a dedicated heal network, split-horizon dns handles that just fine. > Clients resolve a server's hostname to the "eth1" (for example) address and > the servers themselves resolve the same hostname to the "eth0" address. We > played with bonding but decided against the complexity.Good Idea. Thanks. In this was, the cluster network is serparated from the client network, like with ceph. Just a question: you need two dns infrastructure for this, right ? ns1 and ns2 used by client pointing to eth0 and ns3 and ns4 used by gluster pointing to eth1. In small environment the hosts file could be used, but I prefere the DNS way.> There's preference and there's engineering to meet requirements. If your SLA > is 5 nines and you engineer 6 nines, you may realize that the difference > between a 99.99993% uptime and a 99.99997% uptime isn't worth the added > expense of doing replication and raid-1.How to you calculate the number of nines in this environment ? In example, to have 6 nines (for availability and data consistency), which configuration should I adopt ? I can have 6 nines for the whole cluster but 2 nines for data. In the first case, the whole cluster can't go totally down (tons of node, as example), in the second, some data could be lost (replica 1 or 2)> With 300 drives, 60 bricks, replica 3 (across 3 racks), I have a six nines > availability for any one replica subvolume. If you really want to fudge the > numbers, the reliability for any given file is not worth calculating in that > volume. The odds of all three bricks failing for any 1 file among 20 > distribute subvolumes is statistically infinitesimal.How many servers ? 300 drives, bought in a very short time are willing to fail quicky with multiple failure per time. I had 2 drive failures in less than 1 hour some month ago. Hopefully I was using a RAID-6 Both drives was from the same manufacturer and with sequential serial number.
2016-10-26 23:38 GMT+02:00 Joe Julian <joe at julianfamily.org>:> Just do the reliability calculations and engineer a storage system to meet > (exceed) your obligations within the available budget. > http://www.eventhelix.com/realtimemantra/faulthandling/system_reliability_availability.htm >This is good to evaluate reliability with 3 nodes in parallel (like replica 3). With three nodes, with 2-nines each, I'll get 6-nines. But how can I calculate the number of nines for each single server?