Your comment actually helps me more than you think, one of the main doubts I have is whether I go for JOBD with replica 3 or SW RAID 6 with replica2 + arbitrer. Before reading your email I was leaning more towards JOBD, as reconstruction of a moderately big RAID 6 with mdadm can be painful too. Now I see a reconstruct is going to be painful either way... For the record, the workload I am going to migrate is currently 18,314,445 MB and 34,752,784 inodes (which is not exactly the same as files, but let's use that for a rough estimate), for an average file size of about 539 KB per file. Thanks a lot for your time and insights! On 6/6/19 8:53, Hu Bert wrote:> Good morning, > > my comment won't help you directly, but i thought i'd send it anyway... > > Our first glusterfs setup had 3 servers withs 4 disks=bricks (10TB, > JBOD) each. Was running fine in the beginning, but then 1 disk failed. > The following heal took ~1 month, with a bad performance (quite high > IO). Shortly after the heal hat finished another disk failed -> same > problems again. Not funny. > > For our new system we decided to use 3 servers with 10 disks (10 TB) > each, but now the 10 disks in a SW RAID 10 (well, we split the 10 > disks into 2 SW RAID 10, each of them is a brick, we have 2 gluster > volumes). A lot of disk space "wasted", with this type of SW RAID and > a replicate 3 setup, but we wanted to avoid the "healing takes a long > time with bad performance" problems. Now mdadm takes care of > replicating data, glusterfs should always see "good" bricks. > > And the decision may depend on what kind of data you have. Many small > files, like tens of millions? Or not that much, but bigger files? I > once watched a video (i think it was this one: > https://www.youtube.com/watch?v=61HDVwttNYI). Recommendation there: > RAID 6 or 10 for small files, for big files... well, already 2 years > "old" ;-) > > As i said, this won't help you directly. You have to identify what's > most important for your scenario; as you said, high performance is not > an issue - if this is true even when you have slight performance > issues after a disk fail then ok. My experience so far: the bigger and > slower the disks are and the more data you have -> healing will hurt > -> try to avoid this. If the disks are small and fast (SSDs), healing > will be faster -> JBOD is an option. > > > hth, > Hubert > > Am Mi., 5. Juni 2019 um 11:33 Uhr schrieb Eduardo Mayoral <emayoral at arsys.es>: >> Hi, >> >> I am looking into a new gluster deployment to replace an ancient one. >> >> For this deployment I will be using some repurposed servers I >> already have in stock. The disk specs are 12 * 3 TB SATA disks. No HW >> RAID controller. They also have some SSD which would be nice to leverage >> as cache or similar to improve performance, since it is already there. >> Advice on how to leverage the SSDs would be greatly appreciated. >> >> One of the design choices I have to make is using 3 nodes for a >> replica-3 with JBOD, or using 2 nodes with a replica-2 and using SW RAID >> 6 for the disks, maybe adding a 3rd node with a smaller amount of disk >> as metadata node for the replica set. I would love to hear advice on the >> pros and cons of each setup from the gluster experts. >> >> The data will be accessed from 4 to 6 systems with native gluster, >> not sure if that makes any difference. >> >> The amount of data I have to store there is currently 20 TB, with >> moderate growth. iops are quite low so high performance is not an issue. >> The data will fit in any of the two setups. >> >> Thanks in advance for your advice! >> >> -- >> Eduardo Mayoral Jimeno >> Systems engineer, platform department. Arsys Internet. >> emayoral at arsys.es - +34 941 620 105 - ext 2153 >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users-- Eduardo Mayoral Jimeno Systems engineer, platform department. Arsys Internet. emayoral at arsys.es - +34 941 620 105 - ext 2153
Michael Metz-Martini
2019-Jun-06 18:46 UTC
[Gluster-users] Advice for setup: SW RAID 6 vs JBOD
Hi Am 06.06.19 um 18:48 schrieb Eduardo Mayoral:> Your comment actually helps me more than you think, one of the main > doubts I have is whether I go for JOBD with replica 3 or SW RAID 6 with > replica2 + arbitrer. Before reading your email I was leaning more > towards JOBD, as reconstruction of a moderately big RAID 6 with mdadm > can be painful too. Now I see a reconstruct is going to be painful > either way... > > For the record, the workload I am going to migrate is currently > 18,314,445 MB and 34,752,784 inodes (which is not exactly the same as > files, but let's use that for a rough estimate), for an average file > size of about 539 KB per file. > > Thanks a lot for your time and insights!Currently we're hosting ~200 TB split into about 3.500.000.000 files on a Distributed-Replicate-2-gluster volume with each brick running on a hw-raid6 of 8 x 8 TB disks. As we never had a failed drive 'till now I can't tell you something about recovery times but rebalance is damn slow with such high number of small files (so should recovery on jbod-bricks). I think raid-recovery from local disks will be much faster. As our files are nearly 100% readonly and split-brain-issues could be resolevd more or less "easily" we decided against replica 3 in favor of hardware raid6 redundancy. -- Kind regards Michael Metz-Martini
If i remember correctly: in the video they suggested not to make a RAID 10 too big (i.e. too many (big) disks), because the RAID resync then could take a long time. They didn't mention a limit; on my 3 servers with 2 RAID 10 (1x4 disks, 1x6 disks), no disk failed so far, but there were automatic periodic redundancy checks (mdadm checkarray) which ran for a couple of days, increasing load on the servers and responsiveness of glusterfs on the clients. Almost no one even noticed that mdadm checks were running :-) But if i compare it with our old JBOD setup: after the disk change the heal took about a month, resulting in really poor performance on the client side. As we didn't want to experience that period again -> throw hardware at the problem. Maybe a different setup (10 disks -> 5 RAID 1, building a distribute replicate) would've been even better, but so far we're happy with the current setup. Am Do., 6. Juni 2019 um 18:48 Uhr schrieb Eduardo Mayoral <emayoral at arsys.es>:> > Your comment actually helps me more than you think, one of the main > doubts I have is whether I go for JOBD with replica 3 or SW RAID 6 with > replica2 + arbitrer. Before reading your email I was leaning more > towards JOBD, as reconstruction of a moderately big RAID 6 with mdadm > can be painful too. Now I see a reconstruct is going to be painful > either way... > > For the record, the workload I am going to migrate is currently > 18,314,445 MB and 34,752,784 inodes (which is not exactly the same as > files, but let's use that for a rough estimate), for an average file > size of about 539 KB per file. > > Thanks a lot for your time and insights! > > On 6/6/19 8:53, Hu Bert wrote: > > Good morning, > > > > my comment won't help you directly, but i thought i'd send it anyway... > > > > Our first glusterfs setup had 3 servers withs 4 disks=bricks (10TB, > > JBOD) each. Was running fine in the beginning, but then 1 disk failed. > > The following heal took ~1 month, with a bad performance (quite high > > IO). Shortly after the heal hat finished another disk failed -> same > > problems again. Not funny. > > > > For our new system we decided to use 3 servers with 10 disks (10 TB) > > each, but now the 10 disks in a SW RAID 10 (well, we split the 10 > > disks into 2 SW RAID 10, each of them is a brick, we have 2 gluster > > volumes). A lot of disk space "wasted", with this type of SW RAID and > > a replicate 3 setup, but we wanted to avoid the "healing takes a long > > time with bad performance" problems. Now mdadm takes care of > > replicating data, glusterfs should always see "good" bricks. > > > > And the decision may depend on what kind of data you have. Many small > > files, like tens of millions? Or not that much, but bigger files? I > > once watched a video (i think it was this one: > > https://www.youtube.com/watch?v=61HDVwttNYI). Recommendation there: > > RAID 6 or 10 for small files, for big files... well, already 2 years > > "old" ;-) > > > > As i said, this won't help you directly. You have to identify what's > > most important for your scenario; as you said, high performance is not > > an issue - if this is true even when you have slight performance > > issues after a disk fail then ok. My experience so far: the bigger and > > slower the disks are and the more data you have -> healing will hurt > > -> try to avoid this. If the disks are small and fast (SSDs), healing > > will be faster -> JBOD is an option. > > > > > > hth, > > Hubert > > > > Am Mi., 5. Juni 2019 um 11:33 Uhr schrieb Eduardo Mayoral <emayoral at arsys.es>: > >> Hi, > >> > >> I am looking into a new gluster deployment to replace an ancient one. > >> > >> For this deployment I will be using some repurposed servers I > >> already have in stock. The disk specs are 12 * 3 TB SATA disks. No HW > >> RAID controller. They also have some SSD which would be nice to leverage > >> as cache or similar to improve performance, since it is already there. > >> Advice on how to leverage the SSDs would be greatly appreciated. > >> > >> One of the design choices I have to make is using 3 nodes for a > >> replica-3 with JBOD, or using 2 nodes with a replica-2 and using SW RAID > >> 6 for the disks, maybe adding a 3rd node with a smaller amount of disk > >> as metadata node for the replica set. I would love to hear advice on the > >> pros and cons of each setup from the gluster experts. > >> > >> The data will be accessed from 4 to 6 systems with native gluster, > >> not sure if that makes any difference. > >> > >> The amount of data I have to store there is currently 20 TB, with > >> moderate growth. iops are quite low so high performance is not an issue. > >> The data will fit in any of the two setups. > >> > >> Thanks in advance for your advice! > >> > >> -- > >> Eduardo Mayoral Jimeno > >> Systems engineer, platform department. Arsys Internet. > >> emayoral at arsys.es - +34 941 620 105 - ext 2153 > >> > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > Eduardo Mayoral Jimeno > Systems engineer, platform department. Arsys Internet. > emayoral at arsys.es - +34 941 620 105 - ext 2153 >