Serkan Çoban
2020-Feb-13 17:38 UTC
[Gluster-users] multi petabyte gluster dispersed for archival?
Do not use EC with small files. You cannot tolerate losing a 300TB brick, reconstruction will take ages. When I was using glusterfs reconstruction speed of ec was 10-15MB/sec. If you do not loose bricks you will be ok. On Thu, Feb 13, 2020 at 7:38 PM Douglas Duckworth <dod2014 at med.cornell.edu> wrote:> > Hello > > I am thinking of building a Gluster file system for archival data. Initially it will start as 6 brick dispersed volume then expand to distributed dispersed as we increase capacity. > > Since metadata in Gluster isn't centralized it will eventually not perform well at scale. So I am wondering if anyone can help identify that point? Ceph can scale to extremely high levels though the complexity required for management seems much greater than Gluster. > > The first six bricks would be a little over 2PB of raw space. Each server will have 24 7200 RPM NL-SAS drives sans RAID. I estimate we would max out at about 100 million files within these first six servers, though that can be reduced by having users tar their small files before writing to Gluster. I/O patterns would be sequential upon initial copy with very infrequent reads thereafter. Given the demands of erasure coding, especially if we lose a brick, the CPUs will be high thread count AMD Rome. The back-end network would be EDR Infiniband, so I will mount via RDMA, while all bricks will be leaf local. > > Given these variables can anyone say whether Gluster would be able to operate at this level of metadata and continue to scale? If so where could it break, 4PB, 12PB, with that being defined as I/O, with all bricks still online, breaking down dramatically? > > Thank you! > Doug > > -- > Thanks, > > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Weill Cornell Medicine > E: doug at med.cornell.edu > O: 212-746-6305 > F: 212-746-8690 > > ________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/441850968 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users
Douglas Duckworth
2020-Feb-13 17:44 UTC
[Gluster-users] multi petabyte gluster dispersed for archival?
Replication would be better yes but HA isn't a hard requirement whereas the most likely loss of a brick would be power. In that case we could stop the entire file system then bring the brick back up should users complain about poor I/O performance. Could you share more about your configuration at that time? What CPUs were you running on bricks, number of spindles per brick, etc? -- Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit<https://scu.med.cornell.edu/> Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 ________________________________ From: Serkan ?oban <cobanserkan at gmail.com> Sent: Thursday, February 13, 2020 12:38 PM To: Douglas Duckworth <dod2014 at med.cornell.edu> Cc: gluster-users at gluster.org <gluster-users at gluster.org> Subject: [EXTERNAL] Re: [Gluster-users] multi petabyte gluster dispersed for archival? Do not use EC with small files. You cannot tolerate losing a 300TB brick, reconstruction will take ages. When I was using glusterfs reconstruction speed of ec was 10-15MB/sec. If you do not loose bricks you will be ok. On Thu, Feb 13, 2020 at 7:38 PM Douglas Duckworth <dod2014 at med.cornell.edu> wrote:> > Hello > > I am thinking of building a Gluster file system for archival data. Initially it will start as 6 brick dispersed volume then expand to distributed dispersed as we increase capacity. > > Since metadata in Gluster isn't centralized it will eventually not perform well at scale. So I am wondering if anyone can help identify that point? Ceph can scale to extremely high levels though the complexity required for management seems much greater than Gluster. > > The first six bricks would be a little over 2PB of raw space. Each server will have 24 7200 RPM NL-SAS drives sans RAID. I estimate we would max out at about 100 million files within these first six servers, though that can be reduced by having users tar their small files before writing to Gluster. I/O patterns would be sequential upon initial copy with very infrequent reads thereafter. Given the demands of erasure coding, especially if we lose a brick, the CPUs will be high thread count AMD Rome. The back-end network would be EDR Infiniband, so I will mount via RDMA, while all bricks will be leaf local. > > Given these variables can anyone say whether Gluster would be able to operate at this level of metadata and continue to scale? If so where could it break, 4PB, 12PB, with that being defined as I/O, with all bricks still online, breaking down dramatically? > > Thank you! > Doug > > -- > Thanks, > > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Weill Cornell Medicine > E: doug at med.cornell.edu > O: 212-746-6305 > F: 212-746-8690 > > ________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://urldefense.proofpoint.com/v2/url?u=https-3A__bluejeans.com_441850968&d=DwIFaQ&c=lb62iw4YL4RFalcE2hQUQealT9-RXrryqt9KZX2qu2s&r=2Fzhh_78OGspKQpl_e-CbhH6xUjnRkaqPFUS2wTJ2cw&m=SsvW0KsQAhI5SQf6z4WQde56D5y5zBm3wJkCyyiVj6E&s=-tl_YiEBCYUEm7rzTkbvmTck0LsAurEd9DJaq8v5-fc&e> > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://urldefense.proofpoint.com/v2/url?u=https-3A__bluejeans.com_441850968&d=DwIFaQ&c=lb62iw4YL4RFalcE2hQUQealT9-RXrryqt9KZX2qu2s&r=2Fzhh_78OGspKQpl_e-CbhH6xUjnRkaqPFUS2wTJ2cw&m=SsvW0KsQAhI5SQf6z4WQde56D5y5zBm3wJkCyyiVj6E&s=-tl_YiEBCYUEm7rzTkbvmTck0LsAurEd9DJaq8v5-fc&e> > Gluster-users mailing list > Gluster-users at gluster.org > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.gluster.org_mailman_listinfo_gluster-2Dusers&d=DwIFaQ&c=lb62iw4YL4RFalcE2hQUQealT9-RXrryqt9KZX2qu2s&r=2Fzhh_78OGspKQpl_e-CbhH6xUjnRkaqPFUS2wTJ2cw&m=SsvW0KsQAhI5SQf6z4WQde56D5y5zBm3wJkCyyiVj6E&s=i7jvEHb-wZksUurCWq828kigRsSxfrAiNWxT7ORcgFs&e-------------- next part --------------An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200213/95139aa5/attachment.html>
Serkan Çoban
2020-Feb-13 18:03 UTC
[Gluster-users] multi petabyte gluster dispersed for archival?
I was using EC configuration 16+4 with 40 servers, each server has 68x10TB JBOD disks. Glusterfs is mounted to 1000 hadoop datanodes, we were using glusterfs as hadoop archive. When a disk fails we did not loose write speed but read speed slows down too much. I used glusterfs at the edge and it served its purpose. Beware that metadata operations will be much slower as ec size increases. Now hdfs also has EC so we replaced it with hdfs. Our servers had 2x12 core cpus with 256GB RAM each with 2x10G bonded interfaces. CPU is used heavily during reconstruction. On Thu, Feb 13, 2020 at 8:44 PM Douglas Duckworth <dod2014 at med.cornell.edu> wrote:> > Replication would be better yes but HA isn't a hard requirement whereas the most likely loss of a brick would be power. In that case we could stop the entire file system then bring the brick back up should users complain about poor I/O performance. > > Could you share more about your configuration at that time? What CPUs were you running on bricks, number of spindles per brick, etc? > > -- > Thanks, > > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Weill Cornell Medicine > E: doug at med.cornell.edu > O: 212-746-6305 > F: 212-746-8690 > > ________________________________ > From: Serkan ?oban <cobanserkan at gmail.com> > Sent: Thursday, February 13, 2020 12:38 PM > To: Douglas Duckworth <dod2014 at med.cornell.edu> > Cc: gluster-users at gluster.org <gluster-users at gluster.org> > Subject: [EXTERNAL] Re: [Gluster-users] multi petabyte gluster dispersed for archival? > > Do not use EC with small files. You cannot tolerate losing a 300TB > brick, reconstruction will take ages. When I was using glusterfs > reconstruction speed of ec was 10-15MB/sec. If you do not loose bricks > you will be ok. > > On Thu, Feb 13, 2020 at 7:38 PM Douglas Duckworth > <dod2014 at med.cornell.edu> wrote: > > > > Hello > > > > I am thinking of building a Gluster file system for archival data. Initially it will start as 6 brick dispersed volume then expand to distributed dispersed as we increase capacity. > > > > Since metadata in Gluster isn't centralized it will eventually not perform well at scale. So I am wondering if anyone can help identify that point? Ceph can scale to extremely high levels though the complexity required for management seems much greater than Gluster. > > > > The first six bricks would be a little over 2PB of raw space. Each server will have 24 7200 RPM NL-SAS drives sans RAID. I estimate we would max out at about 100 million files within these first six servers, though that can be reduced by having users tar their small files before writing to Gluster. I/O patterns would be sequential upon initial copy with very infrequent reads thereafter. Given the demands of erasure coding, especially if we lose a brick, the CPUs will be high thread count AMD Rome. The back-end network would be EDR Infiniband, so I will mount via RDMA, while all bricks will be leaf local. > > > > Given these variables can anyone say whether Gluster would be able to operate at this level of metadata and continue to scale? If so where could it break, 4PB, 12PB, with that being defined as I/O, with all bricks still online, breaking down dramatically? > > > > Thank you! > > Doug > > > > -- > > Thanks, > > > > Douglas Duckworth, MSc, LFCS > > HPC System Administrator > > Scientific Computing Unit > > Weill Cornell Medicine > > E: doug at med.cornell.edu > > O: 212-746-6305 > > F: 212-746-8690 > > > > ________ > > > > Community Meeting Calendar: > > > > APAC Schedule - > > Every 2nd and 4th Tuesday at 11:30 AM IST > > Bridge: https://urldefense.proofpoint.com/v2/url?u=https-3A__bluejeans.com_441850968&d=DwIFaQ&c=lb62iw4YL4RFalcE2hQUQealT9-RXrryqt9KZX2qu2s&r=2Fzhh_78OGspKQpl_e-CbhH6xUjnRkaqPFUS2wTJ2cw&m=SsvW0KsQAhI5SQf6z4WQde56D5y5zBm3wJkCyyiVj6E&s=-tl_YiEBCYUEm7rzTkbvmTck0LsAurEd9DJaq8v5-fc&e> > > > NA/EMEA Schedule - > > Every 1st and 3rd Tuesday at 01:00 PM EDT > > Bridge: https://urldefense.proofpoint.com/v2/url?u=https-3A__bluejeans.com_441850968&d=DwIFaQ&c=lb62iw4YL4RFalcE2hQUQealT9-RXrryqt9KZX2qu2s&r=2Fzhh_78OGspKQpl_e-CbhH6xUjnRkaqPFUS2wTJ2cw&m=SsvW0KsQAhI5SQf6z4WQde56D5y5zBm3wJkCyyiVj6E&s=-tl_YiEBCYUEm7rzTkbvmTck0LsAurEd9DJaq8v5-fc&e> > > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.gluster.org_mailman_listinfo_gluster-2Dusers&d=DwIFaQ&c=lb62iw4YL4RFalcE2hQUQealT9-RXrryqt9KZX2qu2s&r=2Fzhh_78OGspKQpl_e-CbhH6xUjnRkaqPFUS2wTJ2cw&m=SsvW0KsQAhI5SQf6z4WQde56D5y5zBm3wJkCyyiVj6E&s=i7jvEHb-wZksUurCWq828kigRsSxfrAiNWxT7ORcgFs&e=