On Feb 22, 2006 10:15 +0100, Peter Kjellstr?m wrote:> What is the largest OST I can have? Does it depend on the dist/kernel I use > lustre with? I use centos-4.2 (EL4u2) which can handle ext3 filesystems >4 > TiB (but <8 I think?). Lustre is 1.4.5.1 (I could install 1.4.6 beta if that > makes any difference) and kernel is 2.6.9-22 (as downloaded from lustre). > > What I''m trying to decide is how much (or if at all) I should split my > 4.5 TiB I have per server.CFS currently supports only 2TiB per OST. One of the reasons is that this is the maximum supported by 2.4 kernels. Another reason is that >2TiB ext3 filesystems (not necessarily Lustre-related) have been reported in the past to have data corruption issues when > 2TB of data is written to them, but there isn''t enough evidence to figure out where the problem is (could be ext3-internal, jbd, LVM, MD, block layer, SCSI/ATA, 32-/64-bit). We will be working on ext3 scalability issues in the next few months and the >2TiB filesystem issues are at the top of the list. Of course, if you are just at the early testing stage we would be happy to get any results you care to share with such a setup. It would be important to not get a false sense of security just by formatting the filesystem and mounting it - you would need to actually write a large amount of data into the filesystem and be able to verify the data integrity afterward. There is also a known issue with the use of mballoc on a filesystem > 2TiB so that shouldn''t be used for such testing. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Wednesday 22 February 2006 10:36, Andreas Dilger wrote:> On Feb 22, 2006 10:15 +0100, Peter Kjellstr?m wrote: > > What is the largest OST I can have? Does it depend on the dist/kernel I > > use lustre with? I use centos-4.2 (EL4u2) which can handle ext3 > > filesystems >4 TiB (but <8 I think?). Lustre is 1.4.5.1 (I could install > > 1.4.6 beta if that makes any difference) and kernel is 2.6.9-22 (as > > downloaded from lustre). > > > > What I''m trying to decide is how much (or if at all) I should split my > > 4.5 TiB I have per server. > > CFS currently supports only 2TiB per OST. One of the reasons is that this > is the maximum supported by 2.4 kernels. Another reason is that >2TiB > ext3 filesystems (not necessarily Lustre-related) have been reported in > the past to have data corruption issues when > 2TB of data is written to > them, but there isn''t enough evidence to figure out where the problem is > (could be ext3-internal, jbd, LVM, MD, block layer, SCSI/ATA, 32-/64-bit). > > We will be working on ext3 scalability issues in the next few months and > the >2TiB filesystem issues are at the top of the list.Redhat does already (with EL4u2) support and bless ext3 for up to 8 TiB (see http://www.redhat.com/en_us/USA/rhel/details/limits/). LVM, block, etc (on centos-4.2/el4u2 x86_64) works fine with >2 TiB (I have >10 servers in production with 3.6 TiB devices).> Of course, if you are just at the early testing stage we would be happy to > get any results you care to share with such a setup. It would be important > to not get a false sense of security just by formatting the filesystem and > mounting it - you would need to actually write a large amount of data into > the filesystem and be able to verify the data integrity afterward.I am testing currently, and I will be writing and re-writing and stressing my entire area with verifications.> There is also a known issue with the use of mballoc on a filesystem > 2TiB > so that shouldn''t be used for such testing.what is mballoc? Summing it up, are you saying that if ext3 and all devices works well on a given platform then lustre will too? No additional problems with ldiskfs? /Peter> > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss-- ------------------------------------------------------------ Peter Kjellstr?m | E-mail: cap@nsc.liu.se National Supercomputer Centre | Sweden | http://www.nsc.liu.se -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060222/e9462abe/attachment.bin
On Feb 22, 2006 11:11 +0100, Peter Kjellstr?m wrote:> > CFS currently supports only 2TiB per OST. One of the reasons is that this > > is the maximum supported by 2.4 kernels. Another reason is that >2TiB > > ext3 filesystems (not necessarily Lustre-related) have been reported in > > the past to have data corruption issues when > 2TB of data is written to > > them, but there isn''t enough evidence to figure out where the problem is > > (could be ext3-internal, jbd, LVM, MD, block layer, SCSI/ATA, 32-/64-bit). > > Redhat does already (with EL4u2) support and bless ext3 for up to 8 TiB (see > http://www.redhat.com/en_us/USA/rhel/details/limits/). LVM, block, etc (on > centos-4.2/el4u2 x86_64) works fine with >2 TiB (I have >10 servers in > production with 3.6 TiB devices).I''m aware of Red Hat''s 8TiB support, yet I still see occasional reports of ext3 corruption on > 2TiB devices. Some users report success only with 64-bit kernels and problems with 32-bit kernels. This may be a result of a bug in some SCSI driver, but it is hard to know for sure. I think that as part of our large-device ext3 testing we will write a simple test program to do "fast" device verification (preferrably without writing the whole 8TiB with a test pattern, but that may be an option).> > There is also a known issue with the use of mballoc on a filesystem > 2TiB > > so that shouldn''t be used for such testing. > > what is mballoc?mballoc (multi-block allocator) is a CFS extension for ext3 to allow ext3 to allocate large chunks of the disk efficiently. Since the Lustre clients do 1MB RPCs we also want to allocate 1MB extents on the disk efficiently. This isn''t enabled unless you configure the OSTs with "extents,mballoc" mount options.> Summing it up, are you saying that if ext3 and all devices works well on a > given platform then lustre will too? No additional problems with ldiskfs?Yes, in theory it will work OK (assuming you leave mballoc out of the picture), but we haven''t tested this ourselves yet. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Wednesday 22 February 2006 17:43, Andreas Dilger wrote:> On Feb 22, 2006 11:11 +0100, Peter Kjellstr?m wrote: > > > CFS currently supports only 2TiB per OST. One of the reasons is that > > > this is the maximum supported by 2.4 kernels. Another reason is that > > > >2TiB ext3 filesystems (not necessarily Lustre-related) have been > > > reported in the past to have data corruption issues when > 2TB of data > > > is written to them, but there isn''t enough evidence to figure out where > > > the problem is (could be ext3-internal, jbd, LVM, MD, block layer, > > > SCSI/ATA, 32-/64-bit). > > > > Redhat does already (with EL4u2) support and bless ext3 for up to 8 TiB > > (see http://www.redhat.com/en_us/USA/rhel/details/limits/). LVM, block, > > etc (on centos-4.2/el4u2 x86_64) works fine with >2 TiB (I have >10 > > servers in production with 3.6 TiB devices). > > I''m aware of Red Hat''s 8TiB support, yet I still see occasional reports of > ext3 corruption on > 2TiB devices. Some users report success only with > 64-bit kernels and problems with 32-bit kernels. This may be a result of > a bug in some SCSI driver, but it is hard to know for sure.Ah, yes, there are many potential problems. I don''t have too big devices in under only large LVM volumes on top... Also everything but some clients are 64-bit.> I think that as part of our large-device ext3 testing we will write a > simple test program to do "fast" device verification (preferrably without > writing the whole 8TiB with a test pattern, but that may be an option).sounds like a good idea.> > > There is also a known issue with the use of mballoc on a filesystem > > > > 2TiB so that shouldn''t be used for such testing. > > > > what is mballoc? > > mballoc (multi-block allocator) is a CFS extension for ext3 to allow ext3 > to allocate large chunks of the disk efficiently. Since the Lustre clients > do 1MB RPCs we also want to allocate 1MB extents on the disk efficiently. > This isn''t enabled unless you configure the OSTs with "extents,mballoc" > mount options. > > > Summing it up, are you saying that if ext3 and all devices works well on > > a given platform then lustre will too? No additional problems with > > ldiskfs? > > Yes, in theory it will work OK (assuming you leave mballoc out of the > picture), but we haven''t tested this ourselves yet.I''ll get some tests running with OST size up to 7TB (~6.4 TiB) and report back.> Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc.-- ------------------------------------------------------------ Peter Kjellstr?m | E-mail: cap@nsc.liu.se National Supercomputer Centre | Sweden | http://www.nsc.liu.se -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060223/a6e57ad5/attachment.bin
Hello folks, What is the largest OST I can have? Does it depend on the dist/kernel I use lustre with? I use centos-4.2 (EL4u2) which can handle ext3 filesystems >4 TiB (but <8 I think?). Lustre is 1.4.5.1 (I could install 1.4.6 beta if that makes any difference) and kernel is 2.6.9-22 (as downloaded from lustre). What I''m trying to decide is how much (or if at all) I should split my 4.5 TiB I have per server. tia, Peter -- ------------------------------------------------------------ Peter Kjellstr?m | National Supercomputer Centre | Sweden | http://www.nsc.liu.se -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060222/ea7c22a6/attachment.bin