On May 15, 2006 15:04 +0200, Anselm Strauss wrote:> i noticed every time i create a file on a lustre filesystem one inode > on the corresponding mds is used (as well as one inode on the ost > itself). the minimum bytes per inode for ext3 is 1024 and the maximum > block size is 4096, thus the maximum rate of inodes per block is 4.That is only true if you use the "-i" option for mke2fs. If you specify some absolute number of inodes using "-N {num inodes}" newer e2fsprogs will reduce the group size to allow an increased number of inodes beyond 1 per 1024 bytes. That said, there likely isn''t a good reason to do this. Lustre uses large inodes (at least 512 bytes by default), and there has to be space left in the filesystem for other metadata like the journal (up to 400MB), bitmaps, and directories. There is also a small number of regular files that Lustre uses to maintain cluster consistency. CFS recommends 4kB/inode on the MDS to be safe.> needs to be at least a fourth of the size of my lustre fs (assuming a > block size of 4K on my lustre fs and 1K bytes per inode on my mds). > like this the mds has four times less data blocks as the lustre fs, > but 4 inodes for each block and therefore one inode for each block on > the lustre fs. for example, with a 10tb lustre fs i need at least > 2.5tb metadata storage.Firstly, you are confusing several items here: - the MDS and OST filesystems are unrelated, so the formatting parameters for the two do not need to be the same - since the MDS and OST filesystems are independent, the size of the MDS filesystem is purely a factor of how many inodes you want in the total Lustre filesystem, and not the size of the aggregate OST space - you can have a much higher maximum number of bytes per inode in the filesystem, up to 128 MB per 8 inodes, which is useful for OSTs if you have a very large average file size As a result, the only important factor when calculating the MDS size is the average size of files to be stored in the filesystem. If the average file size is, say, 5MB and you have 100TB of usable OST space then you need at least (100 * 1024 * 1024 / 5) = 20M inodes, though I would always recommend 2x the minimum, so 50M inodes. At the default 4kB/inode space this works out to only 80GB of space for the MDS. On the other end of the spectrum, if you had a very small average file size (e.g. 4kB), it is true that Lustre isn''t very efficient, since at that point you consume as much space on the MDS as you are on the OSTs. This is not a very common configuration for Lustre. With a 2TB MDS you could potentially have 1kB/inode (with 512-byte inodes I wouldn''t go any lower) so 2B inodes, and this would need 2B * 4kB = 8TB of usable OST space. Depending on your needs, you could instead just do this with a single ext3 filesystem instead of Lustre. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Hi Andreas,> CFS recommends 4kB/inode on the MDS to be safe.Formatting with 4096-byte inodes can take a long time. For instance, here are the results I got with a 1,1TB filesystem: Inode size Time to format 128B 6m29s 256B 10m33s 512B 22m15s 1kB 38m31s 2kB 73m46s 4kB 141m44s Do you know if it''s possible to speed up formatting when using large inodes? Cheers, Johann
On May 16, 2006 10:02 +0200, Johann Lombardi wrote:> > CFS recommends 4kB/inode on the MDS to be safe. > > Formatting with 4096-byte inodes can take a long time.Just to be clear, we don''t recommend 4096-byte inodes, rather, we recommend 512-byte inodes (or as needed by striping, -I 512), and 1 inode for each 4096 bytes of filesystem space (-i 4096).> For instance, here are the results I got with a 1,1TB filesystem: > > Inode size Time to format > 128B 6m29s > 256B 10m33s > 512B 22m15s > 1kB 38m31s > 2kB 73m46s > 4kB 141m44s > > Do you know if it''s possible to speed up formatting when using large inodes?First rule is that you shouldn''t have more inodes than you really need, as this slows down e2fsck also. That said, there are a couple of approaches to this that we are investigating. Firstly, there are some patches to ext3 (posted to ext2-devel list by Ted Ts''o recently) to allow the filesystem to be "fast formatted" for testing purposes, and these will hopefully be expanded to add kernel support for "delayed formatting" of these groups so that they can be initialized lazily as the filesystem needs them. The second one is to format a very small filesystem and then use the ext2online tool to resize the filesystem after the intitial mount. Both of these are not available at this time. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
thanks for the detailed answers. On May 15, 2006, at 6:51 PM, Andreas Dilger wrote:> On May 15, 2006 15:04 +0200, Anselm Strauss wrote: >> i noticed every time i create a file on a lustre filesystem one inode >> on the corresponding mds is used (as well as one inode on the ost >> itself). the minimum bytes per inode for ext3 is 1024 and the maximum >> block size is 4096, thus the maximum rate of inodes per block is 4. > > That is only true if you use the "-i" option for mke2fs. If you > specify > some absolute number of inodes using "-N {num inodes}" newer e2fsprogs > will reduce the group size to allow an increased number of inodes > beyond > 1 per 1024 bytes. > > That said, there likely isn''t a good reason to do this. Lustre > uses large > inodes (at least 512 bytes by default), and there has to be space left > in the filesystem for other metadata like the journal (up to 400MB), > bitmaps, and directories. There is also a small number of regular > files > that Lustre uses to maintain cluster consistency. > > CFS recommends 4kB/inode on the MDS to be safe. > >> needs to be at least a fourth of the size of my lustre fs (assuming a >> block size of 4K on my lustre fs and 1K bytes per inode on my mds). >> like this the mds has four times less data blocks as the lustre fs, >> but 4 inodes for each block and therefore one inode for each block on >> the lustre fs. for example, with a 10tb lustre fs i need at least >> 2.5tb metadata storage. > > Firstly, you are confusing several items here: > - the MDS and OST filesystems are unrelated, so the formatting > parameters > for the two do not need to be the same > - since the MDS and OST filesystems are independent, the size of > the MDS > filesystem is purely a factor of how many inodes you want in the > total > Lustre filesystem, and not the size of the aggregate OST space > - you can have a much higher maximum number of bytes per inode in the > filesystem, up to 128 MB per 8 inodes, which is useful for OSTs if > you have a very large average file size > > As a result, the only important factor when calculating the MDS > size is > the average size of files to be stored in the filesystem. If the > average > file size is, say, 5MB and you have 100TB of usable OST space then you > need at least (100 * 1024 * 1024 / 5) = 20M inodes, though I would > always > recommend 2x the minimum, so 50M inodes. At the default 4kB/inode > space > this works out to only 80GB of space for the MDS.this is actually the point i was concerned about. for example, i currently have an xfs filesystem of 4TB with 2600M inodes, and i used standard parameters to format it. with your calculation i would need a verry big mds. but i also have to say, that none of my filesystems has an inode usage bigger than 1%, so your calculation seems sufficient. still, would you think it''s a good idea to format the mds space with - N, for example use half the space for inodes an the other half for metadata? 40GB for metadata should be sufficient, i think. or would a lot of inodes on the mds also have some performance impact?> On the other end of the spectrum, if you had a very small average > file size > (e.g. 4kB), it is true that Lustre isn''t very efficient, since at > that point > you consume as much space on the MDS as you are on the OSTs. This > is not > a very common configuration for Lustre. With a 2TB MDS you could > potentially > have 1kB/inode (with 512-byte inodes I wouldn''t go any lower) so 2B > inodes, > and this would need 2B * 4kB = 8TB of usable OST space. Depending > on your > needs, you could instead just do this with a single ext3 filesystem > instead > of Lustre. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >anselm
hi. i noticed every time i create a file on a lustre filesystem one inode on the corresponding mds is used (as well as one inode on the ost itself). the minimum bytes per inode for ext3 is 1024 and the maximum block size is 4096, thus the maximum rate of inodes per block is 4. so, to not limit the number of files a lustre fs can have the mds needs to be at least a fourth of the size of my lustre fs (assuming a block size of 4K on my lustre fs and 1K bytes per inode on my mds). like this the mds has four times less data blocks as the lustre fs, but 4 inodes for each block and therefore one inode for each block on the lustre fs. for example, with a 10tb lustre fs i need at least 2.5tb metadata storage. isn''t this a bit big just for metadata? what else is stored in the metadata? is there a possibility to increase the number of inodes on a mds avoiding the ext3 limits? anselm strauss