When creating the MDS filesystem, I used ''-i 1024'' on a 860GB logical drive to provide approx 800M inodes in the lustre filesystem. This was then verified with ''df -i'' on the server: /dev/sda 860160000 130452 860029548 1% /data/mds Later, after completing the OST creation and mounting the full filesystem on a client, I noticed that ''df -i'' on the client mount is only showing 108M inodes in the lfs: 10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork 107454606 130452 107324154 1% /gulfwork A check with ''lfs df -i'' shows the MDT only has 108M inodes: gulfwork-MDT0000_UUID 107454606 130452 107324154 0% /gulfwork[MDT:0] Is there a preallocation mechanism in play here, or did I miss something critical in the initial setup? My concern is that modifications to the inodes are not reconfigurable, so it must be correct before the filesystem goes into production. FYI, the filesystem was created with: MDS/MGS on 880G logical drive: mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024'' --failnode=10.18.12.1 /dev/sda OSSs on 9.1TB logical drives: /usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2 at tcp --mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0 Thanks. -- Gary Molenkamp SHARCNET Systems Administrator University of Western Ontario gary at sharcnet.ca http://www.sharcnet.ca (519) 661-2111 x88429 (519) 661-4000
Not sure if it was fixed, but there was a bug in Lustre returning the wrong values here. If you create a bunch of files, the number of inodes reported should go up until you get where you expect it to be. Note that the number of inodes on the OSTs also limits the number of creatable files: each file requires an inodes on at least one OST (number depends on how many OSTs each file is striped across). Kevin Gary Molenkamp wrote:> When creating the MDS filesystem, I used ''-i 1024'' on a 860GB logical > drive to provide approx 800M inodes in the lustre filesystem. This was > then verified with ''df -i'' on the server: > > /dev/sda 860160000 130452 860029548 1% /data/mds > > Later, after completing the OST creation and mounting the full > filesystem on a client, I noticed that ''df -i'' on the client mount is > only showing 108M inodes in the lfs: > > 10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork > 107454606 130452 107324154 1% /gulfwork > > A check with ''lfs df -i'' shows the MDT only has 108M inodes: > > gulfwork-MDT0000_UUID 107454606 130452 107324154 0% > /gulfwork[MDT:0] > > Is there a preallocation mechanism in play here, or did I miss something > critical in the initial setup? My concern is that modifications to the > inodes are not reconfigurable, so it must be correct before the > filesystem goes into production. > > FYI, the filesystem was created with: > > MDS/MGS on 880G logical drive: > mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024'' > --failnode=10.18.12.1 /dev/sda > > OSSs on 9.1TB logical drives: > /usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2 at tcp > --mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0 > > Thanks. > >
On 2010-04-23, at 13:30, Kevin Van Maren wrote:> Not sure if it was fixed, but there was a bug in Lustre returning the > wrong values here. If you create a bunch of files, the number of inodes reported should go up until you get where you expect it to be.It depends what you mean by "wrong values". The number reported by "df" is the number of new files you are guaranteed to be able to create at that time in the filesystem in the worst case scenario. The returned value is limited by both the number of objects on the OSTs, as well as the number of blocks (for wide striped files) on the MDT. As the MDT filesystem has files created in it, the number of files that can be created (i.e. "IFree") will usually stay constant, because the worst case is not the common case. Since we wanted "IFree" and "IUsed" to reflect the actual values, the "Inodes" value had by necessity to be variable, because the Unix statfs() interface only supplies "Inodes" and "IFree", not "IUsed".> Note that the number of inodes on the OSTs also limits the number of > creatable files: > each file requires an inodes on at least one OST (number depends on how > many OSTs each file is striped across).Right. If you don''t have enough OST objects, then you will never be able to hit this limit. However, it is relatively easy to add more OSTs if you ever get close to running out of objects. Most people run out of space first, but then adding more OSTs for space also gives you proportionately more objects, so the available objects are rarely the issue.> Gary Molenkamp wrote: >> When creating the MDS filesystem, I used ''-i 1024'' on a 860GB logical >> drive to provide approx 800M inodes in the lustre filesystem. This was >> then verified with ''df -i'' on the server: >> >> /dev/sda 860160000 130452 860029548 1% /data/mds >> >> Later, after completing the OST creation and mounting the full >> filesystem on a client, I noticed that ''df -i'' on the client mount is >> only showing 108M inodes in the lfs: >> >> 10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork >> 107454606 130452 107324154 1% /gulfwork >> >> A check with ''lfs df -i'' shows the MDT only has 108M inodes: >> >> gulfwork-MDT0000_UUID 107454606 130452 107324154 0% >> /gulfwork[MDT:0] >> >> Is there a preallocation mechanism in play here, or did I miss something >> critical in the initial setup? My concern is that modifications to the >> inodes are not reconfigurable, so it must be correct before the >> filesystem goes into production. >> >> FYI, the filesystem was created with: >> >> MDS/MGS on 880G logical drive: >> mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024'' >> --failnode=10.18.12.1 /dev/sda >> >> OSSs on 9.1TB logical drives: >> /usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2 at tcp >> --mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0 >> >> Thanks. >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Andreas Dilger wrote:> On 2010-04-23, at 13:30, Kevin Van Maren wrote: > >> Not sure if it was fixed, but there was a bug in Lustre returning the >> wrong values here. If you create a bunch of files, the number of inodes reported should go up until you get where you expect it to be. >> > > It depends what you mean by "wrong values". The number reported by "df" is the number of new files you are guaranteed to be able to create at that time in the filesystem in the worst case scenario. The returned value is limited by both the number of objects on the OSTs, as well as the number of blocks (for wide striped files) on the MDT. As the MDT filesystem has files created in it, the number of files that can be created (i.e. "IFree") will usually stay constant, because the worst case is not the common case. > > Since we wanted "IFree" and "IUsed" to reflect the actual values, the "Inodes" value had by necessity to be variable, because the Unix statfs() interface only supplies "Inodes" and "IFree", not "IUsed". >By wrong value I meant that the return values from "df -i" show an increasing number of inodes as more files are created, which did not decrease when the files were removed (ie, I have more free inodes after creating and deleting a bunch of files than I had to start with).>> Note that the number of inodes on the OSTs also limits the number of >> creatable files: >> each file requires an inodes on at least one OST (number depends on how >> many OSTs each file is striped across). >> > > Right. If you don''t have enough OST objects, then you will never be able to hit this limit. However, it is relatively easy to add more OSTs if you ever get close to running out of objects. Most people run out of space first, but then adding more OSTs for space also gives you proportionately more objects, so the available objects are rarely the issue. > > >> Gary Molenkamp wrote: >> >>> When creating the MDS filesystem, I used ''-i 1024'' on a 860GB logical >>> drive to provide approx 800M inodes in the lustre filesystem. This was >>> then verified with ''df -i'' on the server: >>> >>> /dev/sda 860160000 130452 860029548 1% /data/mds >>> >>> Later, after completing the OST creation and mounting the full >>> filesystem on a client, I noticed that ''df -i'' on the client mount is >>> only showing 108M inodes in the lfs: >>> >>> 10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork >>> 107454606 130452 107324154 1% /gulfwork >>> >>> A check with ''lfs df -i'' shows the MDT only has 108M inodes: >>> >>> gulfwork-MDT0000_UUID 107454606 130452 107324154 0% >>> /gulfwork[MDT:0] >>> >>> Is there a preallocation mechanism in play here, or did I miss something >>> critical in the initial setup? My concern is that modifications to the >>> inodes are not reconfigurable, so it must be correct before the >>> filesystem goes into production. >>> >>> FYI, the filesystem was created with: >>> >>> MDS/MGS on 880G logical drive: >>> mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024'' >>> --failnode=10.18.12.1 /dev/sda >>> >>> OSSs on 9.1TB logical drives: >>> /usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2 at tcp >>> --mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0 >>> >>> Thanks. >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > >
Thanx for the details on the Inode number, but I''m still having an issue where I''m not getting the number I expected from the MDS creation, but I suspect its not a reporting error from lfs. When I create the MDS, I specified ''-i 1024'' and I can see (locally) 800M inodes, but only part of the available space is allocated. Also, when the client mounts the filesystem, the MDS only has 400M blocks available: gulfwork-MDT0000_UUID 430781784 500264 387274084 0% /gulfwork[MDT:0] As we were creating files for testing, I saw that each inode allocation on the MDS was consuming 4k of space, so even though I have 800M inodes available on actual mds partition, it appears that the actual space available was only allowing 100M inodes in the lustre fs. Am I understanding that correctly? I tried to force the MDS creation to use a smaller size per inode but that produced an error: mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024 -I 1024'' --reformat --failnode=10.18.12.1 /dev/sda ... mke2fs: inode_size (1024) * inodes_count (860148736) too big for a filesystem with 215037184 blocks, specify higher inode_ratio (-i) or lower inode count (-N). ... yet the actual drive has many more blocks available: SCSI device sda: 1720297472 512-byte hdwr sectors (880792 MB) Is this ext4 setting the block size limit? FYI, I am using: lustre-1.8.2-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm -- Gary Molenkamp SHARCNET Systems Administrator University of Western Ontario gary at sharcnet.ca http://www.sharcnet.ca (519) 661-2111 x88429 (519) 661-4000
On 2010-04-28, at 7:44, Gary Molenkamp <gary at sharcnet.ca> wrote:> When I create the MDS, I specified ''-i 1024'' and I can see (locally) > 800M inodes, but only part of the available space is allocated.This is to be expected. There needs to be free space on the MDS for directories, striping and other internal usage.> Also, when the client mounts the filesystem, the MDS only has 400M > blocks available: > > gulfwork-MDT0000_UUID 430781784 500264 387274084 0% /gulfwork > [MDT:0] > > As we were creating files for testing, I saw that each inode > allocation > on the MDS was consuming 4k of space,That depends on how you are striping your files. If the striping is larger than will fit inside the inode (13 stripes for 512-byte inodes IIRC) them each inode will also consume a block for the striping, and some step-wise fraction of a block for each directory entry. That is why ''df -i'' will return min(free blocks, free inodes), though the common case is that files do not need an external xattr block for the striping (see stripe hint argument for mkfs.lustre) and the number of ''free'' inodes will remain constant as files are being created, until the number of free blocks exceeds the free inode count.> so even though I have 800M inodes available on actual mds partition, > it appears that the actual space available was only allowing 100M > inodes in the lustre fs. Am I > understanding that correctly?Possibly, yes. If you are striping all files widely by default it can happen as you write.> I tried to force the MDS creation to use a smaller size per inode but > that produced an error: > > mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024 -I > 1024'' --reformat --failnode=10.18.12.1 /dev/sda > ... > mke2fs: inode_size (1024) * inodes_count (860148736) too big for a > filesystem with 215037184 blocks, specify higher inode_ratio > (-i) or lower inode count (-N). > ...You can''t fill the filesystem 100% full of inodes (1 inode per 1024 bytes and each inode is 1024 bytes in size). If you ARE striping widely you may try -i 1536 -I 1024 but please make sure this is actually needed or it will reduce you MDS performance due to 2x larger inodes.> yet the actual drive has many more blocks available: > > SCSI device sda: 1720297472 512-byte hdwr sectors (880792 MB) > > Is this ext4 setting the block size limit? > > > FYI, I am using: > lustre-1.8.2-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm > e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm > > > > > -- > Gary Molenkamp SHARCNET > Systems Administrator University of Western Ontario > gary at sharcnet.ca http://www.sharcnet.ca > (519) 661-2111 x88429 (519) 661-4000 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss