thr3ads.net - Lustre discuss - [Lustre-discuss] MDS inode allocation question [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Gary Molenkamp

2010-Apr-23 19:11 UTC

[Lustre-discuss] MDS inode allocation question

When creating the MDS filesystem, I used  ''-i 1024'' on a 860GB
logical
drive to provide approx 800M inodes in the lustre filesystem.  This was
then verified with ''df -i'' on the server:

  /dev/sda    860160000  130452 860029548    1% /data/mds

Later, after completing the OST creation and mounting the full
filesystem on a client, I noticed that ''df -i'' on the client
mount is
only showing 108M inodes in the lfs:

10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork
                     107454606  130452 107324154    1% /gulfwork

A check with ''lfs df -i'' shows the MDT only has 108M inodes:

gulfwork-MDT0000_UUID 107454606    130452 107324154    0%
						/gulfwork[MDT:0]

Is there a preallocation mechanism in play here, or did I miss something
critical in the initial setup?  My concern is that modifications to the
inodes are not reconfigurable, so it must be correct before the
filesystem goes into production.

FYI,  the filesystem was created with:

MDS/MGS on 880G logical drive:
mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i
1024''
	--failnode=10.18.12.1 /dev/sda

OSSs on 9.1TB logical drives:
/usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2 at tcp
	--mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0

Thanks.

-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
gary at sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000

Kevin Van Maren

2010-Apr-23 19:30 UTC

head link

[Lustre-discuss] MDS inode allocation question

Not sure if it was fixed, but there was a bug in Lustre returning the 
wrong values here.
If you create a bunch of files, the number of inodes reported should go 
up until you get
where you expect it to be.

Note that the number of inodes on the OSTs also limits the number of 
creatable files:
each file requires an inodes on at least one OST (number depends on how 
many OSTs
each file is striped across).

Kevin


Gary Molenkamp wrote:> When creating the MDS filesystem, I used  ''-i 1024'' on a
860GB logical
> drive to provide approx 800M inodes in the lustre filesystem.  This was
> then verified with ''df -i'' on the server:
>
>   /dev/sda    860160000  130452 860029548    1% /data/mds
>
> Later, after completing the OST creation and mounting the full
> filesystem on a client, I noticed that ''df -i'' on the
client mount is
> only showing 108M inodes in the lfs:
>
> 10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork
>                      107454606  130452 107324154    1% /gulfwork
>
> A check with ''lfs df -i'' shows the MDT only has 108M
inodes:
>
> gulfwork-MDT0000_UUID 107454606    130452 107324154    0%
> 						/gulfwork[MDT:0]
>
> Is there a preallocation mechanism in play here, or did I miss something
> critical in the initial setup?  My concern is that modifications to the
> inodes are not reconfigurable, so it must be correct before the
> filesystem goes into production.
>
> FYI,  the filesystem was created with:
>
> MDS/MGS on 880G logical drive:
> mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i
1024''
> 	--failnode=10.18.12.1 /dev/sda
>
> OSSs on 9.1TB logical drives:
> /usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2 at tcp
> 	--mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0
>
> Thanks.
>
>

Andreas Dilger

2010-Apr-24 08:43 UTC

head link

[Lustre-discuss] MDS inode allocation question

On 2010-04-23, at 13:30, Kevin Van Maren wrote:> Not sure if it was fixed, but there was a bug in Lustre returning the 
> wrong values here.  If you create a bunch of files, the number of inodes
reported should go up until you get where you expect it to be.
It depends what you mean by "wrong values".  The number reported by
"df" is the number of new files you are guaranteed to be able to
create at that time in the filesystem in the worst case scenario.  The returned
value is limited by both the number of objects on the OSTs, as well as the
number of blocks (for wide striped files) on the MDT.  As the MDT filesystem has
files created in it, the number of files that can be created (i.e.
"IFree") will usually stay constant, because the worst case is not the
common case.

Since we wanted "IFree" and "IUsed" to reflect the actual
values, the "Inodes" value had by necessity to be variable, because
the Unix statfs() interface only supplies "Inodes" and
"IFree", not "IUsed".
> Note that the number of inodes on the OSTs also limits the number of 
> creatable files:
> each file requires an inodes on at least one OST (number depends on how 
> many OSTs each file is striped across).
Right.  If you don''t have enough OST objects, then you will never be
able to hit this limit.  However, it is relatively easy to add more OSTs if you
ever get close to running out of objects.  Most people run out of space first,
but then adding more OSTs for space also gives you proportionately more objects,
so the available objects are rarely the issue.
> Gary Molenkamp wrote:
>> When creating the MDS filesystem, I used  ''-i 1024''
on a 860GB logical
>> drive to provide approx 800M inodes in the lustre filesystem.  This was
>> then verified with ''df -i'' on the server:
>> 
>>  /dev/sda    860160000  130452 860029548    1% /data/mds
>> 
>> Later, after completing the OST creation and mounting the full
>> filesystem on a client, I noticed that ''df -i'' on the
client mount is
>> only showing 108M inodes in the lfs:
>> 
>> 10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork
>>                     107454606  130452 107324154    1% /gulfwork
>> 
>> A check with ''lfs df -i'' shows the MDT only has 108M
inodes:
>> 
>> gulfwork-MDT0000_UUID 107454606    130452 107324154    0%
>> 						/gulfwork[MDT:0]
>> 
>> Is there a preallocation mechanism in play here, or did I miss
something
>> critical in the initial setup?  My concern is that modifications to the
>> inodes are not reconfigurable, so it must be correct before the
>> filesystem goes into production.
>> 
>> FYI,  the filesystem was created with:
>> 
>> MDS/MGS on 880G logical drive:
>> mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i
1024''
>> 	--failnode=10.18.12.1 /dev/sda
>> 
>> OSSs on 9.1TB logical drives:
>> /usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2 at
tcp
>> 	--mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0
>> 
>> Thanks.
>> 
>> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Kevin Van Maren

2010-Apr-26 15:55 UTC

head link

[Lustre-discuss] MDS inode allocation question

Andreas Dilger wrote:> On 2010-04-23, at 13:30, Kevin Van Maren wrote:
>   
>> Not sure if it was fixed, but there was a bug in Lustre returning the 
>> wrong values here.  If you create a bunch of files, the number of
inodes reported should go up until you get where you expect it to be.
>>     
>
> It depends what you mean by "wrong values".  The number reported
by "df" is the number of new files you are guaranteed to be able to
create at that time in the filesystem in the worst case scenario.  The returned
value is limited by both the number of objects on the OSTs, as well as the
number of blocks (for wide striped files) on the MDT.  As the MDT filesystem has
files created in it, the number of files that can be created (i.e.
"IFree") will usually stay constant, because the worst case is not the
common case.
>
> Since we wanted "IFree" and "IUsed" to reflect the
actual values, the "Inodes" value had by necessity to be variable,
because the Unix statfs() interface only supplies "Inodes" and
"IFree", not "IUsed".
>   
By wrong value I meant that the return values from "df -i" show an 
increasing number of inodes as more files are created, which did not 
decrease when the files were removed (ie, I have more free inodes after 
creating and deleting a bunch of files than I had to start with).

>> Note that the number of inodes on the OSTs also limits the number of 
>> creatable files:
>> each file requires an inodes on at least one OST (number depends on how
>> many OSTs each file is striped across).
>>     
>
> Right.  If you don''t have enough OST objects, then you will never
be able to hit this limit.  However, it is relatively easy to add more OSTs if
you ever get close to running out of objects.  Most people run out of space
first, but then adding more OSTs for space also gives you proportionately more
objects, so the available objects are rarely the issue.
>
>   
>> Gary Molenkamp wrote:
>>     
>>> When creating the MDS filesystem, I used  ''-i
1024'' on a 860GB logical
>>> drive to provide approx 800M inodes in the lustre filesystem.  This
was
>>> then verified with ''df -i'' on the server:
>>>
>>>  /dev/sda    860160000  130452 860029548    1% /data/mds
>>>
>>> Later, after completing the OST creation and mounting the full
>>> filesystem on a client, I noticed that ''df -i'' on
the client mount is
>>> only showing 108M inodes in the lfs:
>>>
>>> 10.18.12.1 at tcp:10.18.12.2 at tcp:/gulfwork
>>>                     107454606  130452 107324154    1% /gulfwork
>>>
>>> A check with ''lfs df -i'' shows the MDT only has
108M inodes:
>>>
>>> gulfwork-MDT0000_UUID 107454606    130452 107324154    0%
>>> 						/gulfwork[MDT:0]
>>>
>>> Is there a preallocation mechanism in play here, or did I miss
something
>>> critical in the initial setup?  My concern is that modifications to
the
>>> inodes are not reconfigurable, so it must be correct before the
>>> filesystem goes into production.
>>>
>>> FYI,  the filesystem was created with:
>>>
>>> MDS/MGS on 880G logical drive:
>>> mkfs.lustre --fsname gulfwork --mdt --mgs
--mkfsoptions=''-i 1024''
>>> 	--failnode=10.18.12.1 /dev/sda
>>>
>>> OSSs on 9.1TB logical drives:
>>> /usr/sbin/mkfs.lustre --fsname gulfwork --ost --mgsnode=10.18.12.2
at tcp
>>> 	--mgsnode=10.18.12.1 at tcp /dev/cciss/c0d0
>>>
>>> Thanks.
>>>
>>>
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
>

Gary Molenkamp

2010-Apr-28 13:44 UTC

head link

[Lustre-discuss] MDS inode allocation question

Thanx for the details on the Inode number, but I''m still having an
issue
where I''m not getting the number I expected from the MDS creation, but
I
suspect its not a reporting error from lfs.

When I create the MDS, I specified ''-i 1024'' and I can see
(locally)
800M inodes, but only part of the available space is allocated.  Also,
when the client mounts the filesystem,  the MDS only has 400M blocks
available:

gulfwork-MDT0000_UUID 430781784    500264 387274084    0% /gulfwork[MDT:0]

As we were creating files for testing, I saw that each inode allocation
on the MDS was consuming 4k of space, so even though I have 800M inodes
available on actual mds partition, it appears that the actual space
available was only allowing 100M inodes in the lustre fs.  Am I
understanding that correctly?

I tried to force the MDS creation to use a smaller size per inode but
that produced an error:

mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024 -I
1024'' --reformat --failnode=10.18.12.1 /dev/sda
...
   mke2fs: inode_size (1024) * inodes_count (860148736) too big for a
        filesystem with 215037184 blocks, specify higher inode_ratio
	(-i) or lower inode count (-N).
...

yet the actual drive has many more blocks available:

SCSI device sda: 1720297472 512-byte hdwr sectors (880792 MB)

Is this ext4 setting the block size limit?


FYI, I am using:
  lustre-1.8.2-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm
  lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm
  e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm




-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
gary at sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000

Andreas Dilger

2010-Apr-29 02:19 UTC

head link

[Lustre-discuss] MDS inode allocation question

On 2010-04-28, at 7:44, Gary Molenkamp <gary at sharcnet.ca> wrote:
> When I create the MDS, I specified ''-i 1024'' and I can
see (locally)
> 800M inodes, but only part of the available space is allocated.
This is to be expected. There needs to be free space on the MDS for  
directories, striping and other internal usage.
>  Also, when the client mounts the filesystem,  the MDS only has 400M  
> blocks available:
>
> gulfwork-MDT0000_UUID 430781784    500264 387274084    0% /gulfwork 
> [MDT:0]
>
> As we were creating files for testing, I saw that each inode  
> allocation
> on the MDS was consuming 4k of space,
That depends on how you are striping your files.  If the striping is  
larger than will fit inside the inode (13 stripes for 512-byte inodes  
IIRC) them each inode will also consume a block for the striping, and  
some step-wise fraction of a block for each directory entry. That is  
why ''df -i'' will return min(free blocks, free inodes), though
the
common case is that files do not need an external xattr block for the  
striping (see stripe hint argument for mkfs.lustre) and the number of  
''free'' inodes will remain constant as files are being created,
until
the number of free blocks exceeds the free inode count.
> so even though I have 800M inodes available on actual mds partition,  
> it appears that the actual space available was only allowing 100M  
> inodes in the lustre fs.  Am I
> understanding that correctly?
Possibly, yes. If you are striping all files widely by default it can  
happen as you write.
> I tried to force the MDS creation to use a smaller size per inode but
> that produced an error:
>
> mkfs.lustre --fsname gulfwork --mdt --mgs --mkfsoptions=''-i 1024
-I
> 1024'' --reformat --failnode=10.18.12.1 /dev/sda
> ...
>   mke2fs: inode_size (1024) * inodes_count (860148736) too big for a
>        filesystem with 215037184 blocks, specify higher inode_ratio
>    (-i) or lower inode count (-N).
> ...
You can''t fill the filesystem 100% full of inodes (1 inode per 1024  
bytes and each inode is 1024 bytes in size). If you ARE striping  
widely you may try -i 1536 -I 1024 but please make sure this is  
actually needed or it will reduce you MDS performance due to 2x larger  
inodes.
> yet the actual drive has many more blocks available:
>
> SCSI device sda: 1720297472 512-byte hdwr sectors (880792 MB)
>
> Is this ext4 setting the block size limit?
>
>
> FYI, I am using:
>  lustre-1.8.2-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm
>  lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5-ext4_lustre.1.8.2.x86_64.rpm
>  e2fsprogs-1.41.6.sun1-0redhat.rhel5.x86_64.rpm
>
>
>
>
> -- 
> Gary Molenkamp            SHARCNET
> Systems Administrator        University of Western Ontario
> gary at sharcnet.ca        http://www.sharcnet.ca
> (519) 661-2111 x88429        (519) 661-4000
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Apr 2010 - MDS inode allocation question

[Lustre-discuss] MDS inode allocation question

[Lustre-discuss] MDS inode allocation question

[Lustre-discuss] MDS inode allocation question

[Lustre-discuss] MDS inode allocation question

[Lustre-discuss] MDS inode allocation question

[Lustre-discuss] MDS inode allocation question