thr3ads.net - Btrfs devel - getdents - ext4 vs btrfs performance [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Jacek Luczak

2012-Feb-29 13:31 UTC

getdents - ext4 vs btrfs performance

Hi All,

Long story short: We''ve found that operations on a directory structure
holding many dirs takes ages on ext4.

The Question: Why there''s that huge difference in ext4 and btrfs? See
below test results for real values.

Background: I had to backup a Jenkins directory holding workspace for
few projects which were co from svn (implies lot of extra .svn dirs).
The copy takes lot of time (at least more than I''ve expected) and
process was mostly in D (disk sleep). I''ve dig more and done some
extra test to see if this is not a regression on block/fs site. To
isolate the issue I''ve also performed same tests on btrfs.

Test environment configuration:
1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
2) Kernels: All tests were done on following kernels:
 - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline
break caused by block plug'''' patch. Otherwise it''s
pure 2.6.39.4.
 - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
release recently).
3) A subject of tests, directory holding:
 - 54GB of data (measured on ext4)
 - 1978149 files
 - 844008 directories
4) Mount options:
 - ext4 -- errors=remount-ro,noatime,data=writeback
 - btrfs -- noatime,nodatacow and for later investigation on
copression effect: noatime,nodatacow,compress=lzo

In all tests I''ve been measuring time of execution. Following tests
were performed:
- find . -type d
- find . -type f
- cp -a
- rm -rf

Ext4 results:
| Type     | 2.6.39.4-3   | 3.2.7
| Dir cnt  | 17m 40sec  | 11m 20sec
| File cnt |  17m 36sec | 11m 22sec
| Copy    | 1h 28m        | 1h 27m
| Remove| 3m 43sec
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Feb-29 13:51 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Feb 29, 2012 at 02:31:03PM +0100, Jacek Luczak
wrote:> Hi All,
> 
> Long story short: We''ve found that operations on a directory
structure
> holding many dirs takes ages on ext4.
> 
> The Question: Why there''s that huge difference in ext4 and btrfs?
See
> below test results for real values.
> 
> Background: I had to backup a Jenkins directory holding workspace for
> few projects which were co from svn (implies lot of extra .svn dirs).
> The copy takes lot of time (at least more than I''ve expected) and
> process was mostly in D (disk sleep). I''ve dig more and done some
> extra test to see if this is not a regression on block/fs site. To
> isolate the issue I''ve also performed same tests on btrfs.
> 
> Test environment configuration:
> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> 2) Kernels: All tests were done on following kernels:
>  - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> config changes mostly. In -3 we''ve introduced ,,fix readahead
pipeline
> break caused by block plug'''' patch. Otherwise
it''s pure 2.6.39.4.
>  - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> release recently).
> 3) A subject of tests, directory holding:
>  - 54GB of data (measured on ext4)
>  - 1978149 files
>  - 844008 directories
> 4) Mount options:
>  - ext4 -- errors=remount-ro,noatime,data=writeback
>  - btrfs -- noatime,nodatacow and for later investigation on
> copression effect: noatime,nodatacow,compress=lzo
For btrfs, nodatacow and compression don''t really mix.  The compression
will just override it. (Just FYI, not really related to these results).
> 
> In all tests I''ve been measuring time of execution. Following
tests
> were performed:
> - find . -type d
> - find . -type f
> - cp -a
> - rm -rf
> 
> Ext4 results:
> | Type     | 2.6.39.4-3   | 3.2.7
> | Dir cnt  | 17m 40sec  | 11m 20sec
> | File cnt |  17m 36sec | 11m 22sec
> | Copy    | 1h 28m        | 1h 27m
> | Remove| 3m 43sec
Are the btrfs numbers missing? ;)

In order for btrfs to be faster for cp -a, the files probably didn''t
change much since creation.  Btrfs maintains extra directory indexes
that help in sequential backup scans, but this usually means slower
delete performance.

But, how exactly did you benchmark it?  If you compare a fresh
mkfs.btrfs where you just copied all the data over with an ext4 FS that
has been on the disk for a long time, it isn''t quite fair to ext4.

-chris

Jacek Luczak

2012-Feb-29 13:52 UTC

head link

getdents - ext4 vs btrfs performance

Hi All,

/*Sorry for sending incomplete email, hit wrong button :) I guess I
can''t use Gmail */

Long story short: We''ve found that operations on a directory structure
holding many dirs takes ages on ext4.

The Question: Why there''s that huge difference in ext4 and btrfs? See
below test results for real values.

Background: I had to backup a Jenkins directory holding workspace for
few projects which were co from svn (implies lot of extra .svn dirs).
The copy takes lot of time (at least more than I''ve expected) and
process was mostly in D (disk sleep). I''ve dig more and done some
extra test to see if this is not a regression on block/fs site. To
isolate the issue I''ve also performed same tests on btrfs.

Test environment configuration:
1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
2) Kernels: All tests were done on following kernels:
 - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline
break caused by block plug'''' patch. Otherwise it''s
pure 2.6.39.4.
 - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
release recently).
3) A subject of tests, directory holding:
 - 54GB of data (measured on ext4)
 - 1978149 files
 - 844008 directories
4) Mount options:
 - ext4 -- errors=remount-ro,noatime,
data=writeback
 - btrfs -- noatime,nodatacow and for later investigation on
copression effect: noatime,nodatacow,compress=lzo

In all tests I''ve been measuring time of execution. Following tests
were performed:
- find . -type d
- find . -type f
- cp -a
- rm -rf

Ext4 results:
| Type     | 2.6.39.4-3   | 3.2.7
| Dir cnt  | 17m 40sec  | 11m 20sec
| File cnt |  17m 36sec | 11m 22sec
| Copy    | 1h 28m        | 1h 27m
| Remove| 3m 43sec    | 3m 38sec

Btrfs results (without lzo comression):
| Type     | 2.6.39.4-3   | 3.2.7
| Dir cnt  | 2m 22sec  | 2m 21sec
| File cnt |  2m 26sec | 2m 23sec
| Copy    | 36m 22sec | 39m 35sec
| Remove| 7m 51sec   | 10m 43sec

From above one can see that copy takes close to 1h less on btrfs. I''ve
done strace counting times of calls, results are as follows (from
3.2.7):
1) Ext4 (only to elements):
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.01   13.257850           1  15082163           read
 23.40    5.440353           3   1687702           getdents
 6.15    1.430559           0   3672418           lstat
 3.80    0.883767           0  13106961           write
 2.32    0.539959           0   4794099           open
 1.69    0.393589           0    843695           mkdir
 1.28    0.296700           0   5637802           setxattr
 0.80    0.186539           0   7325195           stat

2) Btrfs:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
53.38    9.486210           1  15179751           read
11.38    2.021662           1   1688328           getdents
 10.64    1.890234           0   4800317           open
 6.83    1.213723           0  13201590           write
 4.85    0.862731           0   5644314           setxattr
 3.50    0.621194           1    844008           mkdir
 2.75    0.489059           0   3675992         1 lstat
 1.71    0.303544           0   5644314           llistxattr
 1.50    0.265943           0   1978149           utimes
 1.02    0.180585           0   5644314    844008 getxattr

On btrfs getdents takes much less time which prove the bottleneck in
copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
for getdents:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.77   10.978816           1  15033132           read
 14.46    3.125996           1   4733589           open
 7.15    1.546311           0   5566988           setxattr
 5.89    1.273845           0   3626505           lstat
 5.81    1.255858           1   1667050           getdents
 5.66    1.224403           0  13083022           write
 3.40    0.735114           1    833371           mkdir
 1.96    0.424881           0   5566988           llistxattr


Why so huge difference in the getdents timings?

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Feb-29 13:55 UTC

head link

Re: getdents - ext4 vs btrfs performance

Hi Chris,

the last one was borked :) Please check this one.

-jacek

2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> Hi All,
>
> /*Sorry for sending incomplete email, hit wrong button :) I guess I
> can''t use Gmail */
>
> Long story short: We''ve found that operations on a directory
structure
> holding many dirs takes ages on ext4.
>
> The Question: Why there''s that huge difference in ext4 and btrfs?
See
> below test results for real values.
>
> Background: I had to backup a Jenkins directory holding workspace for
> few projects which were co from svn (implies lot of extra .svn dirs).
> The copy takes lot of time (at least more than I''ve expected) and
> process was mostly in D (disk sleep). I''ve dig more and done some
> extra test to see if this is not a regression on block/fs site. To
> isolate the issue I''ve also performed same tests on btrfs.
>
> Test environment configuration:
> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> 2) Kernels: All tests were done on following kernels:
>  - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> config changes mostly. In -3 we''ve introduced ,,fix readahead
pipeline
> break caused by block plug'''' patch. Otherwise
it''s pure 2.6.39.4.
>  - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> release recently).
> 3) A subject of tests, directory holding:
>  - 54GB of data (measured on ext4)
>  - 1978149 files
>  - 844008 directories
> 4) Mount options:
>  - ext4 -- errors=remount-ro,noatime,
> data=writeback
>  - btrfs -- noatime,nodatacow and for later investigation on
> copression effect: noatime,nodatacow,compress=lzo
>
> In all tests I''ve been measuring time of execution. Following
tests
> were performed:
> - find . -type d
> - find . -type f
> - cp -a
> - rm -rf
>
> Ext4 results:
> | Type     | 2.6.39.4-3   | 3.2.7
> | Dir cnt  | 17m 40sec  | 11m 20sec
> | File cnt |  17m 36sec | 11m 22sec
> | Copy    | 1h 28m        | 1h 27m
> | Remove| 3m 43sec    | 3m 38sec
>
> Btrfs results (without lzo comression):
> | Type     | 2.6.39.4-3   | 3.2.7
> | Dir cnt  | 2m 22sec  | 2m 21sec
> | File cnt |  2m 26sec | 2m 23sec
> | Copy    | 36m 22sec | 39m 35sec
> | Remove| 7m 51sec   | 10m 43sec
>
> From above one can see that copy takes close to 1h less on btrfs.
I''ve
> done strace counting times of calls, results are as follows (from
> 3.2.7):
> 1) Ext4 (only to elements):
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  57.01   13.257850           1  15082163           read
>  23.40    5.440353           3   1687702           getdents
>  6.15    1.430559           0   3672418           lstat
>  3.80    0.883767           0  13106961           write
>  2.32    0.539959           0   4794099           open
>  1.69    0.393589           0    843695           mkdir
>  1.28    0.296700           0   5637802           setxattr
>  0.80    0.186539           0   7325195           stat
>
> 2) Btrfs:
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.38    9.486210           1  15179751           read
> 11.38    2.021662           1   1688328           getdents
>  10.64    1.890234           0   4800317           open
>  6.83    1.213723           0  13201590           write
>  4.85    0.862731           0   5644314           setxattr
>  3.50    0.621194           1    844008           mkdir
>  2.75    0.489059           0   3675992         1 lstat
>  1.71    0.303544           0   5644314           llistxattr
>  1.50    0.265943           0   1978149           utimes
>  1.02    0.180585           0   5644314    844008 getxattr
>
> On btrfs getdents takes much less time which prove the bottleneck in
> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
> for getdents:
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  50.77   10.978816           1  15033132           read
>  14.46    3.125996           1   4733589           open
>  7.15    1.546311           0   5566988           setxattr
>  5.89    1.273845           0   3626505           lstat
>  5.81    1.255858           1   1667050           getdents
>  5.66    1.224403           0  13083022           write
>  3.40    0.735114           1    833371           mkdir
>  1.96    0.424881           0   5566988           llistxattr
>
>
> Why so huge difference in the getdents timings?
>
> -Jacek

Lukas Czerner

2012-Feb-29 14:00 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, 29 Feb 2012, Chris Mason wrote:
> On Wed, Feb 29, 2012 at 02:31:03PM +0100, Jacek Luczak wrote:
> > Hi All,
> > 
> > Long story short: We''ve found that operations on a directory
structure
> > holding many dirs takes ages on ext4.
> > 
> > The Question: Why there''s that huge difference in ext4 and
btrfs? See
> > below test results for real values.
> > 
> > Background: I had to backup a Jenkins directory holding workspace for
> > few projects which were co from svn (implies lot of extra .svn dirs).
> > The copy takes lot of time (at least more than I''ve expected)
and
> > process was mostly in D (disk sleep). I''ve dig more and done
some
> > extra test to see if this is not a regression on block/fs site. To
> > isolate the issue I''ve also performed same tests on btrfs.
> > 
> > Test environment configuration:
> > 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> > enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> > 2) Kernels: All tests were done on following kernels:
> >  - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> > config changes mostly. In -3 we''ve introduced ,,fix readahead
pipeline
> > break caused by block plug'''' patch. Otherwise
it''s pure 2.6.39.4.
> >  - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> > release recently).
> > 3) A subject of tests, directory holding:
> >  - 54GB of data (measured on ext4)
> >  - 1978149 files
> >  - 844008 directories
> > 4) Mount options:
> >  - ext4 -- errors=remount-ro,noatime,data=writeback
> >  - btrfs -- noatime,nodatacow and for later investigation on
> > copression effect: noatime,nodatacow,compress=lzo
> 
> For btrfs, nodatacow and compression don''t really mix.  The
compression
> will just override it. (Just FYI, not really related to these results).
> 
> > 
> > In all tests I''ve been measuring time of execution. Following
tests
> > were performed:
> > - find . -type d
> > - find . -type f
> > - cp -a
> > - rm -rf
> > 
> > Ext4 results:
> > | Type     | 2.6.39.4-3   | 3.2.7
> > | Dir cnt  | 17m 40sec  | 11m 20sec
> > | File cnt |  17m 36sec | 11m 22sec
> > | Copy    | 1h 28m        | 1h 27m
> > | Remove| 3m 43sec
> 
> Are the btrfs numbers missing? ;)
> 
> In order for btrfs to be faster for cp -a, the files probably
didn''t
> change much since creation.  Btrfs maintains extra directory indexes
> that help in sequential backup scans, but this usually means slower
> delete performance.
Exactly and IIRC ext4 have directory entries stored in hash order which
does not really help the sequential access.
> 
> But, how exactly did you benchmark it?  If you compare a fresh
> mkfs.btrfs where you just copied all the data over with an ext4 FS that
> has been on the disk for a long time, it isn''t quite fair to ext4.
I have the same question, note that if the files on ext4 has been worked
with it may very well be that directory hash trees are not in very good
shape. You can attempt to optimize that by e2fsck (just run fsck.ext4 -f
<device>) but that may take quite some time and memory, but it is worth
trying.

Thanks!
-Lukas
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Feb-29 14:05 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Feb 29, 2012 at 08:51:58AM -0500, Chris Mason
wrote:> On Wed, Feb 29, 2012 at 02:31:03PM +0100, Jacek Luczak wrote:
> > Ext4 results:
> > | Type     | 2.6.39.4-3   | 3.2.7
> > | Dir cnt  | 17m 40sec  | 11m 20sec
> > | File cnt |  17m 36sec | 11m 22sec
> > | Copy    | 1h 28m        | 1h 27m
> > | Remove| 3m 43sec
> 
> Are the btrfs numbers missing? ;)
[ answered in a different reply, btrfs is faster in everything except
delete ]

The btrfs readdir uses an index that is much more likely to be
sequential on disk than ext.  This makes the readdir more sequential and
it makes the actual file IO more sequential because we''re reading
things
in the order they were created instead of (random) htree index order.
> 
> In order for btrfs to be faster for cp -a, the files probably
didn''t
> change much since creation.  Btrfs maintains extra directory indexes
> that help in sequential backup scans, but this usually means slower
> delete performance.
> 
> But, how exactly did you benchmark it?  If you compare a fresh
> mkfs.btrfs where you just copied all the data over with an ext4 FS that
> has been on the disk for a long time, it isn''t quite fair to ext4.
> 
But, the consistent benchmarking part is really important.  We
shouldn''t
put an aged ext4 up against a fresh mkfs.btrfs.  Did you do ext4
comparisons on a fresh copy?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Feb-29 14:07 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> Hi Chris,
>
> the last one was borked :) Please check this one.
>
> -jacek
>
> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:
>> Hi All,
>>
>> /*Sorry for sending incomplete email, hit wrong button :) I guess I
>> can''t use Gmail */
>>
>> Long story short: We''ve found that operations on a directory
structure
>> holding many dirs takes ages on ext4.
>>
>> The Question: Why there''s that huge difference in ext4 and
btrfs? See
>> below test results for real values.
>>
>> Background: I had to backup a Jenkins directory holding workspace for
>> few projects which were co from svn (implies lot of extra .svn dirs).
>> The copy takes lot of time (at least more than I''ve expected)
and
>> process was mostly in D (disk sleep). I''ve dig more and done
some
>> extra test to see if this is not a regression on block/fs site. To
>> isolate the issue I''ve also performed same tests on btrfs.
>>
>> Test environment configuration:
>> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
>> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
>> 2) Kernels: All tests were done on following kernels:
>>  - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
>> config changes mostly. In -3 we''ve introduced ,,fix readahead
pipeline
>> break caused by block plug'''' patch. Otherwise
it''s pure 2.6.39.4.
>>  - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
>> release recently).
>> 3) A subject of tests, directory holding:
>>  - 54GB of data (measured on ext4)
>>  - 1978149 files
>>  - 844008 directories
>> 4) Mount options:
>>  - ext4 -- errors=remount-ro,noatime,
>> data=writeback
>>  - btrfs -- noatime,nodatacow and for later investigation on
>> copression effect: noatime,nodatacow,compress=lzo
>>
>> In all tests I''ve been measuring time of execution. Following
tests
>> were performed:
>> - find . -type d
>> - find . -type f
>> - cp -a
>> - rm -rf
>>
>> Ext4 results:
>> | Type     | 2.6.39.4-3   | 3.2.7
>> | Dir cnt  | 17m 40sec  | 11m 20sec
>> | File cnt |  17m 36sec | 11m 22sec
>> | Copy    | 1h 28m        | 1h 27m
>> | Remove| 3m 43sec    | 3m 38sec
>>
>> Btrfs results (without lzo comression):
>> | Type     | 2.6.39.4-3   | 3.2.7
>> | Dir cnt  | 2m 22sec  | 2m 21sec
>> | File cnt |  2m 26sec | 2m 23sec
>> | Copy    | 36m 22sec | 39m 35sec
>> | Remove| 7m 51sec   | 10m 43sec
>>
>> From above one can see that copy takes close to 1h less on btrfs.
I''ve
>> done strace counting times of calls, results are as follows (from
>> 3.2.7):
>> 1) Ext4 (only to elements):
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  57.01   13.257850           1  15082163           read
>>  23.40    5.440353           3   1687702           getdents
>>  6.15    1.430559           0   3672418           lstat
>>  3.80    0.883767           0  13106961           write
>>  2.32    0.539959           0   4794099           open
>>  1.69    0.393589           0    843695           mkdir
>>  1.28    0.296700           0   5637802           setxattr
>>  0.80    0.186539           0   7325195           stat
>>
>> 2) Btrfs:
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>> 53.38    9.486210           1  15179751           read
>> 11.38    2.021662           1   1688328           getdents
>>  10.64    1.890234           0   4800317           open
>>  6.83    1.213723           0  13201590           write
>>  4.85    0.862731           0   5644314           setxattr
>>  3.50    0.621194           1    844008           mkdir
>>  2.75    0.489059           0   3675992         1 lstat
>>  1.71    0.303544           0   5644314           llistxattr
>>  1.50    0.265943           0   1978149           utimes
>>  1.02    0.180585           0   5644314    844008 getxattr
>>
>> On btrfs getdents takes much less time which prove the bottleneck in
>> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
>> for getdents:
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  50.77   10.978816           1  15033132           read
>>  14.46    3.125996           1   4733589           open
>>  7.15    1.546311           0   5566988           setxattr
>>  5.89    1.273845           0   3626505           lstat
>>  5.81    1.255858           1   1667050           getdents
>>  5.66    1.224403           0  13083022           write
>>  3.40    0.735114           1    833371           mkdir
>>  1.96    0.424881           0   5566988           llistxattr
>>
>>
>> Why so huge difference in the getdents timings?
>>
>> -Jacek
I will try to answer the question from the broken email I''ve sent.

@Lukas, it was always a fresh FS on top of LVM logical volume. I''ve
been cleaning cache/remounting to sync all data before (re)doing
tests.

-Jacek

BTW: Sorry for the email mixture. I just can''t get this gmail thing to
work (why forcing top posting:/). Please use this thread.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Feb-29 14:21 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:
>> Hi Chris,
>>
>> the last one was borked :) Please check this one.
>>
>> -jacek
>>
>> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:
>>> Hi All,
>>>
>>> /*Sorry for sending incomplete email, hit wrong button :) I guess I
>>> can''t use Gmail */
>>>
>>> Long story short: We''ve found that operations on a
directory structure
>>> holding many dirs takes ages on ext4.
>>>
>>> The Question: Why there''s that huge difference in ext4 and
btrfs? See
>>> below test results for real values.
>>>
>>> Background: I had to backup a Jenkins directory holding workspace
for
>>> few projects which were co from svn (implies lot of extra .svn
dirs).
>>> The copy takes lot of time (at least more than I''ve
expected) and
>>> process was mostly in D (disk sleep). I''ve dig more and
done some
>>> extra test to see if this is not a regression on block/fs site. To
>>> isolate the issue I''ve also performed same tests on btrfs.
>>>
>>> Test environment configuration:
>>> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670
HT
>>> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
>>> 2) Kernels: All tests were done on following kernels:
>>>  - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking
of
>>> config changes mostly. In -3 we''ve introduced ,,fix
readahead pipeline
>>> break caused by block plug'''' patch. Otherwise
it''s pure 2.6.39.4.
>>>  - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
>>> release recently).
>>> 3) A subject of tests, directory holding:
>>>  - 54GB of data (measured on ext4)
>>>  - 1978149 files
>>>  - 844008 directories
>>> 4) Mount options:
>>>  - ext4 -- errors=remount-ro,noatime,
>>> data=writeback
>>>  - btrfs -- noatime,nodatacow and for later investigation on
>>> copression effect: noatime,nodatacow,compress=lzo
>>>
>>> In all tests I''ve been measuring time of execution.
Following tests
>>> were performed:
>>> - find . -type d
>>> - find . -type f
>>> - cp -a
>>> - rm -rf
>>>
>>> Ext4 results:
>>> | Type     | 2.6.39.4-3   | 3.2.7
>>> | Dir cnt  | 17m 40sec  | 11m 20sec
>>> | File cnt |  17m 36sec | 11m 22sec
>>> | Copy    | 1h 28m        | 1h 27m
>>> | Remove| 3m 43sec    | 3m 38sec
>>>
>>> Btrfs results (without lzo comression):
>>> | Type     | 2.6.39.4-3   | 3.2.7
>>> | Dir cnt  | 2m 22sec  | 2m 21sec
>>> | File cnt |  2m 26sec | 2m 23sec
>>> | Copy    | 36m 22sec | 39m 35sec
>>> | Remove| 7m 51sec   | 10m 43sec
>>>
>>> From above one can see that copy takes close to 1h less on btrfs.
I''ve
>>> done strace counting times of calls, results are as follows (from
>>> 3.2.7):
>>> 1) Ext4 (only to elements):
>>> % time     seconds  usecs/call     calls    errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>>  57.01   13.257850           1  15082163           read
>>>  23.40    5.440353           3   1687702           getdents
>>>  6.15    1.430559           0   3672418           lstat
>>>  3.80    0.883767           0  13106961           write
>>>  2.32    0.539959           0   4794099           open
>>>  1.69    0.393589           0    843695           mkdir
>>>  1.28    0.296700           0   5637802           setxattr
>>>  0.80    0.186539           0   7325195           stat
>>>
>>> 2) Btrfs:
>>> % time     seconds  usecs/call     calls    errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>> 53.38    9.486210           1  15179751           read
>>> 11.38    2.021662           1   1688328           getdents
>>>  10.64    1.890234           0   4800317           open
>>>  6.83    1.213723           0  13201590           write
>>>  4.85    0.862731           0   5644314           setxattr
>>>  3.50    0.621194           1    844008           mkdir
>>>  2.75    0.489059           0   3675992         1 lstat
>>>  1.71    0.303544           0   5644314           llistxattr
>>>  1.50    0.265943           0   1978149           utimes
>>>  1.02    0.180585           0   5644314    844008 getxattr
>>>
>>> On btrfs getdents takes much less time which prove the bottleneck
in
>>> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less
time
>>> for getdents:
>>> % time     seconds  usecs/call     calls    errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>>  50.77   10.978816           1  15033132           read
>>>  14.46    3.125996           1   4733589           open
>>>  7.15    1.546311           0   5566988           setxattr
>>>  5.89    1.273845           0   3626505           lstat
>>>  5.81    1.255858           1   1667050           getdents
>>>  5.66    1.224403           0  13083022           write
>>>  3.40    0.735114           1    833371           mkdir
>>>  1.96    0.424881           0   5566988           llistxattr
>>>
>>>
>>> Why so huge difference in the getdents timings?
>>>
>>> -Jacek
>
> I will try to answer the question from the broken email I''ve sent.
>
> @Lukas, it was always a fresh FS on top of LVM logical volume.
I''ve
> been cleaning cache/remounting to sync all data before (re)doing
> tests.
>
> -Jacek
>
> BTW: Sorry for the email mixture. I just can''t get this gmail
thing to
> work (why forcing top posting:/). Please use this thread.
More from the observations:
1) 10s dump of the process state during copy shows:
- Ext4: 526 probes done, 34 hits R state, 492 hits D state
- Btrfs (2.6.39.4): 218, 83, 135
- Btrfs (3.2.7): 238, 62, 174, 2 hit sleeping
2) dd write/read of 55GB file to/from volume:
- Ext4: write 127MB/s, read 107MB/s
- Btrfs: 110MB/s, read 176MB/s

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Feb-29 14:42 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote:

[ btrfs faster than ext for find and cp -a ]
> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:
> 
> I will try to answer the question from the broken email I''ve sent.
> 
> @Lukas, it was always a fresh FS on top of LVM logical volume.
I''ve
> been cleaning cache/remounting to sync all data before (re)doing
> tests.
The next step is to get cp -a out of the picture, in this case you''re
benchmarking both the read speed and the write speed (what are you
copying to btw?).

Using tar cf /dev/zero <some_dir> is one way to get a consistent picture
of the read speed.

You can confirm the theory that it is directory order causing problems
by using acp to read the data.

http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2

-chris

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Feb-29 14:55 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/2/29 Chris Mason <chris.mason@oracle.com>:> On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote:
>
> [ btrfs faster than ext for find and cp -a ]
>
>> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:
>>
>> I will try to answer the question from the broken email I''ve
sent.
>>
>> @Lukas, it was always a fresh FS on top of LVM logical volume.
I''ve
>> been cleaning cache/remounting to sync all data before (re)doing
>> tests.
>
> The next step is to get cp -a out of the picture, in this case
you''re
> benchmarking both the read speed and the write speed (what are you
> copying to btw?).
It''s simple cp -a Jenkins{,.bak} so dir to dir copy on same volume.
> Using tar cf /dev/zero <some_dir> is one way to get a consistent
picture
> of the read speed.
IMO the problem is not - only - in read speed. The directory order hit
here. There''s a difference in the sequential tests that place btrfs as
the winner but still this should not have that huge influence on
getdents. I know a bit on the difference between ext4 and btrfs
directory handling and I would not expect that huge difference. On the
production system where the issue has been observed doing some real
work in the background copy takes up to 4h.

For me btrfs looks perfect here, what could be worth checking is the
change of timing in syscall between 39.4 and 3.2.7. Before getdents
was not that high on the list while now it jumps to second position
but without huge impact on the timings.
> You can confirm the theory that it is directory order causing problems
> by using acp to read the data.
>
> http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2
Will check this still today and report back.

-jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Tso

2012-Mar-01 04:44 UTC

head link

Re: getdents - ext4 vs btrfs performance

You might try sorting the entries returned by readdir by inode number before you
stat them.    This is a long-standing weakness in ext3/ext4, and it has to do
with how we added hashed tree indexes to directories in (a) a backwards
compatible way, that (b) was POSIX compliant with respect to adding and removing
directory entries concurrently with reading all of the directory entries using
readdir.

You might try compiling spd_readdir from the e2fsprogs source tree (in the
contrib directory):

http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209

… and then using that as a LD_PRELOAD, and see how that changes things.

The short version is that we can''t easily do this in the kernel since
it''s a problem that primarily shows up with very big directories, and
using non-swappable kernel memory to store all of the directory entries and then
sort them so they can be returned in inode number just isn''t practical.
It is something which can be easily done in userspace, though, and a number of
programs (including mutt for its Maildir support) does do, and it helps greatly
for workloads where you are calling readdir() followed by something that needs
to access the inode (i.e., stat, unlink, etc.)

-- Ted


On Feb 29, 2012, at 8:52 AM, Jacek Luczak wrote:
> Hi All,
> 
> /*Sorry for sending incomplete email, hit wrong button :) I guess I
> can''t use Gmail */
> 
> Long story short: We''ve found that operations on a directory
structure
> holding many dirs takes ages on ext4.
> 
> The Question: Why there''s that huge difference in ext4 and btrfs?
See
> below test results for real values.
> 
> Background: I had to backup a Jenkins directory holding workspace for
> few projects which were co from svn (implies lot of extra .svn dirs).
> The copy takes lot of time (at least more than I''ve expected) and
> process was mostly in D (disk sleep). I''ve dig more and done some
> extra test to see if this is not a regression on block/fs site. To
> isolate the issue I''ve also performed same tests on btrfs.
> 
> Test environment configuration:
> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> 2) Kernels: All tests were done on following kernels:
> - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> config changes mostly. In -3 we''ve introduced ,,fix readahead
pipeline
> break caused by block plug'''' patch. Otherwise
it''s pure 2.6.39.4.
> - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> release recently).
> 3) A subject of tests, directory holding:
> - 54GB of data (measured on ext4)
> - 1978149 files
> - 844008 directories
> 4) Mount options:
> - ext4 -- errors=remount-ro,noatime,
> data=writeback
> - btrfs -- noatime,nodatacow and for later investigation on
> copression effect: noatime,nodatacow,compress=lzo
> 
> In all tests I''ve been measuring time of execution. Following
tests
> were performed:
> - find . -type d
> - find . -type f
> - cp -a
> - rm -rf
> 
> Ext4 results:
> | Type     | 2.6.39.4-3   | 3.2.7
> | Dir cnt  | 17m 40sec  | 11m 20sec
> | File cnt |  17m 36sec | 11m 22sec
> | Copy    | 1h 28m        | 1h 27m
> | Remove| 3m 43sec    | 3m 38sec
> 
> Btrfs results (without lzo comression):
> | Type     | 2.6.39.4-3   | 3.2.7
> | Dir cnt  | 2m 22sec  | 2m 21sec
> | File cnt |  2m 26sec | 2m 23sec
> | Copy    | 36m 22sec | 39m 35sec
> | Remove| 7m 51sec   | 10m 43sec
> 
> From above one can see that copy takes close to 1h less on btrfs.
I''ve
> done strace counting times of calls, results are as follows (from
> 3.2.7):
> 1) Ext4 (only to elements):
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 57.01   13.257850           1  15082163           read
> 23.40    5.440353           3   1687702           getdents
> 6.15    1.430559           0   3672418           lstat
> 3.80    0.883767           0  13106961           write
> 2.32    0.539959           0   4794099           open
> 1.69    0.393589           0    843695           mkdir
> 1.28    0.296700           0   5637802           setxattr
> 0.80    0.186539           0   7325195           stat
> 
> 2) Btrfs:
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.38    9.486210           1  15179751           read
> 11.38    2.021662           1   1688328           getdents
> 10.64    1.890234           0   4800317           open
> 6.83    1.213723           0  13201590           write
> 4.85    0.862731           0   5644314           setxattr
> 3.50    0.621194           1    844008           mkdir
> 2.75    0.489059           0   3675992         1 lstat
> 1.71    0.303544           0   5644314           llistxattr
> 1.50    0.265943           0   1978149           utimes
> 1.02    0.180585           0   5644314    844008 getxattr
> 
> On btrfs getdents takes much less time which prove the bottleneck in
> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
> for getdents:
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 50.77   10.978816           1  15033132           read
> 14.46    3.125996           1   4733589           open
> 7.15    1.546311           0   5566988           setxattr
> 5.89    1.273845           0   3626505           lstat
> 5.81    1.255858           1   1667050           getdents
> 5.66    1.224403           0  13083022           write
> 3.40    0.735114           1    833371           mkdir
> 1.96    0.424881           0   5566988           llistxattr
> 
> 
> Why so huge difference in the getdents timings?
> 
> -Jacek
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-01 13:35 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/2/29 Chris Mason <chris.mason@oracle.com>:
>> On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote:
>>
>> [ btrfs faster than ext for find and cp -a ]
>>
>>> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:
>>>
>>> I will try to answer the question from the broken email
I''ve sent.
>>>
>>> @Lukas, it was always a fresh FS on top of LVM logical volume.
I''ve
>>> been cleaning cache/remounting to sync all data before (re)doing
>>> tests.
>>
>> The next step is to get cp -a out of the picture, in this case
you''re
>> benchmarking both the read speed and the write speed (what are you
>> copying to btw?).
>
> It''s simple cp -a Jenkins{,.bak} so dir to dir copy on same
volume.
>
>> Using tar cf /dev/zero <some_dir> is one way to get a consistent
picture
>> of the read speed.
>
> IMO the problem is not - only - in read speed. The directory order hit
> here. There''s a difference in the sequential tests that place
btrfs as
> the winner but still this should not have that huge influence on
> getdents. I know a bit on the difference between ext4 and btrfs
> directory handling and I would not expect that huge difference. On the
> production system where the issue has been observed doing some real
> work in the background copy takes up to 4h.
>
> For me btrfs looks perfect here, what could be worth checking is the
> change of timing in syscall between 39.4 and 3.2.7. Before getdents
> was not that high on the list while now it jumps to second position
> but without huge impact on the timings.
>
>> You can confirm the theory that it is directory order causing problems
>> by using acp to read the data.
>>
>> http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2
>
> Will check this still today and report back.
>
While I was about to grab acp I''ve noticed seekwatcher with made my day
:)

seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png

I will send acp results soon.

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hillf Danton

2012-Mar-01 13:50 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <difrost.kernel@gmail.com>
wrote:>
> While I was about to grab acp I''ve noticed seekwatcher with made
my day :)
>
> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
> 3) both merged:
http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png
>
> I will send acp results soon.
>Would you please take reiserfs into account?

Thanks
-hd

Jacek Luczak

2012-Mar-01 14:03 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/1 Hillf Danton <dhillf@gmail.com>:> On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak
<difrost.kernel@gmail.com> wrote:
>>
>> While I was about to grab acp I''ve noticed seekwatcher with
made my day :)
>>
>> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
>> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
>> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
>> 3) both merged:
http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png
>>
>> I will send acp results soon.
>>
> Would you please take reiserfs into account?
As of now not (lack of time) but I''m pretty close to consider XFS in
the game. Whenever I will have more time and there won''t be a pressure
on giving host back to production I will redo same tests for reiserfs.

Now I''m focused on the userspace sorting results.

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Mar-01 14:18 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Thu, Mar 01, 2012 at 03:03:53PM +0100, Jacek Luczak
wrote:> 2012/3/1 Hillf Danton <dhillf@gmail.com>:
> > On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak
<difrost.kernel@gmail.com> wrote:
> >>
> >> While I was about to grab acp I''ve noticed seekwatcher
with made my day :)
> >>
> >> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
> >> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
> >> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
> >> 3) both merged:
http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png
Whoa, seekwatcher makes it pretty clear.
> >>
> >> I will send acp results soon.
> >>
> > Would you please take reiserfs into account?
> 
> As of now not (lack of time) but I''m pretty close to consider XFS
in
> the game. Whenever I will have more time and there won''t be a
pressure
> on giving host back to production I will redo same tests for reiserfs.
> 
> Now I''m focused on the userspace sorting results.
reiserfs should have results very similar to ext4.  The directory
hashing used by reiserfs is going to result in a very random read
pattern.

XFS will probably beat btrfs in this test.  Their directory indexes
reflect on disk layout very well.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Mar-01 14:38 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso
wrote:> You might try sorting the entries returned by readdir by inode number
before you stat them.    This is a long-standing weakness in ext3/ext4, and it
has to do with how we added hashed tree indexes to directories in (a) a
backwards compatible way, that (b) was POSIX compliant with respect to adding
and removing directory entries concurrently with reading all of the directory
entries using readdir.
> 
> You might try compiling spd_readdir from the e2fsprogs source tree (in the
contrib directory):
> 
>
http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209
> 
> … and then using that as a LD_PRELOAD, and see how that changes things.
> 
> The short version is that we can''t easily do this in the kernel
since it''s a problem that primarily shows up with very big directories,
and using non-swappable kernel memory to store all of the directory entries and
then sort them so they can be returned in inode number just isn''t
practical.   It is something which can be easily done in userspace, though, and
a number of programs (including mutt for its Maildir support) does do, and it
helps greatly for workloads where you are calling readdir() followed by
something that needs to access the inode (i.e., stat, unlink, etc.)
> 
For reading the files, the acp program I sent him tries to do something
similar.  I had forgotten about spd_readdir though, we should consider
hacking that into cp and tar.

One interesting note is the page cache used to help here.  Picture two
tests:

A) time tar cf /dev/zero /home

and

cp -a /home /new_dir_in_new_fs
unmount / flush caches
B) time tar cf /dev/zero /new_dir_in_new_fs

On ext, The time for B used to be much faster than the time for A
because the files would get written back to disk in roughly htree order.
Based on Jacek''s data, that isn''t true anymore.

-chris

Jacek Luczak

2012-Mar-01 14:43 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/1 Chris Mason <chris.mason@oracle.com>:> On Thu, Mar 01, 2012 at 03:03:53PM +0100, Jacek Luczak wrote:
>> 2012/3/1 Hillf Danton <dhillf@gmail.com>:
>> > On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak
<difrost.kernel@gmail.com> wrote:
>> >>
>> >> While I was about to grab acp I''ve noticed
seekwatcher with made my day :)
>> >>
>> >> seekwatcher run of tar cf to eliminate writes (all done on
3.2.7):
>> >> 1) btrfs:
http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
>> >> 2) ext4:
http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
>> >> 3) both merged:
http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png
>
> Whoa, seekwatcher makes it pretty clear.
Yep, ext4 is close to my wife''s closet.
>> >>
>> >> I will send acp results soon.
>> >>
>> > Would you please take reiserfs into account?
>>
>> As of now not (lack of time) but I''m pretty close to consider
XFS in
>> the game. Whenever I will have more time and there won''t be a
pressure
>> on giving host back to production I will redo same tests for reiserfs.
>>
>> Now I''m focused on the userspace sorting results.
>
> reiserfs should have results very similar to ext4.  The directory
> hashing used by reiserfs is going to result in a very random read
> pattern.
>
> XFS will probably beat btrfs in this test.  Their directory indexes
> reflect on disk layout very well.
True, but not that fast on small files.

Except the question I''ve raised in first mail there''s a point
in all
those action. We are maintaining host that are used for building
software: random access, lot of small files and dirs (always a co),
heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was
around 10% slower on build times. We did not - yet - done same on
btrfs. Now we''re looking for replacement for ext4 as we suffer from
those issue - but we were not aware of that until stepped into this
issue.

If you would like me to do some specific tests around ext4 and btrfs,
let me know.

-Jacek

Chris Mason

2012-Mar-01 14:51 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak
wrote:> 2012/3/1 Chris Mason <chris.mason@oracle.com>:
> > XFS will probably beat btrfs in this test.  Their directory indexes
> > reflect on disk layout very well.
> 
> True, but not that fast on small files.
> 
> Except the question I''ve raised in first mail there''s a
point in all
> those action. We are maintaining host that are used for building
> software: random access, lot of small files and dirs (always a co),
> heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was
> around 10% slower on build times. We did not - yet - done same on
> btrfs. Now we''re looking for replacement for ext4 as we suffer
from
> those issue - but we were not aware of that until stepped into this
> issue.
> 
> If you would like me to do some specific tests around ext4 and btrfs,
> let me know.
I''m always curious to see comparisons in real world workloads.  You
should definitely consider testing XFS again, the big three filesystems
are under pretty constant improvement.  For btrfs, please stick to 3.2
kernels and higher.

This seeky backup performance is somewhat built into ext4, but as Ted
said there are a few workarounds.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-01 14:57 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/1 Chris Mason <chris.mason@oracle.com>:> On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:
>> 2012/3/1 Chris Mason <chris.mason@oracle.com>:
>> > XFS will probably beat btrfs in this test.  Their directory
indexes
>> > reflect on disk layout very well.
>>
>> True, but not that fast on small files.
>>
>> Except the question I''ve raised in first mail there''s
a point in all
>> those action. We are maintaining host that are used for building
>> software: random access, lot of small files and dirs (always a co),
>> heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was
>> around 10% slower on build times. We did not - yet - done same on
>> btrfs. Now we''re looking for replacement for ext4 as we suffer
from
>> those issue - but we were not aware of that until stepped into this
>> issue.
>>
>> If you would like me to do some specific tests around ext4 and btrfs,
>> let me know.
>
> I''m always curious to see comparisons in real world workloads.
 You
> should definitely consider testing XFS again, the big three filesystems
> are under pretty constant improvement.  For btrfs, please stick to 3.2
> kernels and higher.
That''s the plan but I''m waiting for more of the briliant work
that
recently popped up in XFS. For btrfs, the 3.2 introduced changes led
me to give here a try. I don''t have nice pictures and digits in my
hand now but first tests shown close to 40% of timing improvements
between 2.6.39.4 and 3.2.7 - keep doing that great work guys (and
girls if any)!

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-01 18:42 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak
wrote:> 
> Yep, ext4 is close to my wife''s closet.
> 
Were all of the file systems freshly laid down, or was this an aged
ext4 file system?

Also you should beware that if you have a workload which is heavy
parallel I/O, with lots of random, read/write accesses to small files,
a benchmark using tar might not be representative of what you will see
in production --- different file systems have different strengths and
weaknesses --- and the fact that ext3/ext4''s readdir() returns inodes
in a non-optimal order for stat(2) or unlink(2) or file copy in the
cold cache case may not matter as much as you think in a build server.
(i.e., the directories that do need to be searched will probably be
serviced out of the dentry cache, etc.)

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-02 09:51 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/1 Ted Ts''o <tytso@mit.edu>:> On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:
>>
>> Yep, ext4 is close to my wife''s closet.
>>
>
> Were all of the file systems freshly laid down, or was this an aged
> ext4 file system?
Always fresh, recreated for each tests - that''s why it takes quite
much time as I had to copy the test dir back in places. Env is kept in
all tests as much consistent as possible thus the values should have
credibility.
> Also you should beware that if you have a workload which is heavy
> parallel I/O, with lots of random, read/write accesses to small files,
> a benchmark using tar might not be representative of what you will see
> in production --- different file systems have different strengths and
> weaknesses --- and the fact that ext3/ext4''s readdir() returns
inodes
> in a non-optimal order for stat(2) or unlink(2) or file copy in the
> cold cache case may not matter as much as you think in a build server.
> (i.e., the directories that do need to be searched will probably be
> serviced out of the dentry cache, etc.)
The set of tests were chosen as is not to find best fs for build purposes.

For a pure builds ext4 is as of now the best and most of the points
you''ve put above are valid here. We''ve performed a real tests
in a
clone of production environment. Results are not that surprising as
one can find same from e.g. Phoronix test. We''ve done test in XFS vs
ext[34] and ext4 vs btrfs. Here if we are taking into account only a
software compilation ext4 rocks. Btrfs is only few seconds slower (max
5 in average). The choice then was to use ext4 due to more mature
foundations and support in RHEL. Why we''re looking for new one? Well,
the build environment is not only based on software building. There
are e.g. some strange tests running in parallel, code analysis, etc.
Users are doing damn odd things there ... are using Java, you can
imagine how bad bad bad bad zombie java can be. We''ve failed here and
are not able to isolate each use cases and create profiled
environment. Thus we need to find sth that will provide common sense
for all use cases.

The previous tests shown that ext[34] rocks on compilation timing, but
all around not really. Also one need to remember that fs content
changes often. The ext3 was ageing in 4-6 months of use. XFS on the
other hand was great all around while not in compilation timings.
Roughly 10% is not that much but if hosts is doing builds 24h/7 then
after a few days we''ve been much behind ext[34] clone env. Btrfs was
only tested against compilation timings, not in general use.

We''ve created a simple test case for compilation. It''s quite
not same
as what we got in real env but is good baseline (kernel build system
is too perfect). Simply parallel kernel builds with randomly
allyesconfig or allmodconfig. Below are the seekwatcher graphs of
around 1h of tests running. There were 10 builds (kernels
2.6.20-2.6.29) running with three parallel threads:
1) ext4: http://91.234.146.107/~difrost/seekwatcher/kt_ext4.png
2) btrfs: http://91.234.146.107/~difrost/seekwatcher/kt_btrfs.png
3) both: http://91.234.146.107/~difrost/seekwatcher/kt_btrfs_ext4.png

Above graphs prove that ext4 is ahead in this ,,competition''''.
I will
try to setup there a real build env to see how those two compare.

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-02 10:05 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/1 Chris Mason <chris.mason@oracle.com>:> On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso wrote:
>> You might try sorting the entries returned by readdir by inode number
before you stat them.    This is a long-standing weakness in ext3/ext4, and it
has to do with how we added hashed tree indexes to directories in (a) a
backwards compatible way, that (b) was POSIX compliant with respect to adding
and removing directory entries concurrently with reading all of the directory
entries using readdir.
>>
>> You might try compiling spd_readdir from the e2fsprogs source tree (in
the contrib directory):
>>
>>
http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209
>>
>> … and then using that as a LD_PRELOAD, and see how that changes things.
>>
>> The short version is that we can''t easily do this in the
kernel since it''s a problem that primarily shows up with very big
directories, and using non-swappable kernel memory to store all of the directory
entries and then sort them so they can be returned in inode number just
isn''t practical.   It is something which can be easily done in
userspace, though, and a number of programs (including mutt for its Maildir
support) does do, and it helps greatly for workloads where you are calling
readdir() followed by something that needs to access the inode (i.e., stat,
unlink, etc.)
>>
>
> For reading the files, the acp program I sent him tries to do something
> similar.  I had forgotten about spd_readdir though, we should consider
> hacking that into cp and tar.
>
> One interesting note is the page cache used to help here.  Picture two
> tests:
>
> A) time tar cf /dev/zero /home
>
> and
>
> cp -a /home /new_dir_in_new_fs
> unmount / flush caches
> B) time tar cf /dev/zero /new_dir_in_new_fs
>
> On ext, The time for B used to be much faster than the time for A
> because the files would get written back to disk in roughly htree order.
> Based on Jacek''s data, that isn''t true anymore.
I''ve took both on tests. The subject is acp and spd_readdir used with
tar, all on ext4:
1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png

The acp looks much better than spd_readdir but directory copy with
spd_readdir decreased to 52m 39sec (30 min less).

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Mar-02 14:00 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak
wrote:> 
> I''ve took both on tests. The subject is acp and spd_readdir used
with
> tar, all on ext4:
> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
> 2) spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
> 
> The acp looks much better than spd_readdir but directory copy with
> spd_readdir decreased to 52m 39sec (30 min less).
Do you have stats on how big these files are, and how fragmented they
are?  For acp and spd to give us this, I think something has gone wrong
at writeback time (creating individual fragmented files).

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-02 14:16 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/2 Chris Mason <chris.mason@oracle.com>:> On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>>
>> I''ve took both on tests. The subject is acp and spd_readdir
used with
>> tar, all on ext4:
>> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>> 2) spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>>
>> The acp looks much better than spd_readdir but directory copy with
>> spd_readdir decreased to 52m 39sec (30 min less).
>
> Do you have stats on how big these files are, and how fragmented they
> are?  For acp and spd to give us this, I think something has gone wrong
> at writeback time (creating individual fragmented files).
How big? Which files?

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Mar-02 14:26 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak
wrote:> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
> >>
> >> I''ve took both on tests. The subject is acp and
spd_readdir used with
> >> tar, all on ext4:
> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
> >> 2) spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
> >> 3) both:
http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
> >>
> >> The acp looks much better than spd_readdir but directory copy with
> >> spd_readdir decreased to 52m 39sec (30 min less).
> >
> > Do you have stats on how big these files are, and how fragmented they
> > are?  For acp and spd to give us this, I think something has gone
wrong
> > at writeback time (creating individual fragmented files).
> 
> How big? Which files?
All the files you''re reading ;)

filefrag will tell you how many extents each file has, any file with
more than one extent is interesting.  (The ext4 crowd may have better
suggestions on measuring fragmentation).

Since you mention this is a compile farm, I''m guessing there are a
bunch
of .o files created by parallel builds.  There are a lot of chances for
delalloc and the kernel writeback code to do the wrong thing here.

-chris

Ted Ts''o

2012-Mar-02 19:32 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason
wrote:> 
> filefrag will tell you how many extents each file has, any file with
> more than one extent is interesting.  (The ext4 crowd may have better
> suggestions on measuring fragmentation).
You can get a *huge* amount of information (probably more than you''ll
want to analyze) by doing this:

     e2fsck -nf -E fragcheck /dev/XXXX >& /tmp/fragcheck.out

I haven''t had time to do this in a while, but a while back I used this
to debug the writeback code with an eye towards reducing
fragmentation.  At the time I was trying to optimize the case of
reducing fragmentation in the easist case possible, where you start
with an empty file system, and then copy all of the data from another
file system onto it using rsync -avH.

It would be worth while to see what happens with files written by the
compiler and linker.  Given that libelf tends to write .o files
non-sequentially, and without telling us how big the space is in
advance, I could well imagine that we''re not doing the best job
avoiding free space fragmentation, which eventually leads to extra
file system aging.

It would be interesting to have a project where someone added
fallocate() support into libelf, and then added some hueristics into
ext4 so that if a file is fallocated to a precise size, or if the file
is fully written and closed before writeback begins, that we use this
to more efficiently pack the space used by the files by the block
allocator.  This is a place where I would not be surprised that XFS
has some better code to avoid accelerated file system aging, and where
we could do better with ext4 with some development effort.

Of course, it might also be possible to hack around this by simply
using VPATH and dropping your build files in a separate place from
your source files, and periodically reformatting the file system where
your build tree lives.  (As a side note, something that works well for
me is to use an SSD for my source files, and a separate 5400rpm HDD
for my build tree.  That allows me to use a smaller and more
affordable SSD, and since the object files can be written
asynchronously by the writeback threads, but the compiler can''t move
forward until it gets file data from the .c or .h file, it gets me the
best price/performance for a laptop build environment.)

BTW, I suspect we could make acp even more efficient by teaching it to
use FIEMAP ioctl to map out the data blocks for all of the files in
the source file system, and then copied the files (or perhaps even
parts of files) in a read order which reduced seeking on the source
drive.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Mar-02 19:50 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 02:32:15PM -0500, Ted Ts''o
wrote:> On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
> > 
> > filefrag will tell you how many extents each file has, any file with
> > more than one extent is interesting.  (The ext4 crowd may have better
> > suggestions on measuring fragmentation).
> 
> You can get a *huge* amount of information (probably more than
you''ll
> want to analyze) by doing this:
> 
>      e2fsck -nf -E fragcheck /dev/XXXX >& /tmp/fragcheck.out
> 
> I haven''t had time to do this in a while, but a while back I used
this
> to debug the writeback code with an eye towards reducing
> fragmentation.  At the time I was trying to optimize the case of
> reducing fragmentation in the easist case possible, where you start
> with an empty file system, and then copy all of the data from another
> file system onto it using rsync -avH.
> 
> It would be worth while to see what happens with files written by the
> compiler and linker.  Given that libelf tends to write .o files
> non-sequentially, and without telling us how big the space is in
> advance, I could well imagine that we''re not doing the best job
> avoiding free space fragmentation, which eventually leads to extra
> file system aging.
I just realized that I confused things.  He''s doing a read on the
results of a cp -a to a fresh FS, so there''s no way the compiler/linker
are causing trouble.
> 
> It would be interesting to have a project where someone added
> fallocate() support into libelf, and then added some hueristics into
> ext4 so that if a file is fallocated to a precise size, or if the file
> is fully written and closed before writeback begins, that we use this
> to more efficiently pack the space used by the files by the block
> allocator.  This is a place where I would not be surprised that XFS
> has some better code to avoid accelerated file system aging, and where
> we could do better with ext4 with some development effort.
The part I don''t think any of us have solved is writing back the files
in a good order after we''ve fallocated the blocks.

So this will probably be great for reads and not so good for writes.
> 
> Of course, it might also be possible to hack around this by simply
> using VPATH and dropping your build files in a separate place from
> your source files, and periodically reformatting the file system where
> your build tree lives.  (As a side note, something that works well for
> me is to use an SSD for my source files, and a separate 5400rpm HDD
> for my build tree.  That allows me to use a smaller and more
> affordable SSD, and since the object files can be written
> asynchronously by the writeback threads, but the compiler can''t
move
> forward until it gets file data from the .c or .h file, it gets me the
> best price/performance for a laptop build environment.)
mkfs for defrag ;)  It''s the only way to be sure.
> 
> BTW, I suspect we could make acp even more efficient by teaching it to
> use FIEMAP ioctl to map out the data blocks for all of the files in
> the source file system, and then copied the files (or perhaps even
> parts of files) in a read order which reduced seeking on the source
> drive.
acp does have a -b mode where it fibmaps (I was either lazy or it is
older than fiemap, I forget) the first block in the file, and uses that
to sort.  It does help if the file blocks aren''t ordered well wrt their
inode numbers, but not if the files are fragmented.

It''s also worth mentioning that acp doesn''t actually cp.  I
never got
that far.  It was supposed to be the perfect example of why everything
should be done via aio, but it just ended up demonstrating that ordering
by inode number and leveraging kernel/hardware reada were more
important.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-03 22:41 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/2 Chris Mason <chris.mason@oracle.com>:> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
>> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>> >>
>> >> I''ve took both on tests. The subject is acp and
spd_readdir used with
>> >> tar, all on ext4:
>> >> 1) acp:
http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>> >> 2) spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>> >> 3) both:
http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>> >>
>> >> The acp looks much better than spd_readdir but directory copy
with
>> >> spd_readdir decreased to 52m 39sec (30 min less).
>> >
>> > Do you have stats on how big these files are, and how fragmented
they
>> > are?  For acp and spd to give us this, I think something has gone
wrong
>> > at writeback time (creating individual fragmented files).
>>
>> How big? Which files?
>
> All the files you''re reading ;)
>
> filefrag will tell you how many extents each file has, any file with
> more than one extent is interesting.  (The ext4 crowd may have better
> suggestions on measuring fragmentation).
>
> Since you mention this is a compile farm, I''m guessing there are a
bunch
> of .o files created by parallel builds.  There are a lot of chances for
> delalloc and the kernel writeback code to do the wrong thing here.
>
This took some time be here it goes ...

* B size files - total: 1064462 (53.81%)
| Size range    | File cnt      | Fragmented    | Fragmented %  |
| [0,100)       | 288702        | 0             | 0             |
| [100,200)     | 165997        | 0             | 0             |
| [200,300)     | 45993         | 0             | 0             |
| [300,400)     | 64883         | 0             | 0             |
| [400,500)     | 39282         | 0             | 0             |
| [500,600)     | 61612         | 0             | 0             |
| [600,700)     | 208822        | 0             | 0             |
| [700,800)     | 64016         | 0             | 0             |
| [800,900)     | 58181         | 0             | 0             |
| [900,1024)    | 66974         | 0             | 0             |

* K size files - total: 907375 (45.86%)
| Size range    | File cnt      | Fragmented    | Fragmented %  |
| [1,2)         | 285481        | 0             | 0             |
| [2,3)         | 148461        | 0             | 0             |
| [3,4)         | 102569        | 0             | 0             |
| [4,5)         | 45089         | 0             | 0             |
| [5,6)         | 44223         | 0             | 0             |
| [6,10)        | 94202         | 0             | 0             |
| [10,20)       | 83760         | 0             | 0             |
| [20,30)       | 42548         | 0             | 0             |
| [30,40)       | 15274         | 0             | 0             |
| [40,50)       | 4060          | 0             | 0             |
| [50,60)       | 3517          | 0             | 0             |
| [60,100)      | 7834          | 0             | 0             |
| [100,1024)    | 30357         | 0             | 0             |

*M size files - total: 6312 (0.33%)
| Size range    | File cnt      | Fragmented    | Fragmented %  |
| [1.10)        | 5806          | 100           | 1.72          |
| [10,100)      | 482           | 200           | 41.49         |
| [100,1024)    | 24            | 13            | 54.16         |

All files scanned: 1978149
Files fragmented: 313 (0.015%) where 11 have 3+ extents
Total size of fragmented files: 7GB (~13% of dir size)

tar cf on fragmented files:
1) time: 7sec
2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png
3) sw graph with spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png
4) both on one:
http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png

tar cf of fragmented files disturbed with [40,50) K files (in total
4373 files). K files before fragmented M files:
1) size: 7.2GB
2) time: 1m 14sec
3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png
4) sw graph with spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png
5) both on one:
http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png

I will perform same tests on btrfs tomorrow.

-Jacek

Jacek Luczak

2012-Mar-04 10:25 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/3 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
>>> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>>> >>
>>> >> I''ve took both on tests. The subject is acp and
spd_readdir used with
>>> >> tar, all on ext4:
>>> >> 1) acp:
http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>>> >> 2) spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>>> >> 3) both:
http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>>> >>
>>> >> The acp looks much better than spd_readdir but directory
copy with
>>> >> spd_readdir decreased to 52m 39sec (30 min less).
>>> >
>>> > Do you have stats on how big these files are, and how
fragmented they
>>> > are?  For acp and spd to give us this, I think something has
gone wrong
>>> > at writeback time (creating individual fragmented files).
>>>
>>> How big? Which files?
>>
>> All the files you''re reading ;)
>>
>> filefrag will tell you how many extents each file has, any file with
>> more than one extent is interesting.  (The ext4 crowd may have better
>> suggestions on measuring fragmentation).
>>
>> Since you mention this is a compile farm, I''m guessing there
are a bunch
>> of .o files created by parallel builds.  There are a lot of chances for
>> delalloc and the kernel writeback code to do the wrong thing here.
>>
>
[Most of files are B and K size]>
> All files scanned: 1978149
> Files fragmented: 313 (0.015%) where 11 have 3+ extents
> Total size of fragmented files: 7GB (~13% of dir size)
BTRFS: Non of files according to filefrag are fragmented - all fit
into one extent.
> tar cf on fragmented files:
> 1) time: 7sec
> 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png
> 3) sw graph with spd_readdir:
> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png
> 4) both on one:
> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png
BTRFS: tar on ext4 fragmented files
1) time: 6sec
2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_btrfs.png
> tar cf of fragmented files disturbed with [40,50) K files (in total
> 4373 files). K files before fragmented M files:
> 1) size: 7.2GB
> 2) time: 1m 14sec
> 3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png
> 4) sw graph with spd_readdir:
> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png
> 5) both on one:
> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png
BTRFS: tar on [40,50) K and ext4 fragmented
1) time: 56sec
2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_btrfs.png

New test I''ve included - randomly selected files:
- size 240MB
1) ext4 (time: 34sec) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_random_ext4.png
2) btrfs (time: 55sec) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_random_btrfs.png

-Jacek

Jacek Luczak

2012-Mar-05 11:32 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/4 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/3/3 Jacek Luczak <difrost.kernel@gmail.com>:
>> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
>>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
>>>> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
>>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak
wrote:
>>>> >>
>>>> >> I''ve took both on tests. The subject is acp
and spd_readdir used with
>>>> >> tar, all on ext4:
>>>> >> 1) acp:
http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>>>> >> 2) spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>>>> >> 3) both:
http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>>>> >>
>>>> >> The acp looks much better than spd_readdir but
directory copy with
>>>> >> spd_readdir decreased to 52m 39sec (30 min less).
>>>> >
>>>> > Do you have stats on how big these files are, and how
fragmented they
>>>> > are?  For acp and spd to give us this, I think something
has gone wrong
>>>> > at writeback time (creating individual fragmented files).
>>>>
>>>> How big? Which files?
>>>
>>> All the files you''re reading ;)
>>>
>>> filefrag will tell you how many extents each file has, any file
with
>>> more than one extent is interesting.  (The ext4 crowd may have
better
>>> suggestions on measuring fragmentation).
>>>
>>> Since you mention this is a compile farm, I''m guessing
there are a bunch
>>> of .o files created by parallel builds.  There are a lot of chances
for
>>> delalloc and the kernel writeback code to do the wrong thing here.
>>>
>>
> [Most of files are B and K size]
>>
>> All files scanned: 1978149
>> Files fragmented: 313 (0.015%) where 11 have 3+ extents
>> Total size of fragmented files: 7GB (~13% of dir size)
>
> BTRFS: Non of files according to filefrag are fragmented - all fit
> into one extent.
>
>> tar cf on fragmented files:
>> 1) time: 7sec
>> 2) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png
>> 3) sw graph with spd_readdir:
>> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png
>> 4) both on one:
>> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png
>
> BTRFS: tar on ext4 fragmented files
> 1) time: 6sec
> 2) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_btrfs.png
>
>> tar cf of fragmented files disturbed with [40,50) K files (in total
>> 4373 files). K files before fragmented M files:
>> 1) size: 7.2GB
>> 2) time: 1m 14sec
>> 3) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png
>> 4) sw graph with spd_readdir:
>> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png
>> 5) both on one:
>> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png
>
> BTRFS: tar on [40,50) K and ext4 fragmented
> 1) time: 56sec
> 2) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_btrfs.png
>
> New test I''ve included - randomly selected files:
> - size 240MB
> 1) ext4 (time: 34sec) sw graph:
> http://91.234.146.107/~difrost/seekwatcher/tar_random_ext4.png
> 2) btrfs (time: 55sec) sw graph:
> http://91.234.146.107/~difrost/seekwatcher/tar_random_btrfs.png
Yet another test. The original issue is in the directory data
handling. In my case a lot of dirs are introduced due to extra .svn.
Let''s then see how does tar on those dirs looks like.

Number of .svn directories: 61605
1) Ext4:
 - tar time: 10m 53sec
 - sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_ext4.png
 - sw tar graph with spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/svn_dir_spd_ext4.png
2) Btrfs:
 - tar time: 4m 35sec
 - sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs.png
 - sw tar graph with ext4:
http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs_ext4.png

IMO this is not a writeback issue (well it could be but then it mean
that it broken in general), it''s not fragmentation. Sorting files in
readdir helps a bit but is still far behind the btrfs.

Any ideas? Is this a issue or the things are like they are and one
need to live with it.

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara

2012-Mar-05 13:10 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Fri 02-03-12 14:32:15, Ted Tso wrote:> On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
> It would be interesting to have a project where someone added
> fallocate() support into libelf, and then added some hueristics into
> ext4 so that if a file is fallocated to a precise size, or if the file
> is fully written and closed before writeback begins, that we use this
> to more efficiently pack the space used by the files by the block
> allocator.  This is a place where I would not be surprised that XFS
> has some better code to avoid accelerated file system aging, and where
> we could do better with ext4 with some development effort.  AFAIK XFS people actually prefer that applications let them do their work
using delayed allocation and do not interfere with fallocate(2) calls. The
problem they have with fallocate(2) is that it forces you to allocate
blocks while with delayed allocation you can make the decision about
allocation later. So for small files which completely fit into pagecache
before they get pushed out by writeback, they can make better decisions
from delayed allocation. Just dumping my memory from some other thread...

							Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

Chris Mason

2012-Mar-06 00:37 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Mon, Mar 05, 2012 at 12:32:45PM +0100, Jacek Luczak
wrote:> 2012/3/4 Jacek Luczak <difrost.kernel@gmail.com>:
> > 2012/3/3 Jacek Luczak <difrost.kernel@gmail.com>:
> >> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
> >>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
> >>>> 2012/3/2 Chris Mason <chris.mason@oracle.com>:
> >>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek
Luczak wrote:
> >>>> >>
> >>>> >> I''ve took both on tests. The subject is
acp and spd_readdir used with
> >>>> >> tar, all on ext4:
> >>>> >> 1) acp:
http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
> >>>> >> 2) spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
> >>>> >> 3) both:
http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
> >>>> >>
> >>>> >> The acp looks much better than spd_readdir but
directory copy with
> >>>> >> spd_readdir decreased to 52m 39sec (30 min less).
> >>>> >
> >>>> > Do you have stats on how big these files are, and how
fragmented they
> >>>> > are?  For acp and spd to give us this, I think
something has gone wrong
> >>>> > at writeback time (creating individual fragmented
files).
> >>>>
> >>>> How big? Which files?
> >>>
> >>> All the files you''re reading ;)
> >>>
> >>> filefrag will tell you how many extents each file has, any
file with
> >>> more than one extent is interesting.  (The ext4 crowd may have
better
> >>> suggestions on measuring fragmentation).
> >>>
> >>> Since you mention this is a compile farm, I''m
guessing there are a bunch
> >>> of .o files created by parallel builds.  There are a lot of
chances for
> >>> delalloc and the kernel writeback code to do the wrong thing
here.
> >>>
> >>
> > [Most of files are B and K size]
> >>
> >> All files scanned: 1978149
> >> Files fragmented: 313 (0.015%) where 11 have 3+ extents
> >> Total size of fragmented files: 7GB (~13% of dir size)
Ok, so I don''t have a lot of great new ideas.  My guess is that inode
order and disk order for the blocks aren''t matching up.  You can
confirm
this with:

acp -b some_dir

You can also try telling acp to make a bigger read ahead window:

acp -s 4096 -r 128 some_dir

You can tell acp to scan all the files in the directory tree first
(warning, this might use a good chunk of ram)

acp -w some_dir

and you can combine all of these together  None of the above
will actually help in your workload, but it''ll help narrow down what is
actually seeky on disk.

-chris

Phillip Susi

2012-Mar-08 17:02 UTC

head link

Re: getdents - ext4 vs btrfs performance

On 2/29/2012 11:44 PM, Theodore Tso wrote:> You might try sorting the entries returned by readdir by inode number
> before you stat them.    This is a long-standing weakness in
> ext3/ext4, and it has to do with how we added hashed tree indexes to
> directories in (a) a backwards compatible way, that (b) was POSIX
> compliant with respect to adding and removing directory entries
> concurrently with reading all of the directory entries using
> readdir.
When I ran into this a while back, I cobbled together this python script 
to measure the correlation from name to inode, inode to first data 
block, and name to first data block for all of the small files in a 
large directory, and found that ext4 gives a very poor correlation due 
to that directory hashing.  This is one of the reasons I prefer using 
dump instead of tar for backups, since it rips through my Maildir more 
than 10 times faster than tar, since it reads the files in inode order.

Lukas Czerner

2012-Mar-09 11:29 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, 29 Feb 2012, Jacek Luczak wrote:
> Hi All,
> 
> /*Sorry for sending incomplete email, hit wrong button :) I guess I
> can''t use Gmail */
> 
> Long story short: We''ve found that operations on a directory
structure
> holding many dirs takes ages on ext4.
> 
> The Question: Why there''s that huge difference in ext4 and btrfs?
See
> below test results for real values.
> 
> Background: I had to backup a Jenkins directory holding workspace for
> few projects which were co from svn (implies lot of extra .svn dirs).
> The copy takes lot of time (at least more than I''ve expected) and
> process was mostly in D (disk sleep). I''ve dig more and done some
> extra test to see if this is not a regression on block/fs site. To
> isolate the issue I''ve also performed same tests on btrfs.
> 
> Test environment configuration:
> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> 2) Kernels: All tests were done on following kernels:
>  - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> config changes mostly. In -3 we''ve introduced ,,fix readahead
pipeline
> break caused by block plug'''' patch. Otherwise
it''s pure 2.6.39.4.
>  - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> release recently).
> 3) A subject of tests, directory holding:
>  - 54GB of data (measured on ext4)
>  - 1978149 files
>  - 844008 directories
> 4) Mount options:
>  - ext4 -- errors=remount-ro,noatime,
> data=writeback
>  - btrfs -- noatime,nodatacow and for later investigation on
> copression effect: noatime,nodatacow,compress=lzo
> 
> In all tests I''ve been measuring time of execution. Following
tests
> were performed:
> - find . -type d
> - find . -type f
> - cp -a
> - rm -rf
> 
> Ext4 results:
> | Type     | 2.6.39.4-3   | 3.2.7
> | Dir cnt  | 17m 40sec  | 11m 20sec
> | File cnt |  17m 36sec | 11m 22sec
> | Copy    | 1h 28m        | 1h 27m
> | Remove| 3m 43sec    | 3m 38sec
> 
> Btrfs results (without lzo comression):
> | Type     | 2.6.39.4-3   | 3.2.7
> | Dir cnt  | 2m 22sec  | 2m 21sec
> | File cnt |  2m 26sec | 2m 23sec
> | Copy    | 36m 22sec | 39m 35sec
> | Remove| 7m 51sec   | 10m 43sec
> 
> From above one can see that copy takes close to 1h less on btrfs.
I''ve
> done strace counting times of calls, results are as follows (from
> 3.2.7):
> 1) Ext4 (only to elements):
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  57.01   13.257850           1  15082163           read
>  23.40    5.440353           3   1687702           getdents
>  6.15    1.430559           0   3672418           lstat
>  3.80    0.883767           0  13106961           write
>  2.32    0.539959           0   4794099           open
>  1.69    0.393589           0    843695           mkdir
>  1.28    0.296700           0   5637802           setxattr
>  0.80    0.186539           0   7325195           stat
> 
> 2) Btrfs:
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.38    9.486210           1  15179751           read
> 11.38    2.021662           1   1688328           getdents
>  10.64    1.890234           0   4800317           open
>  6.83    1.213723           0  13201590           write
>  4.85    0.862731           0   5644314           setxattr
>  3.50    0.621194           1    844008           mkdir
>  2.75    0.489059           0   3675992         1 lstat
>  1.71    0.303544           0   5644314           llistxattr
>  1.50    0.265943           0   1978149           utimes
>  1.02    0.180585           0   5644314    844008 getxattr
> 
> On btrfs getdents takes much less time which prove the bottleneck in
> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
> for getdents:
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  50.77   10.978816           1  15033132           read
>  14.46    3.125996           1   4733589           open
>  7.15    1.546311           0   5566988           setxattr
>  5.89    1.273845           0   3626505           lstat
>  5.81    1.255858           1   1667050           getdents
>  5.66    1.224403           0  13083022           write
>  3.40    0.735114           1    833371           mkdir
>  1.96    0.424881           0   5566988           llistxattr
> 
> 
> Why so huge difference in the getdents timings?
> 
> -Jacek
Hi,

I have created a simple script which creates a bunch of files with
random names in the directory and then performs operation like list,
tar, find, copy and remove. I have run it for ext4, xfs and btrfs with
the 4k size files. And the result is that ext4 pretty much dominates the
create times, tar times and find times. However copy times is a whole
different story unfortunately - is sucks badly.

Once we cross the mark of 320000 files in the directory (on my system) the
ext4 is becoming significantly worse in copy times. And that is where
the hash tree order in the directory entry really hit in.

Here is a simple graph:

http://people.redhat.com/lczerner/files/copy_benchmark.pdf

Here is a data where you can play with it:

https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE

and here is the txt file for convenience:

http://people.redhat.com/lczerner/files/copy_data.txt

I have also run the correlation.py from Phillip Susi on directory with
100000 4k files and indeed the name to block correlation in ext4 is pretty
much random :)

_ext4_
Name to inode correlation: 0.50002499975
Name to block correlation: 0.50002499975
Inode to block correlation: 0.9999900001

_xfs_
Name to inode correlation: 0.969660303397
Name to block correlation: 0.969660303397
Inode to block correlation: 1.0


So there definitely is a huge space for improvements in ext4.

Thanks!
-Lukas

Here is a script I have used to get the numbers above, just to see that
are the operation I have performed.


#!/bin/bash

dev=$1
mnt=$2
fs=$3
count=$4
size=$5

if [ -z $dev ]; then
	echo "Device was not specified!"
	exit 1
fi

if [ -z $mnt ]; then
	echo "Mount point was not specified!"
	exit 1
fi

if [ -z $fs ]; then
	echo "File system was not specified!"
	exit 1
fi

if [ -z $count ]; then
	count=10000
fi

if [ -z $size ]; then
	size=0
fi

export TIMEFORMAT="%3R"

umount $dev &> /dev/null
umount $mnt &> /dev/null

case $fs in
	"xfs") mkfs.xfs -f $dev &> /dev/null; mount $dev $mnt;;
	"ext3") mkfs.ext3 -F -E lazy_itable_init $dev &> /dev/null;
mount $dev $mnt;;
	"ext4") mkfs.ext4 -F -E lazy_itable_init $dev &> /dev/null;
mount -o noinit_itable $dev $mnt;;
	"btrfs") mkfs.btrfs $dev &> /dev/null; mount $dev $mnt;;
	*) echo "Unsupported file system";
	   exit 1;;
esac


testdir=${mnt}/$$
mkdir $testdir

_remount()
{
	sync
	#umount $mnt
	#mount $dev $mnt
	echo 3 > /proc/sys/vm/drop_caches
}


#echo "[+] Creating $count files"
_remount
create=$((time ./dirtest $testdir $count $size) 2>&1)

#echo "[+] Listing files"
_remount
list=$((time ls $testdir > /dev/null) 2>&1)

#echo "[+] tar the files"
_remount
tar=$((time $(tar -cf - $testdir &> /dev/null)) 2>&1)

#echo "[+] find the files"
_remount
find=$((time $(find $testdir -type f &> /dev/null)) 2>&1)

#echo "[+] Copying files"
_remount
copy=$((time $(cp -a ${testdir} ${mnt}/copy)) 2>&1)

#echo "[+] Removing files"
_remount
remove=$((time $(rm -rf $testdir)) 2>&1)

echo "$fs $count $create $list $tar $find $copy $remove"
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Mar-09 14:34 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Fri, Mar 09, 2012 at 12:29:29PM +0100, Lukas Czerner
wrote:> Hi,
> 
> I have created a simple script which creates a bunch of files with
> random names in the directory and then performs operation like list,
> tar, find, copy and remove. I have run it for ext4, xfs and btrfs with
> the 4k size files. And the result is that ext4 pretty much dominates the
> create times, tar times and find times. However copy times is a whole
> different story unfortunately - is sucks badly.
> 
> Once we cross the mark of 320000 files in the directory (on my system) the
> ext4 is becoming significantly worse in copy times. And that is where
> the hash tree order in the directory entry really hit in.
> 
> Here is a simple graph:
> 
> http://people.redhat.com/lczerner/files/copy_benchmark.pdf
> 
> Here is a data where you can play with it:
> 
> https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE
> 
> and here is the txt file for convenience:
> 
> http://people.redhat.com/lczerner/files/copy_data.txt
> 
> I have also run the correlation.py from Phillip Susi on directory with
> 100000 4k files and indeed the name to block correlation in ext4 is pretty
> much random :)
> 
> _ext4_
> Name to inode correlation: 0.50002499975
> Name to block correlation: 0.50002499975
> Inode to block correlation: 0.9999900001
> 
> _xfs_
> Name to inode correlation: 0.969660303397
> Name to block correlation: 0.969660303397
> Inode to block correlation: 1.0
> 
> 
> So there definitely is a huge space for improvements in ext4.
Thanks Lukas, this is great data.  There is definitely room for btrfs to
speed up in the other phases as well.

-chris

Andreas Dilger

2012-Mar-10 00:09 UTC

head link

Re: getdents - ext4 vs btrfs performance

On 2012-03-09, at 3:29, Lukas Czerner <lczerner@redhat.com>
wrote:> 
> I have created a simple script which creates a bunch of files with
> random names in the directory and then performs operation like list,
> tar, find, copy and remove. I have run it for ext4, xfs and btrfs with
> the 4k size files. And the result is that ext4 pretty much dominates the
> create times, tar times and find times. However copy times is a whole
> different story unfortunately - is sucks badly.
> 
> Once we cross the mark of 320000 files in the directory (on my system) the
> ext4 is becoming significantly worse in copy times. And that is where
> the hash tree order in the directory entry really hit in.
> 
> Here is a simple graph:
> 
> http://people.redhat.com/lczerner/files/copy_benchmark.pdf
> 
> Here is a data where you can play with it:
> 
> https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE
> 
> and here is the txt file for convenience:
> 
> http://people.redhat.com/lczerner/files/copy_data.txt
> 
> I have also run the correlation.py from Phillip Susi on directory with
> 100000 4k files and indeed the name to block correlation in ext4 is pretty
> much random :)
Just reading this on the plane, so I can''t find the exact reference
that I want, but a solution to this problem with htree was discussed a few years
ago between myself and Coly Li.

The basic idea is that for large directories the inode allocator starts by
selecting a range of (relatively) free inodes based on the current directory
size, and then piecewise maps the hash value for the filename into this inode
range and uses that as the goal inode.

When the inode range is (relatively) filled up (as determined by average
distance between goal and allocated inode), a new (larger) inode range is
selected based on the new (larger) directory size and usage continues as
described.

This change is only in-memory allocation policy and does not affect the on disk
format, though it is expected to improve the hash->inode mapping coherency
significantly.

When the directory is small (below a thousand or so) the allocations will be
close to the parent and the ordering doesn''t matter significantly
because the inode table blocks will all be quickly read or prefetched from disk.
It wouldn''t be harmful to use the mapping algorithm in this case, but
it likely won''t show much improvement.

As the directory gets larger, the range of inodes will also get larger. The
number of inodes in the smaller range becomes less significant as the range
continues to grow.

Once the inode range is hundreds of thousands or larger the mapping of the hash
to the inodes will avoid a lot of random IO.

When restarting from a new mount, the inode ranges can be found when doing the
initial name lookup in the leaf block by checking the allocated inodes for
existing dirents.

Unfortunately, the prototype that was developed diverged from this idea and
didn''t really achieve the results I wanted.

Cheers, Andreas
> _ext4_
> Name to inode correlation: 0.50002499975
> Name to block correlation: 0.50002499975
> Inode to block correlation: 0.9999900001
> 
> _xfs_
> Name to inode correlation: 0.969660303397
> Name to block correlation: 0.969660303397
> Inode to block correlation: 1.0
> 
> 
> So there definitely is a huge space for improvements in ext4.
> 
> Thanks!
> -Lukas
> 
> Here is a script I have used to get the numbers above, just to see that
> are the operation I have performed.
> 
> 
> #!/bin/bash
> 
> dev=$1
> mnt=$2
> fs=$3
> count=$4
> size=$5
> 
> if [ -z $dev ]; then
>    echo "Device was not specified!"
>    exit 1
> fi
> 
> if [ -z $mnt ]; then
>    echo "Mount point was not specified!"
>    exit 1
> fi
> 
> if [ -z $fs ]; then
>    echo "File system was not specified!"
>    exit 1
> fi
> 
> if [ -z $count ]; then
>    count=10000
> fi
> 
> if [ -z $size ]; then
>    size=0
> fi
> 
> export TIMEFORMAT="%3R"
> 
> umount $dev &> /dev/null
> umount $mnt &> /dev/null
> 
> case $fs in
>    "xfs") mkfs.xfs -f $dev &> /dev/null; mount $dev $mnt;;
>    "ext3") mkfs.ext3 -F -E lazy_itable_init $dev &>
/dev/null; mount $dev $mnt;;
>    "ext4") mkfs.ext4 -F -E lazy_itable_init $dev &>
/dev/null; mount -o noinit_itable $dev $mnt;;
>    "btrfs") mkfs.btrfs $dev &> /dev/null; mount $dev
$mnt;;
>    *) echo "Unsupported file system";
>       exit 1;;
> esac
> 
> 
> testdir=${mnt}/$$
> mkdir $testdir
> 
> _remount()
> {
>    sync
>    #umount $mnt
>    #mount $dev $mnt
>    echo 3 > /proc/sys/vm/drop_caches
> }
> 
> 
> #echo "[+] Creating $count files"
> _remount
> create=$((time ./dirtest $testdir $count $size) 2>&1)
> 
> #echo "[+] Listing files"
> _remount
> list=$((time ls $testdir > /dev/null) 2>&1)
> 
> #echo "[+] tar the files"
> _remount
> tar=$((time $(tar -cf - $testdir &> /dev/null)) 2>&1)
> 
> #echo "[+] find the files"
> _remount
> find=$((time $(find $testdir -type f &> /dev/null)) 2>&1)
> 
> #echo "[+] Copying files"
> _remount
> copy=$((time $(cp -a ${testdir} ${mnt}/copy)) 2>&1)
> 
> #echo "[+] Removing files"
> _remount
> remove=$((time $(rm -rf $testdir)) 2>&1)
> 
> echo "$fs $count $create $list $tar $find $copy $remove"
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-10 03:52 UTC

head link

Re: getdents - ext4 vs btrfs performance

Hey Jacek,

I''m curious parameters of the set of directories on your production
server.  On an ext4 file system, assuming you''ve copied the
directories over, what are the result of this command pipeline when
you are cd''ed into the top of the directory hierarchy of interest
(your svn tree, as I recall)?

find . -type d -ls | awk ''{print $7}'' | sort -n | uniq -c

I''m just curious what the distribution of directories sizes are in
practice for your use case...

Thanks,

						- Ted


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-10 04:48 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger
wrote:> > I have also run the correlation.py from Phillip Susi on directory with
> > 100000 4k files and indeed the name to block correlation in ext4 is
pretty
> > much random :)
> 
> Just reading this on the plane, so I can''t find the exact
reference
> that I want, but a solution to this problem with htree was discussed
> a few years ago between myself and Coly Li.
> 
> The basic idea is that for large directories the inode allocator
> starts by selecting a range of (relatively) free inodes based on the
> current directory size, and then piecewise maps the hash value for
> the filename into this inode range and uses that as the goal inode.
I''ve heard you sketch out this idea before, but it''s not been
clean to
me how well it would work in practice.  The potential downside is that
it fragments the inode table, such that a single inode table block
might not have as many inodes stored in it as we might otherwise
would.  (Ideally, with 256 byte inodes, there would 16 of them in each
4k block.)  This is something I''d definitely recommend modelling in
simulation to see how well it works first.

I''d observe that if we knew in advance how many files would be in a
particular directory (i.e., via a hint from userspace at mkdir time),
then it would be possible to optimize this case much more easily.  In
fact, if we had perfect knowledge --- if the userspace process could
feed us the exact set of filenames that will be used in the directory,
plus the exact file sizes for each of the file names --- then it would
be possible to allocate inode numbers and starting block numbers such
that when the files are read in readdir() order, the inode numbers and
block numbers would line up perfectly into sequential writes.  

Of course, the trade off is that by optimizing for read order, the
write order will tend to get painful as well!  So there''s a tension
here; making the read part of the benchmark faster will make the
process of writing the directory hierarchy require more seeks --- and
this assuming that you know file names and sizes ahead of time.

(Unless, of course, you play the same game that Hans Reiser did when
he cooked his "tar" benchmark by constructing a kernel source tarball
where the file order in the tarball was perfectly ordered to guarantee
the desired result given Reiser4''s hash!  :-)

I suspect the best optimization for now is probably something like
this:

1) Since the vast majority of directories are less than (say) 256k
(this would be a tunable value), for directories which are less than
this threshold size, the entire directory is sucked in after the first
readdir() after an opendir() or rewinddir().  The directory contents
are then sorted by inode number (or loaded into an rbtree ordered by
inode number), and returned back to userspace in the inode order via
readdir().  The directory contents will be released on a closedir() or
rewinddir().

2) For files larger than this size, we don''t want to hold that much
kernel memory during an opendir(), so fall back to ext4''s current
scheme.   

3) If we want do to better than that, we could create new flag
O_NOTELLDIR, which can be set via fcntl.  (This way programs can use
do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd,
SETFL, O_NOTELLDIR);").  For files who know they don''t need to use
telldir/seekdir, they can set this flag, which will allow the above
optimization to be used on large files.

The other thing we could potentially do, if we really cared about this
issue in ext3/4, would be to define whole new (incompatible) storing
the directory indexing format which avoided this problem.  It would be
an INCOMPAT feature, so it would be a while before it could get used
in practice, which is why I''m not entirely convinced it''s
worth it ---
especially since I suspect just doing (1) would solve the problem for
the vast majority of ext4''s users.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andreas Dilger

2012-Mar-11 10:30 UTC

head link

Re: getdents - ext4 vs btrfs performance

On 2012-03-09, at 9:48 PM, Ted Ts''o wrote:> On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote:
>> 
>> Just reading this on the plane, so I can''t find the exact
reference
>> that I want, but a solution to this problem with htree was discussed
>> a few years ago between myself and Coly Li.
>> 
>> The basic idea is that for large directories the inode allocator
>> starts by selecting a range of (relatively) free inodes based on the
>> current directory size, and then piecewise maps the hash value for
>> the filename into this inode range and uses that as the goal inode.
> 
> I''ve heard you sketch out this idea before, but it''s not
been clean to
> me how well it would work in practice.  The potential downside is that
> it fragments the inode table, such that a single inode table block
> might not have as many inodes stored in it as we might otherwise
> would.  (Ideally, with 256 byte inodes, there would 16 of them in each
> 4k block.)  This is something I''d definitely recommend modelling
in
> simulation to see how well it works first.
Very true, I was thinking this as I wrote the email.
> I''d observe that if we knew in advance how many files would be in
a
> particular directory (i.e., via a hint from userspace at mkdir time),
> then it would be possible to optimize this case much more easily.  In
> fact, if we had perfect knowledge --- if the userspace process could
> feed us the exact set of filenames that will be used in the directory,
> plus the exact file sizes for each of the file names --- then it would
> be possible to allocate inode numbers and starting block numbers such
> that when the files are read in readdir() order, the inode numbers and
> block numbers would line up perfectly into sequential writes.  
Except POSIX doesn''t allow anything close to this at all.  Something
like
fallocate() for directories would at least give the kernel a hint about
how large the directory is, but fewer userspace tools have this information
than the file size (e.g. tar doesn''t really store a
"directory", just a
bunch of pathnames that happen to have largely the same components).

In the cases where size IS known in advance, or at least if the directory
is going to be large, an allocation policy like mine would essentially
"predict" where a uniformly distributed hash function will allocate
the
inodes for better reading order.
> Of course, the trade off is that by optimizing for read order, the
> write order will tend to get painful as well!  So there''s a
tension
> here; making the read part of the benchmark faster will make the
> process of writing the directory hierarchy require more seeks --- and
> this assuming that you know file names and sizes ahead of time.
> 
> (Unless, of course, you play the same game that Hans Reiser did when
> he cooked his "tar" benchmark by constructing a kernel source
tarball
> where the file order in the tarball was perfectly ordered to guarantee
> the desired result given Reiser4''s hash!  :-)
No, I definitely don''t want to go down that path...  Optimization for
the
sake of benchmarks is broken.
> I suspect the best optimization for now is probably something like
> this:
> 
> 1) Since the vast majority of directories are less than (say) 256k
> (this would be a tunable value),
Yes, this would be reasonable, and I suspect could be a lot larger than
256kB.  Probably variable based on the amount of RAM is reasonable.
Looking at http://www.pdsi-scidac.org/fsstats/ (which is listed as HPC,
but is a bit old and the filesystems aren''t much larger than large SATA
drives today :-) having a 1MB cache would handle virtually every directory
in that survey (99.999%).

The unfortunate part is that this will help small directories, where the
problem is already largely hidden by readahead and the page cache, but it
doesn''t help at all for huge directories that do not fit into cache. 
That
is where my approach would do well, but the difficulty is in knowing at
create time how large the directory will be.
> for directories which are less than
> this threshold size, the entire directory is sucked in after the first
> readdir() after an opendir() or rewinddir().  The directory contents
> are then sorted by inode number (or loaded into an rbtree ordered by
> inode number), and returned back to userspace in the inode order via
> readdir().  The directory contents will be released on a closedir() or
> rewinddir().
What would the telldir() cookie be in this case?  The inode number, or
the file descriptor?  What happens if the directory size crosses the
threshold while the cache is in use?  I guess in that case the cache
could still be used (according to POSIX) because readdir() doesn''t need
to report entries added after it started.
> 2) For files larger than this size, we don''t want to hold that
much
> kernel memory during an opendir(), so fall back to ext4''s current
> scheme.   
> 
> 3) If we want do to better than that, we could create new flag
> O_NOTELLDIR, which can be set via fcntl.  (This way programs can use
> do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd,
> SETFL, O_NOTELLDIR);").  For files who know they don''t need
to use
> telldir/seekdir, they can set this flag, which will allow the above
> optimization to be used on large files.
Probably a bit messy.  Since telldir() is rarely used outside NFS (AFAIK,
though I recall something dumb happening in GNU libc at one point that
was doing a telldir() and seekdir() for every entry) we could assume no
telldir() users until it actually was used on the filesystem, and that
could set a hint in the superblock EXT2_FLAGS section.  The only userspace
programs I could find that call telldir() are smbd and nmbd (and related
tools) and perl (not sure when or why).
> The other thing we could potentially do, if we really cared about this
> issue in ext3/4, would be to define whole new (incompatible) storing
> the directory indexing format which avoided this problem.  It would be
> an INCOMPAT feature, so it would be a while before it could get used
> in practice, which is why I''m not entirely convinced it''s
worth it ---
> especially since I suspect just doing (1) would solve the problem for
> the vast majority of ext4''s users.
Agreed.  I don''t know that we need to go to these extremes.

Cheers, Andreas
--
Andreas Dilger                       Whamcloud, Inc.
Principal Lustre Engineer            http://www.whamcloud.com/




--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-11 16:13 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Sun, Mar 11, 2012 at 04:30:37AM -0600, Andreas Dilger
wrote:> > if the userspace process could
> > feed us the exact set of filenames that will be used in the directory,
> > plus the exact file sizes for each of the file names...
> 
> Except POSIX doesn''t allow anything close to this at all. 
Something like
> fallocate() for directories would at least give the kernel a hint about
> how large the directory is, but fewer userspace tools have this information
> than the file size (e.g. tar doesn''t really store a
"directory", just a
> bunch of pathnames that happen to have largely the same components).
Yes, this was purely a thought experiment idea (although if we could
try to extend the interface if it really neted enough gains).  My
point was that even if we could do this, we really are trading off
write performance for read performance (since you would now be doing
the extra seeking on the tar extraction phase).

Ultimately, while I believe it makes sense to optimize the "readdir +
seek" case, I don''t believe it makes sense to optimize the
"copy out
the directory using tar" benchmark --- since I believe this is much
more of a benchmark workload than something that happens a huge amount
in real life, and even for people who do backups (which unfortunately
as we know is a too-small percentage of the ext4 user base), they
generally will prefer than the "real" use case for the file system be
fast rather than the backup phase, and in point of fact as the file
system gets more fragmented, even if some file systems can offer
perfect name -> block correlation, over time I''ll bet their
correlation breaks down and so ext4''s disadvantage over those file
systems will disappear in real life usages.  So I don''t believe it
makes that much sense to worry overly about the "name -> block"
correlation for freshly formatted file systems.  It''s a nice-to-have,
but I really think that''s more of a benchmark game than anything else.
> In the cases where size IS known in advance, or at least if the directory
> is going to be large, an allocation policy like mine would essentially
> "predict" where a uniformly distributed hash function will
allocate the
> inodes for better reading order.
Well, yes; if we knew in advance how many files (roughly) would be
stored in a directory, there are some optimizations we could do.  The
simplest of which is preallocating the directory blocks so they are
contiguous on disk.  I''m not pursuaded yet that the optimizations are
worth the effort of trying to define a new interface and then trying
to convince application programers to use a hypothetical interface
extension *just* for this purpose, though.
> The unfortunate part is that this will help small directories, where the
> problem is already largely hidden by readahead and the page cache, but it
> doesn''t help at all for huge directories that do not fit into
cache.  That
> is where my approach would do well, but the difficulty is in knowing at
> create time how large the directory will be.
Well, my goal in proposing this optimization is that helps for the
"medium size" directories in the cold cache case.  The ext4 user who
first kicked off this thread was using his file system for an SVN
server, as I recall.  I could easily believe that he has thousands of
files; maybe even tens of thousands of files in a directory.  But that
probably still fits in 256k, or at best 512k worth of directory blocks.

It definitely won''t help for the really lage directories, yes; but how
common are they?  The most common case I can think of is squid caches,
but you can configure a larger number of 1st and 2nd level
subdirectories to keep the actual directory size smaller.
> > for directories which are less than
> > this threshold size, the entire directory is sucked in after the first
> > readdir() after an opendir() or rewinddir().  The directory contents
> > are then sorted by inode number (or loaded into an rbtree ordered by
> > inode number), and returned back to userspace in the inode order via
> > readdir().  The directory contents will be released on a closedir() or
> > rewinddir().
> 
> What would the telldir() cookie be in this case?  The inode number, or
> the file descriptor?  What happens if the directory size crosses the
> threshold while the cache is in use?  I guess in that case the cache
> could still be used (according to POSIX) because readdir() doesn''t
need
> to report entries added after it started.
The telldir() cookie can simply be an ordinal -- so "7" means the 7th
directory entry in the rbtree.  And all we need to do, which obeys
POSIX as you''ve noted, is that we instantiate the rbtree during the
first readdir() after an opendir() or rewinddir(), and free the old
rbtree() when we get the rewinddir(), reach the end of the directory
stream, or get the closedir().
> > 3) If we want do to better than that, we could create new flag
> > O_NOTELLDIR, which can be set via fcntl.  (This way programs can use
> > do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd,
> > SETFL, O_NOTELLDIR);").  For files who know they don''t
need to use
> > telldir/seekdir, they can set this flag, which will allow the above
> > optimization to be used on large files.
> 
> Probably a bit messy.  Since telldir() is rarely used outside NFS (AFAIK,
> though I recall something dumb happening in GNU libc at one point that
> was doing a telldir() and seekdir() for every entry) we could assume no
> telldir() users until it actually was used on the filesystem, and that
> could set a hint in the superblock EXT2_FLAGS section.  The only userspace
> programs I could find that call telldir() are smbd and nmbd (and related
> tools) and perl (not sure when or why).
By the time a process uses telldir(), we''re midway through a readdir()
stream, and so by that point it''s too late to switch back to a
seekdir-friendly order.  So what do we do at that point?  If we
continue returning directory entries where we read in each directory
256kb at a time, sorting by inode number, and then returning dirents
in inode number order, then anytime there was a telldir() during a
256kb chuck, we wouldn''t be able to discard that rbtree, but rather we
would have to keep it in memory until the next rewinddir().  So a
process which opens a large directory(), and then calls telldir() a
huge number of times could end up pinning a large amount of kernel
memory.

We could, of course, set some kind of resource limit on this, and if a
process behaves too badly by calling telldir in a way that it has
pinned too much kernel memory, we could potentially kill the process.
We could then tell application programmers that if they really want to
use telldir() a huge amount, they need to tell us in advance via some
kind of process flag.  That is, we could default to being
telldir-unfriendly instead of defaulting to telldir-friendly.  If we
really think it''s only Samba and Perl, and that we can convince them
to give us this hint, then that we might get away with it.

I had been assuming though that we would have to default to
telldir-friendly, which is why I suggested the O_NOTELLDIR flag.

		  	       	 	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Phillip Susi

2012-Mar-13 19:05 UTC

head link

Re: getdents - ext4 vs btrfs performance

On 3/9/2012 11:48 PM, Ted Ts''o wrote:> I suspect the best optimization for now is probably something like
> this:
>
> 1) Since the vast majority of directories are less than (say) 256k
> (this would be a tunable value), for directories which are less than
> this threshold size, the entire directory is sucked in after the first
> readdir() after an opendir() or rewinddir().  The directory contents
> are then sorted by inode number (or loaded into an rbtree ordered by
> inode number), and returned back to userspace in the inode order via
> readdir().  The directory contents will be released on a closedir() or
> rewinddir().
Why not just separate the hash table from the conventional, mostly in 
inode order directory entries?  For instance, the first 200k of the 
directory could be the normal entries that would tend to be in inode 
order ( and e2fsck -D would reorder ), and the last 56k of the directory 
would contain the hash table.  Then readdir() just walks the directory 
like normal, and namei() can check the hash table.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-13 19:53 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Tue, Mar 13, 2012 at 03:05:59PM -0400, Phillip Susi
wrote:> Why not just separate the hash table from the conventional, mostly
> in inode order directory entries?  For instance, the first 200k of
> the directory could be the normal entries that would tend to be in
> inode order ( and e2fsck -D would reorder ), and the last 56k of the
> directory would contain the hash table.  Then readdir() just walks
> the directory like normal, and namei() can check the hash table.
Because that would be a format change.

What we have today is not a hash table; it''s a hashed tree, where we
use a fixed-length key for the tree based on the hash of the file
name.  Currently the leaf nodes of the tree are the directory blocks
themselves; that is, the lowest level of the index blocks tells you to
look at directory block N, where that directory contains the directory
indexes for those file names which are in a particular range (say,
between 0x2325777A and 0x2325801).

If we aren''t going to change the ordering of the directory directory,
that means we would need to change things so the leaf nodes contain
the actual directory file names themselves, so that we know whether or
not we''ve hit the correct entry or not before we go to read in a
specific directory block (otherwise, you''d have problems dealing with
hash collisions).  But in that case, instead of storing the pointer to
the directory entry, since the bulk of the size of a directory entry
is the filename itself, you might as well store the inode number in
the tree itself, and be done with it.

And in that case, since you are replicating the information directory
twice over, and it''s going to be an incompatible format change anyway,
you might as well just store the second copy of the directory entries
in *another* btree, except this one is indexed by inode number, and
then you use the second tree for readdir(), and you make the
telldir/seekdir cookie be the inode number.  That way readdir() will
always return results in a stat() optimized order, even in the face of
directory fragmentation and file system aging.

**This** is why telldir/seekdir is so evil; in order to do things
correctly in terms of the semantics of readdir() in the face of
telldir/seekdir and file names getting inserted and deleted into the
btree, and the possibility for tree splits/rotations, etc., and the
fact that the cookie is only 32 or 64 bits, you essentially either
need to just do something stupid and have a linear directory aala ext2
and V7 Unix, or you need to store the directory information twice over
in redundant b-trees.

Or, userspace needs to do the gosh-darned sorting by inode, or we do
some hack such as only sorting the inodes using non-swapping kernel
memory if the directory is smaller than some arbitrary size, such as
256k or 512k.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Phillip Susi

2012-Mar-13 20:22 UTC

head link

Re: getdents - ext4 vs btrfs performance

On 3/13/2012 3:53 PM, Ted Ts''o wrote:> Because that would be a format change.
I think a format change would be preferable to runtime sorting.
> What we have today is not a hash table; it''s a hashed tree, where
we
> use a fixed-length key for the tree based on the hash of the file
> name.  Currently the leaf nodes of the tree are the directory blocks
> themselves; that is, the lowest level of the index blocks tells you to
> look at directory block N, where that directory contains the directory
> indexes for those file names which are in a particular range (say,
> between 0x2325777A and 0x2325801).
So the index nodes contain the hash ranges for the leaf block, but the 
leaf block only contains the regular directory entries, not a hash for 
each name?  That would mean that adding or removing names would require 
moving around the regular directory entries wouldn''t it?
> If we aren''t going to change the ordering of the directory
directory,
> that means we would need to change things so the leaf nodes contain
> the actual directory file names themselves, so that we know whether or
> not we''ve hit the correct entry or not before we go to read in a
> specific directory block (otherwise, you''d have problems dealing
with
> hash collisions).  But in that case, instead of storing the pointer to
> the directory entry, since the bulk of the size of a directory entry
> is the filename itself, you might as well store the inode number in
> the tree itself, and be done with it.
I would think that hash collisions are rare enough that reading a 
directory block you end up not needing once in a blue moon would be 
chalked up under "who cares".  So just stick with hash, offset pairs
to
map the hash to the normal directory entry.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-13 21:33 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi
wrote:> 
> I think a format change would be preferable to runtime sorting.
Are you volunteering to spearhead the design and coding of such a
thing?  Run-time sorting is backwards compatible, and a heck of a lot
easier to code and test...

The reality is we''d probably want to implement run-time sorting
*anyway*, for the vast majority of people who don''t want to convert to
a new incompatible file system format.  (Even if you can do the
conversion using e2fsck --- which is possible, but it would be even
more code to write --- system administrators tend to be very
conservative about such things, since they might need to boot an older
kernel, or use a rescue CD that doesn''t have an uptodate kernel or
file system utilities, etc.)
> So the index nodes contain the hash ranges for the leaf block, but
> the leaf block only contains the regular directory entries, not a
> hash for each name?  That would mean that adding or removing names
> would require moving around the regular directory entries wouldn''t
> it?
They aren''t sorted in the leaf block, so we only need to move around
regular directory entries when we do a node split (and at the moment
we don''t support shrinking directories), so we don''t have to
worry the
reverse case.
> I would think that hash collisions are rare enough that reading a
> directory block you end up not needing once in a blue moon would be
> chalked up under "who cares".  So just stick with hash, offset
pairs
> to map the hash to the normal directory entry.
With a 64-bit hash, and if we were actually going to implement this as
a new incompatible feature, you''re probably right in terms of
accepting the extra directory block search.

We would still have to implement the case where hash collisions *do*
exist, though, and make sure the right thing happens in that case.
Even if the chance of that happening is 1 in 2**32, with enough
deployed systems (i.e., every Android handset, etc.) it''s going to
happen in real life.

					- Ted




--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yongqiang Yang

2012-Mar-14 02:48 UTC

head link

Re: getdents - ext4 vs btrfs performance

What if we use inode number as the hash value?  Does it work?

Yongqiang.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-14 02:51 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 10:48:17AM +0800, Yongqiang Yang
wrote:> What if we use inode number as the hash value?  Does it work?
The whole point of using the tree structure is to accelerate filename
-> inode number lookups.  So the namei lookup doesn''t have the inode
number; the whole point is to use the filename to lookup the inode
number.  So we can''t use the inode number as the hash value since
that''s what we are trying to find out.

We could do this if we have two b-trees, one indexed by filename and
one indexed by inode number, which is what JFS (and I believe btrfs)
does.

						- Ted

Lukas Czerner

2012-Mar-14 08:12 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Tue, 13 Mar 2012, Ted Ts''o wrote:
> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:
> > 
> > I think a format change would be preferable to runtime sorting.
> 
> Are you volunteering to spearhead the design and coding of such a
> thing?  Run-time sorting is backwards compatible, and a heck of a lot
> easier to code and test...
Actually I am in the process of creating some projects for the local
universities and this certainly sounds like something that some student
can do. I have already put that into the proposal, so we might have
someone looking into this in the near future. Since the problem has been
there since forever, I think that it''s not that critical to solve it
right away.

I kind of like the idea about having the separate btree with inode
numbers for the directory reading, just because it does not affect
allocation policy nor the write performance which is a good thing. Also
it has been done before in btrfs and it works very well for them. The
only downside (aside from format change) is the extra step when removing
the directory entry, but the positives outperform the negatives.

Maybe this might be even done in a way which does not require format
change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
tell us that there is a inode number btree to use on directory reads.
Then the pointer to the root of that tree would be stored at the end of
the root block of the hree (of course if there is still some space for
it) and the rest is easy.

So If we mount the old file system with the new code, there will not be
the flag, but with newly added directories it could be used anyway. Also
e2fsck could easily add the inumber tree into the directory if it is
missing.

If we mount the new file system with the old code (the one which does
not know about this feature), everything would stay the same and we
would not touch the inumber tree at all. Of course there is a
possibility that we would rewrite the end of the root htree, overwriting
the pointer to the inumber tree and effectively losing one block. We
would not actually care that we rewrote this information because we
do not actually need it, so the only downside is the one block lost,
which is not that bad I guess.

Now, we have rewritten the pointer to the inumber tree and we''re going
to mount it with the new code again. It would expect the the inumber
tree to be there, but not blindly. Some king of magic information would
have to be stored in the inumber root pointer so we can be sure that it
really is what we''re looking for and if it is not there, well then we
do
not care and walk the directory in the old way. And we can alway create
the inumber tree in the new directory. And again e2fsck should be able
to fix that for us as well.

So that is just an idea, I have not looked at the actual code to see it
it would be really possible, but it certainly seems like so. Am I
missing something ?

Anyway, if there is going to be a format change I agree that having
solution for those who are not comfortable with the format change is
equally as important. But maybe there will be better solution which does
not require the format change.

Thanks!
-Lukas
> 
> The reality is we''d probably want to implement run-time sorting
> *anyway*, for the vast majority of people who don''t want to
convert to
> a new incompatible file system format.  (Even if you can do the
> conversion using e2fsck --- which is possible, but it would be even
> more code to write --- system administrators tend to be very
> conservative about such things, since they might need to boot an older
> kernel, or use a rescue CD that doesn''t have an uptodate kernel or
> file system utilities, etc.)
> 
> > So the index nodes contain the hash ranges for the leaf block, but
> > the leaf block only contains the regular directory entries, not a
> > hash for each name?  That would mean that adding or removing names
> > would require moving around the regular directory entries
wouldn''t
> > it?
> 
> They aren''t sorted in the leaf block, so we only need to move
around
> regular directory entries when we do a node split (and at the moment
> we don''t support shrinking directories), so we don''t have
to worry the
> reverse case.
> 
> > I would think that hash collisions are rare enough that reading a
> > directory block you end up not needing once in a blue moon would be
> > chalked up under "who cares".  So just stick with hash,
offset pairs
> > to map the hash to the normal directory entry.
> 
> With a 64-bit hash, and if we were actually going to implement this as
> a new incompatible feature, you''re probably right in terms of
> accepting the extra directory block search.
> 
> We would still have to implement the case where hash collisions *do*
> exist, though, and make sure the right thing happens in that case.
> Even if the chance of that happening is 1 in 2**32, with enough
> deployed systems (i.e., every Android handset, etc.) it''s going to
> happen in real life.
> 
> 					- Ted--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yongqiang Yang

2012-Mar-14 09:29 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 4:12 PM, Lukas Czerner <lczerner@redhat.com>
wrote:> On Tue, 13 Mar 2012, Ted Ts''o wrote:
>
>> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:
>> >
>> > I think a format change would be preferable to runtime sorting.
>>
>> Are you volunteering to spearhead the design and coding of such a
>> thing?  Run-time sorting is backwards compatible, and a heck of a lot
>> easier to code and test...
>
> Actually I am in the process of creating some projects for the local
> universities and this certainly sounds like something that some student
> can do. I have already put that into the proposal, so we might have
> someone looking into this in the near future. Since the problem has been
> there since forever, I think that it''s not that critical to solve
it
> right away.
>
> I kind of like the idea about having the separate btree with inode
> numbers for the directory reading, just because it does not affect
> allocation policy nor the write performance which is a good thing. Also
> it has been done before in btrfs and it works very well for them. The
> only downside (aside from format change) is the extra step when removing
> the directory entry, but the positives outperform the negatives.
>
>
> Maybe this might be even done in a way which does not require format
> change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> tell us that there is a inode number btree to use on directory reads.
> Then the pointer to the root of that tree would be stored at the end of
> the root block of the hree (of course if there is still some space for
> it) and the rest is easy.dx_root takes 32 bytes from 60 bytes of i_data.  The beginning of
dx_entries is controlled by info_length, that is, dx_root->info +
dx_root->info->info_length.   It seems that we can put the new tree
root behind dx_root and add the length of the new tree root to
info_length.  Then the old code can handle new filesystem and new code
can handle old filesystem because of the flag you described above.

If no one attempts to do it ,  I can do it as a google summer of code project.

Yongqiang.>
> So If we mount the old file system with the new code, there will not be
> the flag, but with newly added directories it could be used anyway. Also
> e2fsck could easily add the inumber tree into the directory if it is
> missing.
>
> If we mount the new file system with the old code (the one which does
> not know about this feature), everything would stay the same and we
> would not touch the inumber tree at all. Of course there is a
> possibility that we would rewrite the end of the root htree, overwriting
> the pointer to the inumber tree and effectively losing one block. We
> would not actually care that we rewrote this information because we
> do not actually need it, so the only downside is the one block lost,
> which is not that bad I guess.
>
> Now, we have rewritten the pointer to the inumber tree and we''re
going
> to mount it with the new code again. It would expect the the inumber
> tree to be there, but not blindly. Some king of magic information would
> have to be stored in the inumber root pointer so we can be sure that it
> really is what we''re looking for and if it is not there, well then
we do
> not care and walk the directory in the old way. And we can alway create
> the inumber tree in the new directory. And again e2fsck should be able
> to fix that for us as well.
>
> So that is just an idea, I have not looked at the actual code to see it
> it would be really possible, but it certainly seems like so. Am I
> missing something ?
>
>
> Anyway, if there is going to be a format change I agree that having
> solution for those who are not comfortable with the format change is
> equally as important. But maybe there will be better solution which does
> not require the format change.
>
> Thanks!
> -Lukas
>
>>
>> The reality is we''d probably want to implement run-time
sorting
>> *anyway*, for the vast majority of people who don''t want to
convert to
>> a new incompatible file system format.  (Even if you can do the
>> conversion using e2fsck --- which is possible, but it would be even
>> more code to write --- system administrators tend to be very
>> conservative about such things, since they might need to boot an older
>> kernel, or use a rescue CD that doesn''t have an uptodate
kernel or
>> file system utilities, etc.)
>>
>> > So the index nodes contain the hash ranges for the leaf block, but
>> > the leaf block only contains the regular directory entries, not a
>> > hash for each name?  That would mean that adding or removing names
>> > would require moving around the regular directory entries
wouldn''t
>> > it?
>>
>> They aren''t sorted in the leaf block, so we only need to move
around
>> regular directory entries when we do a node split (and at the moment
>> we don''t support shrinking directories), so we don''t
have to worry the
>> reverse case.
>>
>> > I would think that hash collisions are rare enough that reading a
>> > directory block you end up not needing once in a blue moon would
be
>> > chalked up under "who cares".  So just stick with hash,
offset pairs
>> > to map the hash to the normal directory entry.
>>
>> With a 64-bit hash, and if we were actually going to implement this as
>> a new incompatible feature, you''re probably right in terms of
>> accepting the extra directory block search.
>>
>> We would still have to implement the case where hash collisions *do*
>> exist, though, and make sure the right thing happens in that case.
>> Even if the chance of that happening is 1 in 2**32, with enough
>> deployed systems (i.e., every Android handset, etc.) it''s
going to
>> happen in real life.
>>
>>                                       - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best Wishes
Yongqiang Yang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lukas Czerner

2012-Mar-14 09:38 UTC

head link

Re: getdents - ext4 vs btrfs performance

Ted Ts''o

2012-Mar-14 12:50 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner
wrote:> I kind of like the idea about having the separate btree with inode
> numbers for the directory reading, just because it does not affect
> allocation policy nor the write performance which is a good thing. Also
> it has been done before in btrfs and it works very well for them. The
> only downside (aside from format change) is the extra step when removing
> the directory entry, but the positives outperform the negatives.
Well, there will be extra (journaled!) block reads and writes involved
when adding or removing directory entries.  So the performance on
workloads that do a large number of directory adds and removed will
have to be impacted.  By how much is not obvious, but it will be
something which is measurable.
> Maybe this might be even done in a way which does not require format
> change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> tell us that there is a inode number btree to use on directory reads.
> Then the pointer to the root of that tree would be stored at the end of
> the root block of the hree (of course if there is still some space for
> it) and the rest is easy.
You can make it be a RO_COMPAT change instead of an INCOMPAT change,
yes.

And if you want to do it as simply as possible, we could just recycle
the current htree approach for the second tree, and simply store the
root in another set of directory blocks.  But by putting the index
nodes in the directory blocks, masquerading as deleted directories, it
means that readdir() has to cycle through them and ignore the index
blocks.

The alternate approach is to use physical block numbers instead of
logical block numbers for the index blocks, and to store it separately
from the blocks containing actual directory entries.  But if we do
that for the inumber tree, the next question that arise is maybe we
should do that for the hash tree as well --- and then once you upon
that can of worms, it gets a lot more complicated.

So the question that really arises here is how wide open do we want to
leave the design space, and whether we are optimizing for the best
possible layout change ignoring the amount of implementation work it
might require, or whether we keep things as simple as possible from a
code change perspective.

There are good arguments that can be made either way, and a lot
depends on the quality of the students you can recruit, the amount of
time they have, and how much review time it will take out of the core
team during the design and implementation phase.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2012-Mar-14 14:17 UTC

head link

Re: getdents - ext4 vs btrfs performance

> We could do this if we have two b-trees, one indexed by filename and
> one indexed by inode number, which is what JFS (and I believe btrfs)
> does.
Typically the inode number of the destination inode isn''t used to index
entries for a readdir tree because of (wait for it) hard links.  You end
up right back where you started with multiple entries per key.

A painful solution is to have the key in the readdir tree allocated by
the tree itself -- count key populations in subtrees per child pointer
and use that to find free keys.  You still have the problem of
correlating entry keys and the location of inodes, but at least now you
have a structure to navigate to try and do that.  You can also trivially
re-use freed keys and have densely packed readdir keys that won''t break
32bit f_pos so quickly.

btrfs just punts and has an incrementing 64bit counter per directory
that determines a new entry''s position in readdir.  Accompanied by the
obligatory "will be smarter some day" comment :).

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Phillip Susi

2012-Mar-14 14:28 UTC

head link

Re: getdents - ext4 vs btrfs performance

On 3/13/2012 5:33 PM, Ted Ts''o wrote:> Are you volunteering to spearhead the design and coding of such a
> thing?  Run-time sorting is backwards compatible, and a heck of a lot
> easier to code and test...
Do you really think it is that much easier?  Even if it is easier, it is 
still an ugly kludge.  It would be much better to fix the underlying 
problem rather than try to paper over it.
> The reality is we''d probably want to implement run-time sorting
> *anyway*, for the vast majority of people who don''t want to
convert to
> a new incompatible file system format.  (Even if you can do the
> conversion using e2fsck --- which is possible, but it would be even
> more code to write --- system administrators tend to be very
> conservative about such things, since they might need to boot an older
> kernel, or use a rescue CD that doesn''t have an uptodate kernel or
> file system utilities, etc.)
The same argument could have been made against the current htree 
implementation when it was added.  I think it carries just as little 
weight now as it did then.  People who care about the added performance 
the new feature provides can use it, those who don''t, won''t. 
Eventually
it will become ubiquitous.
> We would still have to implement the case where hash collisions *do*
> exist, though, and make sure the right thing happens in that case.
> Even if the chance of that happening is 1 in 2**32, with enough
> deployed systems (i.e., every Android handset, etc.) it''s going to
> happen in real life.
Sure it will happen, but if we read one extra block 1 in 4 billion 
times, nobody is going to notice.

Lukas Czerner

2012-Mar-14 14:34 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, 14 Mar 2012, Ted Ts''o wrote:
> On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote:
> > I kind of like the idea about having the separate btree with inode
> > numbers for the directory reading, just because it does not affect
> > allocation policy nor the write performance which is a good thing.
Also
> > it has been done before in btrfs and it works very well for them. The
> > only downside (aside from format change) is the extra step when
removing
> > the directory entry, but the positives outperform the negatives.
> 
> Well, there will be extra (journaled!) block reads and writes involved
> when adding or removing directory entries.  So the performance on
> workloads that do a large number of directory adds and removed will
> have to be impacted.  By how much is not obvious, but it will be
> something which is measurable.
Yes, it would have to be measured in various workloads.
> 
> > Maybe this might be even done in a way which does not require format
> > change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> > tell us that there is a inode number btree to use on directory reads.
> > Then the pointer to the root of that tree would be stored at the end
of
> > the root block of the hree (of course if there is still some space for
> > it) and the rest is easy.
> 
> You can make it be a RO_COMPAT change instead of an INCOMPAT change,
> yes.
Does it have to be RO_COMPAT change though ? Since this would be both
forward and backward compatible.
> 
> And if you want to do it as simply as possible, we could just recycle
> the current htree approach for the second tree, and simply store the
> root in another set of directory blocks.  But by putting the index
> nodes in the directory blocks, masquerading as deleted directories, it
> means that readdir() has to cycle through them and ignore the index
> blocks.
> 
> The alternate approach is to use physical block numbers instead of
> logical block numbers for the index blocks, and to store it separately
> from the blocks containing actual directory entries.  But if we do
> that for the inumber tree, the next question that arise is maybe we
> should do that for the hash tree as well --- and then once you upon
> that can of worms, it gets a lot more complicated.
> 
> So the question that really arises here is how wide open do we want to
> leave the design space, and whether we are optimizing for the best
> possible layout change ignoring the amount of implementation work it
> might require, or whether we keep things as simple as possible from a
> code change perspective.
What do you expect to gain by storing the index nodes outside the
directory entries ? Only to avoid reading the index nodes of both trees
when walking just one of it ? If I understand it correctly directory
blocks are laid out sequentially so reading it in cold cache state would
not pose much troubles I think. Nor the iteration through all of the
index nodes. However it would have to be measured first.
> 
> There are good arguments that can be made either way, and a lot
> depends on the quality of the students you can recruit, the amount of
> time they have, and how much review time it will take out of the core
> team during the design and implementation phase.
I am always hoping for the best :). We have had a good expedience with
the students here, but you''ll never know that in advance.
They''ll have
the whole year to get through this, which is plenty of thine, but the
question of course is whether they''ll use that time wisely :). But
anyway, I think it is definitely worth a try.

Thanks!
-Lukas
> 
> Regards,
> 
> 						- Ted
> 
--

Ted Ts''o

2012-Mar-14 16:48 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 10:17:37AM -0400, Zach Brown
wrote:> 
> >We could do this if we have two b-trees, one indexed by filename and
> >one indexed by inode number, which is what JFS (and I believe btrfs)
> >does.
> 
> Typically the inode number of the destination inode isn''t used to
index
> entries for a readdir tree because of (wait for it) hard links.  You end
> up right back where you started with multiple entries per key.
Well, if you are using 32-bit (or even 48-bit) inode numbers and a
64-bit telldir cookie, it''s possible to make the right thing happen.
But yes, if you are using 32-bit inode numbers and a 32-bit telldir
cookie, dealing with what happens when you have multiple hard links to
the same inode in the same directory gets tricky.
> A painful solution is to have the key in the readdir tree allocated by
> the tree itself -- count key populations in subtrees per child pointer
> and use that to find free keys.
One thing that might work is to have a 16-bit extra field in the
directory entry that gives an signed offset to the inode number so
that such that inode+offset is a unique value within the btree sorted
by inode+offset number.  Since this tree is only used for returning
entries in an optimal (or as close to optimal as can be arranged)
order, we could get away with that.

				- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-14 16:54 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 10:28:20AM -0400, Phillip Susi
wrote:> 
> Do you really think it is that much easier?  Even if it is easier,
> it is still an ugly kludge.  It would be much better to fix the
> underlying problem rather than try to paper over it.
I don''t think the choice is obvious.  A solution which solves the
problem 90% plus of the time for real-life workloads, without
requiring any file system format changes, is very appealing to me and
to I think many customers.

That being said, for someone who is looking for perfection, I can see
how it would grate on someone''s nerves.
> The same argument could have been made against the current htree
> implementation when it was added.  I think it carries just as little
> weight now as it did then.  People who care about the added
> performance the new feature provides can use it, those who don''t,
> won''t.  Eventually it will become ubiquitous.
The original htree implementation tried very hard to be fully
backwards compatible for reading and writing.  At the time that was
considered very important, and for some folks it was a major killer
feature.  The original ext3 journalling support was designed in a way
where you could back out and mount such a file system on an ext2 file
system, and that was considered important back then, too.

The fact that we''ve been so successful with ext4''s upgrade to
features
that require RO_COMPAT or INCOMPAT features does make me more willing
to consider those features, but the reality is that the deployment
time for such a new feature is measured in 2-3 years, minimum.  So I
won''t rule out making such a change, but I wouldn''t let the
fact that
someone wants to sign up to implement a long-term feature, to be
mutually exclusive with a short-term solution which solves most of the
problem that could be rolled out much more quickly.

	     	      	     	      	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-14 17:02 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 03:34:13PM +0100, Lukas Czerner
wrote:> > 
> > You can make it be a RO_COMPAT change instead of an INCOMPAT change,
> > yes.
> 
> Does it have to be RO_COMPAT change though ? Since this would be both
> forward and backward compatible.
The challenge is how do you notice if the file system is mounted on an
older kernel, which then inserts a directory entry without updating
the secondary tree.  The older kernel won''t know about the new inode
flag, so it can''t clear the flag when it modifies the directory.

We were able to get away with making the original htree feature
read/write compatible because the design for it was anticipated far in
advance, and because it was before the days of enterprise kernels that
had to be supported for seven years.  So we added code into ext3 to
clear the the htree flag whenever the directory was modified something
like two years before the htree code made its appearance, and back
then we decided that was fair to assume no one would be using a kernel
that old, or be jumping back and forth between an ancient kernel and a
modern kernel with htree enabled.  Yes, that was playing a bit fast
and loose, but early on in the kernel history, we could do that.  It''s
not something I would do today.

The bigger deal is that as Zach pointed out, we can''t just index it by
inode number because we have to worry about hard links.  Which means
we need either an explicit counter field added to the directory entry,
or some kind of new offset field.  That we can''t just shoehorn in and
retain backwards compatibility.

And once we break backwards compatibility, we might as well look at
the full design space and see what is the most optimal.

    	 	      	      	      	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2012-Mar-14 17:37 UTC

head link

Re: getdents - ext4 vs btrfs performance

On 03/14/2012 12:48 PM, Ted Ts''o wrote:> On Wed, Mar 14, 2012 at 10:17:37AM -0400, Zach Brown wrote:
>>
>>> We could do this if we have two b-trees, one indexed by filename
and
>>> one indexed by inode number, which is what JFS (and I believe
btrfs)
>>> does.
>>
>> Typically the inode number of the destination inode isn''t used
to index
>> entries for a readdir tree because of (wait for it) hard links.  You
end
>> up right back where you started with multiple entries per key.
>
> One thing that might work is to have a 16-bit extra field in the
> directory entry that gives an signed offset to the inode number so
> that such that inode+offset is a unique value within the btree sorted
> by inode+offset number.  Since this tree is only used for returning
> entries in an optimal (or as close to optimal as can be arranged)
> order, we could get away with that.
Yeah, at least that''s nice and simple.  Personally, I''m always
nervous
when we build in new limits like this.  Some joker somewhere always
manages to push up against them.

But such is life.  Maybe it''s the right thing in the ext* design space.
The fixed number of possible inodes certainly makes it easier to pack
more inode offsets into the telldir cookie.

- z

Chris Mason

2012-Mar-14 19:17 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 08:50:02AM -0400, Ted Ts''o
wrote:> On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote:
> > I kind of like the idea about having the separate btree with inode
> > numbers for the directory reading, just because it does not affect
> > allocation policy nor the write performance which is a good thing.
Also
> > it has been done before in btrfs and it works very well for them. The
> > only downside (aside from format change) is the extra step when
removing
> > the directory entry, but the positives outperform the negatives.
> 
> Well, there will be extra (journaled!) block reads and writes involved
> when adding or removing directory entries.  So the performance on
> workloads that do a large number of directory adds and removed will
> have to be impacted.  By how much is not obvious, but it will be
> something which is measurable.
Most of the difference in btrfs rm times comes from our double indexing.
We store the directory name 3 times, once for each dir index and once
in the inode backref.  But the inode backref doesn''t generate seeks the
same way the extra directory index does, so it is probably only 10% of
the cost.

We also delete file data crcs during rm, so if you''re benchmarking
things to see how much it will hurt, benchmark btrfs on zero length
files.

In terms of other costs, most users of readdir will eventually wander
off and stat or open the files in the dir.  With your current indexes, the stat
reads the same directory blocks that readdir used.

With the double index, the stat reads the name index instead of the
readdir index.  For the common case of normal directories, today''s ext4
will have the directory blocks in cache when stat is called.  Btrfs has
to do extra reads to pull in the name index.

Josef worked around this by prefilling a dentry during readdir with
details on where to find the inode.

Either way, I''d definitely recommend sitting down with the xfs
directory
indexes and look at how they manage locality without the double
indexing.  It''s faster than btrfs and without the penalty.

It''s also worth mentioning that on an SSD, there is no benefit at all
to
the extra index.  So when the industry finally floods the markets with
hybrid PCM shingled drives, this might all be moot.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-15 07:59 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/10 Ted Ts''o <tytso@mit.edu>:> Hey Jacek,
>
> I''m curious parameters of the set of directories on your
production
> server.  On an ext4 file system, assuming you''ve copied the
> directories over, what are the result of this command pipeline when
> you are cd''ed into the top of the directory hierarchy of interest
> (your svn tree, as I recall)?
>
> find . -type d -ls | awk ''{print $7}'' | sort -n | uniq -c
>
> I''m just curious what the distribution of directories sizes are in
> practice for your use case...
>
Hi Ted,

below is from the latest upstream on ext4:

 843341 4096
    165 12288
     29 16384
    164 20480
     26 24576
     22 28672
     24 32768
    118 36864
     12 40960
     19 45056
      3 49152
     12 53248
     37 57344
     10 61440
      2 65536
     13 69632
      1 131072
      1 139264
      1 143360
      1 221184
      1 249856
      1 258048
      1 262144
      1 319488
      2 335872
      1 339968

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jacek Luczak

2012-Mar-15 10:42 UTC

head link

Re: getdents - ext4 vs btrfs performance

2012/3/11 Ted Ts''o <tytso@mit.edu>:
>
> Well, my goal in proposing this optimization is that helps for the
> "medium size" directories in the cold cache case.  The ext4 user
who
> first kicked off this thread was using his file system for an SVN
> server, as I recall.  I could easily believe that he has thousands of
> files; maybe even tens of thousands of files in a directory.  But that
> probably still fits in 256k, or at best 512k worth of directory blocks.
>
That was not a SVN server. It was a build host having checkouts of SVN
projects.

The many files/dirs case is common for VCS and the SVN is not the only
that would be affected here. AFAIR git.kernel.org was also suffering
from the getdents(). Same applies to commercial products that are
heavily stuffed with many files/dirs, e.g. ClearCase or Synergy. In
the second one case we are running on XFS for a longer while and I
must say that this was a huge improvement mostly on backup times
(where a dump of Informix and the whole filesystem is ordinary
compressed in tarball).

A medium size you are referring would most probably fit into 256k and
this could be enough for 90% of cases. Large production system running
on ext4 need backups thus those would benefit the most here.

-Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2012-Mar-18 20:56 UTC

head link

Re: getdents - ext4 vs btrfs performance

On Thu, Mar 15, 2012 at 11:42:24AM +0100, Jacek Luczak
wrote:> 
> That was not a SVN server. It was a build host having checkouts of SVN
> projects.
> 
> The many files/dirs case is common for VCS and the SVN is not the only
> that would be affected here. 
Well, with SVN it''s 2x or 3x the number of files in the checked out
source code directory, right?  So if a particular source tree has
2,000 files in a source directory, then SVN might have at most 6,000
files, and if you assume each directory entry is 64 bytes, we''re still
talking about 375k.  Do you have more files than that in a directory
in practice with SVN?  And if so why?
> AFAIR git.kernel.org was also suffering from the getdents().
git.kernel.org was suffering from a different problem, which was that
the git.kernel.org administrators didn''t feel like automatically doing
a "git gc" on all of the repositories, and a lot of people were just
doing "git pushes" and not bothering to gc their repositories.  Since
git.kernel.org users don''t have shell access any more, the
git.kernel.org administrators have to be doing automatic git gc''s.  By
default git is supposed to automatically do a gc when there are more
than 6700 loose object files (which are distributed across 256 1st
level directories, so in practice a .git/objects/XX directory
shouldn''t have more than 30 objects in it, which each directory object
taking 48 bytes).  The problem I believe is that "git push" commands
weren''t checking gc.auto limit, and so that''s why
git.kernel.org had
in the past suffered from large directories.  This is arguably a git
bug, though, and as I mentioned, since we all don''t have shell access
to git.kernel.org, this has to be handled automatically now....
> Same applies to commercial products that are
> heavily stuffed with many files/dirs, e.g. ClearCase or Synergy. 
How many files in a dircectory do we commonly see with these systems?
I''m not familiar with them, and so I don''t have a good feel
for what
typical directory sizes tend to be.
> A medium size you are referring would most probably fit into 256k and
> this could be enough for 90% of cases. Large production system running
> on ext4 need backups thus those would benefit the most here.
Yeah, 256k or 512k is probably the best.  Alternatively, the backup
programs could simply be taught to sort the directory entries by inode
number, and if that''s not enough, to grab the initial block numbers
using FIEMAP and then sort by block number.  Of course, all of this
optimization may or may not actually give us as much returns as we
think, given that the disk is probably seeking from other workloads
happening in parallel anyway (another reason why I am suspicious that
timing the tar command may not be an accurate way of measuring actual
performance when you have other tasks accessing the file system in
parallel with the backup).

    	    	  		     	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Feb 2012 - getdents - ext4 vs btrfs performance

getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance

Re: getdents - ext4 vs btrfs performance