Hi All, Long story short: We''ve found that operations on a directory structure holding many dirs takes ages on ext4. The Question: Why there''s that huge difference in ext4 and btrfs? See below test results for real values. Background: I had to backup a Jenkins directory holding workspace for few projects which were co from svn (implies lot of extra .svn dirs). The copy takes lot of time (at least more than I''ve expected) and process was mostly in D (disk sleep). I''ve dig more and done some extra test to see if this is not a regression on block/fs site. To isolate the issue I''ve also performed same tests on btrfs. Test environment configuration: 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. 2) Kernels: All tests were done on following kernels: - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been release recently). 3) A subject of tests, directory holding: - 54GB of data (measured on ext4) - 1978149 files - 844008 directories 4) Mount options: - ext4 -- errors=remount-ro,noatime,data=writeback - btrfs -- noatime,nodatacow and for later investigation on copression effect: noatime,nodatacow,compress=lzo In all tests I''ve been measuring time of execution. Following tests were performed: - find . -type d - find . -type f - cp -a - rm -rf Ext4 results: | Type | 2.6.39.4-3 | 3.2.7 | Dir cnt | 17m 40sec | 11m 20sec | File cnt | 17m 36sec | 11m 22sec | Copy | 1h 28m | 1h 27m | Remove| 3m 43sec -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 29, 2012 at 02:31:03PM +0100, Jacek Luczak wrote:> Hi All, > > Long story short: We''ve found that operations on a directory structure > holding many dirs takes ages on ext4. > > The Question: Why there''s that huge difference in ext4 and btrfs? See > below test results for real values. > > Background: I had to backup a Jenkins directory holding workspace for > few projects which were co from svn (implies lot of extra .svn dirs). > The copy takes lot of time (at least more than I''ve expected) and > process was mostly in D (disk sleep). I''ve dig more and done some > extra test to see if this is not a regression on block/fs site. To > isolate the issue I''ve also performed same tests on btrfs. > > Test environment configuration: > 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT > enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. > 2) Kernels: All tests were done on following kernels: > - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of > config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline > break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. > - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been > release recently). > 3) A subject of tests, directory holding: > - 54GB of data (measured on ext4) > - 1978149 files > - 844008 directories > 4) Mount options: > - ext4 -- errors=remount-ro,noatime,data=writeback > - btrfs -- noatime,nodatacow and for later investigation on > copression effect: noatime,nodatacow,compress=lzoFor btrfs, nodatacow and compression don''t really mix. The compression will just override it. (Just FYI, not really related to these results).> > In all tests I''ve been measuring time of execution. Following tests > were performed: > - find . -type d > - find . -type f > - cp -a > - rm -rf > > Ext4 results: > | Type | 2.6.39.4-3 | 3.2.7 > | Dir cnt | 17m 40sec | 11m 20sec > | File cnt | 17m 36sec | 11m 22sec > | Copy | 1h 28m | 1h 27m > | Remove| 3m 43secAre the btrfs numbers missing? ;) In order for btrfs to be faster for cp -a, the files probably didn''t change much since creation. Btrfs maintains extra directory indexes that help in sequential backup scans, but this usually means slower delete performance. But, how exactly did you benchmark it? If you compare a fresh mkfs.btrfs where you just copied all the data over with an ext4 FS that has been on the disk for a long time, it isn''t quite fair to ext4. -chris
Hi All, /*Sorry for sending incomplete email, hit wrong button :) I guess I can''t use Gmail */ Long story short: We''ve found that operations on a directory structure holding many dirs takes ages on ext4. The Question: Why there''s that huge difference in ext4 and btrfs? See below test results for real values. Background: I had to backup a Jenkins directory holding workspace for few projects which were co from svn (implies lot of extra .svn dirs). The copy takes lot of time (at least more than I''ve expected) and process was mostly in D (disk sleep). I''ve dig more and done some extra test to see if this is not a regression on block/fs site. To isolate the issue I''ve also performed same tests on btrfs. Test environment configuration: 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. 2) Kernels: All tests were done on following kernels: - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been release recently). 3) A subject of tests, directory holding: - 54GB of data (measured on ext4) - 1978149 files - 844008 directories 4) Mount options: - ext4 -- errors=remount-ro,noatime, data=writeback - btrfs -- noatime,nodatacow and for later investigation on copression effect: noatime,nodatacow,compress=lzo In all tests I''ve been measuring time of execution. Following tests were performed: - find . -type d - find . -type f - cp -a - rm -rf Ext4 results: | Type | 2.6.39.4-3 | 3.2.7 | Dir cnt | 17m 40sec | 11m 20sec | File cnt | 17m 36sec | 11m 22sec | Copy | 1h 28m | 1h 27m | Remove| 3m 43sec | 3m 38sec Btrfs results (without lzo comression): | Type | 2.6.39.4-3 | 3.2.7 | Dir cnt | 2m 22sec | 2m 21sec | File cnt | 2m 26sec | 2m 23sec | Copy | 36m 22sec | 39m 35sec | Remove| 7m 51sec | 10m 43sec From above one can see that copy takes close to 1h less on btrfs. I''ve done strace counting times of calls, results are as follows (from 3.2.7): 1) Ext4 (only to elements): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 57.01 13.257850 1 15082163 read 23.40 5.440353 3 1687702 getdents 6.15 1.430559 0 3672418 lstat 3.80 0.883767 0 13106961 write 2.32 0.539959 0 4794099 open 1.69 0.393589 0 843695 mkdir 1.28 0.296700 0 5637802 setxattr 0.80 0.186539 0 7325195 stat 2) Btrfs: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 53.38 9.486210 1 15179751 read 11.38 2.021662 1 1688328 getdents 10.64 1.890234 0 4800317 open 6.83 1.213723 0 13201590 write 4.85 0.862731 0 5644314 setxattr 3.50 0.621194 1 844008 mkdir 2.75 0.489059 0 3675992 1 lstat 1.71 0.303544 0 5644314 llistxattr 1.50 0.265943 0 1978149 utimes 1.02 0.180585 0 5644314 844008 getxattr On btrfs getdents takes much less time which prove the bottleneck in copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time for getdents: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 50.77 10.978816 1 15033132 read 14.46 3.125996 1 4733589 open 7.15 1.546311 0 5566988 setxattr 5.89 1.273845 0 3626505 lstat 5.81 1.255858 1 1667050 getdents 5.66 1.224403 0 13083022 write 3.40 0.735114 1 833371 mkdir 1.96 0.424881 0 5566988 llistxattr Why so huge difference in the getdents timings? -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Chris, the last one was borked :) Please check this one. -jacek 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> Hi All, > > /*Sorry for sending incomplete email, hit wrong button :) I guess I > can''t use Gmail */ > > Long story short: We''ve found that operations on a directory structure > holding many dirs takes ages on ext4. > > The Question: Why there''s that huge difference in ext4 and btrfs? See > below test results for real values. > > Background: I had to backup a Jenkins directory holding workspace for > few projects which were co from svn (implies lot of extra .svn dirs). > The copy takes lot of time (at least more than I''ve expected) and > process was mostly in D (disk sleep). I''ve dig more and done some > extra test to see if this is not a regression on block/fs site. To > isolate the issue I''ve also performed same tests on btrfs. > > Test environment configuration: > 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT > enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. > 2) Kernels: All tests were done on following kernels: > - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of > config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline > break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. > - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been > release recently). > 3) A subject of tests, directory holding: > - 54GB of data (measured on ext4) > - 1978149 files > - 844008 directories > 4) Mount options: > - ext4 -- errors=remount-ro,noatime, > data=writeback > - btrfs -- noatime,nodatacow and for later investigation on > copression effect: noatime,nodatacow,compress=lzo > > In all tests I''ve been measuring time of execution. Following tests > were performed: > - find . -type d > - find . -type f > - cp -a > - rm -rf > > Ext4 results: > | Type | 2.6.39.4-3 | 3.2.7 > | Dir cnt | 17m 40sec | 11m 20sec > | File cnt | 17m 36sec | 11m 22sec > | Copy | 1h 28m | 1h 27m > | Remove| 3m 43sec | 3m 38sec > > Btrfs results (without lzo comression): > | Type | 2.6.39.4-3 | 3.2.7 > | Dir cnt | 2m 22sec | 2m 21sec > | File cnt | 2m 26sec | 2m 23sec > | Copy | 36m 22sec | 39m 35sec > | Remove| 7m 51sec | 10m 43sec > > From above one can see that copy takes close to 1h less on btrfs. I''ve > done strace counting times of calls, results are as follows (from > 3.2.7): > 1) Ext4 (only to elements): > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 57.01 13.257850 1 15082163 read > 23.40 5.440353 3 1687702 getdents > 6.15 1.430559 0 3672418 lstat > 3.80 0.883767 0 13106961 write > 2.32 0.539959 0 4794099 open > 1.69 0.393589 0 843695 mkdir > 1.28 0.296700 0 5637802 setxattr > 0.80 0.186539 0 7325195 stat > > 2) Btrfs: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 53.38 9.486210 1 15179751 read > 11.38 2.021662 1 1688328 getdents > 10.64 1.890234 0 4800317 open > 6.83 1.213723 0 13201590 write > 4.85 0.862731 0 5644314 setxattr > 3.50 0.621194 1 844008 mkdir > 2.75 0.489059 0 3675992 1 lstat > 1.71 0.303544 0 5644314 llistxattr > 1.50 0.265943 0 1978149 utimes > 1.02 0.180585 0 5644314 844008 getxattr > > On btrfs getdents takes much less time which prove the bottleneck in > copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time > for getdents: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 50.77 10.978816 1 15033132 read > 14.46 3.125996 1 4733589 open > 7.15 1.546311 0 5566988 setxattr > 5.89 1.273845 0 3626505 lstat > 5.81 1.255858 1 1667050 getdents > 5.66 1.224403 0 13083022 write > 3.40 0.735114 1 833371 mkdir > 1.96 0.424881 0 5566988 llistxattr > > > Why so huge difference in the getdents timings? > > -Jacek
On Wed, 29 Feb 2012, Chris Mason wrote:> On Wed, Feb 29, 2012 at 02:31:03PM +0100, Jacek Luczak wrote: > > Hi All, > > > > Long story short: We''ve found that operations on a directory structure > > holding many dirs takes ages on ext4. > > > > The Question: Why there''s that huge difference in ext4 and btrfs? See > > below test results for real values. > > > > Background: I had to backup a Jenkins directory holding workspace for > > few projects which were co from svn (implies lot of extra .svn dirs). > > The copy takes lot of time (at least more than I''ve expected) and > > process was mostly in D (disk sleep). I''ve dig more and done some > > extra test to see if this is not a regression on block/fs site. To > > isolate the issue I''ve also performed same tests on btrfs. > > > > Test environment configuration: > > 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT > > enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. > > 2) Kernels: All tests were done on following kernels: > > - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of > > config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline > > break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. > > - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been > > release recently). > > 3) A subject of tests, directory holding: > > - 54GB of data (measured on ext4) > > - 1978149 files > > - 844008 directories > > 4) Mount options: > > - ext4 -- errors=remount-ro,noatime,data=writeback > > - btrfs -- noatime,nodatacow and for later investigation on > > copression effect: noatime,nodatacow,compress=lzo > > For btrfs, nodatacow and compression don''t really mix. The compression > will just override it. (Just FYI, not really related to these results). > > > > > In all tests I''ve been measuring time of execution. Following tests > > were performed: > > - find . -type d > > - find . -type f > > - cp -a > > - rm -rf > > > > Ext4 results: > > | Type | 2.6.39.4-3 | 3.2.7 > > | Dir cnt | 17m 40sec | 11m 20sec > > | File cnt | 17m 36sec | 11m 22sec > > | Copy | 1h 28m | 1h 27m > > | Remove| 3m 43sec > > Are the btrfs numbers missing? ;) > > In order for btrfs to be faster for cp -a, the files probably didn''t > change much since creation. Btrfs maintains extra directory indexes > that help in sequential backup scans, but this usually means slower > delete performance.Exactly and IIRC ext4 have directory entries stored in hash order which does not really help the sequential access.> > But, how exactly did you benchmark it? If you compare a fresh > mkfs.btrfs where you just copied all the data over with an ext4 FS that > has been on the disk for a long time, it isn''t quite fair to ext4.I have the same question, note that if the files on ext4 has been worked with it may very well be that directory hash trees are not in very good shape. You can attempt to optimize that by e2fsck (just run fsck.ext4 -f <device>) but that may take quite some time and memory, but it is worth trying. Thanks! -Lukas> > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 29, 2012 at 08:51:58AM -0500, Chris Mason wrote:> On Wed, Feb 29, 2012 at 02:31:03PM +0100, Jacek Luczak wrote: > > Ext4 results: > > | Type | 2.6.39.4-3 | 3.2.7 > > | Dir cnt | 17m 40sec | 11m 20sec > > | File cnt | 17m 36sec | 11m 22sec > > | Copy | 1h 28m | 1h 27m > > | Remove| 3m 43sec > > Are the btrfs numbers missing? ;)[ answered in a different reply, btrfs is faster in everything except delete ] The btrfs readdir uses an index that is much more likely to be sequential on disk than ext. This makes the readdir more sequential and it makes the actual file IO more sequential because we''re reading things in the order they were created instead of (random) htree index order.> > In order for btrfs to be faster for cp -a, the files probably didn''t > change much since creation. Btrfs maintains extra directory indexes > that help in sequential backup scans, but this usually means slower > delete performance. > > But, how exactly did you benchmark it? If you compare a fresh > mkfs.btrfs where you just copied all the data over with an ext4 FS that > has been on the disk for a long time, it isn''t quite fair to ext4. >But, the consistent benchmarking part is really important. We shouldn''t put an aged ext4 up against a fresh mkfs.btrfs. Did you do ext4 comparisons on a fresh copy? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> Hi Chris, > > the last one was borked :) Please check this one. > > -jacek > > 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>: >> Hi All, >> >> /*Sorry for sending incomplete email, hit wrong button :) I guess I >> can''t use Gmail */ >> >> Long story short: We''ve found that operations on a directory structure >> holding many dirs takes ages on ext4. >> >> The Question: Why there''s that huge difference in ext4 and btrfs? See >> below test results for real values. >> >> Background: I had to backup a Jenkins directory holding workspace for >> few projects which were co from svn (implies lot of extra .svn dirs). >> The copy takes lot of time (at least more than I''ve expected) and >> process was mostly in D (disk sleep). I''ve dig more and done some >> extra test to see if this is not a regression on block/fs site. To >> isolate the issue I''ve also performed same tests on btrfs. >> >> Test environment configuration: >> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT >> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. >> 2) Kernels: All tests were done on following kernels: >> - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of >> config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline >> break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. >> - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been >> release recently). >> 3) A subject of tests, directory holding: >> - 54GB of data (measured on ext4) >> - 1978149 files >> - 844008 directories >> 4) Mount options: >> - ext4 -- errors=remount-ro,noatime, >> data=writeback >> - btrfs -- noatime,nodatacow and for later investigation on >> copression effect: noatime,nodatacow,compress=lzo >> >> In all tests I''ve been measuring time of execution. Following tests >> were performed: >> - find . -type d >> - find . -type f >> - cp -a >> - rm -rf >> >> Ext4 results: >> | Type | 2.6.39.4-3 | 3.2.7 >> | Dir cnt | 17m 40sec | 11m 20sec >> | File cnt | 17m 36sec | 11m 22sec >> | Copy | 1h 28m | 1h 27m >> | Remove| 3m 43sec | 3m 38sec >> >> Btrfs results (without lzo comression): >> | Type | 2.6.39.4-3 | 3.2.7 >> | Dir cnt | 2m 22sec | 2m 21sec >> | File cnt | 2m 26sec | 2m 23sec >> | Copy | 36m 22sec | 39m 35sec >> | Remove| 7m 51sec | 10m 43sec >> >> From above one can see that copy takes close to 1h less on btrfs. I''ve >> done strace counting times of calls, results are as follows (from >> 3.2.7): >> 1) Ext4 (only to elements): >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ---------------- >> 57.01 13.257850 1 15082163 read >> 23.40 5.440353 3 1687702 getdents >> 6.15 1.430559 0 3672418 lstat >> 3.80 0.883767 0 13106961 write >> 2.32 0.539959 0 4794099 open >> 1.69 0.393589 0 843695 mkdir >> 1.28 0.296700 0 5637802 setxattr >> 0.80 0.186539 0 7325195 stat >> >> 2) Btrfs: >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ---------------- >> 53.38 9.486210 1 15179751 read >> 11.38 2.021662 1 1688328 getdents >> 10.64 1.890234 0 4800317 open >> 6.83 1.213723 0 13201590 write >> 4.85 0.862731 0 5644314 setxattr >> 3.50 0.621194 1 844008 mkdir >> 2.75 0.489059 0 3675992 1 lstat >> 1.71 0.303544 0 5644314 llistxattr >> 1.50 0.265943 0 1978149 utimes >> 1.02 0.180585 0 5644314 844008 getxattr >> >> On btrfs getdents takes much less time which prove the bottleneck in >> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time >> for getdents: >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ---------------- >> 50.77 10.978816 1 15033132 read >> 14.46 3.125996 1 4733589 open >> 7.15 1.546311 0 5566988 setxattr >> 5.89 1.273845 0 3626505 lstat >> 5.81 1.255858 1 1667050 getdents >> 5.66 1.224403 0 13083022 write >> 3.40 0.735114 1 833371 mkdir >> 1.96 0.424881 0 5566988 llistxattr >> >> >> Why so huge difference in the getdents timings? >> >> -JacekI will try to answer the question from the broken email I''ve sent. @Lukas, it was always a fresh FS on top of LVM logical volume. I''ve been cleaning cache/remounting to sync all data before (re)doing tests. -Jacek BTW: Sorry for the email mixture. I just can''t get this gmail thing to work (why forcing top posting:/). Please use this thread. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>: >> Hi Chris, >> >> the last one was borked :) Please check this one. >> >> -jacek >> >> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>: >>> Hi All, >>> >>> /*Sorry for sending incomplete email, hit wrong button :) I guess I >>> can''t use Gmail */ >>> >>> Long story short: We''ve found that operations on a directory structure >>> holding many dirs takes ages on ext4. >>> >>> The Question: Why there''s that huge difference in ext4 and btrfs? See >>> below test results for real values. >>> >>> Background: I had to backup a Jenkins directory holding workspace for >>> few projects which were co from svn (implies lot of extra .svn dirs). >>> The copy takes lot of time (at least more than I''ve expected) and >>> process was mostly in D (disk sleep). I''ve dig more and done some >>> extra test to see if this is not a regression on block/fs site. To >>> isolate the issue I''ve also performed same tests on btrfs. >>> >>> Test environment configuration: >>> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT >>> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. >>> 2) Kernels: All tests were done on following kernels: >>> - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of >>> config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline >>> break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. >>> - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been >>> release recently). >>> 3) A subject of tests, directory holding: >>> - 54GB of data (measured on ext4) >>> - 1978149 files >>> - 844008 directories >>> 4) Mount options: >>> - ext4 -- errors=remount-ro,noatime, >>> data=writeback >>> - btrfs -- noatime,nodatacow and for later investigation on >>> copression effect: noatime,nodatacow,compress=lzo >>> >>> In all tests I''ve been measuring time of execution. Following tests >>> were performed: >>> - find . -type d >>> - find . -type f >>> - cp -a >>> - rm -rf >>> >>> Ext4 results: >>> | Type | 2.6.39.4-3 | 3.2.7 >>> | Dir cnt | 17m 40sec | 11m 20sec >>> | File cnt | 17m 36sec | 11m 22sec >>> | Copy | 1h 28m | 1h 27m >>> | Remove| 3m 43sec | 3m 38sec >>> >>> Btrfs results (without lzo comression): >>> | Type | 2.6.39.4-3 | 3.2.7 >>> | Dir cnt | 2m 22sec | 2m 21sec >>> | File cnt | 2m 26sec | 2m 23sec >>> | Copy | 36m 22sec | 39m 35sec >>> | Remove| 7m 51sec | 10m 43sec >>> >>> From above one can see that copy takes close to 1h less on btrfs. I''ve >>> done strace counting times of calls, results are as follows (from >>> 3.2.7): >>> 1) Ext4 (only to elements): >>> % time seconds usecs/call calls errors syscall >>> ------ ----------- ----------- --------- --------- ---------------- >>> 57.01 13.257850 1 15082163 read >>> 23.40 5.440353 3 1687702 getdents >>> 6.15 1.430559 0 3672418 lstat >>> 3.80 0.883767 0 13106961 write >>> 2.32 0.539959 0 4794099 open >>> 1.69 0.393589 0 843695 mkdir >>> 1.28 0.296700 0 5637802 setxattr >>> 0.80 0.186539 0 7325195 stat >>> >>> 2) Btrfs: >>> % time seconds usecs/call calls errors syscall >>> ------ ----------- ----------- --------- --------- ---------------- >>> 53.38 9.486210 1 15179751 read >>> 11.38 2.021662 1 1688328 getdents >>> 10.64 1.890234 0 4800317 open >>> 6.83 1.213723 0 13201590 write >>> 4.85 0.862731 0 5644314 setxattr >>> 3.50 0.621194 1 844008 mkdir >>> 2.75 0.489059 0 3675992 1 lstat >>> 1.71 0.303544 0 5644314 llistxattr >>> 1.50 0.265943 0 1978149 utimes >>> 1.02 0.180585 0 5644314 844008 getxattr >>> >>> On btrfs getdents takes much less time which prove the bottleneck in >>> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time >>> for getdents: >>> % time seconds usecs/call calls errors syscall >>> ------ ----------- ----------- --------- --------- ---------------- >>> 50.77 10.978816 1 15033132 read >>> 14.46 3.125996 1 4733589 open >>> 7.15 1.546311 0 5566988 setxattr >>> 5.89 1.273845 0 3626505 lstat >>> 5.81 1.255858 1 1667050 getdents >>> 5.66 1.224403 0 13083022 write >>> 3.40 0.735114 1 833371 mkdir >>> 1.96 0.424881 0 5566988 llistxattr >>> >>> >>> Why so huge difference in the getdents timings? >>> >>> -Jacek > > I will try to answer the question from the broken email I''ve sent. > > @Lukas, it was always a fresh FS on top of LVM logical volume. I''ve > been cleaning cache/remounting to sync all data before (re)doing > tests. > > -Jacek > > BTW: Sorry for the email mixture. I just can''t get this gmail thing to > work (why forcing top posting:/). Please use this thread.More from the observations: 1) 10s dump of the process state during copy shows: - Ext4: 526 probes done, 34 hits R state, 492 hits D state - Btrfs (2.6.39.4): 218, 83, 135 - Btrfs (3.2.7): 238, 62, 174, 2 hit sleeping 2) dd write/read of 55GB file to/from volume: - Ext4: write 127MB/s, read 107MB/s - Btrfs: 110MB/s, read 176MB/s -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote: [ btrfs faster than ext for find and cp -a ]> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>: > > I will try to answer the question from the broken email I''ve sent. > > @Lukas, it was always a fresh FS on top of LVM logical volume. I''ve > been cleaning cache/remounting to sync all data before (re)doing > tests.The next step is to get cp -a out of the picture, in this case you''re benchmarking both the read speed and the write speed (what are you copying to btw?). Using tar cf /dev/zero <some_dir> is one way to get a consistent picture of the read speed. You can confirm the theory that it is directory order causing problems by using acp to read the data. http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2 -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/2/29 Chris Mason <chris.mason@oracle.com>:> On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote: > > [ btrfs faster than ext for find and cp -a ] > >> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>: >> >> I will try to answer the question from the broken email I''ve sent. >> >> @Lukas, it was always a fresh FS on top of LVM logical volume. I''ve >> been cleaning cache/remounting to sync all data before (re)doing >> tests. > > The next step is to get cp -a out of the picture, in this case you''re > benchmarking both the read speed and the write speed (what are you > copying to btw?).It''s simple cp -a Jenkins{,.bak} so dir to dir copy on same volume.> Using tar cf /dev/zero <some_dir> is one way to get a consistent picture > of the read speed.IMO the problem is not - only - in read speed. The directory order hit here. There''s a difference in the sequential tests that place btrfs as the winner but still this should not have that huge influence on getdents. I know a bit on the difference between ext4 and btrfs directory handling and I would not expect that huge difference. On the production system where the issue has been observed doing some real work in the background copy takes up to 4h. For me btrfs looks perfect here, what could be worth checking is the change of timing in syscall between 39.4 and 3.2.7. Before getdents was not that high on the list while now it jumps to second position but without huge impact on the timings.> You can confirm the theory that it is directory order causing problems > by using acp to read the data. > > http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2Will check this still today and report back. -jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
You might try sorting the entries returned by readdir by inode number before you stat them. This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir. You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory): http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209 … and then using that as a LD_PRELOAD, and see how that changes things. The short version is that we can''t easily do this in the kernel since it''s a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn''t practical. It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.) -- Ted On Feb 29, 2012, at 8:52 AM, Jacek Luczak wrote:> Hi All, > > /*Sorry for sending incomplete email, hit wrong button :) I guess I > can''t use Gmail */ > > Long story short: We''ve found that operations on a directory structure > holding many dirs takes ages on ext4. > > The Question: Why there''s that huge difference in ext4 and btrfs? See > below test results for real values. > > Background: I had to backup a Jenkins directory holding workspace for > few projects which were co from svn (implies lot of extra .svn dirs). > The copy takes lot of time (at least more than I''ve expected) and > process was mostly in D (disk sleep). I''ve dig more and done some > extra test to see if this is not a regression on block/fs site. To > isolate the issue I''ve also performed same tests on btrfs. > > Test environment configuration: > 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT > enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. > 2) Kernels: All tests were done on following kernels: > - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of > config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline > break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. > - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been > release recently). > 3) A subject of tests, directory holding: > - 54GB of data (measured on ext4) > - 1978149 files > - 844008 directories > 4) Mount options: > - ext4 -- errors=remount-ro,noatime, > data=writeback > - btrfs -- noatime,nodatacow and for later investigation on > copression effect: noatime,nodatacow,compress=lzo > > In all tests I''ve been measuring time of execution. Following tests > were performed: > - find . -type d > - find . -type f > - cp -a > - rm -rf > > Ext4 results: > | Type | 2.6.39.4-3 | 3.2.7 > | Dir cnt | 17m 40sec | 11m 20sec > | File cnt | 17m 36sec | 11m 22sec > | Copy | 1h 28m | 1h 27m > | Remove| 3m 43sec | 3m 38sec > > Btrfs results (without lzo comression): > | Type | 2.6.39.4-3 | 3.2.7 > | Dir cnt | 2m 22sec | 2m 21sec > | File cnt | 2m 26sec | 2m 23sec > | Copy | 36m 22sec | 39m 35sec > | Remove| 7m 51sec | 10m 43sec > > From above one can see that copy takes close to 1h less on btrfs. I''ve > done strace counting times of calls, results are as follows (from > 3.2.7): > 1) Ext4 (only to elements): > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 57.01 13.257850 1 15082163 read > 23.40 5.440353 3 1687702 getdents > 6.15 1.430559 0 3672418 lstat > 3.80 0.883767 0 13106961 write > 2.32 0.539959 0 4794099 open > 1.69 0.393589 0 843695 mkdir > 1.28 0.296700 0 5637802 setxattr > 0.80 0.186539 0 7325195 stat > > 2) Btrfs: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 53.38 9.486210 1 15179751 read > 11.38 2.021662 1 1688328 getdents > 10.64 1.890234 0 4800317 open > 6.83 1.213723 0 13201590 write > 4.85 0.862731 0 5644314 setxattr > 3.50 0.621194 1 844008 mkdir > 2.75 0.489059 0 3675992 1 lstat > 1.71 0.303544 0 5644314 llistxattr > 1.50 0.265943 0 1978149 utimes > 1.02 0.180585 0 5644314 844008 getxattr > > On btrfs getdents takes much less time which prove the bottleneck in > copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time > for getdents: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 50.77 10.978816 1 15033132 read > 14.46 3.125996 1 4733589 open > 7.15 1.546311 0 5566988 setxattr > 5.89 1.273845 0 3626505 lstat > 5.81 1.255858 1 1667050 getdents > 5.66 1.224403 0 13083022 write > 3.40 0.735114 1 833371 mkdir > 1.96 0.424881 0 5566988 llistxattr > > > Why so huge difference in the getdents timings? > > -Jacek > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/2/29 Chris Mason <chris.mason@oracle.com>: >> On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote: >> >> [ btrfs faster than ext for find and cp -a ] >> >>> 2012/2/29 Jacek Luczak <difrost.kernel@gmail.com>: >>> >>> I will try to answer the question from the broken email I''ve sent. >>> >>> @Lukas, it was always a fresh FS on top of LVM logical volume. I''ve >>> been cleaning cache/remounting to sync all data before (re)doing >>> tests. >> >> The next step is to get cp -a out of the picture, in this case you''re >> benchmarking both the read speed and the write speed (what are you >> copying to btw?). > > It''s simple cp -a Jenkins{,.bak} so dir to dir copy on same volume. > >> Using tar cf /dev/zero <some_dir> is one way to get a consistent picture >> of the read speed. > > IMO the problem is not - only - in read speed. The directory order hit > here. There''s a difference in the sequential tests that place btrfs as > the winner but still this should not have that huge influence on > getdents. I know a bit on the difference between ext4 and btrfs > directory handling and I would not expect that huge difference. On the > production system where the issue has been observed doing some real > work in the background copy takes up to 4h. > > For me btrfs looks perfect here, what could be worth checking is the > change of timing in syscall between 39.4 and 3.2.7. Before getdents > was not that high on the list while now it jumps to second position > but without huge impact on the timings. > >> You can confirm the theory that it is directory order causing problems >> by using acp to read the data. >> >> http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2 > > Will check this still today and report back. >While I was about to grab acp I''ve noticed seekwatcher with made my day :) seekwatcher run of tar cf to eliminate writes (all done on 3.2.7): 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png I will send acp results soon. -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <difrost.kernel@gmail.com> wrote:> > While I was about to grab acp I''ve noticed seekwatcher with made my day :) > > seekwatcher run of tar cf to eliminate writes (all done on 3.2.7): > 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png > 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png > 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png > > I will send acp results soon. >Would you please take reiserfs into account? Thanks -hd
2012/3/1 Hillf Danton <dhillf@gmail.com>:> On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <difrost.kernel@gmail.com> wrote: >> >> While I was about to grab acp I''ve noticed seekwatcher with made my day :) >> >> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7): >> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png >> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png >> 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png >> >> I will send acp results soon. >> > Would you please take reiserfs into account?As of now not (lack of time) but I''m pretty close to consider XFS in the game. Whenever I will have more time and there won''t be a pressure on giving host back to production I will redo same tests for reiserfs. Now I''m focused on the userspace sorting results. -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 01, 2012 at 03:03:53PM +0100, Jacek Luczak wrote:> 2012/3/1 Hillf Danton <dhillf@gmail.com>: > > On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <difrost.kernel@gmail.com> wrote: > >> > >> While I was about to grab acp I''ve noticed seekwatcher with made my day :) > >> > >> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7): > >> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png > >> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png > >> 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.pngWhoa, seekwatcher makes it pretty clear.> >> > >> I will send acp results soon. > >> > > Would you please take reiserfs into account? > > As of now not (lack of time) but I''m pretty close to consider XFS in > the game. Whenever I will have more time and there won''t be a pressure > on giving host back to production I will redo same tests for reiserfs. > > Now I''m focused on the userspace sorting results.reiserfs should have results very similar to ext4. The directory hashing used by reiserfs is going to result in a very random read pattern. XFS will probably beat btrfs in this test. Their directory indexes reflect on disk layout very well. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso wrote:> You might try sorting the entries returned by readdir by inode number before you stat them. This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir. > > You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory): > > http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209 > > … and then using that as a LD_PRELOAD, and see how that changes things. > > The short version is that we can''t easily do this in the kernel since it''s a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn''t practical. It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.) >For reading the files, the acp program I sent him tries to do something similar. I had forgotten about spd_readdir though, we should consider hacking that into cp and tar. One interesting note is the page cache used to help here. Picture two tests: A) time tar cf /dev/zero /home and cp -a /home /new_dir_in_new_fs unmount / flush caches B) time tar cf /dev/zero /new_dir_in_new_fs On ext, The time for B used to be much faster than the time for A because the files would get written back to disk in roughly htree order. Based on Jacek''s data, that isn''t true anymore. -chris
2012/3/1 Chris Mason <chris.mason@oracle.com>:> On Thu, Mar 01, 2012 at 03:03:53PM +0100, Jacek Luczak wrote: >> 2012/3/1 Hillf Danton <dhillf@gmail.com>: >> > On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <difrost.kernel@gmail.com> wrote: >> >> >> >> While I was about to grab acp I''ve noticed seekwatcher with made my day :) >> >> >> >> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7): >> >> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png >> >> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png >> >> 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png > > Whoa, seekwatcher makes it pretty clear.Yep, ext4 is close to my wife''s closet.>> >> >> >> I will send acp results soon. >> >> >> > Would you please take reiserfs into account? >> >> As of now not (lack of time) but I''m pretty close to consider XFS in >> the game. Whenever I will have more time and there won''t be a pressure >> on giving host back to production I will redo same tests for reiserfs. >> >> Now I''m focused on the userspace sorting results. > > reiserfs should have results very similar to ext4. The directory > hashing used by reiserfs is going to result in a very random read > pattern. > > XFS will probably beat btrfs in this test. Their directory indexes > reflect on disk layout very well.True, but not that fast on small files. Except the question I''ve raised in first mail there''s a point in all those action. We are maintaining host that are used for building software: random access, lot of small files and dirs (always a co), heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was around 10% slower on build times. We did not - yet - done same on btrfs. Now we''re looking for replacement for ext4 as we suffer from those issue - but we were not aware of that until stepped into this issue. If you would like me to do some specific tests around ext4 and btrfs, let me know. -Jacek
On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:> 2012/3/1 Chris Mason <chris.mason@oracle.com>: > > XFS will probably beat btrfs in this test. Their directory indexes > > reflect on disk layout very well. > > True, but not that fast on small files. > > Except the question I''ve raised in first mail there''s a point in all > those action. We are maintaining host that are used for building > software: random access, lot of small files and dirs (always a co), > heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was > around 10% slower on build times. We did not - yet - done same on > btrfs. Now we''re looking for replacement for ext4 as we suffer from > those issue - but we were not aware of that until stepped into this > issue. > > If you would like me to do some specific tests around ext4 and btrfs, > let me know.I''m always curious to see comparisons in real world workloads. You should definitely consider testing XFS again, the big three filesystems are under pretty constant improvement. For btrfs, please stick to 3.2 kernels and higher. This seeky backup performance is somewhat built into ext4, but as Ted said there are a few workarounds. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/3/1 Chris Mason <chris.mason@oracle.com>:> On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote: >> 2012/3/1 Chris Mason <chris.mason@oracle.com>: >> > XFS will probably beat btrfs in this test. Their directory indexes >> > reflect on disk layout very well. >> >> True, but not that fast on small files. >> >> Except the question I''ve raised in first mail there''s a point in all >> those action. We are maintaining host that are used for building >> software: random access, lot of small files and dirs (always a co), >> heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was >> around 10% slower on build times. We did not - yet - done same on >> btrfs. Now we''re looking for replacement for ext4 as we suffer from >> those issue - but we were not aware of that until stepped into this >> issue. >> >> If you would like me to do some specific tests around ext4 and btrfs, >> let me know. > > I''m always curious to see comparisons in real world workloads. You > should definitely consider testing XFS again, the big three filesystems > are under pretty constant improvement. For btrfs, please stick to 3.2 > kernels and higher.That''s the plan but I''m waiting for more of the briliant work that recently popped up in XFS. For btrfs, the 3.2 introduced changes led me to give here a try. I don''t have nice pictures and digits in my hand now but first tests shown close to 40% of timing improvements between 2.6.39.4 and 3.2.7 - keep doing that great work guys (and girls if any)! -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:> > Yep, ext4 is close to my wife''s closet. >Were all of the file systems freshly laid down, or was this an aged ext4 file system? Also you should beware that if you have a workload which is heavy parallel I/O, with lots of random, read/write accesses to small files, a benchmark using tar might not be representative of what you will see in production --- different file systems have different strengths and weaknesses --- and the fact that ext3/ext4''s readdir() returns inodes in a non-optimal order for stat(2) or unlink(2) or file copy in the cold cache case may not matter as much as you think in a build server. (i.e., the directories that do need to be searched will probably be serviced out of the dentry cache, etc.) Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/3/1 Ted Ts''o <tytso@mit.edu>:> On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote: >> >> Yep, ext4 is close to my wife''s closet. >> > > Were all of the file systems freshly laid down, or was this an aged > ext4 file system?Always fresh, recreated for each tests - that''s why it takes quite much time as I had to copy the test dir back in places. Env is kept in all tests as much consistent as possible thus the values should have credibility.> Also you should beware that if you have a workload which is heavy > parallel I/O, with lots of random, read/write accesses to small files, > a benchmark using tar might not be representative of what you will see > in production --- different file systems have different strengths and > weaknesses --- and the fact that ext3/ext4''s readdir() returns inodes > in a non-optimal order for stat(2) or unlink(2) or file copy in the > cold cache case may not matter as much as you think in a build server. > (i.e., the directories that do need to be searched will probably be > serviced out of the dentry cache, etc.)The set of tests were chosen as is not to find best fs for build purposes. For a pure builds ext4 is as of now the best and most of the points you''ve put above are valid here. We''ve performed a real tests in a clone of production environment. Results are not that surprising as one can find same from e.g. Phoronix test. We''ve done test in XFS vs ext[34] and ext4 vs btrfs. Here if we are taking into account only a software compilation ext4 rocks. Btrfs is only few seconds slower (max 5 in average). The choice then was to use ext4 due to more mature foundations and support in RHEL. Why we''re looking for new one? Well, the build environment is not only based on software building. There are e.g. some strange tests running in parallel, code analysis, etc. Users are doing damn odd things there ... are using Java, you can imagine how bad bad bad bad zombie java can be. We''ve failed here and are not able to isolate each use cases and create profiled environment. Thus we need to find sth that will provide common sense for all use cases. The previous tests shown that ext[34] rocks on compilation timing, but all around not really. Also one need to remember that fs content changes often. The ext3 was ageing in 4-6 months of use. XFS on the other hand was great all around while not in compilation timings. Roughly 10% is not that much but if hosts is doing builds 24h/7 then after a few days we''ve been much behind ext[34] clone env. Btrfs was only tested against compilation timings, not in general use. We''ve created a simple test case for compilation. It''s quite not same as what we got in real env but is good baseline (kernel build system is too perfect). Simply parallel kernel builds with randomly allyesconfig or allmodconfig. Below are the seekwatcher graphs of around 1h of tests running. There were 10 builds (kernels 2.6.20-2.6.29) running with three parallel threads: 1) ext4: http://91.234.146.107/~difrost/seekwatcher/kt_ext4.png 2) btrfs: http://91.234.146.107/~difrost/seekwatcher/kt_btrfs.png 3) both: http://91.234.146.107/~difrost/seekwatcher/kt_btrfs_ext4.png Above graphs prove that ext4 is ahead in this ,,competition''''. I will try to setup there a real build env to see how those two compare. -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/3/1 Chris Mason <chris.mason@oracle.com>:> On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso wrote: >> You might try sorting the entries returned by readdir by inode number before you stat them. This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir. >> >> You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory): >> >> http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209 >> >> … and then using that as a LD_PRELOAD, and see how that changes things. >> >> The short version is that we can''t easily do this in the kernel since it''s a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn''t practical. It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.) >> > > For reading the files, the acp program I sent him tries to do something > similar. I had forgotten about spd_readdir though, we should consider > hacking that into cp and tar. > > One interesting note is the page cache used to help here. Picture two > tests: > > A) time tar cf /dev/zero /home > > and > > cp -a /home /new_dir_in_new_fs > unmount / flush caches > B) time tar cf /dev/zero /new_dir_in_new_fs > > On ext, The time for B used to be much faster than the time for A > because the files would get written back to disk in roughly htree order. > Based on Jacek''s data, that isn''t true anymore.I''ve took both on tests. The subject is acp and spd_readdir used with tar, all on ext4: 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png The acp looks much better than spd_readdir but directory copy with spd_readdir decreased to 52m 39sec (30 min less). -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:> > I''ve took both on tests. The subject is acp and spd_readdir used with > tar, all on ext4: > 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png > 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png > 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png > > The acp looks much better than spd_readdir but directory copy with > spd_readdir decreased to 52m 39sec (30 min less).Do you have stats on how big these files are, and how fragmented they are? For acp and spd to give us this, I think something has gone wrong at writeback time (creating individual fragmented files). -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/3/2 Chris Mason <chris.mason@oracle.com>:> On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote: >> >> I''ve took both on tests. The subject is acp and spd_readdir used with >> tar, all on ext4: >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png >> >> The acp looks much better than spd_readdir but directory copy with >> spd_readdir decreased to 52m 39sec (30 min less). > > Do you have stats on how big these files are, and how fragmented they > are? For acp and spd to give us this, I think something has gone wrong > at writeback time (creating individual fragmented files).How big? Which files? -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:> 2012/3/2 Chris Mason <chris.mason@oracle.com>: > > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote: > >> > >> I''ve took both on tests. The subject is acp and spd_readdir used with > >> tar, all on ext4: > >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png > >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png > >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png > >> > >> The acp looks much better than spd_readdir but directory copy with > >> spd_readdir decreased to 52m 39sec (30 min less). > > > > Do you have stats on how big these files are, and how fragmented they > > are? For acp and spd to give us this, I think something has gone wrong > > at writeback time (creating individual fragmented files). > > How big? Which files?All the files you''re reading ;) filefrag will tell you how many extents each file has, any file with more than one extent is interesting. (The ext4 crowd may have better suggestions on measuring fragmentation). Since you mention this is a compile farm, I''m guessing there are a bunch of .o files created by parallel builds. There are a lot of chances for delalloc and the kernel writeback code to do the wrong thing here. -chris
On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:> > filefrag will tell you how many extents each file has, any file with > more than one extent is interesting. (The ext4 crowd may have better > suggestions on measuring fragmentation).You can get a *huge* amount of information (probably more than you''ll want to analyze) by doing this: e2fsck -nf -E fragcheck /dev/XXXX >& /tmp/fragcheck.out I haven''t had time to do this in a while, but a while back I used this to debug the writeback code with an eye towards reducing fragmentation. At the time I was trying to optimize the case of reducing fragmentation in the easist case possible, where you start with an empty file system, and then copy all of the data from another file system onto it using rsync -avH. It would be worth while to see what happens with files written by the compiler and linker. Given that libelf tends to write .o files non-sequentially, and without telling us how big the space is in advance, I could well imagine that we''re not doing the best job avoiding free space fragmentation, which eventually leads to extra file system aging. It would be interesting to have a project where someone added fallocate() support into libelf, and then added some hueristics into ext4 so that if a file is fallocated to a precise size, or if the file is fully written and closed before writeback begins, that we use this to more efficiently pack the space used by the files by the block allocator. This is a place where I would not be surprised that XFS has some better code to avoid accelerated file system aging, and where we could do better with ext4 with some development effort. Of course, it might also be possible to hack around this by simply using VPATH and dropping your build files in a separate place from your source files, and periodically reformatting the file system where your build tree lives. (As a side note, something that works well for me is to use an SSD for my source files, and a separate 5400rpm HDD for my build tree. That allows me to use a smaller and more affordable SSD, and since the object files can be written asynchronously by the writeback threads, but the compiler can''t move forward until it gets file data from the .c or .h file, it gets me the best price/performance for a laptop build environment.) BTW, I suspect we could make acp even more efficient by teaching it to use FIEMAP ioctl to map out the data blocks for all of the files in the source file system, and then copied the files (or perhaps even parts of files) in a read order which reduced seeking on the source drive. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 02, 2012 at 02:32:15PM -0500, Ted Ts''o wrote:> On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote: > > > > filefrag will tell you how many extents each file has, any file with > > more than one extent is interesting. (The ext4 crowd may have better > > suggestions on measuring fragmentation). > > You can get a *huge* amount of information (probably more than you''ll > want to analyze) by doing this: > > e2fsck -nf -E fragcheck /dev/XXXX >& /tmp/fragcheck.out > > I haven''t had time to do this in a while, but a while back I used this > to debug the writeback code with an eye towards reducing > fragmentation. At the time I was trying to optimize the case of > reducing fragmentation in the easist case possible, where you start > with an empty file system, and then copy all of the data from another > file system onto it using rsync -avH. > > It would be worth while to see what happens with files written by the > compiler and linker. Given that libelf tends to write .o files > non-sequentially, and without telling us how big the space is in > advance, I could well imagine that we''re not doing the best job > avoiding free space fragmentation, which eventually leads to extra > file system aging.I just realized that I confused things. He''s doing a read on the results of a cp -a to a fresh FS, so there''s no way the compiler/linker are causing trouble.> > It would be interesting to have a project where someone added > fallocate() support into libelf, and then added some hueristics into > ext4 so that if a file is fallocated to a precise size, or if the file > is fully written and closed before writeback begins, that we use this > to more efficiently pack the space used by the files by the block > allocator. This is a place where I would not be surprised that XFS > has some better code to avoid accelerated file system aging, and where > we could do better with ext4 with some development effort.The part I don''t think any of us have solved is writing back the files in a good order after we''ve fallocated the blocks. So this will probably be great for reads and not so good for writes.> > Of course, it might also be possible to hack around this by simply > using VPATH and dropping your build files in a separate place from > your source files, and periodically reformatting the file system where > your build tree lives. (As a side note, something that works well for > me is to use an SSD for my source files, and a separate 5400rpm HDD > for my build tree. That allows me to use a smaller and more > affordable SSD, and since the object files can be written > asynchronously by the writeback threads, but the compiler can''t move > forward until it gets file data from the .c or .h file, it gets me the > best price/performance for a laptop build environment.)mkfs for defrag ;) It''s the only way to be sure.> > BTW, I suspect we could make acp even more efficient by teaching it to > use FIEMAP ioctl to map out the data blocks for all of the files in > the source file system, and then copied the files (or perhaps even > parts of files) in a read order which reduced seeking on the source > drive.acp does have a -b mode where it fibmaps (I was either lazy or it is older than fiemap, I forget) the first block in the file, and uses that to sort. It does help if the file blocks aren''t ordered well wrt their inode numbers, but not if the files are fragmented. It''s also worth mentioning that acp doesn''t actually cp. I never got that far. It was supposed to be the perfect example of why everything should be done via aio, but it just ended up demonstrating that ordering by inode number and leveraging kernel/hardware reada were more important. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/3/2 Chris Mason <chris.mason@oracle.com>:> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote: >> 2012/3/2 Chris Mason <chris.mason@oracle.com>: >> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote: >> >> >> >> I''ve took both on tests. The subject is acp and spd_readdir used with >> >> tar, all on ext4: >> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png >> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png >> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png >> >> >> >> The acp looks much better than spd_readdir but directory copy with >> >> spd_readdir decreased to 52m 39sec (30 min less). >> > >> > Do you have stats on how big these files are, and how fragmented they >> > are? For acp and spd to give us this, I think something has gone wrong >> > at writeback time (creating individual fragmented files). >> >> How big? Which files? > > All the files you''re reading ;) > > filefrag will tell you how many extents each file has, any file with > more than one extent is interesting. (The ext4 crowd may have better > suggestions on measuring fragmentation). > > Since you mention this is a compile farm, I''m guessing there are a bunch > of .o files created by parallel builds. There are a lot of chances for > delalloc and the kernel writeback code to do the wrong thing here. >This took some time be here it goes ... * B size files - total: 1064462 (53.81%) | Size range | File cnt | Fragmented | Fragmented % | | [0,100) | 288702 | 0 | 0 | | [100,200) | 165997 | 0 | 0 | | [200,300) | 45993 | 0 | 0 | | [300,400) | 64883 | 0 | 0 | | [400,500) | 39282 | 0 | 0 | | [500,600) | 61612 | 0 | 0 | | [600,700) | 208822 | 0 | 0 | | [700,800) | 64016 | 0 | 0 | | [800,900) | 58181 | 0 | 0 | | [900,1024) | 66974 | 0 | 0 | * K size files - total: 907375 (45.86%) | Size range | File cnt | Fragmented | Fragmented % | | [1,2) | 285481 | 0 | 0 | | [2,3) | 148461 | 0 | 0 | | [3,4) | 102569 | 0 | 0 | | [4,5) | 45089 | 0 | 0 | | [5,6) | 44223 | 0 | 0 | | [6,10) | 94202 | 0 | 0 | | [10,20) | 83760 | 0 | 0 | | [20,30) | 42548 | 0 | 0 | | [30,40) | 15274 | 0 | 0 | | [40,50) | 4060 | 0 | 0 | | [50,60) | 3517 | 0 | 0 | | [60,100) | 7834 | 0 | 0 | | [100,1024) | 30357 | 0 | 0 | *M size files - total: 6312 (0.33%) | Size range | File cnt | Fragmented | Fragmented % | | [1.10) | 5806 | 100 | 1.72 | | [10,100) | 482 | 200 | 41.49 | | [100,1024) | 24 | 13 | 54.16 | All files scanned: 1978149 Files fragmented: 313 (0.015%) where 11 have 3+ extents Total size of fragmented files: 7GB (~13% of dir size) tar cf on fragmented files: 1) time: 7sec 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png 3) sw graph with spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png 4) both on one: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png tar cf of fragmented files disturbed with [40,50) K files (in total 4373 files). K files before fragmented M files: 1) size: 7.2GB 2) time: 1m 14sec 3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png 4) sw graph with spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png 5) both on one: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png I will perform same tests on btrfs tomorrow. -Jacek
2012/3/3 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/3/2 Chris Mason <chris.mason@oracle.com>: >> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote: >>> 2012/3/2 Chris Mason <chris.mason@oracle.com>: >>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote: >>> >> >>> >> I''ve took both on tests. The subject is acp and spd_readdir used with >>> >> tar, all on ext4: >>> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png >>> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png >>> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png >>> >> >>> >> The acp looks much better than spd_readdir but directory copy with >>> >> spd_readdir decreased to 52m 39sec (30 min less). >>> > >>> > Do you have stats on how big these files are, and how fragmented they >>> > are? For acp and spd to give us this, I think something has gone wrong >>> > at writeback time (creating individual fragmented files). >>> >>> How big? Which files? >> >> All the files you''re reading ;) >> >> filefrag will tell you how many extents each file has, any file with >> more than one extent is interesting. (The ext4 crowd may have better >> suggestions on measuring fragmentation). >> >> Since you mention this is a compile farm, I''m guessing there are a bunch >> of .o files created by parallel builds. There are a lot of chances for >> delalloc and the kernel writeback code to do the wrong thing here. >> >[Most of files are B and K size]> > All files scanned: 1978149 > Files fragmented: 313 (0.015%) where 11 have 3+ extents > Total size of fragmented files: 7GB (~13% of dir size)BTRFS: Non of files according to filefrag are fragmented - all fit into one extent.> tar cf on fragmented files: > 1) time: 7sec > 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png > 3) sw graph with spd_readdir: > http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png > 4) both on one: > http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.pngBTRFS: tar on ext4 fragmented files 1) time: 6sec 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_btrfs.png> tar cf of fragmented files disturbed with [40,50) K files (in total > 4373 files). K files before fragmented M files: > 1) size: 7.2GB > 2) time: 1m 14sec > 3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png > 4) sw graph with spd_readdir: > http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png > 5) both on one: > http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.pngBTRFS: tar on [40,50) K and ext4 fragmented 1) time: 56sec 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_btrfs.png New test I''ve included - randomly selected files: - size 240MB 1) ext4 (time: 34sec) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_random_ext4.png 2) btrfs (time: 55sec) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_random_btrfs.png -Jacek
2012/3/4 Jacek Luczak <difrost.kernel@gmail.com>:> 2012/3/3 Jacek Luczak <difrost.kernel@gmail.com>: >> 2012/3/2 Chris Mason <chris.mason@oracle.com>: >>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote: >>>> 2012/3/2 Chris Mason <chris.mason@oracle.com>: >>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote: >>>> >> >>>> >> I''ve took both on tests. The subject is acp and spd_readdir used with >>>> >> tar, all on ext4: >>>> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png >>>> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png >>>> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png >>>> >> >>>> >> The acp looks much better than spd_readdir but directory copy with >>>> >> spd_readdir decreased to 52m 39sec (30 min less). >>>> > >>>> > Do you have stats on how big these files are, and how fragmented they >>>> > are? For acp and spd to give us this, I think something has gone wrong >>>> > at writeback time (creating individual fragmented files). >>>> >>>> How big? Which files? >>> >>> All the files you''re reading ;) >>> >>> filefrag will tell you how many extents each file has, any file with >>> more than one extent is interesting. (The ext4 crowd may have better >>> suggestions on measuring fragmentation). >>> >>> Since you mention this is a compile farm, I''m guessing there are a bunch >>> of .o files created by parallel builds. There are a lot of chances for >>> delalloc and the kernel writeback code to do the wrong thing here. >>> >> > [Most of files are B and K size] >> >> All files scanned: 1978149 >> Files fragmented: 313 (0.015%) where 11 have 3+ extents >> Total size of fragmented files: 7GB (~13% of dir size) > > BTRFS: Non of files according to filefrag are fragmented - all fit > into one extent. > >> tar cf on fragmented files: >> 1) time: 7sec >> 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png >> 3) sw graph with spd_readdir: >> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png >> 4) both on one: >> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png > > BTRFS: tar on ext4 fragmented files > 1) time: 6sec > 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_btrfs.png > >> tar cf of fragmented files disturbed with [40,50) K files (in total >> 4373 files). K files before fragmented M files: >> 1) size: 7.2GB >> 2) time: 1m 14sec >> 3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png >> 4) sw graph with spd_readdir: >> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png >> 5) both on one: >> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png > > BTRFS: tar on [40,50) K and ext4 fragmented > 1) time: 56sec > 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_btrfs.png > > New test I''ve included - randomly selected files: > - size 240MB > 1) ext4 (time: 34sec) sw graph: > http://91.234.146.107/~difrost/seekwatcher/tar_random_ext4.png > 2) btrfs (time: 55sec) sw graph: > http://91.234.146.107/~difrost/seekwatcher/tar_random_btrfs.pngYet another test. The original issue is in the directory data handling. In my case a lot of dirs are introduced due to extra .svn. Let''s then see how does tar on those dirs looks like. Number of .svn directories: 61605 1) Ext4: - tar time: 10m 53sec - sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_ext4.png - sw tar graph with spd_readdir: http://91.234.146.107/~difrost/seekwatcher/svn_dir_spd_ext4.png 2) Btrfs: - tar time: 4m 35sec - sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs.png - sw tar graph with ext4: http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs_ext4.png IMO this is not a writeback issue (well it could be but then it mean that it broken in general), it''s not fragmentation. Sorting files in readdir helps a bit but is still far behind the btrfs. Any ideas? Is this a issue or the things are like they are and one need to live with it. -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 02-03-12 14:32:15, Ted Tso wrote:> On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote: > It would be interesting to have a project where someone added > fallocate() support into libelf, and then added some hueristics into > ext4 so that if a file is fallocated to a precise size, or if the file > is fully written and closed before writeback begins, that we use this > to more efficiently pack the space used by the files by the block > allocator. This is a place where I would not be surprised that XFS > has some better code to avoid accelerated file system aging, and where > we could do better with ext4 with some development effort.AFAIK XFS people actually prefer that applications let them do their work using delayed allocation and do not interfere with fallocate(2) calls. The problem they have with fallocate(2) is that it forces you to allocate blocks while with delayed allocation you can make the decision about allocation later. So for small files which completely fit into pagecache before they get pushed out by writeback, they can make better decisions from delayed allocation. Just dumping my memory from some other thread... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR
On Mon, Mar 05, 2012 at 12:32:45PM +0100, Jacek Luczak wrote:> 2012/3/4 Jacek Luczak <difrost.kernel@gmail.com>: > > 2012/3/3 Jacek Luczak <difrost.kernel@gmail.com>: > >> 2012/3/2 Chris Mason <chris.mason@oracle.com>: > >>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote: > >>>> 2012/3/2 Chris Mason <chris.mason@oracle.com>: > >>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote: > >>>> >> > >>>> >> I''ve took both on tests. The subject is acp and spd_readdir used with > >>>> >> tar, all on ext4: > >>>> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png > >>>> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png > >>>> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png > >>>> >> > >>>> >> The acp looks much better than spd_readdir but directory copy with > >>>> >> spd_readdir decreased to 52m 39sec (30 min less). > >>>> > > >>>> > Do you have stats on how big these files are, and how fragmented they > >>>> > are? For acp and spd to give us this, I think something has gone wrong > >>>> > at writeback time (creating individual fragmented files). > >>>> > >>>> How big? Which files? > >>> > >>> All the files you''re reading ;) > >>> > >>> filefrag will tell you how many extents each file has, any file with > >>> more than one extent is interesting. (The ext4 crowd may have better > >>> suggestions on measuring fragmentation). > >>> > >>> Since you mention this is a compile farm, I''m guessing there are a bunch > >>> of .o files created by parallel builds. There are a lot of chances for > >>> delalloc and the kernel writeback code to do the wrong thing here. > >>> > >> > > [Most of files are B and K size] > >> > >> All files scanned: 1978149 > >> Files fragmented: 313 (0.015%) where 11 have 3+ extents > >> Total size of fragmented files: 7GB (~13% of dir size)Ok, so I don''t have a lot of great new ideas. My guess is that inode order and disk order for the blocks aren''t matching up. You can confirm this with: acp -b some_dir You can also try telling acp to make a bigger read ahead window: acp -s 4096 -r 128 some_dir You can tell acp to scan all the files in the directory tree first (warning, this might use a good chunk of ram) acp -w some_dir and you can combine all of these together None of the above will actually help in your workload, but it''ll help narrow down what is actually seeky on disk. -chris
On 2/29/2012 11:44 PM, Theodore Tso wrote:> You might try sorting the entries returned by readdir by inode number > before you stat them. This is a long-standing weakness in > ext3/ext4, and it has to do with how we added hashed tree indexes to > directories in (a) a backwards compatible way, that (b) was POSIX > compliant with respect to adding and removing directory entries > concurrently with reading all of the directory entries using > readdir.When I ran into this a while back, I cobbled together this python script to measure the correlation from name to inode, inode to first data block, and name to first data block for all of the small files in a large directory, and found that ext4 gives a very poor correlation due to that directory hashing. This is one of the reasons I prefer using dump instead of tar for backups, since it rips through my Maildir more than 10 times faster than tar, since it reads the files in inode order.
On Wed, 29 Feb 2012, Jacek Luczak wrote:> Hi All, > > /*Sorry for sending incomplete email, hit wrong button :) I guess I > can''t use Gmail */ > > Long story short: We''ve found that operations on a directory structure > holding many dirs takes ages on ext4. > > The Question: Why there''s that huge difference in ext4 and btrfs? See > below test results for real values. > > Background: I had to backup a Jenkins directory holding workspace for > few projects which were co from svn (implies lot of extra .svn dirs). > The copy takes lot of time (at least more than I''ve expected) and > process was mostly in D (disk sleep). I''ve dig more and done some > extra test to see if this is not a regression on block/fs site. To > isolate the issue I''ve also performed same tests on btrfs. > > Test environment configuration: > 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT > enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs. > 2) Kernels: All tests were done on following kernels: > - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of > config changes mostly. In -3 we''ve introduced ,,fix readahead pipeline > break caused by block plug'''' patch. Otherwise it''s pure 2.6.39.4. > - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been > release recently). > 3) A subject of tests, directory holding: > - 54GB of data (measured on ext4) > - 1978149 files > - 844008 directories > 4) Mount options: > - ext4 -- errors=remount-ro,noatime, > data=writeback > - btrfs -- noatime,nodatacow and for later investigation on > copression effect: noatime,nodatacow,compress=lzo > > In all tests I''ve been measuring time of execution. Following tests > were performed: > - find . -type d > - find . -type f > - cp -a > - rm -rf > > Ext4 results: > | Type | 2.6.39.4-3 | 3.2.7 > | Dir cnt | 17m 40sec | 11m 20sec > | File cnt | 17m 36sec | 11m 22sec > | Copy | 1h 28m | 1h 27m > | Remove| 3m 43sec | 3m 38sec > > Btrfs results (without lzo comression): > | Type | 2.6.39.4-3 | 3.2.7 > | Dir cnt | 2m 22sec | 2m 21sec > | File cnt | 2m 26sec | 2m 23sec > | Copy | 36m 22sec | 39m 35sec > | Remove| 7m 51sec | 10m 43sec > > From above one can see that copy takes close to 1h less on btrfs. I''ve > done strace counting times of calls, results are as follows (from > 3.2.7): > 1) Ext4 (only to elements): > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 57.01 13.257850 1 15082163 read > 23.40 5.440353 3 1687702 getdents > 6.15 1.430559 0 3672418 lstat > 3.80 0.883767 0 13106961 write > 2.32 0.539959 0 4794099 open > 1.69 0.393589 0 843695 mkdir > 1.28 0.296700 0 5637802 setxattr > 0.80 0.186539 0 7325195 stat > > 2) Btrfs: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 53.38 9.486210 1 15179751 read > 11.38 2.021662 1 1688328 getdents > 10.64 1.890234 0 4800317 open > 6.83 1.213723 0 13201590 write > 4.85 0.862731 0 5644314 setxattr > 3.50 0.621194 1 844008 mkdir > 2.75 0.489059 0 3675992 1 lstat > 1.71 0.303544 0 5644314 llistxattr > 1.50 0.265943 0 1978149 utimes > 1.02 0.180585 0 5644314 844008 getxattr > > On btrfs getdents takes much less time which prove the bottleneck in > copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time > for getdents: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 50.77 10.978816 1 15033132 read > 14.46 3.125996 1 4733589 open > 7.15 1.546311 0 5566988 setxattr > 5.89 1.273845 0 3626505 lstat > 5.81 1.255858 1 1667050 getdents > 5.66 1.224403 0 13083022 write > 3.40 0.735114 1 833371 mkdir > 1.96 0.424881 0 5566988 llistxattr > > > Why so huge difference in the getdents timings? > > -JacekHi, I have created a simple script which creates a bunch of files with random names in the directory and then performs operation like list, tar, find, copy and remove. I have run it for ext4, xfs and btrfs with the 4k size files. And the result is that ext4 pretty much dominates the create times, tar times and find times. However copy times is a whole different story unfortunately - is sucks badly. Once we cross the mark of 320000 files in the directory (on my system) the ext4 is becoming significantly worse in copy times. And that is where the hash tree order in the directory entry really hit in. Here is a simple graph: http://people.redhat.com/lczerner/files/copy_benchmark.pdf Here is a data where you can play with it: https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE and here is the txt file for convenience: http://people.redhat.com/lczerner/files/copy_data.txt I have also run the correlation.py from Phillip Susi on directory with 100000 4k files and indeed the name to block correlation in ext4 is pretty much random :) _ext4_ Name to inode correlation: 0.50002499975 Name to block correlation: 0.50002499975 Inode to block correlation: 0.9999900001 _xfs_ Name to inode correlation: 0.969660303397 Name to block correlation: 0.969660303397 Inode to block correlation: 1.0 So there definitely is a huge space for improvements in ext4. Thanks! -Lukas Here is a script I have used to get the numbers above, just to see that are the operation I have performed. #!/bin/bash dev=$1 mnt=$2 fs=$3 count=$4 size=$5 if [ -z $dev ]; then echo "Device was not specified!" exit 1 fi if [ -z $mnt ]; then echo "Mount point was not specified!" exit 1 fi if [ -z $fs ]; then echo "File system was not specified!" exit 1 fi if [ -z $count ]; then count=10000 fi if [ -z $size ]; then size=0 fi export TIMEFORMAT="%3R" umount $dev &> /dev/null umount $mnt &> /dev/null case $fs in "xfs") mkfs.xfs -f $dev &> /dev/null; mount $dev $mnt;; "ext3") mkfs.ext3 -F -E lazy_itable_init $dev &> /dev/null; mount $dev $mnt;; "ext4") mkfs.ext4 -F -E lazy_itable_init $dev &> /dev/null; mount -o noinit_itable $dev $mnt;; "btrfs") mkfs.btrfs $dev &> /dev/null; mount $dev $mnt;; *) echo "Unsupported file system"; exit 1;; esac testdir=${mnt}/$$ mkdir $testdir _remount() { sync #umount $mnt #mount $dev $mnt echo 3 > /proc/sys/vm/drop_caches } #echo "[+] Creating $count files" _remount create=$((time ./dirtest $testdir $count $size) 2>&1) #echo "[+] Listing files" _remount list=$((time ls $testdir > /dev/null) 2>&1) #echo "[+] tar the files" _remount tar=$((time $(tar -cf - $testdir &> /dev/null)) 2>&1) #echo "[+] find the files" _remount find=$((time $(find $testdir -type f &> /dev/null)) 2>&1) #echo "[+] Copying files" _remount copy=$((time $(cp -a ${testdir} ${mnt}/copy)) 2>&1) #echo "[+] Removing files" _remount remove=$((time $(rm -rf $testdir)) 2>&1) echo "$fs $count $create $list $tar $find $copy $remove" -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 09, 2012 at 12:29:29PM +0100, Lukas Czerner wrote:> Hi, > > I have created a simple script which creates a bunch of files with > random names in the directory and then performs operation like list, > tar, find, copy and remove. I have run it for ext4, xfs and btrfs with > the 4k size files. And the result is that ext4 pretty much dominates the > create times, tar times and find times. However copy times is a whole > different story unfortunately - is sucks badly. > > Once we cross the mark of 320000 files in the directory (on my system) the > ext4 is becoming significantly worse in copy times. And that is where > the hash tree order in the directory entry really hit in. > > Here is a simple graph: > > http://people.redhat.com/lczerner/files/copy_benchmark.pdf > > Here is a data where you can play with it: > > https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE > > and here is the txt file for convenience: > > http://people.redhat.com/lczerner/files/copy_data.txt > > I have also run the correlation.py from Phillip Susi on directory with > 100000 4k files and indeed the name to block correlation in ext4 is pretty > much random :) > > _ext4_ > Name to inode correlation: 0.50002499975 > Name to block correlation: 0.50002499975 > Inode to block correlation: 0.9999900001 > > _xfs_ > Name to inode correlation: 0.969660303397 > Name to block correlation: 0.969660303397 > Inode to block correlation: 1.0 > > > So there definitely is a huge space for improvements in ext4.Thanks Lukas, this is great data. There is definitely room for btrfs to speed up in the other phases as well. -chris
On 2012-03-09, at 3:29, Lukas Czerner <lczerner@redhat.com> wrote:> > I have created a simple script which creates a bunch of files with > random names in the directory and then performs operation like list, > tar, find, copy and remove. I have run it for ext4, xfs and btrfs with > the 4k size files. And the result is that ext4 pretty much dominates the > create times, tar times and find times. However copy times is a whole > different story unfortunately - is sucks badly. > > Once we cross the mark of 320000 files in the directory (on my system) the > ext4 is becoming significantly worse in copy times. And that is where > the hash tree order in the directory entry really hit in. > > Here is a simple graph: > > http://people.redhat.com/lczerner/files/copy_benchmark.pdf > > Here is a data where you can play with it: > > https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE > > and here is the txt file for convenience: > > http://people.redhat.com/lczerner/files/copy_data.txt > > I have also run the correlation.py from Phillip Susi on directory with > 100000 4k files and indeed the name to block correlation in ext4 is pretty > much random :)Just reading this on the plane, so I can''t find the exact reference that I want, but a solution to this problem with htree was discussed a few years ago between myself and Coly Li. The basic idea is that for large directories the inode allocator starts by selecting a range of (relatively) free inodes based on the current directory size, and then piecewise maps the hash value for the filename into this inode range and uses that as the goal inode. When the inode range is (relatively) filled up (as determined by average distance between goal and allocated inode), a new (larger) inode range is selected based on the new (larger) directory size and usage continues as described. This change is only in-memory allocation policy and does not affect the on disk format, though it is expected to improve the hash->inode mapping coherency significantly. When the directory is small (below a thousand or so) the allocations will be close to the parent and the ordering doesn''t matter significantly because the inode table blocks will all be quickly read or prefetched from disk. It wouldn''t be harmful to use the mapping algorithm in this case, but it likely won''t show much improvement. As the directory gets larger, the range of inodes will also get larger. The number of inodes in the smaller range becomes less significant as the range continues to grow. Once the inode range is hundreds of thousands or larger the mapping of the hash to the inodes will avoid a lot of random IO. When restarting from a new mount, the inode ranges can be found when doing the initial name lookup in the leaf block by checking the allocated inodes for existing dirents. Unfortunately, the prototype that was developed diverged from this idea and didn''t really achieve the results I wanted. Cheers, Andreas> _ext4_ > Name to inode correlation: 0.50002499975 > Name to block correlation: 0.50002499975 > Inode to block correlation: 0.9999900001 > > _xfs_ > Name to inode correlation: 0.969660303397 > Name to block correlation: 0.969660303397 > Inode to block correlation: 1.0 > > > So there definitely is a huge space for improvements in ext4. > > Thanks! > -Lukas > > Here is a script I have used to get the numbers above, just to see that > are the operation I have performed. > > > #!/bin/bash > > dev=$1 > mnt=$2 > fs=$3 > count=$4 > size=$5 > > if [ -z $dev ]; then > echo "Device was not specified!" > exit 1 > fi > > if [ -z $mnt ]; then > echo "Mount point was not specified!" > exit 1 > fi > > if [ -z $fs ]; then > echo "File system was not specified!" > exit 1 > fi > > if [ -z $count ]; then > count=10000 > fi > > if [ -z $size ]; then > size=0 > fi > > export TIMEFORMAT="%3R" > > umount $dev &> /dev/null > umount $mnt &> /dev/null > > case $fs in > "xfs") mkfs.xfs -f $dev &> /dev/null; mount $dev $mnt;; > "ext3") mkfs.ext3 -F -E lazy_itable_init $dev &> /dev/null; mount $dev $mnt;; > "ext4") mkfs.ext4 -F -E lazy_itable_init $dev &> /dev/null; mount -o noinit_itable $dev $mnt;; > "btrfs") mkfs.btrfs $dev &> /dev/null; mount $dev $mnt;; > *) echo "Unsupported file system"; > exit 1;; > esac > > > testdir=${mnt}/$$ > mkdir $testdir > > _remount() > { > sync > #umount $mnt > #mount $dev $mnt > echo 3 > /proc/sys/vm/drop_caches > } > > > #echo "[+] Creating $count files" > _remount > create=$((time ./dirtest $testdir $count $size) 2>&1) > > #echo "[+] Listing files" > _remount > list=$((time ls $testdir > /dev/null) 2>&1) > > #echo "[+] tar the files" > _remount > tar=$((time $(tar -cf - $testdir &> /dev/null)) 2>&1) > > #echo "[+] find the files" > _remount > find=$((time $(find $testdir -type f &> /dev/null)) 2>&1) > > #echo "[+] Copying files" > _remount > copy=$((time $(cp -a ${testdir} ${mnt}/copy)) 2>&1) > > #echo "[+] Removing files" > _remount > remove=$((time $(rm -rf $testdir)) 2>&1) > > echo "$fs $count $create $list $tar $find $copy $remove" > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hey Jacek, I''m curious parameters of the set of directories on your production server. On an ext4 file system, assuming you''ve copied the directories over, what are the result of this command pipeline when you are cd''ed into the top of the directory hierarchy of interest (your svn tree, as I recall)? find . -type d -ls | awk ''{print $7}'' | sort -n | uniq -c I''m just curious what the distribution of directories sizes are in practice for your use case... Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote:> > I have also run the correlation.py from Phillip Susi on directory with > > 100000 4k files and indeed the name to block correlation in ext4 is pretty > > much random :) > > Just reading this on the plane, so I can''t find the exact reference > that I want, but a solution to this problem with htree was discussed > a few years ago between myself and Coly Li. > > The basic idea is that for large directories the inode allocator > starts by selecting a range of (relatively) free inodes based on the > current directory size, and then piecewise maps the hash value for > the filename into this inode range and uses that as the goal inode.I''ve heard you sketch out this idea before, but it''s not been clean to me how well it would work in practice. The potential downside is that it fragments the inode table, such that a single inode table block might not have as many inodes stored in it as we might otherwise would. (Ideally, with 256 byte inodes, there would 16 of them in each 4k block.) This is something I''d definitely recommend modelling in simulation to see how well it works first. I''d observe that if we knew in advance how many files would be in a particular directory (i.e., via a hint from userspace at mkdir time), then it would be possible to optimize this case much more easily. In fact, if we had perfect knowledge --- if the userspace process could feed us the exact set of filenames that will be used in the directory, plus the exact file sizes for each of the file names --- then it would be possible to allocate inode numbers and starting block numbers such that when the files are read in readdir() order, the inode numbers and block numbers would line up perfectly into sequential writes. Of course, the trade off is that by optimizing for read order, the write order will tend to get painful as well! So there''s a tension here; making the read part of the benchmark faster will make the process of writing the directory hierarchy require more seeks --- and this assuming that you know file names and sizes ahead of time. (Unless, of course, you play the same game that Hans Reiser did when he cooked his "tar" benchmark by constructing a kernel source tarball where the file order in the tarball was perfectly ordered to guarantee the desired result given Reiser4''s hash! :-) I suspect the best optimization for now is probably something like this: 1) Since the vast majority of directories are less than (say) 256k (this would be a tunable value), for directories which are less than this threshold size, the entire directory is sucked in after the first readdir() after an opendir() or rewinddir(). The directory contents are then sorted by inode number (or loaded into an rbtree ordered by inode number), and returned back to userspace in the inode order via readdir(). The directory contents will be released on a closedir() or rewinddir(). 2) For files larger than this size, we don''t want to hold that much kernel memory during an opendir(), so fall back to ext4''s current scheme. 3) If we want do to better than that, we could create new flag O_NOTELLDIR, which can be set via fcntl. (This way programs can use do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd, SETFL, O_NOTELLDIR);"). For files who know they don''t need to use telldir/seekdir, they can set this flag, which will allow the above optimization to be used on large files. The other thing we could potentially do, if we really cared about this issue in ext3/4, would be to define whole new (incompatible) storing the directory indexing format which avoided this problem. It would be an INCOMPAT feature, so it would be a while before it could get used in practice, which is why I''m not entirely convinced it''s worth it --- especially since I suspect just doing (1) would solve the problem for the vast majority of ext4''s users. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-03-09, at 9:48 PM, Ted Ts''o wrote:> On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote: >> >> Just reading this on the plane, so I can''t find the exact reference >> that I want, but a solution to this problem with htree was discussed >> a few years ago between myself and Coly Li. >> >> The basic idea is that for large directories the inode allocator >> starts by selecting a range of (relatively) free inodes based on the >> current directory size, and then piecewise maps the hash value for >> the filename into this inode range and uses that as the goal inode. > > I''ve heard you sketch out this idea before, but it''s not been clean to > me how well it would work in practice. The potential downside is that > it fragments the inode table, such that a single inode table block > might not have as many inodes stored in it as we might otherwise > would. (Ideally, with 256 byte inodes, there would 16 of them in each > 4k block.) This is something I''d definitely recommend modelling in > simulation to see how well it works first.Very true, I was thinking this as I wrote the email.> I''d observe that if we knew in advance how many files would be in a > particular directory (i.e., via a hint from userspace at mkdir time), > then it would be possible to optimize this case much more easily. In > fact, if we had perfect knowledge --- if the userspace process could > feed us the exact set of filenames that will be used in the directory, > plus the exact file sizes for each of the file names --- then it would > be possible to allocate inode numbers and starting block numbers such > that when the files are read in readdir() order, the inode numbers and > block numbers would line up perfectly into sequential writes.Except POSIX doesn''t allow anything close to this at all. Something like fallocate() for directories would at least give the kernel a hint about how large the directory is, but fewer userspace tools have this information than the file size (e.g. tar doesn''t really store a "directory", just a bunch of pathnames that happen to have largely the same components). In the cases where size IS known in advance, or at least if the directory is going to be large, an allocation policy like mine would essentially "predict" where a uniformly distributed hash function will allocate the inodes for better reading order.> Of course, the trade off is that by optimizing for read order, the > write order will tend to get painful as well! So there''s a tension > here; making the read part of the benchmark faster will make the > process of writing the directory hierarchy require more seeks --- and > this assuming that you know file names and sizes ahead of time. > > (Unless, of course, you play the same game that Hans Reiser did when > he cooked his "tar" benchmark by constructing a kernel source tarball > where the file order in the tarball was perfectly ordered to guarantee > the desired result given Reiser4''s hash! :-)No, I definitely don''t want to go down that path... Optimization for the sake of benchmarks is broken.> I suspect the best optimization for now is probably something like > this: > > 1) Since the vast majority of directories are less than (say) 256k > (this would be a tunable value),Yes, this would be reasonable, and I suspect could be a lot larger than 256kB. Probably variable based on the amount of RAM is reasonable. Looking at http://www.pdsi-scidac.org/fsstats/ (which is listed as HPC, but is a bit old and the filesystems aren''t much larger than large SATA drives today :-) having a 1MB cache would handle virtually every directory in that survey (99.999%). The unfortunate part is that this will help small directories, where the problem is already largely hidden by readahead and the page cache, but it doesn''t help at all for huge directories that do not fit into cache. That is where my approach would do well, but the difficulty is in knowing at create time how large the directory will be.> for directories which are less than > this threshold size, the entire directory is sucked in after the first > readdir() after an opendir() or rewinddir(). The directory contents > are then sorted by inode number (or loaded into an rbtree ordered by > inode number), and returned back to userspace in the inode order via > readdir(). The directory contents will be released on a closedir() or > rewinddir().What would the telldir() cookie be in this case? The inode number, or the file descriptor? What happens if the directory size crosses the threshold while the cache is in use? I guess in that case the cache could still be used (according to POSIX) because readdir() doesn''t need to report entries added after it started.> 2) For files larger than this size, we don''t want to hold that much > kernel memory during an opendir(), so fall back to ext4''s current > scheme. > > 3) If we want do to better than that, we could create new flag > O_NOTELLDIR, which can be set via fcntl. (This way programs can use > do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd, > SETFL, O_NOTELLDIR);"). For files who know they don''t need to use > telldir/seekdir, they can set this flag, which will allow the above > optimization to be used on large files.Probably a bit messy. Since telldir() is rarely used outside NFS (AFAIK, though I recall something dumb happening in GNU libc at one point that was doing a telldir() and seekdir() for every entry) we could assume no telldir() users until it actually was used on the filesystem, and that could set a hint in the superblock EXT2_FLAGS section. The only userspace programs I could find that call telldir() are smbd and nmbd (and related tools) and perl (not sure when or why).> The other thing we could potentially do, if we really cared about this > issue in ext3/4, would be to define whole new (incompatible) storing > the directory indexing format which avoided this problem. It would be > an INCOMPAT feature, so it would be a while before it could get used > in practice, which is why I''m not entirely convinced it''s worth it --- > especially since I suspect just doing (1) would solve the problem for > the vast majority of ext4''s users.Agreed. I don''t know that we need to go to these extremes. Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Lustre Engineer http://www.whamcloud.com/ -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 11, 2012 at 04:30:37AM -0600, Andreas Dilger wrote:> > if the userspace process could > > feed us the exact set of filenames that will be used in the directory, > > plus the exact file sizes for each of the file names... > > Except POSIX doesn''t allow anything close to this at all. Something like > fallocate() for directories would at least give the kernel a hint about > how large the directory is, but fewer userspace tools have this information > than the file size (e.g. tar doesn''t really store a "directory", just a > bunch of pathnames that happen to have largely the same components).Yes, this was purely a thought experiment idea (although if we could try to extend the interface if it really neted enough gains). My point was that even if we could do this, we really are trading off write performance for read performance (since you would now be doing the extra seeking on the tar extraction phase). Ultimately, while I believe it makes sense to optimize the "readdir + seek" case, I don''t believe it makes sense to optimize the "copy out the directory using tar" benchmark --- since I believe this is much more of a benchmark workload than something that happens a huge amount in real life, and even for people who do backups (which unfortunately as we know is a too-small percentage of the ext4 user base), they generally will prefer than the "real" use case for the file system be fast rather than the backup phase, and in point of fact as the file system gets more fragmented, even if some file systems can offer perfect name -> block correlation, over time I''ll bet their correlation breaks down and so ext4''s disadvantage over those file systems will disappear in real life usages. So I don''t believe it makes that much sense to worry overly about the "name -> block" correlation for freshly formatted file systems. It''s a nice-to-have, but I really think that''s more of a benchmark game than anything else.> In the cases where size IS known in advance, or at least if the directory > is going to be large, an allocation policy like mine would essentially > "predict" where a uniformly distributed hash function will allocate the > inodes for better reading order.Well, yes; if we knew in advance how many files (roughly) would be stored in a directory, there are some optimizations we could do. The simplest of which is preallocating the directory blocks so they are contiguous on disk. I''m not pursuaded yet that the optimizations are worth the effort of trying to define a new interface and then trying to convince application programers to use a hypothetical interface extension *just* for this purpose, though.> The unfortunate part is that this will help small directories, where the > problem is already largely hidden by readahead and the page cache, but it > doesn''t help at all for huge directories that do not fit into cache. That > is where my approach would do well, but the difficulty is in knowing at > create time how large the directory will be.Well, my goal in proposing this optimization is that helps for the "medium size" directories in the cold cache case. The ext4 user who first kicked off this thread was using his file system for an SVN server, as I recall. I could easily believe that he has thousands of files; maybe even tens of thousands of files in a directory. But that probably still fits in 256k, or at best 512k worth of directory blocks. It definitely won''t help for the really lage directories, yes; but how common are they? The most common case I can think of is squid caches, but you can configure a larger number of 1st and 2nd level subdirectories to keep the actual directory size smaller.> > for directories which are less than > > this threshold size, the entire directory is sucked in after the first > > readdir() after an opendir() or rewinddir(). The directory contents > > are then sorted by inode number (or loaded into an rbtree ordered by > > inode number), and returned back to userspace in the inode order via > > readdir(). The directory contents will be released on a closedir() or > > rewinddir(). > > What would the telldir() cookie be in this case? The inode number, or > the file descriptor? What happens if the directory size crosses the > threshold while the cache is in use? I guess in that case the cache > could still be used (according to POSIX) because readdir() doesn''t need > to report entries added after it started.The telldir() cookie can simply be an ordinal -- so "7" means the 7th directory entry in the rbtree. And all we need to do, which obeys POSIX as you''ve noted, is that we instantiate the rbtree during the first readdir() after an opendir() or rewinddir(), and free the old rbtree() when we get the rewinddir(), reach the end of the directory stream, or get the closedir().> > 3) If we want do to better than that, we could create new flag > > O_NOTELLDIR, which can be set via fcntl. (This way programs can use > > do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd, > > SETFL, O_NOTELLDIR);"). For files who know they don''t need to use > > telldir/seekdir, they can set this flag, which will allow the above > > optimization to be used on large files. > > Probably a bit messy. Since telldir() is rarely used outside NFS (AFAIK, > though I recall something dumb happening in GNU libc at one point that > was doing a telldir() and seekdir() for every entry) we could assume no > telldir() users until it actually was used on the filesystem, and that > could set a hint in the superblock EXT2_FLAGS section. The only userspace > programs I could find that call telldir() are smbd and nmbd (and related > tools) and perl (not sure when or why).By the time a process uses telldir(), we''re midway through a readdir() stream, and so by that point it''s too late to switch back to a seekdir-friendly order. So what do we do at that point? If we continue returning directory entries where we read in each directory 256kb at a time, sorting by inode number, and then returning dirents in inode number order, then anytime there was a telldir() during a 256kb chuck, we wouldn''t be able to discard that rbtree, but rather we would have to keep it in memory until the next rewinddir(). So a process which opens a large directory(), and then calls telldir() a huge number of times could end up pinning a large amount of kernel memory. We could, of course, set some kind of resource limit on this, and if a process behaves too badly by calling telldir in a way that it has pinned too much kernel memory, we could potentially kill the process. We could then tell application programmers that if they really want to use telldir() a huge amount, they need to tell us in advance via some kind of process flag. That is, we could default to being telldir-unfriendly instead of defaulting to telldir-friendly. If we really think it''s only Samba and Perl, and that we can convince them to give us this hint, then that we might get away with it. I had been assuming though that we would have to default to telldir-friendly, which is why I suggested the O_NOTELLDIR flag. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/9/2012 11:48 PM, Ted Ts''o wrote:> I suspect the best optimization for now is probably something like > this: > > 1) Since the vast majority of directories are less than (say) 256k > (this would be a tunable value), for directories which are less than > this threshold size, the entire directory is sucked in after the first > readdir() after an opendir() or rewinddir(). The directory contents > are then sorted by inode number (or loaded into an rbtree ordered by > inode number), and returned back to userspace in the inode order via > readdir(). The directory contents will be released on a closedir() or > rewinddir().Why not just separate the hash table from the conventional, mostly in inode order directory entries? For instance, the first 200k of the directory could be the normal entries that would tend to be in inode order ( and e2fsck -D would reorder ), and the last 56k of the directory would contain the hash table. Then readdir() just walks the directory like normal, and namei() can check the hash table. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 13, 2012 at 03:05:59PM -0400, Phillip Susi wrote:> Why not just separate the hash table from the conventional, mostly > in inode order directory entries? For instance, the first 200k of > the directory could be the normal entries that would tend to be in > inode order ( and e2fsck -D would reorder ), and the last 56k of the > directory would contain the hash table. Then readdir() just walks > the directory like normal, and namei() can check the hash table.Because that would be a format change. What we have today is not a hash table; it''s a hashed tree, where we use a fixed-length key for the tree based on the hash of the file name. Currently the leaf nodes of the tree are the directory blocks themselves; that is, the lowest level of the index blocks tells you to look at directory block N, where that directory contains the directory indexes for those file names which are in a particular range (say, between 0x2325777A and 0x2325801). If we aren''t going to change the ordering of the directory directory, that means we would need to change things so the leaf nodes contain the actual directory file names themselves, so that we know whether or not we''ve hit the correct entry or not before we go to read in a specific directory block (otherwise, you''d have problems dealing with hash collisions). But in that case, instead of storing the pointer to the directory entry, since the bulk of the size of a directory entry is the filename itself, you might as well store the inode number in the tree itself, and be done with it. And in that case, since you are replicating the information directory twice over, and it''s going to be an incompatible format change anyway, you might as well just store the second copy of the directory entries in *another* btree, except this one is indexed by inode number, and then you use the second tree for readdir(), and you make the telldir/seekdir cookie be the inode number. That way readdir() will always return results in a stat() optimized order, even in the face of directory fragmentation and file system aging. **This** is why telldir/seekdir is so evil; in order to do things correctly in terms of the semantics of readdir() in the face of telldir/seekdir and file names getting inserted and deleted into the btree, and the possibility for tree splits/rotations, etc., and the fact that the cookie is only 32 or 64 bits, you essentially either need to just do something stupid and have a linear directory aala ext2 and V7 Unix, or you need to store the directory information twice over in redundant b-trees. Or, userspace needs to do the gosh-darned sorting by inode, or we do some hack such as only sorting the inodes using non-swapping kernel memory if the directory is smaller than some arbitrary size, such as 256k or 512k. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/13/2012 3:53 PM, Ted Ts''o wrote:> Because that would be a format change.I think a format change would be preferable to runtime sorting.> What we have today is not a hash table; it''s a hashed tree, where we > use a fixed-length key for the tree based on the hash of the file > name. Currently the leaf nodes of the tree are the directory blocks > themselves; that is, the lowest level of the index blocks tells you to > look at directory block N, where that directory contains the directory > indexes for those file names which are in a particular range (say, > between 0x2325777A and 0x2325801).So the index nodes contain the hash ranges for the leaf block, but the leaf block only contains the regular directory entries, not a hash for each name? That would mean that adding or removing names would require moving around the regular directory entries wouldn''t it?> If we aren''t going to change the ordering of the directory directory, > that means we would need to change things so the leaf nodes contain > the actual directory file names themselves, so that we know whether or > not we''ve hit the correct entry or not before we go to read in a > specific directory block (otherwise, you''d have problems dealing with > hash collisions). But in that case, instead of storing the pointer to > the directory entry, since the bulk of the size of a directory entry > is the filename itself, you might as well store the inode number in > the tree itself, and be done with it.I would think that hash collisions are rare enough that reading a directory block you end up not needing once in a blue moon would be chalked up under "who cares". So just stick with hash, offset pairs to map the hash to the normal directory entry. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:> > I think a format change would be preferable to runtime sorting.Are you volunteering to spearhead the design and coding of such a thing? Run-time sorting is backwards compatible, and a heck of a lot easier to code and test... The reality is we''d probably want to implement run-time sorting *anyway*, for the vast majority of people who don''t want to convert to a new incompatible file system format. (Even if you can do the conversion using e2fsck --- which is possible, but it would be even more code to write --- system administrators tend to be very conservative about such things, since they might need to boot an older kernel, or use a rescue CD that doesn''t have an uptodate kernel or file system utilities, etc.)> So the index nodes contain the hash ranges for the leaf block, but > the leaf block only contains the regular directory entries, not a > hash for each name? That would mean that adding or removing names > would require moving around the regular directory entries wouldn''t > it?They aren''t sorted in the leaf block, so we only need to move around regular directory entries when we do a node split (and at the moment we don''t support shrinking directories), so we don''t have to worry the reverse case.> I would think that hash collisions are rare enough that reading a > directory block you end up not needing once in a blue moon would be > chalked up under "who cares". So just stick with hash, offset pairs > to map the hash to the normal directory entry.With a 64-bit hash, and if we were actually going to implement this as a new incompatible feature, you''re probably right in terms of accepting the extra directory block search. We would still have to implement the case where hash collisions *do* exist, though, and make sure the right thing happens in that case. Even if the chance of that happening is 1 in 2**32, with enough deployed systems (i.e., every Android handset, etc.) it''s going to happen in real life. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
What if we use inode number as the hash value? Does it work? Yongqiang. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 14, 2012 at 10:48:17AM +0800, Yongqiang Yang wrote:> What if we use inode number as the hash value? Does it work?The whole point of using the tree structure is to accelerate filename -> inode number lookups. So the namei lookup doesn''t have the inode number; the whole point is to use the filename to lookup the inode number. So we can''t use the inode number as the hash value since that''s what we are trying to find out. We could do this if we have two b-trees, one indexed by filename and one indexed by inode number, which is what JFS (and I believe btrfs) does. - Ted
On Tue, 13 Mar 2012, Ted Ts''o wrote:> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote: > > > > I think a format change would be preferable to runtime sorting. > > Are you volunteering to spearhead the design and coding of such a > thing? Run-time sorting is backwards compatible, and a heck of a lot > easier to code and test...Actually I am in the process of creating some projects for the local universities and this certainly sounds like something that some student can do. I have already put that into the proposal, so we might have someone looking into this in the near future. Since the problem has been there since forever, I think that it''s not that critical to solve it right away. I kind of like the idea about having the separate btree with inode numbers for the directory reading, just because it does not affect allocation policy nor the write performance which is a good thing. Also it has been done before in btrfs and it works very well for them. The only downside (aside from format change) is the extra step when removing the directory entry, but the positives outperform the negatives. Maybe this might be even done in a way which does not require format change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will tell us that there is a inode number btree to use on directory reads. Then the pointer to the root of that tree would be stored at the end of the root block of the hree (of course if there is still some space for it) and the rest is easy. So If we mount the old file system with the new code, there will not be the flag, but with newly added directories it could be used anyway. Also e2fsck could easily add the inumber tree into the directory if it is missing. If we mount the new file system with the old code (the one which does not know about this feature), everything would stay the same and we would not touch the inumber tree at all. Of course there is a possibility that we would rewrite the end of the root htree, overwriting the pointer to the inumber tree and effectively losing one block. We would not actually care that we rewrote this information because we do not actually need it, so the only downside is the one block lost, which is not that bad I guess. Now, we have rewritten the pointer to the inumber tree and we''re going to mount it with the new code again. It would expect the the inumber tree to be there, but not blindly. Some king of magic information would have to be stored in the inumber root pointer so we can be sure that it really is what we''re looking for and if it is not there, well then we do not care and walk the directory in the old way. And we can alway create the inumber tree in the new directory. And again e2fsck should be able to fix that for us as well. So that is just an idea, I have not looked at the actual code to see it it would be really possible, but it certainly seems like so. Am I missing something ? Anyway, if there is going to be a format change I agree that having solution for those who are not comfortable with the format change is equally as important. But maybe there will be better solution which does not require the format change. Thanks! -Lukas> > The reality is we''d probably want to implement run-time sorting > *anyway*, for the vast majority of people who don''t want to convert to > a new incompatible file system format. (Even if you can do the > conversion using e2fsck --- which is possible, but it would be even > more code to write --- system administrators tend to be very > conservative about such things, since they might need to boot an older > kernel, or use a rescue CD that doesn''t have an uptodate kernel or > file system utilities, etc.) > > > So the index nodes contain the hash ranges for the leaf block, but > > the leaf block only contains the regular directory entries, not a > > hash for each name? That would mean that adding or removing names > > would require moving around the regular directory entries wouldn''t > > it? > > They aren''t sorted in the leaf block, so we only need to move around > regular directory entries when we do a node split (and at the moment > we don''t support shrinking directories), so we don''t have to worry the > reverse case. > > > I would think that hash collisions are rare enough that reading a > > directory block you end up not needing once in a blue moon would be > > chalked up under "who cares". So just stick with hash, offset pairs > > to map the hash to the normal directory entry. > > With a 64-bit hash, and if we were actually going to implement this as > a new incompatible feature, you''re probably right in terms of > accepting the extra directory block search. > > We would still have to implement the case where hash collisions *do* > exist, though, and make sure the right thing happens in that case. > Even if the chance of that happening is 1 in 2**32, with enough > deployed systems (i.e., every Android handset, etc.) it''s going to > happen in real life. > > - Ted-- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 14, 2012 at 4:12 PM, Lukas Czerner <lczerner@redhat.com> wrote:> On Tue, 13 Mar 2012, Ted Ts''o wrote: > >> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote: >> > >> > I think a format change would be preferable to runtime sorting. >> >> Are you volunteering to spearhead the design and coding of such a >> thing? Run-time sorting is backwards compatible, and a heck of a lot >> easier to code and test... > > Actually I am in the process of creating some projects for the local > universities and this certainly sounds like something that some student > can do. I have already put that into the proposal, so we might have > someone looking into this in the near future. Since the problem has been > there since forever, I think that it''s not that critical to solve it > right away. > > I kind of like the idea about having the separate btree with inode > numbers for the directory reading, just because it does not affect > allocation policy nor the write performance which is a good thing. Also > it has been done before in btrfs and it works very well for them. The > only downside (aside from format change) is the extra step when removing > the directory entry, but the positives outperform the negatives. > > > Maybe this might be even done in a way which does not require format > change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will > tell us that there is a inode number btree to use on directory reads. > Then the pointer to the root of that tree would be stored at the end of > the root block of the hree (of course if there is still some space for > it) and the rest is easy.dx_root takes 32 bytes from 60 bytes of i_data. The beginning of dx_entries is controlled by info_length, that is, dx_root->info + dx_root->info->info_length. It seems that we can put the new tree root behind dx_root and add the length of the new tree root to info_length. Then the old code can handle new filesystem and new code can handle old filesystem because of the flag you described above. If no one attempts to do it , I can do it as a google summer of code project. Yongqiang.> > So If we mount the old file system with the new code, there will not be > the flag, but with newly added directories it could be used anyway. Also > e2fsck could easily add the inumber tree into the directory if it is > missing. > > If we mount the new file system with the old code (the one which does > not know about this feature), everything would stay the same and we > would not touch the inumber tree at all. Of course there is a > possibility that we would rewrite the end of the root htree, overwriting > the pointer to the inumber tree and effectively losing one block. We > would not actually care that we rewrote this information because we > do not actually need it, so the only downside is the one block lost, > which is not that bad I guess. > > Now, we have rewritten the pointer to the inumber tree and we''re going > to mount it with the new code again. It would expect the the inumber > tree to be there, but not blindly. Some king of magic information would > have to be stored in the inumber root pointer so we can be sure that it > really is what we''re looking for and if it is not there, well then we do > not care and walk the directory in the old way. And we can alway create > the inumber tree in the new directory. And again e2fsck should be able > to fix that for us as well. > > So that is just an idea, I have not looked at the actual code to see it > it would be really possible, but it certainly seems like so. Am I > missing something ? > > > Anyway, if there is going to be a format change I agree that having > solution for those who are not comfortable with the format change is > equally as important. But maybe there will be better solution which does > not require the format change. > > Thanks! > -Lukas > >> >> The reality is we''d probably want to implement run-time sorting >> *anyway*, for the vast majority of people who don''t want to convert to >> a new incompatible file system format. (Even if you can do the >> conversion using e2fsck --- which is possible, but it would be even >> more code to write --- system administrators tend to be very >> conservative about such things, since they might need to boot an older >> kernel, or use a rescue CD that doesn''t have an uptodate kernel or >> file system utilities, etc.) >> >> > So the index nodes contain the hash ranges for the leaf block, but >> > the leaf block only contains the regular directory entries, not a >> > hash for each name? That would mean that adding or removing names >> > would require moving around the regular directory entries wouldn''t >> > it? >> >> They aren''t sorted in the leaf block, so we only need to move around >> regular directory entries when we do a node split (and at the moment >> we don''t support shrinking directories), so we don''t have to worry the >> reverse case. >> >> > I would think that hash collisions are rare enough that reading a >> > directory block you end up not needing once in a blue moon would be >> > chalked up under "who cares". So just stick with hash, offset pairs >> > to map the hash to the normal directory entry. >> >> With a 64-bit hash, and if we were actually going to implement this as >> a new incompatible feature, you''re probably right in terms of >> accepting the extra directory block search. >> >> We would still have to implement the case where hash collisions *do* >> exist, though, and make sure the right thing happens in that case. >> Even if the chance of that happening is 1 in 2**32, with enough >> deployed systems (i.e., every Android handset, etc.) it''s going to >> happen in real life. >> >> - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- Best Wishes Yongqiang Yang -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote:> I kind of like the idea about having the separate btree with inode > numbers for the directory reading, just because it does not affect > allocation policy nor the write performance which is a good thing. Also > it has been done before in btrfs and it works very well for them. The > only downside (aside from format change) is the extra step when removing > the directory entry, but the positives outperform the negatives.Well, there will be extra (journaled!) block reads and writes involved when adding or removing directory entries. So the performance on workloads that do a large number of directory adds and removed will have to be impacted. By how much is not obvious, but it will be something which is measurable.> Maybe this might be even done in a way which does not require format > change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will > tell us that there is a inode number btree to use on directory reads. > Then the pointer to the root of that tree would be stored at the end of > the root block of the hree (of course if there is still some space for > it) and the rest is easy.You can make it be a RO_COMPAT change instead of an INCOMPAT change, yes. And if you want to do it as simply as possible, we could just recycle the current htree approach for the second tree, and simply store the root in another set of directory blocks. But by putting the index nodes in the directory blocks, masquerading as deleted directories, it means that readdir() has to cycle through them and ignore the index blocks. The alternate approach is to use physical block numbers instead of logical block numbers for the index blocks, and to store it separately from the blocks containing actual directory entries. But if we do that for the inumber tree, the next question that arise is maybe we should do that for the hash tree as well --- and then once you upon that can of worms, it gets a lot more complicated. So the question that really arises here is how wide open do we want to leave the design space, and whether we are optimizing for the best possible layout change ignoring the amount of implementation work it might require, or whether we keep things as simple as possible from a code change perspective. There are good arguments that can be made either way, and a lot depends on the quality of the students you can recruit, the amount of time they have, and how much review time it will take out of the core team during the design and implementation phase. Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> We could do this if we have two b-trees, one indexed by filename and > one indexed by inode number, which is what JFS (and I believe btrfs) > does.Typically the inode number of the destination inode isn''t used to index entries for a readdir tree because of (wait for it) hard links. You end up right back where you started with multiple entries per key. A painful solution is to have the key in the readdir tree allocated by the tree itself -- count key populations in subtrees per child pointer and use that to find free keys. You still have the problem of correlating entry keys and the location of inodes, but at least now you have a structure to navigate to try and do that. You can also trivially re-use freed keys and have densely packed readdir keys that won''t break 32bit f_pos so quickly. btrfs just punts and has an incrementing 64bit counter per directory that determines a new entry''s position in readdir. Accompanied by the obligatory "will be smarter some day" comment :). - z -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/13/2012 5:33 PM, Ted Ts''o wrote:> Are you volunteering to spearhead the design and coding of such a > thing? Run-time sorting is backwards compatible, and a heck of a lot > easier to code and test...Do you really think it is that much easier? Even if it is easier, it is still an ugly kludge. It would be much better to fix the underlying problem rather than try to paper over it.> The reality is we''d probably want to implement run-time sorting > *anyway*, for the vast majority of people who don''t want to convert to > a new incompatible file system format. (Even if you can do the > conversion using e2fsck --- which is possible, but it would be even > more code to write --- system administrators tend to be very > conservative about such things, since they might need to boot an older > kernel, or use a rescue CD that doesn''t have an uptodate kernel or > file system utilities, etc.)The same argument could have been made against the current htree implementation when it was added. I think it carries just as little weight now as it did then. People who care about the added performance the new feature provides can use it, those who don''t, won''t. Eventually it will become ubiquitous.> We would still have to implement the case where hash collisions *do* > exist, though, and make sure the right thing happens in that case. > Even if the chance of that happening is 1 in 2**32, with enough > deployed systems (i.e., every Android handset, etc.) it''s going to > happen in real life.Sure it will happen, but if we read one extra block 1 in 4 billion times, nobody is going to notice.
On Wed, 14 Mar 2012, Ted Ts''o wrote:> On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote: > > I kind of like the idea about having the separate btree with inode > > numbers for the directory reading, just because it does not affect > > allocation policy nor the write performance which is a good thing. Also > > it has been done before in btrfs and it works very well for them. The > > only downside (aside from format change) is the extra step when removing > > the directory entry, but the positives outperform the negatives. > > Well, there will be extra (journaled!) block reads and writes involved > when adding or removing directory entries. So the performance on > workloads that do a large number of directory adds and removed will > have to be impacted. By how much is not obvious, but it will be > something which is measurable.Yes, it would have to be measured in various workloads.> > > Maybe this might be even done in a way which does not require format > > change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will > > tell us that there is a inode number btree to use on directory reads. > > Then the pointer to the root of that tree would be stored at the end of > > the root block of the hree (of course if there is still some space for > > it) and the rest is easy. > > You can make it be a RO_COMPAT change instead of an INCOMPAT change, > yes.Does it have to be RO_COMPAT change though ? Since this would be both forward and backward compatible.> > And if you want to do it as simply as possible, we could just recycle > the current htree approach for the second tree, and simply store the > root in another set of directory blocks. But by putting the index > nodes in the directory blocks, masquerading as deleted directories, it > means that readdir() has to cycle through them and ignore the index > blocks. > > The alternate approach is to use physical block numbers instead of > logical block numbers for the index blocks, and to store it separately > from the blocks containing actual directory entries. But if we do > that for the inumber tree, the next question that arise is maybe we > should do that for the hash tree as well --- and then once you upon > that can of worms, it gets a lot more complicated. > > So the question that really arises here is how wide open do we want to > leave the design space, and whether we are optimizing for the best > possible layout change ignoring the amount of implementation work it > might require, or whether we keep things as simple as possible from a > code change perspective.What do you expect to gain by storing the index nodes outside the directory entries ? Only to avoid reading the index nodes of both trees when walking just one of it ? If I understand it correctly directory blocks are laid out sequentially so reading it in cold cache state would not pose much troubles I think. Nor the iteration through all of the index nodes. However it would have to be measured first.> > There are good arguments that can be made either way, and a lot > depends on the quality of the students you can recruit, the amount of > time they have, and how much review time it will take out of the core > team during the design and implementation phase.I am always hoping for the best :). We have had a good expedience with the students here, but you''ll never know that in advance. They''ll have the whole year to get through this, which is plenty of thine, but the question of course is whether they''ll use that time wisely :). But anyway, I think it is definitely worth a try. Thanks! -Lukas> > Regards, > > - Ted >--
On Wed, Mar 14, 2012 at 10:17:37AM -0400, Zach Brown wrote:> > >We could do this if we have two b-trees, one indexed by filename and > >one indexed by inode number, which is what JFS (and I believe btrfs) > >does. > > Typically the inode number of the destination inode isn''t used to index > entries for a readdir tree because of (wait for it) hard links. You end > up right back where you started with multiple entries per key.Well, if you are using 32-bit (or even 48-bit) inode numbers and a 64-bit telldir cookie, it''s possible to make the right thing happen. But yes, if you are using 32-bit inode numbers and a 32-bit telldir cookie, dealing with what happens when you have multiple hard links to the same inode in the same directory gets tricky.> A painful solution is to have the key in the readdir tree allocated by > the tree itself -- count key populations in subtrees per child pointer > and use that to find free keys.One thing that might work is to have a 16-bit extra field in the directory entry that gives an signed offset to the inode number so that such that inode+offset is a unique value within the btree sorted by inode+offset number. Since this tree is only used for returning entries in an optimal (or as close to optimal as can be arranged) order, we could get away with that. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 14, 2012 at 10:28:20AM -0400, Phillip Susi wrote:> > Do you really think it is that much easier? Even if it is easier, > it is still an ugly kludge. It would be much better to fix the > underlying problem rather than try to paper over it.I don''t think the choice is obvious. A solution which solves the problem 90% plus of the time for real-life workloads, without requiring any file system format changes, is very appealing to me and to I think many customers. That being said, for someone who is looking for perfection, I can see how it would grate on someone''s nerves.> The same argument could have been made against the current htree > implementation when it was added. I think it carries just as little > weight now as it did then. People who care about the added > performance the new feature provides can use it, those who don''t, > won''t. Eventually it will become ubiquitous.The original htree implementation tried very hard to be fully backwards compatible for reading and writing. At the time that was considered very important, and for some folks it was a major killer feature. The original ext3 journalling support was designed in a way where you could back out and mount such a file system on an ext2 file system, and that was considered important back then, too. The fact that we''ve been so successful with ext4''s upgrade to features that require RO_COMPAT or INCOMPAT features does make me more willing to consider those features, but the reality is that the deployment time for such a new feature is measured in 2-3 years, minimum. So I won''t rule out making such a change, but I wouldn''t let the fact that someone wants to sign up to implement a long-term feature, to be mutually exclusive with a short-term solution which solves most of the problem that could be rolled out much more quickly. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 14, 2012 at 03:34:13PM +0100, Lukas Czerner wrote:> > > > You can make it be a RO_COMPAT change instead of an INCOMPAT change, > > yes. > > Does it have to be RO_COMPAT change though ? Since this would be both > forward and backward compatible.The challenge is how do you notice if the file system is mounted on an older kernel, which then inserts a directory entry without updating the secondary tree. The older kernel won''t know about the new inode flag, so it can''t clear the flag when it modifies the directory. We were able to get away with making the original htree feature read/write compatible because the design for it was anticipated far in advance, and because it was before the days of enterprise kernels that had to be supported for seven years. So we added code into ext3 to clear the the htree flag whenever the directory was modified something like two years before the htree code made its appearance, and back then we decided that was fair to assume no one would be using a kernel that old, or be jumping back and forth between an ancient kernel and a modern kernel with htree enabled. Yes, that was playing a bit fast and loose, but early on in the kernel history, we could do that. It''s not something I would do today. The bigger deal is that as Zach pointed out, we can''t just index it by inode number because we have to worry about hard links. Which means we need either an explicit counter field added to the directory entry, or some kind of new offset field. That we can''t just shoehorn in and retain backwards compatibility. And once we break backwards compatibility, we might as well look at the full design space and see what is the most optimal. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/14/2012 12:48 PM, Ted Ts''o wrote:> On Wed, Mar 14, 2012 at 10:17:37AM -0400, Zach Brown wrote: >> >>> We could do this if we have two b-trees, one indexed by filename and >>> one indexed by inode number, which is what JFS (and I believe btrfs) >>> does. >> >> Typically the inode number of the destination inode isn''t used to index >> entries for a readdir tree because of (wait for it) hard links. You end >> up right back where you started with multiple entries per key. > > One thing that might work is to have a 16-bit extra field in the > directory entry that gives an signed offset to the inode number so > that such that inode+offset is a unique value within the btree sorted > by inode+offset number. Since this tree is only used for returning > entries in an optimal (or as close to optimal as can be arranged) > order, we could get away with that.Yeah, at least that''s nice and simple. Personally, I''m always nervous when we build in new limits like this. Some joker somewhere always manages to push up against them. But such is life. Maybe it''s the right thing in the ext* design space. The fixed number of possible inodes certainly makes it easier to pack more inode offsets into the telldir cookie. - z
On Wed, Mar 14, 2012 at 08:50:02AM -0400, Ted Ts''o wrote:> On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote: > > I kind of like the idea about having the separate btree with inode > > numbers for the directory reading, just because it does not affect > > allocation policy nor the write performance which is a good thing. Also > > it has been done before in btrfs and it works very well for them. The > > only downside (aside from format change) is the extra step when removing > > the directory entry, but the positives outperform the negatives. > > Well, there will be extra (journaled!) block reads and writes involved > when adding or removing directory entries. So the performance on > workloads that do a large number of directory adds and removed will > have to be impacted. By how much is not obvious, but it will be > something which is measurable.Most of the difference in btrfs rm times comes from our double indexing. We store the directory name 3 times, once for each dir index and once in the inode backref. But the inode backref doesn''t generate seeks the same way the extra directory index does, so it is probably only 10% of the cost. We also delete file data crcs during rm, so if you''re benchmarking things to see how much it will hurt, benchmark btrfs on zero length files. In terms of other costs, most users of readdir will eventually wander off and stat or open the files in the dir. With your current indexes, the stat reads the same directory blocks that readdir used. With the double index, the stat reads the name index instead of the readdir index. For the common case of normal directories, today''s ext4 will have the directory blocks in cache when stat is called. Btrfs has to do extra reads to pull in the name index. Josef worked around this by prefilling a dentry during readdir with details on where to find the inode. Either way, I''d definitely recommend sitting down with the xfs directory indexes and look at how they manage locality without the double indexing. It''s faster than btrfs and without the penalty. It''s also worth mentioning that on an SSD, there is no benefit at all to the extra index. So when the industry finally floods the markets with hybrid PCM shingled drives, this might all be moot. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/3/10 Ted Ts''o <tytso@mit.edu>:> Hey Jacek, > > I''m curious parameters of the set of directories on your production > server. On an ext4 file system, assuming you''ve copied the > directories over, what are the result of this command pipeline when > you are cd''ed into the top of the directory hierarchy of interest > (your svn tree, as I recall)? > > find . -type d -ls | awk ''{print $7}'' | sort -n | uniq -c > > I''m just curious what the distribution of directories sizes are in > practice for your use case... >Hi Ted, below is from the latest upstream on ext4: 843341 4096 165 12288 29 16384 164 20480 26 24576 22 28672 24 32768 118 36864 12 40960 19 45056 3 49152 12 53248 37 57344 10 61440 2 65536 13 69632 1 131072 1 139264 1 143360 1 221184 1 249856 1 258048 1 262144 1 319488 2 335872 1 339968 -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2012/3/11 Ted Ts''o <tytso@mit.edu>:> > Well, my goal in proposing this optimization is that helps for the > "medium size" directories in the cold cache case. The ext4 user who > first kicked off this thread was using his file system for an SVN > server, as I recall. I could easily believe that he has thousands of > files; maybe even tens of thousands of files in a directory. But that > probably still fits in 256k, or at best 512k worth of directory blocks. >That was not a SVN server. It was a build host having checkouts of SVN projects. The many files/dirs case is common for VCS and the SVN is not the only that would be affected here. AFAIR git.kernel.org was also suffering from the getdents(). Same applies to commercial products that are heavily stuffed with many files/dirs, e.g. ClearCase or Synergy. In the second one case we are running on XFS for a longer while and I must say that this was a huge improvement mostly on backup times (where a dump of Informix and the whole filesystem is ordinary compressed in tarball). A medium size you are referring would most probably fit into 256k and this could be enough for 90% of cases. Large production system running on ext4 need backups thus those would benefit the most here. -Jacek -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 15, 2012 at 11:42:24AM +0100, Jacek Luczak wrote:> > That was not a SVN server. It was a build host having checkouts of SVN > projects. > > The many files/dirs case is common for VCS and the SVN is not the only > that would be affected here.Well, with SVN it''s 2x or 3x the number of files in the checked out source code directory, right? So if a particular source tree has 2,000 files in a source directory, then SVN might have at most 6,000 files, and if you assume each directory entry is 64 bytes, we''re still talking about 375k. Do you have more files than that in a directory in practice with SVN? And if so why?> AFAIR git.kernel.org was also suffering from the getdents().git.kernel.org was suffering from a different problem, which was that the git.kernel.org administrators didn''t feel like automatically doing a "git gc" on all of the repositories, and a lot of people were just doing "git pushes" and not bothering to gc their repositories. Since git.kernel.org users don''t have shell access any more, the git.kernel.org administrators have to be doing automatic git gc''s. By default git is supposed to automatically do a gc when there are more than 6700 loose object files (which are distributed across 256 1st level directories, so in practice a .git/objects/XX directory shouldn''t have more than 30 objects in it, which each directory object taking 48 bytes). The problem I believe is that "git push" commands weren''t checking gc.auto limit, and so that''s why git.kernel.org had in the past suffered from large directories. This is arguably a git bug, though, and as I mentioned, since we all don''t have shell access to git.kernel.org, this has to be handled automatically now....> Same applies to commercial products that are > heavily stuffed with many files/dirs, e.g. ClearCase or Synergy.How many files in a dircectory do we commonly see with these systems? I''m not familiar with them, and so I don''t have a good feel for what typical directory sizes tend to be.> A medium size you are referring would most probably fit into 256k and > this could be enough for 90% of cases. Large production system running > on ext4 need backups thus those would benefit the most here.Yeah, 256k or 512k is probably the best. Alternatively, the backup programs could simply be taught to sort the directory entries by inode number, and if that''s not enough, to grab the initial block numbers using FIEMAP and then sort by block number. Of course, all of this optimization may or may not actually give us as much returns as we think, given that the disk is probably seeking from other workloads happening in parallel anyway (another reason why I am suspicious that timing the tar command may not be an accurate way of measuring actual performance when you have other tasks accessing the file system in parallel with the backup). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html