Hi, I think that the disk allocation size of each file becomes a monotone increase when the file is made. But, it sometimes return to 0. Is it correct? The result of the test at 2.6.37-rc4 is shown below. (see inode no. 291) # df -T /test14 Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/sdd14 btrfs 4162560 8736 3709440 1% /test14 # dd if=/dev/zero of=/test14/dir/as001.26603 bs=1M count=100 # dd if=/dev/zero of=/test14/dir/as002.26603 bs=1M count=200 # dd if=/dev/zero of=/test14/dir/sy001.26603 bs=1M count=300 oflag=direct # dd if=/dev/zero of=/test14/dir/as003.26603 bs=1M count=400 # ls -lis /test14/dir total 406528 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 # sleep 3 # ls -lis /test14/dir total 406528 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 # sleep 3 # ls -lis /test14/dir total 307200 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 # sleep 3 # ls -lis /test14/dir total 409600 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 # sync # ls -lis /test14/dir total 1024000 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 289 204800 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 -> 291 409600 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 The trace result of btrfs_getattr() is shown below. Dec 7 15:08:03 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 Dec 7 15:08:06 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 Dec 7 15:08:09 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 Dec 7 15:08:12 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 Dec 7 15:08:18 luna kernel: ino:291 blocks:819200 i_blocks:819200 i_bytes:0 delalloc_bytes:0 Regards, Itoh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tsutomu Itoh wrote:> Hi, > > I think that the disk allocation size of each file becomes a monotone increase > when the file is made. > But, it sometimes return to 0. Is it correct? >The # of blocks is: stat->blocks = (inode_get_bytes(inode) + BTRFS_I(inode)->delalloc_bytes) >> 9; So I think after sub(delalloc_bytes) and before inode_add_bytes(), you may see 0 value.> > The result of the test at 2.6.37-rc4 is shown below. > (see inode no. 291) > > # df -T /test14 > Filesystem Type 1K-blocks Used Available Use% Mounted on > /dev/sdd14 btrfs 4162560 8736 3709440 1% /test14 > # dd if=/dev/zero of=/test14/dir/as001.26603 bs=1M count=100 > # dd if=/dev/zero of=/test14/dir/as002.26603 bs=1M count=200 > # dd if=/dev/zero of=/test14/dir/sy001.26603 bs=1M count=300 oflag=direct > # dd if=/dev/zero of=/test14/dir/as003.26603 bs=1M count=400 > # ls -lis /test14/dir > total 406528 > 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sleep 3 > # ls -lis /test14/dir > total 406528 > 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sleep 3 > # ls -lis /test14/dir > total 307200 > 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sleep 3 > # ls -lis /test14/dir > total 409600 > 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sync > # ls -lis /test14/dir > total 1024000 > 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 204800 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 409600 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > > The trace result of btrfs_getattr() is shown below. > > Dec 7 15:08:03 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 > Dec 7 15:08:06 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 > Dec 7 15:08:09 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 > Dec 7 15:08:12 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 > Dec 7 15:08:18 luna kernel: ino:291 blocks:819200 i_blocks:819200 i_bytes:0 delalloc_bytes:0 > > > Regards, > Itoh >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-Dec-07 18:44 UTC
Re: The value displayed by ''ls -s'' command is strange.
Excerpts from Tsutomu Itoh''s message of 2010-12-07 02:59:52 -0500:> Hi, > > I think that the disk allocation size of each file becomes a monotone increase > when the file is made. > But, it sometimes return to 0. Is it correct?Well, there''s a window during the processing of delayed allocation where we don''t have the bytes recorded as delalloc and we don''t have the bytes recorded in the inode yet. That''s why they are showing up as zero. We don''t call inode_add_bytes() until after we insert the extent, but we drop the delalloc byte count on the file before the IO is done. Fixing it will be a little tricky because all the extent accounting assumes the inode_add_bytes happens at extent insertion time. -chris> > > The result of the test at 2.6.37-rc4 is shown below. > (see inode no. 291) > > # df -T /test14 > Filesystem Type 1K-blocks Used Available Use% Mounted on > /dev/sdd14 btrfs 4162560 8736 3709440 1% /test14 > # dd if=/dev/zero of=/test14/dir/as001.26603 bs=1M count=100 > # dd if=/dev/zero of=/test14/dir/as002.26603 bs=1M count=200 > # dd if=/dev/zero of=/test14/dir/sy001.26603 bs=1M count=300 oflag=direct > # dd if=/dev/zero of=/test14/dir/as003.26603 bs=1M count=400 > # ls -lis /test14/dir > total 406528 > 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sleep 3 > # ls -lis /test14/dir > total 406528 > 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sleep 3 > # ls -lis /test14/dir > total 307200 > 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sleep 3 > # ls -lis /test14/dir > total 409600 > 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > # sync > # ls -lis /test14/dir > total 1024000 > 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 > 289 204800 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 > -> 291 409600 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 > 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 > > The trace result of btrfs_getattr() is shown below. > > Dec 7 15:08:03 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 > Dec 7 15:08:06 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 > Dec 7 15:08:09 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 > Dec 7 15:08:12 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 > Dec 7 15:08:18 luna kernel: ino:291 blocks:819200 i_blocks:819200 i_bytes:0 delalloc_bytes:0 > > > Regards, > Itoh >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote:> Excerpts from Tsutomu Itoh''s message of 2010-12-07 02:59:52 -0500: >> Hi, >> >> I think that the disk allocation size of each file becomes a monotone increase >> when the file is made. >> But, it sometimes return to 0. Is it correct? > > Well, there''s a window during the processing of delayed allocation where > we don''t have the bytes recorded as delalloc and we don''t have the bytes > recorded in the inode yet. That''s why they are showing up as zero. > > We don''t call inode_add_bytes() until after we insert the extent, but we > drop the delalloc byte count on the file before the IO is done. > > Fixing it will be a little tricky because all the extent accounting > assumes the inode_add_bytes happens at extent insertion time. >How does opening the inode with O_APPEND during this window know where to write the bytes? If it''s a pointer/cursor to the EOF then that size could be used during the window. Is that right?>> >> >> The result of the test at 2.6.37-rc4 is shown below. >> (see inode no. 291) >> >> # df -T /test14 >> Filesystem Type 1K-blocks Used Available Use% Mounted on >> /dev/sdd14 btrfs 4162560 8736 3709440 1% /test14 >> # dd if=/dev/zero of=/test14/dir/as001.26603 bs=1M count=100 >> # dd if=/dev/zero of=/test14/dir/as002.26603 bs=1M count=200 >> # dd if=/dev/zero of=/test14/dir/sy001.26603 bs=1M count=300 oflag=direct >> # dd if=/dev/zero of=/test14/dir/as003.26603 bs=1M count=400 >> # ls -lis /test14/dir >> total 406528 >> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sleep 3 >> # ls -lis /test14/dir >> total 406528 >> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sleep 3 >> # ls -lis /test14/dir >> total 307200 >> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sleep 3 >> # ls -lis /test14/dir >> total 409600 >> 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sync >> # ls -lis /test14/dir >> total 1024000 >> 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 204800 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 409600 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> >> The trace result of btrfs_getattr() is shown below. >> >> Dec 7 15:08:03 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 >> Dec 7 15:08:06 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 >> Dec 7 15:08:09 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 >> Dec 7 15:08:12 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 >> Dec 7 15:08:18 luna kernel: ino:291 blocks:819200 i_blocks:819200 i_bytes:0 delalloc_bytes:0 >> >> >> Regards, >> Itoh >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-Dec-07 19:29 UTC
Re: The value displayed by ''ls -s'' command is strange.
Excerpts from Mike Fedyk''s message of 2010-12-07 14:16:55 -0500:> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote: > > Excerpts from Tsutomu Itoh''s message of 2010-12-07 02:59:52 -0500: > >> Hi, > >> > >> I think that the disk allocation size of each file becomes a monotone increase > >> when the file is made. > >> But, it sometimes return to 0. Â Is it correct? > > > > Well, there''s a window during the processing of delayed allocation where > > we don''t have the bytes recorded as delalloc and we don''t have the bytes > > recorded in the inode yet. Â That''s why they are showing up as zero. > > > > We don''t call inode_add_bytes() until after we insert the extent, but we > > drop the delalloc byte count on the file before the IO is done. > > > > Fixing it will be a little tricky because all the extent accounting > > assumes the inode_add_bytes happens at extent insertion time. > > > > How does opening the inode with O_APPEND during this window know where > to write the bytes? If it''s a pointer/cursor to the EOF then that > size could be used during the window. Is that right?This counter records the number of blocks allocated to the file, and reading it with ls -l or stat is somewhat racey by nature. Most of the time its fine, btrfs just has a really big window where the results from ls -l seem wrong. But, the counter really means nothing to the btrfs internals. When we do file operations we go based on the extent pointers we find in the tree and i_size (i_size is strictly maintained). The incorrect results are confusing but they don''t hurt the metadata itself. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com> wrote:> Excerpts from Mike Fedyk''s message of 2010-12-07 14:16:55 -0500: >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote: >> > Excerpts from Tsutomu Itoh''s message of 2010-12-07 02:59:52 -0500: >> >> Hi, >> >> >> >> I think that the disk allocation size of each file becomes a monotone increase >> >> when the file is made. >> >> But, it sometimes return to 0. Is it correct? >> > >> > Well, there''s a window during the processing of delayed allocation where >> > we don''t have the bytes recorded as delalloc and we don''t have the bytes >> > recorded in the inode yet. That''s why they are showing up as zero. >> > >> > We don''t call inode_add_bytes() until after we insert the extent, but we >> > drop the delalloc byte count on the file before the IO is done. >> > >> > Fixing it will be a little tricky because all the extent accounting >> > assumes the inode_add_bytes happens at extent insertion time. >> > >> >> How does opening the inode with O_APPEND during this window know where >> to write the bytes? If it''s a pointer/cursor to the EOF then that >> size could be used during the window. Is that right? > > This counter records the number of blocks allocated to the file, and > reading it with ls -l or stat is somewhat racey by nature. Most of the > time its fine, btrfs just has a really big window where the results from > ls -l seem wrong. >I see. Is it using per-cpu vars or something similar?> But, the counter really means nothing to the btrfs internals. When we > do file operations we go based on the extent pointers we find in the > tree and i_size (i_size is strictly maintained). >Would it be too heavy of an operation to have stat walk the btrfs tree to get its data?> The incorrect results are confusing but they don''t hurt the metadata > itself.-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-Dec-07 20:15 UTC
Re: The value displayed by ''ls -s'' command is strange.
Excerpts from Mike Fedyk''s message of 2010-12-07 15:07:08 -0500:> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com> wrote: > > Excerpts from Mike Fedyk''s message of 2010-12-07 14:16:55 -0500: > >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote: > >> > Excerpts from Tsutomu Itoh''s message of 2010-12-07 02:59:52 -0500: > >> >> Hi, > >> >> > >> >> I think that the disk allocation size of each file becomes a monotone increase > >> >> when the file is made. > >> >> But, it sometimes return to 0. Â Is it correct? > >> > > >> > Well, there''s a window during the processing of delayed allocation where > >> > we don''t have the bytes recorded as delalloc and we don''t have the bytes > >> > recorded in the inode yet. Â That''s why they are showing up as zero. > >> > > >> > We don''t call inode_add_bytes() until after we insert the extent, but we > >> > drop the delalloc byte count on the file before the IO is done. > >> > > >> > Fixing it will be a little tricky because all the extent accounting > >> > assumes the inode_add_bytes happens at extent insertion time. > >> > > >> > >> How does opening the inode with O_APPEND during this window know where > >> to write the bytes? Â If it''s a pointer/cursor to the EOF then that > >> size could be used during the window. Â Is that right? > > > > This counter records the number of blocks allocated to the file, and > > reading it with ls -l or stat is somewhat racey by nature. Â Most of the > > time its fine, btrfs just has a really big window where the results from > > ls -l seem wrong. > > > > I see. Is it using per-cpu vars or something similar?Our stat function returns the block count in the inode plus the number of bytes we have accounted as delayed allocation. As we do writes to the file, the delayed allocation count goes up and then eventually we decide we need to do some IO. Before we do the IO, we have to decide where on the disk to write the extents. Once that is decided, we decrement the count of delayed allocation bytes. This is when stat starts returning the wrong answer. Then we do the IO, and when the IO is done we actually insert the file extents into the file metadata. This is when stat starts returning the right answer again. The whole setup sounds strange, but this is how btrfs implements the semantics from data=ordered. We don''t update the file to point to the new blocks until after the IO is done, so we never have to wait on the data IO before we can do a transaction commit. It avoids all kinds of latencies with fsync and other problems. One easy solution is to just add another counter in the in-memory inode for the number of bytes in flight that aren''t accounted for in other places. But I''d rather not make the inode any bigger, so I''ll have to think if we can solve this another way.> > > But, the counter really means nothing to the btrfs internals. Â When we > > do file operations we go based on the extent pointers we find in the > > tree and i_size (i_size is strictly maintained). > > > > Would it be too heavy of an operation to have stat walk the btrfs tree > to get its data? >I''m afraid so, stat is fairly performance critical. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 7, 2010 at 12:15 PM, Chris Mason <chris.mason@oracle.com> wrote:> Excerpts from Mike Fedyk''s message of 2010-12-07 15:07:08 -0500: >> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com> wrote: >> > Excerpts from Mike Fedyk''s message of 2010-12-07 14:16:55 -0500: >> >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote: >> >> > Excerpts from Tsutomu Itoh''s message of 2010-12-07 02:59:52 -0500: >> >> >> Hi, >> >> >> >> >> >> I think that the disk allocation size of each file becomes a monotone increase >> >> >> when the file is made. >> >> >> But, it sometimes return to 0. Is it correct? >> >> > >> >> > Well, there''s a window during the processing of delayed allocation where >> >> > we don''t have the bytes recorded as delalloc and we don''t have the bytes >> >> > recorded in the inode yet. That''s why they are showing up as zero. >> >> > >> >> > We don''t call inode_add_bytes() until after we insert the extent, but we >> >> > drop the delalloc byte count on the file before the IO is done. >> >> > >> >> > Fixing it will be a little tricky because all the extent accounting >> >> > assumes the inode_add_bytes happens at extent insertion time. >> >> > >> >> >> >> How does opening the inode with O_APPEND during this window know where >> >> to write the bytes? If it''s a pointer/cursor to the EOF then that >> >> size could be used during the window. Is that right? >> > >> > This counter records the number of blocks allocated to the file, and >> > reading it with ls -l or stat is somewhat racey by nature. Most of the >> > time its fine, btrfs just has a really big window where the results from >> > ls -l seem wrong. >> > >> >> I see. Is it using per-cpu vars or something similar? >Ok, so to make sure I fully understand I''m going to make some psuedo code based on your description.> Our stat function returns the block count in the inode plus the number > of bytes we have accounted as delayed allocation. >stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes> As we do writes to the file, the delayed allocation count goes up and > then eventually we decide we need to do some IO. > > Before we do the IO, we have to decide where on the disk to write the > extents.inode_a2 = inode_a1 inode_a1 and inode_a2 are the same inode, but inode_a2 has a different list of extents and is not written yet (in the case of appending, most of the extents will be the same in the two extent lists, but inode_a2 will have more extents for the newly appended data)> Once that is decided, we decrement the count of delayed > allocation bytes. > > This is when stat starts returning the wrong answer. >inode_a2.bytes += inode_a1_delayed_allocation_bytes inode_a1_delayed_allocation_bytes -= inode_a1_delayed_allocation_bytes stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes Is it possible to have stat read from inode_a2 during this window? So it would be instead: stat = inode_a2.bytes> Then we do the IO, and when the IO is done we actually insert the file > extents into the file metadata. This is when stat starts returning the > right answer again. >/* implicit when write completes */ inode_a1 = inode_a2 kfree(inode_a2) stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes> The whole setup sounds strange, but this is how btrfs implements the > semantics from data=ordered. We don''t update the file to point to > the new blocks until after the IO is done, so we never have to wait on > the data IO before we can do a transaction commit. It avoids all kinds > of latencies with fsync and other problems. > > One easy solution is to just add another counter in the in-memory inode > for the number of bytes in flight that aren''t accounted for in other > places. But I''d rather not make the inode any bigger, so I''ll have to > think if we can solve this another way. >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tsutomu Itoh
2010-Dec-07 23:53 UTC
Re: The value displayed by ''ls -s'' command is strange.
(2010/12/07 18:25), Li Zefan wrote:> Tsutomu Itoh wrote: >> Hi, >> >> I think that the disk allocation size of each file becomes a monotone increase >> when the file is made. >> But, it sometimes return to 0. Is it correct? >> > > The # of blocks is: > > stat->blocks = (inode_get_bytes(inode) + > BTRFS_I(inode)->delalloc_bytes) >> 9; > > So I think after sub(delalloc_bytes) and before inode_add_bytes(), you may > see 0 value.Yes, I also think so. But, I think that such a state is too long for only the update timing...> >> >> The result of the test at 2.6.37-rc4 is shown below. >> (see inode no. 291) >> >> # df -T /test14 >> Filesystem Type 1K-blocks Used Available Use% Mounted on >> /dev/sdd14 btrfs 4162560 8736 3709440 1% /test14 >> # dd if=/dev/zero of=/test14/dir/as001.26603 bs=1M count=100 >> # dd if=/dev/zero of=/test14/dir/as002.26603 bs=1M count=200 >> # dd if=/dev/zero of=/test14/dir/sy001.26603 bs=1M count=300 oflag=direct >> # dd if=/dev/zero of=/test14/dir/as003.26603 bs=1M count=400 >> # ls -lis /test14/dir >> total 406528 >> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sleep 3 >> # ls -lis /test14/dir >> total 406528 >> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sleep 3 >> # ls -lis /test14/dir >> total 307200 >> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sleep 3 >> # ls -lis /test14/dir >> total 409600 >> 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> # sync >> # ls -lis /test14/dir >> total 1024000 >> 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >> 289 204800 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >> -> 291 409600 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >> >> The trace result of btrfs_getattr() is shown below. >> >> Dec 7 15:08:03 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 >> Dec 7 15:08:06 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 >> Dec 7 15:08:09 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 >> Dec 7 15:08:12 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 >> Dec 7 15:08:18 luna kernel: ino:291 blocks:819200 i_blocks:819200 i_bytes:0 delalloc_bytes:0 >> >> >> Regards, >> Itoh >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tsutomu Itoh
2010-Dec-08 00:15 UTC
Re: The value displayed by ''ls -s'' command is strange.
(2010/12/08 5:15), Chris Mason wrote:> Excerpts from Mike Fedyk''s message of 2010-12-07 15:07:08 -0500: >> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@oracle.com> wrote: >>> Excerpts from Mike Fedyk''s message of 2010-12-07 14:16:55 -0500: >>>> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@oracle.com> wrote: >>>>> Excerpts from Tsutomu Itoh''s message of 2010-12-07 02:59:52 -0500: >>>>>> Hi, >>>>>> >>>>>> I think that the disk allocation size of each file becomes a monotone increase >>>>>> when the file is made. >>>>>> But, it sometimes return to 0. Is it correct? >>>>> >>>>> Well, there''s a window during the processing of delayed allocation where >>>>> we don''t have the bytes recorded as delalloc and we don''t have the bytes >>>>> recorded in the inode yet. That''s why they are showing up as zero. >>>>> >>>>> We don''t call inode_add_bytes() until after we insert the extent, but we >>>>> drop the delalloc byte count on the file before the IO is done. >>>>> >>>>> Fixing it will be a little tricky because all the extent accounting >>>>> assumes the inode_add_bytes happens at extent insertion time. >>>>> >>>> >>>> How does opening the inode with O_APPEND during this window know where >>>> to write the bytes? If it''s a pointer/cursor to the EOF then that >>>> size could be used during the window. Is that right? >>> >>> This counter records the number of blocks allocated to the file, and >>> reading it with ls -l or stat is somewhat racey by nature. Most of the >>> time its fine, btrfs just has a really big window where the results from >>> ls -l seem wrong. >>> >> >> I see. Is it using per-cpu vars or something similar? > > Our stat function returns the block count in the inode plus the number > of bytes we have accounted as delayed allocation. > > As we do writes to the file, the delayed allocation count goes up and > then eventually we decide we need to do some IO. > > Before we do the IO, we have to decide where on the disk to write the > extents. Once that is decided, we decrement the count of delayed > allocation bytes. > > This is when stat starts returning the wrong answer. > > Then we do the IO, and when the IO is done we actually insert the file > extents into the file metadata. This is when stat starts returning the > right answer again.I understood. However, I worry that the user is confused because the wrong condition is too long.> > The whole setup sounds strange, but this is how btrfs implements the > semantics from data=ordered. We don''t update the file to point to > the new blocks until after the IO is done, so we never have to wait on > the data IO before we can do a transaction commit. It avoids all kinds > of latencies with fsync and other problems. > > One easy solution is to just add another counter in the in-memory inode > for the number of bytes in flight that aren''t accounted for in other > places. But I''d rather not make the inode any bigger, so I''ll have to > think if we can solve this another way. > >> >>> But, the counter really means nothing to the btrfs internals. When we >>> do file operations we go based on the extent pointers we find in the >>> tree and i_size (i_size is strictly maintained). >>> >> >> Would it be too heavy of an operation to have stat walk the btrfs tree >> to get its data? >> > > I''m afraid so, stat is fairly performance critical. > > -chris-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On wed, 08 Dec 2010 08:53:55 +0900, Tsutomu Itoh wrote:>>> I think that the disk allocation size of each file becomes a monotone increase >>> when the file is made. >>> But, it sometimes return to 0. Is it correct? >>> >> >> The # of blocks is: >> >> stat->blocks = (inode_get_bytes(inode) + >> BTRFS_I(inode)->delalloc_bytes)>> 9; >> >> So I think after sub(delalloc_bytes) and before inode_add_bytes(), you may >> see 0 value. > > Yes, I also think so. > But, I think that such a state is too long for only the update timing...Several months ago, some one posted a patch to get the allocated size of the compressed file, http://marc.info/?l=linux-btrfs&m=128109745012238&w=2 this patch may help you to implement what you need. Regards Miao>> >>> >>> The result of the test at 2.6.37-rc4 is shown below. >>> (see inode no. 291) >>> >>> # df -T /test14 >>> Filesystem Type 1K-blocks Used Available Use% Mounted on >>> /dev/sdd14 btrfs 4162560 8736 3709440 1% /test14 >>> # dd if=/dev/zero of=/test14/dir/as001.26603 bs=1M count=100 >>> # dd if=/dev/zero of=/test14/dir/as002.26603 bs=1M count=200 >>> # dd if=/dev/zero of=/test14/dir/sy001.26603 bs=1M count=300 oflag=direct >>> # dd if=/dev/zero of=/test14/dir/as003.26603 bs=1M count=400 >>> # ls -lis /test14/dir >>> total 406528 >>> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >>> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >>> -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >>> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >>> # sleep 3 >>> # ls -lis /test14/dir >>> total 406528 >>> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >>> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >>> -> 291 99328 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >>> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >>> # sleep 3 >>> # ls -lis /test14/dir >>> total 307200 >>> 288 0 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >>> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >>> -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >>> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >>> # sleep 3 >>> # ls -lis /test14/dir >>> total 409600 >>> 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >>> 289 0 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >>> -> 291 0 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >>> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >>> # sync >>> # ls -lis /test14/dir >>> total 1024000 >>> 288 102400 -rw-r--r-- 1 root root 104857600 Dec 7 15:07 as001.26603 >>> 289 204800 -rw-r--r-- 1 root root 209715200 Dec 7 15:07 as002.26603 >>> -> 291 409600 -rw-r--r-- 1 root root 419430400 Dec 7 15:08 as003.26603 >>> 290 307200 -rw-r--r-- 1 root root 314572800 Dec 7 15:08 sy001.26603 >>> >>> The trace result of btrfs_getattr() is shown below. >>> >>> Dec 7 15:08:03 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 >>> Dec 7 15:08:06 luna kernel: ino:291 blocks:198656 i_blocks:0 i_bytes:0 delalloc_bytes:101711872 >>> Dec 7 15:08:09 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 >>> Dec 7 15:08:12 luna kernel: ino:291 blocks:0 i_blocks:0 i_bytes:0 delalloc_bytes:0 >>> Dec 7 15:08:18 luna kernel: ino:291 blocks:819200 i_blocks:819200 i_bytes:0 delalloc_bytes:0 >>> >>> >>> Regards, >>> Itoh >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html