Jeff Davis
2007-Feb-26 03:28 UTC
[zfs-discuss] Efficiency when reading the same file blocks
if you have N processes reading the same file sequentially (where file size is much greater than physical memory) from the same starting position, should I expect that all N processes finish in the same time as if it were a single process? In other words, if you have one process that reads blocks from a file, is it "free" (meaning no additional total I/O cost) to have another process read the same blocks from the same file at the same time? This message posted from opensolaris.org
Neil Perrin
2007-Feb-26 03:40 UTC
[zfs-discuss] Efficiency when reading the same file blocks
Jeff Davis wrote On 02/25/07 20:28,:> if you have N processes reading the same file sequentially (where file size is much greater than physical memory) from the same starting position, should I expect that all N processes finish in the same time as if it were a single process?Yes I would expect them to finish the same time. There should be no additional reads because the data will be in the ZFS cache (ARC). Given your question are you about to come back with a case where you are not seeing this? Neil.
Jeff Davis
2007-Feb-26 17:05 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
> Given your question are you about to come back with a > case where you are not > seeing this?Actually, the case where I saw the bad behavior was in Linux using the CFQ I/O scheduler. When reading the same file sequentially, adding processes drastically reduced total disk throughput (single disk machine). Using the Linux anticipatory scheduler worked just fine: no additional I/O costs for more processes. That got me worried about the project I''m working on, and I wanted to understand ZFS''s caching behavior better to prove to myself that the problem wouldn''t happen under ZFS. Clearly the block will be in cache on the second read, but what I''d like to know is if ZFS will ask the disk to do a long, efficient sequential read of the disk, or whether it will somehow not recognize that the read is sequential because the requests are coming from different processes? This message posted from opensolaris.org
Frank Cusack
2007-Feb-26 17:30 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
On February 26, 2007 9:05:21 AM -0800 Jeff Davis <jeff95350 at yahoo.com> wrote:> That got me worried about the project I''m working on, and I wanted to > understand ZFS''s caching behavior better to prove to myself that the > problem wouldn''t happen under ZFS. Clearly the block will be in cache on > the second read, but what I''d like to know is if ZFS will ask the disk to > do a long, efficient sequential read of the disk, or whether it will > somehow not recognize that the read is sequential because the requests > are coming from different processes?ISTM zfs would be process-independent wrt that kind of decision. I have no clue about it but I couldn''t imagine otherwise. But you have to be aware that logically sequential reads do not necessarily translate into physically sequential reads with zfs. zfs translates all writes into sequential writes, ie writes to disk are time ordered not data ordered (like log structured filesystems), so whether or not the bits on disk are physically sequential depends on HOW the file was written. If Thomas is reading, I wonder how this affects mt-tar? Well, not tar itself, but subsequent read of the data tar has written. -frank
Bart Smaalders
2007-Feb-26 17:36 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
Jeff Davis wrote:>> Given your question are you about to come back with a >> case where you are not >> seeing this? > > Actually, the case where I saw the bad behavior was in Linux using the CFQ I/O scheduler. When reading the same file sequentially, adding processes drastically reduced total disk throughput (single disk machine). Using the Linux anticipatory scheduler worked just fine: no additional I/O costs for more processes. > > That got me worried about the project I''m working on, and I wanted to understand ZFS''s caching behavior better to prove to myself that the problem wouldn''t happen under ZFS. Clearly the block will be in cache on the second read, but what I''d like to know is if ZFS will ask the disk to do a long, efficient sequential read of the disk, or whether it will somehow not recognize that the read is sequential because the requests are coming from different processes? > > > This message posted from opensolaris.orgZFS has a pretty clever IO scheduler; it will handle multiple readers of the same file, readers of different files, etc; in each case prefetch is done correctly. It also handles programs that skip blocks... You can see this pretty simply; for small configs (where a single CPU can saturate all the drives) the net throughput of the drives doesn''t vary significantly if one is reading a single file or reading 10 files in parallel. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Roch Bourbonnais
2007-Feb-26 18:17 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
Le 26 f?vr. 07 ? 18:30, Frank Cusack a ?crit :> On February 26, 2007 9:05:21 AM -0800 Jeff Davis > <jeff95350 at yahoo.com> wrote: >> That got me worried about the project I''m working on, and I wanted to >> understand ZFS''s caching behavior better to prove to myself that the >> problem wouldn''t happen under ZFS. Clearly the block will be in >> cache on >> the second read, but what I''d like to know is if ZFS will ask the >> disk to >> do a long, efficient sequential read of the disk, or whether it will >> somehow not recognize that the read is sequential because the >> requests >> are coming from different processes? > > ISTM zfs would be process-independent wrt that kind of decision. I > have > no clue about it but I couldn''t imagine otherwise. > > But you have to be aware that logically sequential reads do not > necessarily translate into physically sequential reads with zfs. zfs > translates all writes into sequential writes, ie writes to disk are > time ordered not data ordered (like log structured filesystems), so > whether or not the bits on disk are physically sequential depends on > HOW the file was written. > > If Thomas is reading, I wonder how this affects mt-tar? Well, not > tar itself, but subsequent read of the data tar has written. >In each transaction, all blocks that belong to one file are treated together before handling the next file. So I think an mt-hot tar should not have too much of an adverse effect on the on-disk layout.> -frank > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Wade.Stuart at fallon.com
2007-Feb-26 18:27 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
zfs-discuss-bounces at opensolaris.org wrote on 02/26/2007 11:36:18 AM:> Jeff Davis wrote: > >> Given your question are you about to come back with a > >> case where you are not > >> seeing this? > > > > Actually, the case where I saw the bad behavior was in Linux using > the CFQ I/O scheduler. When reading the same file sequentially, > adding processes drastically reduced total disk throughput (single > disk machine). Using the Linux anticipatory scheduler worked just > fine: no additional I/O costs for more processes. > > > > That got me worried about the project I''m working on, and I wanted > to understand ZFS''s caching behavior better to prove to myself that > the problem wouldn''t happen under ZFS. Clearly the block will be in > cache on the second read, but what I''d like to know is if ZFS will > ask the disk to do a long, efficient sequential read of the disk, or > whether it will somehow not recognize that the read is sequential > because the requests are coming from different processes? > >This gets complicated quickly, here is how I understand it to be: "if you have N processes reading the same file sequentially (where file size is much greater than physical memory)." to me inherently means that the cache will not have the file completely cached for multiple processes. If you have N processes in different read stages (first block, middle block, last block) on a file described above, there will be a sliding/fragmented window of cached/prefetched data (with complex dynamics because fo the IO scheduler and ARC). When you have N procs that window the ARC cache you will not get (near) 100% cache hits, and depending on how large the file is compared to the available ARC and other fs loads you may get very poor cache hits. If all of the procs start reading the file at the same relative time and read at about the same rate your cache hits will be higher as this allows the ARC/Prefetch/IO Scheduler to have the best chance of covering the reads. The closer the filesize is to available ARC memory, the closer the procs are grouped together on reads of the same blocks and fewer groupings you have of those proc read groups the better your cache rate will be. In any case do not expect 1:IO or N:IO. it will be somewhere in the middle depending on the specifics of workload. -Wade
Jeff Davis
2007-Feb-27 17:59 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
> > Given your question are you about to come back with a > case where you are not > seeing this? >As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O rate drops off quickly when you add processes while reading the same blocks from the same file at the same time. I don''t know why this is, and it would be helpful if someone explained it to me. ZFS did a lot better. There did not appear to be any drop-off after the first process. There was a drop in I/O rate as I kept adding processes, but in that case the CPU was at 100%. I haven''t had a chance to test this on a bigger box, but I suspect ZFS is able to keep the sequential read going at full speed (at least if the blocks happen to be written sequentially). I did these tests with each process being a "dd if=bigfile of=/dev/null" started at the same time, and I measured I/O rate with "zpool iostat mypool 2" and "iostat -Md 2". This message posted from opensolaris.org
Frank Hofmann
2007-Feb-27 18:24 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
On Tue, 27 Feb 2007, Jeff Davis wrote:>> >> Given your question are you about to come back with a >> case where you are not >> seeing this? >> > > As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O rate drops off quickly when you add processes while reading the same blocks from the same file at the same time. I don''t know why this is, and it would be helpful if someone explained it to me.UFS readahead isn''t MT-aware - it starts trashing when multiple threads perform reads of the same blocks. UFS readahead only works if it''s a single thread per file, as the readahead state, i_nextr, is per-inode (and not a per-thread) state. Multiple concurrent readers trash this for each other, as there''s only one-per-file.> > ZFS did a lot better. There did not appear to be any drop-off after the first process. There was a drop in I/O rate as I kept adding processes, but in that case the CPU was at 100%. I haven''t had a chance to test this on a bigger box, but I suspect ZFS is able to keep the sequential read going at full speed (at least if the blocks happen to be written sequentially).ZFS caches multiple readahead states - see the leading comment in usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace. FrankH.> > I did these tests with each process being a "dd if=bigfile of=/dev/null" started at the same time, and I measured I/O rate with "zpool iostat mypool 2" and "iostat -Md 2". > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Jeff Davis
2007-Feb-27 18:40 UTC
[zfs-discuss] Re: Re: Efficiency when reading the same file blocks
> On February 26, 2007 9:05:21 AM -0800 Jeff Davis > But you have to be aware that logically sequential > reads do not > necessarily translate into physically sequential > reads with zfs. zfsI understand that the COW design can fragment files. I''m still trying to understand how that would affect a database. It seems like that may be bad for performance on single disks due to the seeking, but I would expect that to be less significant when you have many spindles. I''ve read the following blogs regarding the topic, but didn''t find a lot of details: http://blogs.sun.com/bonwick/entry/zfs_block_allocation http://blogs.sun.com/realneel/entry/zfs_and_databases This message posted from opensolaris.org
Frank Cusack
2007-Feb-27 18:42 UTC
[zfs-discuss] Re: Re: Efficiency when reading the same file blocks
On February 27, 2007 10:40:57 AM -0800 Jeff Davis <jeff95350 at yahoo.com> wrote:>> On February 26, 2007 9:05:21 AM -0800 Jeff Davis >> But you have to be aware that logically sequential >> reads do not >> necessarily translate into physically sequential >> reads with zfs. zfs > > I understand that the COW design can fragment files. I''m still trying to > understand how that would affect a database. It seems like that may be > bad for performance on single disks due to the seeking, but I would > expect that to be less significant when you have many spindles.I''d expect that as well. -frank
Roch - PAE
2007-Feb-28 11:17 UTC
[zfs-discuss] Re: Re: Efficiency when reading the same file blocks
Jeff Davis writes: > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis > > But you have to be aware that logically sequential > > reads do not > > necessarily translate into physically sequential > > reads with zfs. zfs > > I understand that the COW design can fragment files. I''m still trying to understand how that would affect a database. It seems like that may be bad for performance on single disks due to the seeking, but I would expect that to be less significant when you have many spindles. I''ve read the following blogs regarding the topic, but didn''t find a lot of details: > > http://blogs.sun.com/bonwick/entry/zfs_block_allocation > http://blogs.sun.com/realneel/entry/zfs_and_databases > > Here is my take on this: DB updates (writes) are mostly governed by the synchronous write code path which for ZFS means the ZIL performance. It''s already quite good in that it aggregates multiple updates into few I/Os. Some further improvements are in the works. COW, in general, simplify greatly write code path. DB reads in a transaction workloads are mostly random. If the DB is not cacheable the performance will be that of a head seek no matter what FS is used (since we can''t guess in advance where to seek, COW nature does not help nor hinders performance). DB reads in a decision workloads can benefit from good prefetching (since here we actually know where the next seeks will be). -r > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Roch - PAE
2007-Feb-28 17:26 UTC
[zfs-discuss] Re: Efficiency when reading the same file blocks
Frank Hofmann writes: > On Tue, 27 Feb 2007, Jeff Davis wrote: > > >> > >> Given your question are you about to come back with a > >> case where you are not > >> seeing this? > >> > > > > As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O rate drops off quickly when you add processes while reading the same blocks from the same file at the same time. I don''t know why this is, and it would be helpful if someone explained it to me. > > UFS readahead isn''t MT-aware - it starts trashing when multiple threads > perform reads of the same blocks. UFS readahead only works if it''s a > single thread per file, as the readahead state, i_nextr, is per-inode > (and not a per-thread) state. Multiple concurrent readers trash this for > each other, as there''s only one-per-file. > To qualify ''trashing'', this means UFS looses tracks of the access, considers workload as random and so does not do any readahead. > > > > ZFS did a lot better. There did not appear to be any drop-off after the first process. There was a drop in I/O rate as I kept adding processes, but in that case the CPU was at 100%. I haven''t had a chance to test this on a bigger box, but I suspect ZFS is able to keep the sequential read going at full speed (at least if the blocks happen to be written sequentially). > > ZFS caches multiple readahead states - see the leading comment in > usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace. > The vdev_cache is where you have the low level device level prefetch (I/O for 8K, read 64K of whatever happens to be under the disk head). dmu_zfetch.c is where the logical prefetching occurs. -r
Erblichs
2007-Feb-28 23:43 UTC
[zfs-discuss] Re: Re: Efficiency when reading the same file blocks
ZFS Group, My two cents.. Currently, in my experience, it is a waste of time to try to guarantee "exact" location of disk blocks with any FS. A simple reason exception is bad blocks, a neighboring block will suffice. Second, current disk controllers have logic that translates and you can''t be sure outside of the firmware where the disk block actually is. Yes, I wrote code in this area before. Third, some FSs, do a Read-Modify-Write, where the write is NOT, NOT, NOT overwriting the original location of the read. Why? for a couple of reasons.. One is that the original read may have existed in a fragment. Some do it for FS consistency to allow the write to become a partial write in some circumstances (Ex:crash), and the second file block location then allows for FS consistency and the ability to recover the original contents. No overwite. Another reason is that sometimes we are filling a hole within a FS object window from a base addr to new offset. The ability to concatenate allows us to reduce the number of future seeks and small reads / writes versus having a slightly longer transfer time for the larger theorectical disk block. Thus, the tradeoff is that we accept that we waste some FS space, we may not fully optimize the location of the disk block, we have larger read and write single large block latencies, but... we seek less, the per byte overhead is less, we can order our writes so that we again seek less, our writes can be delayed (assuming that we might write multiple times and then commmit on close) to minimize the amount of actual write operations, we can prioritize our reads over our writes to decrease read latency, etc.. Bottom line is that performance may suffer if we do alot of random small read-modify-writes within FS objects that use a very large disk block. Since the actual CHANGE is small to the file, each small write outside of a delayed write window, will consume at least 1 disk block. However, some writes are to FS objects that are writethru and thus each small write will consume a new disk block. Mitchell Erblich ----------------- Roch - PAE wrote:> > Jeff Davis writes: > > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis > > > But you have to be aware that logically sequential > > > reads do not > > > necessarily translate into physically sequential > > > reads with zfs. zfs > > > > I understand that the COW design can fragment files. I''m still trying to understand how that would affect a database. It seems like that may be bad for performance on single disks due to the seeking, but I would expect that to be less significant when you have many spindles. I''ve read the following blogs regarding the topic, but didn''t find a lot of details: > > > > http://blogs.sun.com/bonwick/entry/zfs_block_allocation > > http://blogs.sun.com/realneel/entry/zfs_and_databases > > > > > > Here is my take on this: > > DB updates (writes) are mostly governed by the synchronous > write code path which for ZFS means the ZIL performance. > It''s already quite good in that it aggregates multiple > updates into few I/Os. Some further improvements are in the > works. COW, in general, simplify greatly write code path. > > DB reads in a transaction workloads are mostly random. If > the DB is not cacheable the performance will be that of a > head seek no matter what FS is used (since we can''t guess in > advance where to seek, COW nature does not help nor hinders > performance). > > DB reads in a decision workloads can benefit from good > prefetching (since here we actually know where the next > seeks will be). > > -r > > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Toby Thain
2007-Mar-01 00:04 UTC
[zfs-discuss] Re: Re: Efficiency when reading the same file blocks
On 28-Feb-07, at 6:43 PM, Erblichs wrote:> ZFS Group, > > My two cents.. > > Currently, in my experience, it is a waste of time to try to > guarantee "exact" location of disk blocks with any FS.? Sounds like you''re confusing logical location with physical location, throughout this post. I''m sure Roch meant logical location. --T> > A simple reason exception is bad blocks, a neighboring block > will suffice. > > Second, current disk controllers have logic that translates > and you can''t be sure outside of the firmware where the > disk block actually is. Yes, I wrote code in this area before. > > Third, some FSs, do a Read-Modify-Write, where the write is > NOT, NOT, NOT overwriting the original location of the read....
Erblichs
2007-Mar-01 00:40 UTC
[zfs-discuss] Re: Re: Efficiency when reading the same file blocks
Toby Thain, No, physical location was for the exact location and logical was for the rest of my info. But, what I might not have made clear was the use of fragments. Their are two types of fragments. One which is the partial use of a logical disk block and the other which I was also trying to refer to is the moving of modified sections of the file. The first use was well used with the Joy FFS implementation where a FS and drive tended to have a high cost per byte overhead and was fairly small. Now, lets make this perfectly clear. If a FS object is large and written "somewhat" in sequence as a stream of bytes and then random FS logical blocks or physical blocks are then modified, the new FS object will be less sequentially written and CAN decrease read performance. Sorry, I tend to care less about write performance, due to the fact that writes tend to be async without threads blocking waiting for their operation to complete. This will happen MOST as the FS fills and less optimal locations of the FS are found for the COW blocks. The same problem happens with memory with OSs that support multiple page sizes where a well used system may not be able to allocate large page sizes due to fragmentation. Yes, this is a overloaded term... :) Thus, FS performance may suffer even if their are just alot of 1 byte changes to frequently accessed FS objects. If this occurs, either keep a larger FS, clean out the FS more frequently, or backup, cleanup, and then restore to get newly sequental FS objects. Mitchell Erblich ----------------- Toby Thain wrote:> > On 28-Feb-07, at 6:43 PM, Erblichs wrote: > > > ZFS Group, > > > > My two cents.. > > > > Currently, in my experience, it is a waste of time to try to > > guarantee "exact" location of disk blocks with any FS. > > ? Sounds like you''re confusing logical location with physical > location, throughout this post. > > I''m sure Roch meant logical location. > > --T > > > > > A simple reason exception is bad blocks, a neighboring block > > will suffice. > > > > Second, current disk controllers have logic that translates > > and you can''t be sure outside of the firmware where the > > disk block actually is. Yes, I wrote code in this area before. > > > > Third, some FSs, do a Read-Modify-Write, where the write is > > NOT, NOT, NOT overwriting the original location of the read. > ...