thr3ads.net - zfs discuss - [zfs-discuss] Efficiency when reading the same file blocks [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Jeff Davis

2007-Feb-26 03:28 UTC

[zfs-discuss] Efficiency when reading the same file blocks

if you have N processes reading the same file sequentially (where file size is
much greater than physical memory) from the same starting position, should I
expect that all N processes finish in the same time as if it were a single
process?

In other words, if you have one process that reads blocks from a file, is it
"free" (meaning no additional total I/O cost) to have another process
read the same blocks from the same file at the same time?
 
 
This message posted from opensolaris.org

Neil Perrin

2007-Feb-26 03:40 UTC

head link

[zfs-discuss] Efficiency when reading the same file blocks

Jeff Davis wrote On 02/25/07 20:28,:> if you have N processes reading the same file sequentially (where file size
is much greater than physical memory) from the same starting position, should I
expect that all N processes finish in the same time as if it were a single
process?
Yes I would expect them to finish the same time. There should be no additional 
reads because
the data will be in the ZFS cache (ARC).

Given your question are you about to come back with a case where you are not 
seeing this?

Neil.

Jeff Davis

2007-Feb-26 17:05 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

> Given your question are you about to come back with a
> case where you are not 
> seeing this?
Actually, the case where I saw the bad behavior was in Linux using the CFQ I/O
scheduler. When reading the same file sequentially, adding processes drastically
reduced total disk throughput (single disk machine). Using the Linux
anticipatory scheduler worked just fine: no additional I/O costs for more
processes.

That got me worried about the project I''m working on, and I wanted to
understand ZFS''s caching behavior better to prove to myself that the
problem wouldn''t happen under ZFS. Clearly the block will be in cache
on the second read, but what I''d like to know is if ZFS will ask the
disk to do a long, efficient sequential read of the disk, or whether it will
somehow not recognize that the read is sequential because the requests are
coming from different processes?
 
 
This message posted from opensolaris.org

Frank Cusack

2007-Feb-26 17:30 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

On February 26, 2007 9:05:21 AM -0800 Jeff Davis <jeff95350 at yahoo.com> 
wrote:> That got me worried about the project I''m working on, and I wanted
to
> understand ZFS''s caching behavior better to prove to myself that
the
> problem wouldn''t happen under ZFS. Clearly the block will be in
cache on
> the second read, but what I''d like to know is if ZFS will ask the
disk to
> do a long, efficient sequential read of the disk, or whether it will
> somehow not recognize that the read is sequential because the requests
> are coming from different processes?
ISTM zfs would be process-independent wrt that kind of decision.  I have
no clue about it but I couldn''t imagine otherwise.

But you have to be aware that logically sequential reads do not
necessarily translate into physically sequential reads with zfs.  zfs
translates all writes into sequential writes, ie writes to disk are
time ordered not data ordered (like log structured filesystems), so
whether or not the bits on disk are physically sequential depends on
HOW the file was written.

If Thomas is reading, I wonder how this affects mt-tar?  Well, not
tar itself, but subsequent read of the data tar has written.

-frank

Bart Smaalders

2007-Feb-26 17:36 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

Jeff Davis wrote:>> Given your question are you about to come back with a
>> case where you are not 
>> seeing this?
> 
> Actually, the case where I saw the bad behavior was in Linux using the CFQ
I/O scheduler. When reading the same file sequentially, adding processes
drastically reduced total disk throughput (single disk machine). Using the Linux
anticipatory scheduler worked just fine: no additional I/O costs for more
processes.
> 
> That got me worried about the project I''m working on, and I wanted
to understand ZFS''s caching behavior better to prove to myself that the
problem wouldn''t happen under ZFS. Clearly the block will be in cache
on the second read, but what I''d like to know is if ZFS will ask the
disk to do a long, efficient sequential read of the disk, or whether it will
somehow not recognize that the read is sequential because the requests are
coming from different processes?
>  
>  
> This message posted from opensolaris.org

ZFS has a pretty clever IO scheduler; it will handle multiple
readers of the same file, readers of different files, etc;
in each case prefetch is done correctly.  It also handles
programs that skip blocks...

You can see this pretty simply; for small configs (where
a single CPU can saturate all the drives) the net throughput
of the drives doesn''t vary significantly if one is reading a
single file or reading 10 files in parallel.


- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Roch Bourbonnais

2007-Feb-26 18:17 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

Le 26 f?vr. 07 ? 18:30, Frank Cusack a ?crit :
> On February 26, 2007 9:05:21 AM -0800 Jeff Davis  
> <jeff95350 at yahoo.com> wrote:
>> That got me worried about the project I''m working on, and I
wanted to
>> understand ZFS''s caching behavior better to prove to myself
that the
>> problem wouldn''t happen under ZFS. Clearly the block will be
in
>> cache on
>> the second read, but what I''d like to know is if ZFS will ask
the
>> disk to
>> do a long, efficient sequential read of the disk, or whether it will
>> somehow not recognize that the read is sequential because the  
>> requests
>> are coming from different processes?
>
> ISTM zfs would be process-independent wrt that kind of decision.  I  
> have
> no clue about it but I couldn''t imagine otherwise.
>
> But you have to be aware that logically sequential reads do not
> necessarily translate into physically sequential reads with zfs.  zfs
> translates all writes into sequential writes, ie writes to disk are
> time ordered not data ordered (like log structured filesystems), so
> whether or not the bits on disk are physically sequential depends on
> HOW the file was written.
>
> If Thomas is reading, I wonder how this affects mt-tar?  Well, not
> tar itself, but subsequent read of the data tar has written.
>
In each transaction, all blocks that belong to one file are treated  
together before handling the next file.
So I think an mt-hot tar should not have too much of an adverse  
effect on the on-disk layout.

> -frank
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Wade.Stuart at fallon.com

2007-Feb-26 18:27 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

zfs-discuss-bounces at opensolaris.org wrote on 02/26/2007 11:36:18 AM:
> Jeff Davis wrote:
> >> Given your question are you about to come back with a
> >> case where you are not
> >> seeing this?
> >
> > Actually, the case where I saw the bad behavior was in Linux using
> the CFQ I/O scheduler. When reading the same file sequentially,
> adding processes drastically reduced total disk throughput (single
> disk machine). Using the Linux anticipatory scheduler worked just
> fine: no additional I/O costs for more processes.
> >
> > That got me worried about the project I''m working on, and I
wanted
> to understand ZFS''s caching behavior better to prove to myself
that
> the problem wouldn''t happen under ZFS. Clearly the block will be
in
> cache on the second read, but what I''d like to know is if ZFS will
> ask the disk to do a long, efficient sequential read of the disk, or
> whether it will somehow not recognize that the read is sequential
> because the requests are coming from different processes?
> >This gets complicated quickly, here is how I understand it to be:

"if you have N processes reading the same file sequentially (where file
size is much greater than physical memory)."

to me inherently means that the cache will not have the file completely
cached for multiple processes.  If you have N processes in different read
stages (first block, middle block, last block) on a file described above,
there will be a sliding/fragmented window of cached/prefetched data (with
complex dynamics because fo the IO scheduler and ARC). When you have N
procs that window the ARC cache you will not get (near) 100% cache hits,
and depending on how large the file is compared to the available ARC and
other fs loads you may get very poor cache hits.  If all of the procs start
reading the file at the same relative time and read at about the same rate
your cache hits will be higher as this allows the ARC/Prefetch/IO Scheduler
to have the best chance of covering the reads.
      The closer the filesize is to available ARC memory, the closer the
procs are grouped together on reads of the same blocks and fewer groupings
you have of those proc read groups the better your cache rate will be.  In
any case do not expect 1:IO or N:IO.  it will be somewhere in the middle
depending on the specifics of workload.

-Wade

Jeff Davis

2007-Feb-27 17:59 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

> 
> Given your question are you about to come back with a
> case where you are not 
> seeing this?
> 
As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O rate
drops off quickly when you add processes while reading the same blocks from the
same file at the same time. I don''t know why this is, and it would be
helpful if someone explained it to me.

ZFS did a lot better. There did not appear to be any drop-off after the first
process. There was a drop in I/O rate as I kept adding processes, but in that
case the CPU was at 100%. I haven''t had a chance to test this on a
bigger box, but I suspect ZFS is able to keep the sequential read going at full
speed (at least if the blocks happen to be written sequentially).

I did these tests with each process being a "dd if=bigfile
of=/dev/null" started at the same time, and I measured I/O rate with
"zpool iostat mypool 2" and "iostat -Md 2".
 
 
This message posted from opensolaris.org

Frank Hofmann

2007-Feb-27 18:24 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

On Tue, 27 Feb 2007, Jeff Davis wrote:
>>
>> Given your question are you about to come back with a
>> case where you are not
>> seeing this?
>>
>
> As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O
rate drops off quickly when you add processes while reading the same blocks from
the same file at the same time. I don''t know why this is, and it would
be helpful if someone explained it to me.
UFS readahead isn''t MT-aware - it starts trashing when multiple threads
perform reads of the same blocks. UFS readahead only works if it''s a 
single thread per file, as the readahead state, i_nextr, is per-inode 
(and not a per-thread) state. Multiple concurrent readers trash this for 
each other, as there''s only one-per-file.
>
> ZFS did a lot better. There did not appear to be any drop-off after the
first process. There was a drop in I/O rate as I kept adding processes, but in
that case the CPU was at 100%. I haven''t had a chance to test this on a
bigger box, but I suspect ZFS is able to keep the sequential read going at full
speed (at least if the blocks happen to be written sequentially).
ZFS caches multiple readahead states - see the leading comment in
usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace.

FrankH.>
> I did these tests with each process being a "dd if=bigfile
of=/dev/null" started at the same time, and I measured I/O rate with
"zpool iostat mypool 2" and "iostat -Md 2".
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Jeff Davis

2007-Feb-27 18:40 UTC

head link

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

> On February 26, 2007 9:05:21 AM -0800 Jeff Davis
> But you have to be aware that logically sequential
> reads do not
> necessarily translate into physically sequential
> reads with zfs.  zfs
I understand that the COW design can fragment files. I''m still trying
to understand how that would affect a database. It seems like that may be bad
for performance on single disks due to the seeking, but I would expect that to
be less significant when you have many spindles. I''ve read the
following blogs regarding the topic, but didn''t find a lot of details:

http://blogs.sun.com/bonwick/entry/zfs_block_allocation
http://blogs.sun.com/realneel/entry/zfs_and_databases
 
 
This message posted from opensolaris.org

Frank Cusack

2007-Feb-27 18:42 UTC

head link

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

On February 27, 2007 10:40:57 AM -0800 Jeff Davis <jeff95350 at yahoo.com>
wrote:>> On February 26, 2007 9:05:21 AM -0800 Jeff Davis
>> But you have to be aware that logically sequential
>> reads do not
>> necessarily translate into physically sequential
>> reads with zfs.  zfs
>
> I understand that the COW design can fragment files. I''m still
trying to
> understand how that would affect a database. It seems like that may be
> bad for performance on single disks due to the seeking, but I would
> expect that to be less significant when you have many spindles.
I''d expect that as well.
-frank

Roch - PAE

2007-Feb-28 11:17 UTC

head link

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

Jeff Davis writes:
 > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis
 > > But you have to be aware that logically sequential
 > > reads do not
 > > necessarily translate into physically sequential
 > > reads with zfs.  zfs
 > 
 > I understand that the COW design can fragment files. I''m still
trying to understand how that would affect a database. It seems like that may be
bad for performance on single disks due to the seeking, but I would expect that
to be less significant when you have many spindles. I''ve read the
following blogs regarding the topic, but didn''t find a lot of details:
 > 
 > http://blogs.sun.com/bonwick/entry/zfs_block_allocation
 > http://blogs.sun.com/realneel/entry/zfs_and_databases
 >  
 >  

Here is my take on this:

DB updates (writes) are mostly  governed by the  synchronous
write  code  path which for ZFS   means the ZIL performance.
It''s already quite good  in   that it aggregates    multiple
updates into few I/Os.  Some further improvements are in the
works.  COW, in general, simplify greatly write code path.

DB reads in a transaction  workloads  are mostly random.  If
the DB  is not cacheable the performance  will  be that of a
head seek no matter what FS is used (since we can''t guess in
advance where to seek, COW nature does  not help nor hinders
performance).

DB reads in a decision workloads can benefit from good
prefetching (since here we actually know where the next
seeks will be).

-r

 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch - PAE

2007-Feb-28 17:26 UTC

head link

[zfs-discuss] Re: Efficiency when reading the same file blocks

Frank Hofmann writes:
 > On Tue, 27 Feb 2007, Jeff Davis wrote:
 > 
 > >>
 > >> Given your question are you about to come back with a
 > >> case where you are not
 > >> seeing this?
 > >>
 > >
 > > As a follow-up, I tested this on UFS and ZFS. UFS does very poorly:
the I/O rate drops off quickly when you add processes while reading the same
blocks from the same file at the same time. I don''t know why this is,
and it would be helpful if someone explained it to me.
 > 
 > UFS readahead isn''t MT-aware - it starts trashing when multiple
threads
 > perform reads of the same blocks. UFS readahead only works if
it''s a
 > single thread per file, as the readahead state, i_nextr, is per-inode 
 > (and not a per-thread) state. Multiple concurrent readers trash this for 
 > each other, as there''s only one-per-file.
 > 

To qualify ''trashing'', this means UFS looses tracks of the
access, considers workload as random and so does not do any
readahead.

 > >
 > > ZFS did a lot better. There did not appear to be any drop-off after
the first process. There was a drop in I/O rate as I kept adding processes, but
in that case the CPU was at 100%. I haven''t had a chance to test this
on a bigger box, but I suspect ZFS is able to keep the sequential read going at
full speed (at least if the blocks happen to be written sequentially).
 > 
 > ZFS caches multiple readahead states - see the leading comment in
 > usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace.
 > 

The vdev_cache is where you have the low level device level prefetch (I/O
for 8K, read 64K of whatever happens to be under the disk
head).

dmu_zfetch.c is where the logical prefetching occurs.


-r

Erblichs

2007-Feb-28 23:43 UTC

head link

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

ZFS Group,

	My two cents..

	Currently, in  my experience, it is a waste of time to try to
	guarantee "exact" location of disk blocks with any FS.

	A simple reason exception is bad blocks, a neighboring block
	will suffice.

	Second, current disk controllers have logic that translates
	and you can''t be sure outside of the firmware where the
	disk block actually is. Yes, I wrote code in this area before.

	Third, some FSs, do a Read-Modify-Write, where the write is
	NOT, NOT, NOT overwriting the original location of the read.

	Why? for a couple of reasons.. One is that the original read
	may have existed in a fragment. Some do it for FS consistency
	to allow the write to become a partial write in some
	circumstances (Ex:crash), and the second file block location
	then allows for FS consistency and the ability to recover the
	original contents. No overwite.

	Another reason is that sometimes we are filling a hole
	within a FS object window from a base addr to new offset.
	The ability to concatenate allows us to reduce the number of
	future seeks and small reads / writes versus having a slightly
	longer transfer time for the larger theorectical disk block.

	Thus, the tradeoff is that we accept that we waste some FS
	space, we may not fully optimize the location of the disk
	block,  we have larger read and write single large block
	latencies, but... we seek less, the per byte overhead is
	less, we can order our writes so that we again seek less, our
	writes can be delayed (assuming that we might write multiple
	times and then commmit on close) to minimize the amount of
	actual write operations, we can prioritize our reads over
	our writes to decrease read latency, etc..

	Bottom line is that performance may suffer if we do alot
	of random small read-modify-writes within FS objects that
	use a very large disk block. Since the actual CHANGE is
	small to the file, each small write outside of a delayed
	write window, will consume at least 1 disk block. However,
	some writes are to FS objects that are writethru and thus
	each small write will consume a new disk block.

	Mitchell Erblich
	-----------------

	

Roch - PAE wrote:> 
> Jeff Davis writes:
>  > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis
>  > > But you have to be aware that logically sequential
>  > > reads do not
>  > > necessarily translate into physically sequential
>  > > reads with zfs.  zfs
>  >
>  > I understand that the COW design can fragment files. I''m
still trying to understand how that would affect a database. It seems like that
may be bad for performance on single disks due to the seeking, but I would
expect that to be less significant when you have many spindles. I''ve
read the following blogs regarding the topic, but didn''t find a lot of
details:
>  >
>  > http://blogs.sun.com/bonwick/entry/zfs_block_allocation
>  > http://blogs.sun.com/realneel/entry/zfs_and_databases
>  >
>  >
> 
> Here is my take on this:
> 
> DB updates (writes) are mostly  governed by the  synchronous
> write  code  path which for ZFS   means the ZIL performance.
> It''s already quite good  in   that it aggregates    multiple
> updates into few I/Os.  Some further improvements are in the
> works.  COW, in general, simplify greatly write code path.
> 
> DB reads in a transaction  workloads  are mostly random.  If
> the DB  is not cacheable the performance  will  be that of a
> head seek no matter what FS is used (since we can''t guess in
> advance where to seek, COW nature does  not help nor hinders
> performance).
> 
> DB reads in a decision workloads can benefit from good
> prefetching (since here we actually know where the next
> seeks will be).
> 
> -r
> 
>  > This message posted from opensolaris.org
>  > _______________________________________________
>  > zfs-discuss mailing list
>  > zfs-discuss at opensolaris.org
>  > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Toby Thain

2007-Mar-01 00:04 UTC

head link

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

On 28-Feb-07, at 6:43 PM, Erblichs wrote:
> ZFS Group,
>
> 	My two cents..
>
> 	Currently, in  my experience, it is a waste of time to try to
> 	guarantee "exact" location of disk blocks with any FS.
? Sounds like you''re confusing logical location with physical  
location, throughout this post.

I''m sure Roch meant logical location.

--T
>
> 	A simple reason exception is bad blocks, a neighboring block
> 	will suffice.
>
> 	Second, current disk controllers have logic that translates
> 	and you can''t be sure outside of the firmware where the
> 	disk block actually is. Yes, I wrote code in this area before.
>
> 	Third, some FSs, do a Read-Modify-Write, where the write is
> 	NOT, NOT, NOT overwriting the original location of the read....

Erblichs

2007-Mar-01 00:40 UTC

head link

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

Toby Thain,

	No, physical location was for the exact location and
	logical was for the rest of my info.

	But, what I might not have made clear was the use of
	fragments. Their are two types of fragments. One which
	is the partial use of a logical disk block and the other
	which I was also trying to refer to is the moving of modified
	sections of the file. The first use was well used with
	the Joy FFS implementation where a FS and drive tended
	to have a high cost per byte overhead and was fairly
	small.

	Now, lets make this perfectly clear. If a FS object is
	large and written "somewhat" in sequence as a stream
	of bytes and then random FS logical blocks or physical
	blocks are then modified, the new FS object will be less
	sequentially written and CAN decrease read performance.
	Sorry, I tend to care less about write  performance, due
	to the fact that writes tend to be async without threads
	blocking waiting for their operation to complete.

	This will happen MOST as the FS fills and less optimal
	locations of the FS are found for the COW blocks.

	The same problem happens with memory with OSs that support
	multiple page sizes where a well used system may not be
	able to allocate large page sizes due to fragmentation.
	Yes, this is a overloaded term... :)

	Thus, FS performance may suffer even if their are just
	alot of 1 byte changes to frequently accessed FS objects.
	If this occurs, either keep a larger FS, clean out the
	FS more frequently, or backup, cleanup, and then restore
	to get newly sequental FS objects.

	Mitchell Erblich
	-----------------
	
	
Toby Thain wrote:> 
> On 28-Feb-07, at 6:43 PM, Erblichs wrote:
> 
> > ZFS Group,
> >
> >       My two cents..
> >
> >       Currently, in  my experience, it is a waste of time to try to
> >       guarantee "exact" location of disk blocks with any FS.
> 
> ? Sounds like you''re confusing logical location with physical
> location, throughout this post.
> 
> I''m sure Roch meant logical location.
> 
> --T
> 
> >
> >       A simple reason exception is bad blocks, a neighboring block
> >       will suffice.
> >
> >       Second, current disk controllers have logic that translates
> >       and you can''t be sure outside of the firmware where the
> >       disk block actually is. Yes, I wrote code in this area before.
> >
> >       Third, some FSs, do a Read-Modify-Write, where the write is
> >       NOT, NOT, NOT overwriting the original location of the read.
> ...

Maybe Matching Threads

Search for more reasonably related threads

zfs discuss - Feb 2007 - Efficiency when reading the same file blocks

[zfs-discuss] Efficiency when reading the same file blocks

[zfs-discuss] Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

[zfs-discuss] Re: Re: Efficiency when reading the same file blocks

Maybe Matching Threads