thr3ads.net - zfs discuss - [zfs-discuss] I/O Read starvation [Jan 2010]

If this information is useful, please help other people find it:
Share via:

bank kus

2010-Jan-09 03:49 UTC

[zfs-discuss] I/O Read starvation

dd if=/dev/urandom of=largefile.txt bs=1G count=8

cp largefile.txt ./test/1.txt &
cp largefile.txt ./test/2.txt &

Thats it now the system is totally unusable after launching the two 8G copies.
Until these copies finish no other application is able to launch completely.
Checking prstat shows them to be in the sleep state.

Question:
<> I m guessing this because ZFS doesnt use CFQ and that one process is
allowed to queue up all its I/O reads ahead of other processes?

<> Is there a concept of priority among I/O reads? I only ask because if
root were to launch some GUI application they dont start up until both copies
are done. So there is no concept of priority? Needless to say this does not
exist on Linux 2.60...
-- 
This message posted from opensolaris.org

Henrik Johansson

2010-Jan-09 12:48 UTC

head link

[zfs-discuss] I/O Read starvation

Henrik
http://sparcv9.blogspot.com

On 9 jan 2010, at 04.49, bank kus <kus.bank at gmail.com> wrote:
> dd if=/dev/urandom of=largefile.txt bs=1G count=8
>
> cp largefile.txt ./test/1.txt &
> cp largefile.txt ./test/2.txt &
>
> Thats it now the system is totally unusable after launching the two  
> 8G copies. Until these copies finish no other application is able to  
> launch completely. Checking prstat shows them to be in the sleep  
> state.
>
> Question:
> <> I m guessing this because ZFS doesnt use CFQ and that one process
> is allowed to queue up all its I/O reads ahead of other processes?
>
What is CFQ, a sheduler, if you are running OpenSolaris, then you do  
not have CFQ.
> <> Is there a concept of priority among I/O reads? I only ask  
> because if root were to launch some GUI application they dont start  
> up until both copies are done. So there is no concept of priority?  
> Needless to say this does not exist on Linux 2.60...
> --
Probably not, but ZFS only runs in userspace on Linux with fuse so it  
will be quite different.


>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

bank kus

2010-Jan-09 13:02 UTC

head link

[zfs-discuss] I/O Read starvation

> Probably not, but ZFS only runs in userspace on Linux
> with fuse so it  
> will be quite different.
I wasnt clear in my description, I m referring to ext4 on Linux. In fact on a
system with low RAM even the dd command makes the system horribly unresponsive.

IMHO not having fairshare or timeslicing between different processes issuing
reads is frankly unacceptable given a lame user can bring the system to a halt
with 3 large file copies. Are there ZFS settings or Project Resource Control
settings one can use to limit abuse from individual processes?
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Jan-09 16:50 UTC

head link

[zfs-discuss] I/O Read starvation

On Sat, 9 Jan 2010, bank kus wrote:
>> Probably not, but ZFS only runs in userspace on Linux
>> with fuse so it
>> will be quite different.
>
> I wasnt clear in my description, I m referring to ext4 on Linux. In 
> fact on a system with low RAM even the dd command makes the system 
> horribly unresponsive.
>
> IMHO not having fairshare or timeslicing between different processes 
> issuing reads is frankly unacceptable given a lame user can bring 
> the system to a halt with 3 large file copies. Are there ZFS 
> settings or Project Resource Control settings one can use to limit 
> abuse from individual processes?
I am confused.  Are you talking about ZFS under OpenSolaris, or are 
you talking about ZFS under Linux via Fuse?

Do you have compression or deduplication enabled on the zfs 
filesystem?

What sort of system are you using?

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

bank kus

2010-Jan-09 17:14 UTC

head link

[zfs-discuss] I/O Read starvation

> I am confused.  Are you talking about ZFS under
> OpenSolaris, or are 
> you talking about ZFS under Linux via Fuse?
??? 
> Do you have compression or deduplication enabled on
> the zfs 
> filesystem?
Compression no. I m guessing 2009.06 doesnt have dedup.
 > What sort of system are you using?
OSOL 2009.06 on Intel i7 920. The repro steps are at the top of this thread.
-- 
This message posted from opensolaris.org

Jürgen Keil

2010-Jan-09 17:37 UTC

head link

[zfs-discuss] I/O Read starvation

> > I wasnt clear in my description, I m referring to ext4 on Linux. In 
> > fact on a system with low RAM even the dd command makes the system 
> > horribly unresponsive.
> >
> > IMHO not having fairshare or timeslicing between different processes 
> > issuing reads is frankly unacceptable given a lame user can bring 
> > the system to a halt with 3 large file copies. Are there ZFS 
> > settings or Project Resource Control settings one can use to limit 
> > abuse from individual processes?
> 
> I am confused.  Are you talking about ZFS under OpenSolaris, or are 
> you talking about ZFS under Linux via Fuse?
> 
> Do you have compression or deduplication enabled on
> the zfs  filesystem?
> 
> What sort of system are you using?
I was able to reproduce the problem running
current (mercurial) opensolaris bits, with the
"dd" command:

  dd if=/dev/urandom of=largefile.txt bs=1048576k count=8

dedup is off, compression is on. System is a 32-bit laptop
with 2GB of memory, single core cpu.  The system was
unusable / unresponsive for about 5 minutes before I was
able to interrupt the dd process.
-- 
This message posted from opensolaris.org

Henrik Johansson

2010-Jan-10 03:52 UTC

head link

[zfs-discuss] I/O Read starvation

On Jan 9, 2010, at 2:02 PM, bank kus wrote:
>> Probably not, but ZFS only runs in userspace on Linux
>> with fuse so it  
>> will be quite different.
> 
> I wasnt clear in my description, I m referring to ext4 on Linux. In fact on
a system with low RAM even the dd command makes the system horribly
unresponsive.
> 
> IMHO not having fairshare or timeslicing between different processes
issuing reads is frankly unacceptable given a lame user can bring the system to
a halt with 3 large file copies. Are there ZFS settings or Project Resource
Control settings one can use to limit abuse from individual processes?
> -- 
Are your sure this problem is related to ZFS? I have no problem with multiple
threads reading and writing to my pools, it''s till responsive, if I
however put urandom with dd into the mix I get much more latency.

Does''t  for example $(dd if=/dev/urandom of=/dev/null bs=1048576k
count=8) give you the same problem, or if you use the file you already created
from urandom as input to dd?

Regards

Henrik
http://sparcv9.blogspot.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100110/3e54afb7/attachment.html>

bank kus

2010-Jan-10 04:39 UTC

head link

[zfs-discuss] I/O Read starvation

Hi Henrik
I have 16GB Ram on my system on a lesser RAM system dd does cause problems as I
mentioned above. My __guess__ dd is probably sitting in some in memory cache
since du -sh doesnt show the full file size until I do a sync.

At this point I m less looking for QA type repro questions and/or speculations
rather looking for  ZFS design expectations.

What is the expected behaviour, if one thread queues 100 reads  and another
thread comes later with 50 reads are these 50 reads __guaranteed__ to fall
behind the first 100 or is timeslice/fairshre done between two streams?

Btw this problem is pretty serious with 3 users using the system one of them
initiating a large copy grinds the other 2 to a halt. Linux doesnt have this
problem and this is almost a switch O/S moment for us unfortunately :-(

Regards
banks
-- 
This message posted from opensolaris.org

bank kus

2010-Jan-10 05:20 UTC

head link

[zfs-discuss] I/O Read starvation

Btw FWIW if I redo the dd + 2 cp experiment on /tmp the result is far more
disastrous. The GUI stops moving caps lock stops responding for large intervals
no clue why.
-- 
This message posted from opensolaris.org

Markus Kovero

2010-Jan-10 08:37 UTC

head link

[zfs-discuss] I/O Read starvation

Hi, it seems you might have somekind of hardware issue there, I have no way
reproducing this.

Yours
Markus Kovero

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of bank kus
Sent: 10. tammikuuta 2010 7:21
To: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] I/O Read starvation

Btw FWIW if I redo the dd + 2 cp experiment on /tmp the result is far more
disastrous. The GUI stops moving caps lock stops responding for large intervals
no clue why.
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Phil Harman

2010-Jan-10 10:46 UTC

head link

[zfs-discuss] I/O Read starvation

What version of Solaris / OpenSolaris are you using? Older versions use 
mmap(2) for reads in cp(1). Sadly, mmap(2) does not jive well with ZFS.

To be sure, you could check how your cp(1) is implemented using truss(1) 
(i.e. does it do mmap/write or read/write?)

<aside>
I find it interesting that ZFS''s mmap(2) deficiencies are now dictating
implementation of utilities which may benefit from mmap(2) on other 
filesystems. And whilst some might argue that mmap(2) is dead for file 
I/O, I think it''s interesting to note that Linux appears to have a 
relatively efficient mmap(2) implementation. Sadly, this means that some 
commercial apps which are mmap(2) heavy currently perform much bettter 
on Linux than Solaris, especially ZFS. However, I doubt that Linux uses 
mmap(2) for reads in cp(1).
</aside>

You could also try using dd(1) instead of cp(1).

However, it seems to me that you are using bs=1G count=8 as a lazy way 
to generate 8GB (because you don''t want to do the math on smaller 
blocksizes?)

Did you know that you are asking dd(1) to do 1GB read(2) and write(2) 
systems calls using a 1GB buffer? This will cause further pressure on 
the memory system.

In performance terms, you''ll probably find that block sizes beyond 128K
add little benefit. So I''d suggest something like:

dd if=/dev/urandom of=largefile.txt bs=128k count=65536

dd if=largefile.txt of=./test/1.txt bs=128k &
dd if=largefile.txt of=./test/2.txt bs=128k &

Phil

http://harmanholistix.com

bank kus wrote:> dd if=/dev/urandom of=largefile.txt bs=1G count=8
>
> cp largefile.txt ./test/1.txt &
> cp largefile.txt ./test/2.txt &
>
> Thats it now the system is totally unusable after launching the two 8G
copies. Until these copies finish no other application is able to launch
completely. Checking prstat shows them to be in the sleep state.
>
> Question:
> <> I m guessing this because ZFS doesnt use CFQ and that one process
is allowed to queue up all its I/O reads ahead of other processes?
>
> <> Is there a concept of priority among I/O reads? I only ask because
if root were to launch some GUI application they dont start up until both copies
are done. So there is no concept of priority? Needless to say this does not
exist on Linux 2.60...
>

bank kus

2010-Jan-10 13:29 UTC

head link

[zfs-discuss] I/O Read starvation

Hi Phil 
You make some interesting points here:

-> yes bs=1G was a lazy thing 

->  the GNU cp I m using does __not__ appears to use mmap
open64 open64  read write  close close is the relevant sequence

-> replacing cp with dd 128K * 64K does not help no new apps can be launched
until the copies complete.

Regards
banks
-- 
This message posted from opensolaris.org

Henrik Johansson

2010-Jan-10 13:40 UTC

head link

[zfs-discuss] I/O Read starvation

Hello again,

On Jan 10, 2010, at 5:39 AM, bank kus wrote:
> Hi Henrik
> I have 16GB Ram on my system on a lesser RAM system dd does cause problems
as I mentioned above. My __guess__ dd is probably sitting in some in memory
cache since du -sh doesnt show the full file size until I do a sync.
> 
> At this point I m less looking for QA type repro questions and/or
speculations rather looking for  ZFS design expectations.
> 
> What is the expected behaviour, if one thread queues 100 reads  and another
thread comes later with 50 reads are these 50 reads __guaranteed__ to fall
behind the first 100 or is timeslice/fairshre done between two streams?
> 
> Btw this problem is pretty serious with 3 users using the system one of
them initiating a large copy grinds the other 2 to a halt. Linux doesnt have
this problem and this is almost a switch O/S moment for us unfortunately :-(
Have you reproduced the problem without using /dev/urandom? I can only get this
behavior when using dd from urandom, not using files with cp, and not even files
with dd. This could then be related the random driver spending kernel time in
high priority threads.

So while I agree that this is not optimal, there is a huge difference in how bad
it is, if it''s urandom generated there is no problem with copying
files. Since you also found that it''s not related to ZFS (also tmpfs,
and perhaps only urandom?) we are on the wrong list. Please isolate the problem,
can we put aside any filesystem, if so we are on the wrong list, i''ve
added perf-discuss also.

Regards

Henrik
http://sparcv9.blogspot.com


Henrik
http://sparcv9.blogspot.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100110/4190560a/attachment.html>

Bob Friesenhahn

2010-Jan-10 15:54 UTC

head link

[zfs-discuss] I/O Read starvation

On Sun, 10 Jan 2010, Phil Harman wrote:> In performance terms, you''ll probably find that block sizes beyond
128K add
> little benefit. So I''d suggest something like:
>
> dd if=/dev/urandom of=largefile.txt bs=128k count=65536
>
> dd if=largefile.txt of=./test/1.txt bs=128k &
> dd if=largefile.txt of=./test/2.txt bs=128k &
As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), 
dd (Solaris or GNU) does not produce the expected file size when using 
/dev/urandom as input:

% /bin/dd if=/dev/urandom of=largefile.txt bs=131072 count=65536
0+65536 records in
0+65536 records out
% ls -lh largefile.txt
-rw-r--r--   1 bfriesen home         65M Jan 10 09:32 largefile.txt
% /opt/sfw/bin/dd if=/dev/urandom of=largefile.txt bs=131072 
count=65536
0+65536 records in
0+65536 records out
68157440 bytes (68 MB) copied, 1.9741 seconds, 34.5 MB/s
% ls -lh largefile.txt
-rw-r--r--   1 bfriesen home         65M Jan 10 09:33 largefile.txt
% df -h .
Filesystem            Size  Used Avail Use% Mounted on
Sun_2540/zfstest/defaults
                       1.2T   66M  1.2T   1% /Sun_2540/zfstest/defaults

However:
% dd if=/dev/urandom of=largefile.txt bs=1024 count=8388608
8388608+0 records in
8388608+0 records out
8589934592 bytes (8.6 GB) copied, 255.06 seconds, 33.7 MB/s
% ls -lh largefile.txt
-rw-r--r--   1 bfriesen home        8.0G Jan 10 09:40 largefile.txt

% dd if=/dev/urandom of=largefile.txt bs=8192 count=1048576
0+1048576 records in
0+1048576 records out
1090519040 bytes (1.1 GB) copied, 31.8846 seconds, 34.2 MB/s

It seems that on my system dd + /dev/urandom is willing to read 1k 
blocks from /dev/urandom but with even 8K blocks, the actual blocksize 
is getting truncated down (without warning), producing much less data 
than requested.

Testing with /dev/zero produces different results:
% dd if=/dev/zero of=largefile.txt bs=8192 count=1048576
1048576+0 records in
1048576+0 records out
8589934592 bytes (8.6 GB) copied, 20.7434 seconds, 414 MB/s

WTF?

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

bank kus

2010-Jan-10 16:35 UTC

head link

[zfs-discuss] I/O Read starvation

place a sync call after dd ?
-- 
This message posted from opensolaris.org

Henrik Johansson

2010-Jan-10 18:25 UTC

head link

[zfs-discuss] I/O Read starvation

Hello Bob,

On Jan 10, 2010, at 4:54 PM, Bob Friesenhahn wrote:
> On Sun, 10 Jan 2010, Phil Harman wrote:
>> In performance terms, you''ll probably find that block sizes
beyond 128K add little benefit. So I''d suggest something like:
>> 
>> dd if=/dev/urandom of=largefile.txt bs=128k count=65536
>> 
>> dd if=largefile.txt of=./test/1.txt bs=128k &
>> dd if=largefile.txt of=./test/2.txt bs=128k &
> 
> As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd
(Solaris or GNU) does not produce the expected file size when using /dev/urandom
as input:
Do you feel this is related to the filesystem, is there any difference between
putting the data in a file on ZFS or just throwing it away?

$(dd if=/dev/urandom of=/dev/null bs=1048576k count=16) gives me a quite
unresponsive system too.

Henrik
http://sparcv9.blogspot.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100110/b08f0259/attachment.html>

Bob Friesenhahn

2010-Jan-10 19:09 UTC

head link

[zfs-discuss] I/O Read starvation

On Sun, 10 Jan 2010, Henrik Johansson wrote:>       As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR),
dd (Solaris or GNU) does
>       not produce the expected file size when using /dev/urandom as input:
> 
> Do you feel this is related to the filesystem, is there any difference
between putting the data in a file on
> ZFS or just throwing it away??
My guess is that is due to the implementation of /dev/urandom.  It 
seems to be blocked-up at 1024 bytes and ''dd'' is just using
that block
size.  It is interesting that OpenSolaris is different, and this seems 
like a bug in Solaris 10.  It seems like a new bug to me.

The /dev/random and /dev/urandom devices are rather special since 
reading from them consumes a precious resource -- entropy.  Entropy is 
created based on other activities of the system, which are expected to 
be random.  Using up all the available entropy could dramatically 
slow-down software which uses /dev/random, such as ssh or ssl.  The 
/dev/random device will completely block when the system runs out of 
entropy.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Daniel Carosone

2010-Jan-10 21:48 UTC

head link

[zfs-discuss] I/O Read starvation

On Sun, Jan 10, 2010 at 09:54:56AM -0600, Bob Friesenhahn
wrote:> WTF?
urandom is a character device and is returning short reads (note the
0+n vs n+0 counts). dd is not padding these out to the full blocksize
(conv=sync) or making multiple reads to fill blocks (conv=fullblock).

Evidently the urandom device changed behaviour along the way, with
regards to producing/buffering additional requested data, possibly as
a result of a changed source implementation that stretches
better/faster.  No bug here, just bad assumptions. 

--
Dan.




-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100111/6f282cc4/attachment.bin>

Richard Elling

2010-Jan-10 21:48 UTC

head link

[zfs-discuss] I/O Read starvation

On Jan 8, 2010, at 7:49 PM, bank kus wrote:
> dd if=/dev/urandom of=largefile.txt bs=1G count=8
> 
> cp largefile.txt ./test/1.txt &
> cp largefile.txt ./test/2.txt &
> 
> Thats it now the system is totally unusable after launching the two 8G
copies. Until these copies finish no other application is able to launch
completely. Checking prstat shows them to be in the sleep state.
What disk drivers are you using?  IDE?
 -- richard
> 
> Question:
> <> I m guessing this because ZFS doesnt use CFQ and that one process
is allowed to queue up all its I/O reads ahead of other processes?
> 
> <> Is there a concept of priority among I/O reads? I only ask because
if root were to launch some GUI application they dont start up until both copies
are done. So there is no concept of priority? Needless to say this does not
exist on Linux 2.60...
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Phil Harman

2010-Jan-11 09:30 UTC

head link

[zfs-discuss] I/O Read starvation

Hi Banks,

Some basic stats might shed some light, e.g. vmstat 5, mpstat 5,  
iostat -xnz 5, prstat -Lmc 5 ... all running from just before you  
start the tests until things are "normal" again.

Memory starvation is certainly a possibility. The ARC can be greedy  
and slow to release memory under pressure.

Phil

Sent from my iPhone

On 10 Jan 2010, at 13:29, bank kus <kus.bank at gmail.com> wrote:
> Hi Phil
> You make some interesting points here:
>
> -> yes bs=1G was a lazy thing
>
> ->  the GNU cp I m using does __not__ appears to use mmap
> open64 open64  read write  close close is the relevant sequence
>
> -> replacing cp with dd 128K * 64K does not help no new apps can be  
> launched until the copies complete.
>
> Regards
> banks
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

bank kus

2010-Jan-11 17:35 UTC

head link

[zfs-discuss] I/O Read starvation

vmstat does show something interesting.  The free memory shrinks while doing the
first dd (generating the 8G file) from around 10G  to 1.5Gish. The copy
operations thereafter dont consume much and it stays at 1.2G after all
operations have completed. (btw at the point of system slugishness theres 1.5G
free RAM so that shouldnt explain the problem)

However I noticed something weird, long after the file operations are done the
free memory doesnt seem to grow back (below) Essentially ZFS File Data claims to
use 76% of memory long after the file has been written. How does one reclaim it
back. Is ZFS File Data a pool that once grown to a size doesnt shrink back even
though its current contents might not be used by any process?
> ::memstatPage Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     234696               916    7%
ZFS File Data             2384657              9315   76%
Anon                       145915               569    5%
Exec and libs                4250                16    0%
Page cache                  28582               111    1%
Free (cachelist)            53147               207    2%
Free (freelist)            290158              1133    9%

Total                     3141405             12271
Physical                  3141404             12271
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Jan-11 17:50 UTC

head link

[zfs-discuss] I/O Read starvation

On Mon, 11 Jan 2010, bank kus wrote:>
> However I noticed something weird, long after the file operations 
> are done the free memory doesnt seem to grow back (below) 
> Essentially ZFS File Data claims to use 76% of memory long after the 
> file has been written. How does one reclaim it back. Is ZFS File 
> Data a pool that once grown to a size doesnt shrink back even though 
> its current contents might not be used by any process?
It is normal for the ZFS ARC to retain data as long as there is not 
other memory pressure.  This should not cause a problem other than a 
small delay when starting an application which does need a lot of 
memory since the ARC will give memory back to the kernel.

For better interactive use, you can place a cap on the maximum ARC 
size via an entry in /etc/system:

   http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE

For example, you could set it to half your (8GB) memory so that 4GB is 
immediately available for other uses.

* Set maximum ZFS ARC size to 4GB
* http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE
* set zfs:zfs_arc_max = 0x100000000

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

bank kus

2010-Jan-11 17:53 UTC

head link

[zfs-discuss] I/O Read starvation

> For example, you could set it to half your (8GB) memory so that 4GB is
> immediately available for other uses.
>
> * Set maximum ZFS ARC size to 4GB
capping max sounds like a good idea

thanks
banks

Henrik Johansson

2010-Jan-11 18:41 UTC

head link

[zfs-discuss] I/O Read starvation

Hello,

On Jan 11, 2010, at 6:53 PM, bank kus wrote:
>> For example, you could set it to half your (8GB) memory so that 4GB is
>> immediately available for other uses.
>> 
>> * Set maximum ZFS ARC size to 4GB
> 
> capping max sounds like a good idea.

Are we still trying to solve the starvation problem?

I filed a bug on the non-ZFS related urandom stall problem yesterday, primary
since it can do nasty things from inside a resource capped zone:
CR 6915579 solaris-cryp/random Large read from /dev/urandom can stall system

Regards
Henrik
http://sparcv9.blogspot.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100111/e6c7bdee/attachment.html>

bank kus

2010-Jan-11 18:56 UTC

head link

[zfs-discuss] I/O Read starvation

> Are we still trying to solve the starvation problem?
I would argue the disk I/O model is fundamentally broken on Solaris if there is
no fair I/O scheduling between multiple read sources until that is fixed
individual I_am_systemstalled_while_doing_xyz problems will crop up. Started a
new thread focussing on just this problem.

http://opensolaris.org/jive/thread.jspa?threadID=121479&tstart=0
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Jan-11 19:23 UTC

head link

[zfs-discuss] I/O Read starvation

On Mon, 11 Jan 2010, bank kus wrote:
>> Are we still trying to solve the starvation problem?
>
> I would argue the disk I/O model is fundamentally broken on Solaris 
> if there is no fair I/O scheduling between multiple read sources 
> until that is fixed individual I_am_systemstalled_while_doing_xyz 
> problems will crop up. Started a new thread focussing on just this 
> problem.
While I will readily agree that zfs has a I/O read starvation problem 
(which has been discussed here many times before), I doubt that it is 
due to the reasons you are thinking.

A true fair I/O scheduling model would severely hinder overall 
throughput in the same way that true real-time task scheduling 
cripples throughput.  ZFS is very much based on its ARC model.  ZFS is 
designed for maximum throughput with minimum disk accesses in server 
systems.  Most reads and writes are to and from its ARC.  Systems with 
sufficient memory hardly ever do a read from disk and so you will only 
see writes occuring in ''zpool iostat''.

The most common complaint is read stalls while zfs writes its 
transaction group, but zfs may write this data up to 30 seconds after 
the application requested the write, and the application might not 
even be running any more.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Walker

2010-Jan-11 23:46 UTC

head link

[zfs-discuss] I/O Read starvation

On Jan 11, 2010, at 2:23 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us
 > wrote:
> On Mon, 11 Jan 2010, bank kus wrote:
>
>>> Are we still trying to solve the starvation problem?
>>
>> I would argue the disk I/O model is fundamentally broken on Solaris  
>> if there is no fair I/O scheduling between multiple read sources  
>> until that is fixed individual I_am_systemstalled_while_doing_xyz  
>> problems will crop up. Started a new thread focussing on just this  
>> problem.
>
> While I will readily agree that zfs has a I/O read starvation  
> problem (which has been discussed here many times before), I doubt  
> that it is due to the reasons you are thinking.
>
> A true fair I/O scheduling model would severely hinder overall  
> throughput in the same way that true real-time task scheduling  
> cripples throughput.  ZFS is very much based on its ARC model.  ZFS  
> is designed for maximum throughput with minimum disk accesses in  
> server systems.  Most reads and writes are to and from its ARC.   
> Systems with sufficient memory hardly ever do a read from disk and  
> so you will only see writes occuring in ''zpool iostat''.
>
> The most common complaint is read stalls while zfs writes its  
> transaction group, but zfs may write this data up to 30 seconds  
> after the application requested the write, and the application might  
> not even be running any more.
Maybe an IO scheduler like Linux''s ''deadline'' IO
scheduler whose only
purpose is to reduce the effect of writers starving readers while  
providing some form of guaranteed latency.

-Ross

zfs discuss - Jan 2010 - I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation

[zfs-discuss] I/O Read starvation