dd if=/dev/urandom of=largefile.txt bs=1G count=8 cp largefile.txt ./test/1.txt & cp largefile.txt ./test/2.txt & Thats it now the system is totally unusable after launching the two 8G copies. Until these copies finish no other application is able to launch completely. Checking prstat shows them to be in the sleep state. Question: <> I m guessing this because ZFS doesnt use CFQ and that one process is allowed to queue up all its I/O reads ahead of other processes? <> Is there a concept of priority among I/O reads? I only ask because if root were to launch some GUI application they dont start up until both copies are done. So there is no concept of priority? Needless to say this does not exist on Linux 2.60... -- This message posted from opensolaris.org
Henrik http://sparcv9.blogspot.com On 9 jan 2010, at 04.49, bank kus <kus.bank at gmail.com> wrote:> dd if=/dev/urandom of=largefile.txt bs=1G count=8 > > cp largefile.txt ./test/1.txt & > cp largefile.txt ./test/2.txt & > > Thats it now the system is totally unusable after launching the two > 8G copies. Until these copies finish no other application is able to > launch completely. Checking prstat shows them to be in the sleep > state. > > Question: > <> I m guessing this because ZFS doesnt use CFQ and that one process > is allowed to queue up all its I/O reads ahead of other processes? >What is CFQ, a sheduler, if you are running OpenSolaris, then you do not have CFQ.> <> Is there a concept of priority among I/O reads? I only ask > because if root were to launch some GUI application they dont start > up until both copies are done. So there is no concept of priority? > Needless to say this does not exist on Linux 2.60... > --Probably not, but ZFS only runs in userspace on Linux with fuse so it will be quite different.> > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Probably not, but ZFS only runs in userspace on Linux > with fuse so it > will be quite different.I wasnt clear in my description, I m referring to ext4 on Linux. In fact on a system with low RAM even the dd command makes the system horribly unresponsive. IMHO not having fairshare or timeslicing between different processes issuing reads is frankly unacceptable given a lame user can bring the system to a halt with 3 large file copies. Are there ZFS settings or Project Resource Control settings one can use to limit abuse from individual processes? -- This message posted from opensolaris.org
On Sat, 9 Jan 2010, bank kus wrote:>> Probably not, but ZFS only runs in userspace on Linux >> with fuse so it >> will be quite different. > > I wasnt clear in my description, I m referring to ext4 on Linux. In > fact on a system with low RAM even the dd command makes the system > horribly unresponsive. > > IMHO not having fairshare or timeslicing between different processes > issuing reads is frankly unacceptable given a lame user can bring > the system to a halt with 3 large file copies. Are there ZFS > settings or Project Resource Control settings one can use to limit > abuse from individual processes?I am confused. Are you talking about ZFS under OpenSolaris, or are you talking about ZFS under Linux via Fuse? Do you have compression or deduplication enabled on the zfs filesystem? What sort of system are you using? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> I am confused. Are you talking about ZFS under > OpenSolaris, or are > you talking about ZFS under Linux via Fuse????> Do you have compression or deduplication enabled on > the zfs > filesystem?Compression no. I m guessing 2009.06 doesnt have dedup.> What sort of system are you using?OSOL 2009.06 on Intel i7 920. The repro steps are at the top of this thread. -- This message posted from opensolaris.org
> > I wasnt clear in my description, I m referring to ext4 on Linux. In > > fact on a system with low RAM even the dd command makes the system > > horribly unresponsive. > > > > IMHO not having fairshare or timeslicing between different processes > > issuing reads is frankly unacceptable given a lame user can bring > > the system to a halt with 3 large file copies. Are there ZFS > > settings or Project Resource Control settings one can use to limit > > abuse from individual processes? > > I am confused. Are you talking about ZFS under OpenSolaris, or are > you talking about ZFS under Linux via Fuse? > > Do you have compression or deduplication enabled on > the zfs filesystem? > > What sort of system are you using?I was able to reproduce the problem running current (mercurial) opensolaris bits, with the "dd" command: dd if=/dev/urandom of=largefile.txt bs=1048576k count=8 dedup is off, compression is on. System is a 32-bit laptop with 2GB of memory, single core cpu. The system was unusable / unresponsive for about 5 minutes before I was able to interrupt the dd process. -- This message posted from opensolaris.org
On Jan 9, 2010, at 2:02 PM, bank kus wrote:>> Probably not, but ZFS only runs in userspace on Linux >> with fuse so it >> will be quite different. > > I wasnt clear in my description, I m referring to ext4 on Linux. In fact on a system with low RAM even the dd command makes the system horribly unresponsive. > > IMHO not having fairshare or timeslicing between different processes issuing reads is frankly unacceptable given a lame user can bring the system to a halt with 3 large file copies. Are there ZFS settings or Project Resource Control settings one can use to limit abuse from individual processes? > --Are your sure this problem is related to ZFS? I have no problem with multiple threads reading and writing to my pools, it''s till responsive, if I however put urandom with dd into the mix I get much more latency. Does''t for example $(dd if=/dev/urandom of=/dev/null bs=1048576k count=8) give you the same problem, or if you use the file you already created from urandom as input to dd? Regards Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100110/3e54afb7/attachment.html>
Hi Henrik I have 16GB Ram on my system on a lesser RAM system dd does cause problems as I mentioned above. My __guess__ dd is probably sitting in some in memory cache since du -sh doesnt show the full file size until I do a sync. At this point I m less looking for QA type repro questions and/or speculations rather looking for ZFS design expectations. What is the expected behaviour, if one thread queues 100 reads and another thread comes later with 50 reads are these 50 reads __guaranteed__ to fall behind the first 100 or is timeslice/fairshre done between two streams? Btw this problem is pretty serious with 3 users using the system one of them initiating a large copy grinds the other 2 to a halt. Linux doesnt have this problem and this is almost a switch O/S moment for us unfortunately :-( Regards banks -- This message posted from opensolaris.org
Btw FWIW if I redo the dd + 2 cp experiment on /tmp the result is far more disastrous. The GUI stops moving caps lock stops responding for large intervals no clue why. -- This message posted from opensolaris.org
Hi, it seems you might have somekind of hardware issue there, I have no way reproducing this. Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of bank kus Sent: 10. tammikuuta 2010 7:21 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] I/O Read starvation Btw FWIW if I redo the dd + 2 cp experiment on /tmp the result is far more disastrous. The GUI stops moving caps lock stops responding for large intervals no clue why. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
What version of Solaris / OpenSolaris are you using? Older versions use mmap(2) for reads in cp(1). Sadly, mmap(2) does not jive well with ZFS. To be sure, you could check how your cp(1) is implemented using truss(1) (i.e. does it do mmap/write or read/write?) <aside> I find it interesting that ZFS''s mmap(2) deficiencies are now dictating implementation of utilities which may benefit from mmap(2) on other filesystems. And whilst some might argue that mmap(2) is dead for file I/O, I think it''s interesting to note that Linux appears to have a relatively efficient mmap(2) implementation. Sadly, this means that some commercial apps which are mmap(2) heavy currently perform much bettter on Linux than Solaris, especially ZFS. However, I doubt that Linux uses mmap(2) for reads in cp(1). </aside> You could also try using dd(1) instead of cp(1). However, it seems to me that you are using bs=1G count=8 as a lazy way to generate 8GB (because you don''t want to do the math on smaller blocksizes?) Did you know that you are asking dd(1) to do 1GB read(2) and write(2) systems calls using a 1GB buffer? This will cause further pressure on the memory system. In performance terms, you''ll probably find that block sizes beyond 128K add little benefit. So I''d suggest something like: dd if=/dev/urandom of=largefile.txt bs=128k count=65536 dd if=largefile.txt of=./test/1.txt bs=128k & dd if=largefile.txt of=./test/2.txt bs=128k & Phil http://harmanholistix.com bank kus wrote:> dd if=/dev/urandom of=largefile.txt bs=1G count=8 > > cp largefile.txt ./test/1.txt & > cp largefile.txt ./test/2.txt & > > Thats it now the system is totally unusable after launching the two 8G copies. Until these copies finish no other application is able to launch completely. Checking prstat shows them to be in the sleep state. > > Question: > <> I m guessing this because ZFS doesnt use CFQ and that one process is allowed to queue up all its I/O reads ahead of other processes? > > <> Is there a concept of priority among I/O reads? I only ask because if root were to launch some GUI application they dont start up until both copies are done. So there is no concept of priority? Needless to say this does not exist on Linux 2.60... >
Hi Phil You make some interesting points here: -> yes bs=1G was a lazy thing -> the GNU cp I m using does __not__ appears to use mmap open64 open64 read write close close is the relevant sequence -> replacing cp with dd 128K * 64K does not help no new apps can be launched until the copies complete. Regards banks -- This message posted from opensolaris.org
Hello again, On Jan 10, 2010, at 5:39 AM, bank kus wrote:> Hi Henrik > I have 16GB Ram on my system on a lesser RAM system dd does cause problems as I mentioned above. My __guess__ dd is probably sitting in some in memory cache since du -sh doesnt show the full file size until I do a sync. > > At this point I m less looking for QA type repro questions and/or speculations rather looking for ZFS design expectations. > > What is the expected behaviour, if one thread queues 100 reads and another thread comes later with 50 reads are these 50 reads __guaranteed__ to fall behind the first 100 or is timeslice/fairshre done between two streams? > > Btw this problem is pretty serious with 3 users using the system one of them initiating a large copy grinds the other 2 to a halt. Linux doesnt have this problem and this is almost a switch O/S moment for us unfortunately :-(Have you reproduced the problem without using /dev/urandom? I can only get this behavior when using dd from urandom, not using files with cp, and not even files with dd. This could then be related the random driver spending kernel time in high priority threads. So while I agree that this is not optimal, there is a huge difference in how bad it is, if it''s urandom generated there is no problem with copying files. Since you also found that it''s not related to ZFS (also tmpfs, and perhaps only urandom?) we are on the wrong list. Please isolate the problem, can we put aside any filesystem, if so we are on the wrong list, i''ve added perf-discuss also. Regards Henrik http://sparcv9.blogspot.com Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100110/4190560a/attachment.html>
On Sun, 10 Jan 2010, Phil Harman wrote:> In performance terms, you''ll probably find that block sizes beyond 128K add > little benefit. So I''d suggest something like: > > dd if=/dev/urandom of=largefile.txt bs=128k count=65536 > > dd if=largefile.txt of=./test/1.txt bs=128k & > dd if=largefile.txt of=./test/2.txt bs=128k &As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd (Solaris or GNU) does not produce the expected file size when using /dev/urandom as input: % /bin/dd if=/dev/urandom of=largefile.txt bs=131072 count=65536 0+65536 records in 0+65536 records out % ls -lh largefile.txt -rw-r--r-- 1 bfriesen home 65M Jan 10 09:32 largefile.txt % /opt/sfw/bin/dd if=/dev/urandom of=largefile.txt bs=131072 count=65536 0+65536 records in 0+65536 records out 68157440 bytes (68 MB) copied, 1.9741 seconds, 34.5 MB/s % ls -lh largefile.txt -rw-r--r-- 1 bfriesen home 65M Jan 10 09:33 largefile.txt % df -h . Filesystem Size Used Avail Use% Mounted on Sun_2540/zfstest/defaults 1.2T 66M 1.2T 1% /Sun_2540/zfstest/defaults However: % dd if=/dev/urandom of=largefile.txt bs=1024 count=8388608 8388608+0 records in 8388608+0 records out 8589934592 bytes (8.6 GB) copied, 255.06 seconds, 33.7 MB/s % ls -lh largefile.txt -rw-r--r-- 1 bfriesen home 8.0G Jan 10 09:40 largefile.txt % dd if=/dev/urandom of=largefile.txt bs=8192 count=1048576 0+1048576 records in 0+1048576 records out 1090519040 bytes (1.1 GB) copied, 31.8846 seconds, 34.2 MB/s It seems that on my system dd + /dev/urandom is willing to read 1k blocks from /dev/urandom but with even 8K blocks, the actual blocksize is getting truncated down (without warning), producing much less data than requested. Testing with /dev/zero produces different results: % dd if=/dev/zero of=largefile.txt bs=8192 count=1048576 1048576+0 records in 1048576+0 records out 8589934592 bytes (8.6 GB) copied, 20.7434 seconds, 414 MB/s WTF? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
place a sync call after dd ? -- This message posted from opensolaris.org
Hello Bob, On Jan 10, 2010, at 4:54 PM, Bob Friesenhahn wrote:> On Sun, 10 Jan 2010, Phil Harman wrote: >> In performance terms, you''ll probably find that block sizes beyond 128K add little benefit. So I''d suggest something like: >> >> dd if=/dev/urandom of=largefile.txt bs=128k count=65536 >> >> dd if=largefile.txt of=./test/1.txt bs=128k & >> dd if=largefile.txt of=./test/2.txt bs=128k & > > As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd (Solaris or GNU) does not produce the expected file size when using /dev/urandom as input:Do you feel this is related to the filesystem, is there any difference between putting the data in a file on ZFS or just throwing it away? $(dd if=/dev/urandom of=/dev/null bs=1048576k count=16) gives me a quite unresponsive system too. Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100110/b08f0259/attachment.html>
On Sun, 10 Jan 2010, Henrik Johansson wrote:> As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd (Solaris or GNU) does > not produce the expected file size when using /dev/urandom as input: > > Do you feel this is related to the filesystem, is there any difference between putting the data in a file on > ZFS or just throwing it away??My guess is that is due to the implementation of /dev/urandom. It seems to be blocked-up at 1024 bytes and ''dd'' is just using that block size. It is interesting that OpenSolaris is different, and this seems like a bug in Solaris 10. It seems like a new bug to me. The /dev/random and /dev/urandom devices are rather special since reading from them consumes a precious resource -- entropy. Entropy is created based on other activities of the system, which are expected to be random. Using up all the available entropy could dramatically slow-down software which uses /dev/random, such as ssh or ssl. The /dev/random device will completely block when the system runs out of entropy. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sun, Jan 10, 2010 at 09:54:56AM -0600, Bob Friesenhahn wrote:> WTF?urandom is a character device and is returning short reads (note the 0+n vs n+0 counts). dd is not padding these out to the full blocksize (conv=sync) or making multiple reads to fill blocks (conv=fullblock). Evidently the urandom device changed behaviour along the way, with regards to producing/buffering additional requested data, possibly as a result of a changed source implementation that stretches better/faster. No bug here, just bad assumptions. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100111/6f282cc4/attachment.bin>
On Jan 8, 2010, at 7:49 PM, bank kus wrote:> dd if=/dev/urandom of=largefile.txt bs=1G count=8 > > cp largefile.txt ./test/1.txt & > cp largefile.txt ./test/2.txt & > > Thats it now the system is totally unusable after launching the two 8G copies. Until these copies finish no other application is able to launch completely. Checking prstat shows them to be in the sleep state.What disk drivers are you using? IDE? -- richard> > Question: > <> I m guessing this because ZFS doesnt use CFQ and that one process is allowed to queue up all its I/O reads ahead of other processes? > > <> Is there a concept of priority among I/O reads? I only ask because if root were to launch some GUI application they dont start up until both copies are done. So there is no concept of priority? Needless to say this does not exist on Linux 2.60... > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Banks, Some basic stats might shed some light, e.g. vmstat 5, mpstat 5, iostat -xnz 5, prstat -Lmc 5 ... all running from just before you start the tests until things are "normal" again. Memory starvation is certainly a possibility. The ARC can be greedy and slow to release memory under pressure. Phil Sent from my iPhone On 10 Jan 2010, at 13:29, bank kus <kus.bank at gmail.com> wrote:> Hi Phil > You make some interesting points here: > > -> yes bs=1G was a lazy thing > > -> the GNU cp I m using does __not__ appears to use mmap > open64 open64 read write close close is the relevant sequence > > -> replacing cp with dd 128K * 64K does not help no new apps can be > launched until the copies complete. > > Regards > banks > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
vmstat does show something interesting. The free memory shrinks while doing the first dd (generating the 8G file) from around 10G to 1.5Gish. The copy operations thereafter dont consume much and it stays at 1.2G after all operations have completed. (btw at the point of system slugishness theres 1.5G free RAM so that shouldnt explain the problem) However I noticed something weird, long after the file operations are done the free memory doesnt seem to grow back (below) Essentially ZFS File Data claims to use 76% of memory long after the file has been written. How does one reclaim it back. Is ZFS File Data a pool that once grown to a size doesnt shrink back even though its current contents might not be used by any process?> ::memstatPage Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 234696 916 7% ZFS File Data 2384657 9315 76% Anon 145915 569 5% Exec and libs 4250 16 0% Page cache 28582 111 1% Free (cachelist) 53147 207 2% Free (freelist) 290158 1133 9% Total 3141405 12271 Physical 3141404 12271 -- This message posted from opensolaris.org
On Mon, 11 Jan 2010, bank kus wrote:> > However I noticed something weird, long after the file operations > are done the free memory doesnt seem to grow back (below) > Essentially ZFS File Data claims to use 76% of memory long after the > file has been written. How does one reclaim it back. Is ZFS File > Data a pool that once grown to a size doesnt shrink back even though > its current contents might not be used by any process?It is normal for the ZFS ARC to retain data as long as there is not other memory pressure. This should not cause a problem other than a small delay when starting an application which does need a lot of memory since the ARC will give memory back to the kernel. For better interactive use, you can place a cap on the maximum ARC size via an entry in /etc/system: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE For example, you could set it to half your (8GB) memory so that 4GB is immediately available for other uses. * Set maximum ZFS ARC size to 4GB * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE * set zfs:zfs_arc_max = 0x100000000 Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> For example, you could set it to half your (8GB) memory so that 4GB is > immediately available for other uses. > > * Set maximum ZFS ARC size to 4GBcapping max sounds like a good idea thanks banks
Hello, On Jan 11, 2010, at 6:53 PM, bank kus wrote:>> For example, you could set it to half your (8GB) memory so that 4GB is >> immediately available for other uses. >> >> * Set maximum ZFS ARC size to 4GB > > capping max sounds like a good idea.Are we still trying to solve the starvation problem? I filed a bug on the non-ZFS related urandom stall problem yesterday, primary since it can do nasty things from inside a resource capped zone: CR 6915579 solaris-cryp/random Large read from /dev/urandom can stall system Regards Henrik http://sparcv9.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100111/e6c7bdee/attachment.html>
> Are we still trying to solve the starvation problem?I would argue the disk I/O model is fundamentally broken on Solaris if there is no fair I/O scheduling between multiple read sources until that is fixed individual I_am_systemstalled_while_doing_xyz problems will crop up. Started a new thread focussing on just this problem. http://opensolaris.org/jive/thread.jspa?threadID=121479&tstart=0 -- This message posted from opensolaris.org
On Mon, 11 Jan 2010, bank kus wrote:>> Are we still trying to solve the starvation problem? > > I would argue the disk I/O model is fundamentally broken on Solaris > if there is no fair I/O scheduling between multiple read sources > until that is fixed individual I_am_systemstalled_while_doing_xyz > problems will crop up. Started a new thread focussing on just this > problem.While I will readily agree that zfs has a I/O read starvation problem (which has been discussed here many times before), I doubt that it is due to the reasons you are thinking. A true fair I/O scheduling model would severely hinder overall throughput in the same way that true real-time task scheduling cripples throughput. ZFS is very much based on its ARC model. ZFS is designed for maximum throughput with minimum disk accesses in server systems. Most reads and writes are to and from its ARC. Systems with sufficient memory hardly ever do a read from disk and so you will only see writes occuring in ''zpool iostat''. The most common complaint is read stalls while zfs writes its transaction group, but zfs may write this data up to 30 seconds after the application requested the write, and the application might not even be running any more. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Jan 11, 2010, at 2:23 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Mon, 11 Jan 2010, bank kus wrote: > >>> Are we still trying to solve the starvation problem? >> >> I would argue the disk I/O model is fundamentally broken on Solaris >> if there is no fair I/O scheduling between multiple read sources >> until that is fixed individual I_am_systemstalled_while_doing_xyz >> problems will crop up. Started a new thread focussing on just this >> problem. > > While I will readily agree that zfs has a I/O read starvation > problem (which has been discussed here many times before), I doubt > that it is due to the reasons you are thinking. > > A true fair I/O scheduling model would severely hinder overall > throughput in the same way that true real-time task scheduling > cripples throughput. ZFS is very much based on its ARC model. ZFS > is designed for maximum throughput with minimum disk accesses in > server systems. Most reads and writes are to and from its ARC. > Systems with sufficient memory hardly ever do a read from disk and > so you will only see writes occuring in ''zpool iostat''. > > The most common complaint is read stalls while zfs writes its > transaction group, but zfs may write this data up to 30 seconds > after the application requested the write, and the application might > not even be running any more.Maybe an IO scheduler like Linux''s ''deadline'' IO scheduler whose only purpose is to reduce the effect of writers starving readers while providing some form of guaranteed latency. -Ross