-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I''ve started porting a video streaming application to opensolaris on ZFS, and am hitting some pretty weird performance issues. The thing I''m trying to do is run 77 concurrent video capture processes (roughly 430Mbit/s in total) all writing into separate files on a 12TB J4200 storage array. The disks in the array are arranged into a single RAID-0 ZFS volume (though I''ve tried different RAID levels, none helped). CPU performance is not an issue (barely hitting 35% utilization on a single CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the storage array''s sequential write performance is around 600MB/s. The problem is the bursty behavior of ZFS writes. All the capture processes do, in essence is poll() on a socket and then read() and write() any available data from it to a file. The poll() call is done with a timeout of 250ms, expecting that if no data arrives within 0.25 seconds, the input is dead and recording stops (I tried increasing this value, but the problem still arises, although not as frequently). When ZFS decides that it wants to commit a transaction group to disk (every 30 seconds), the system stalls for a short amount of time and depending on the number capture of processes currently running, the poll() call (which usually blocks for 1-2ms), takes on the order of hundreds of ms, sometimes even longer. I figured that I might be able to resolve this by lowering the txg timeout to something like 1-2 seconds (I need ZFS to write as soon as data arrives, since it will likely never be overwritten), but I couldn''t find any tunable parameter for it anywhere on the net. On FreeBSD, I think this can be done via the vfs.zfs.txg_timeout sysctl. A glimpse into the source at http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c on line 40 made me worry that somebody maybe hard-coded this value into the kernel, in which case I''d be pretty much screwed in opensolaris. Any help would be greatly appreciated. Regards, - -- Saso -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks0/QoACgkQRO8UcfzpOHB9PgCeOuJFVHTCohRzuf7kAEkC1l1i qBAAn18Jkx+N9OotWVCwpz5iQzNZSsEG =FCJL -----END PGP SIGNATURE-----
Richard Elling
2009-Dec-25 20:50 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Dec 25, 2009, at 9:57 AM, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I''ve started porting a video streaming application to opensolaris on > ZFS, and am hitting some pretty weird performance issues. The thing > I''m > trying to do is run 77 concurrent video capture processes (roughly > 430Mbit/s in total) all writing into separate files on a 12TB J4200 > storage array. The disks in the array are arranged into a single > RAID-0 > ZFS volume (though I''ve tried different RAID levels, none helped). CPU > performance is not an issue (barely hitting 35% utilization on a > single > CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the > storage array''s sequential write performance is around 600MB/s. > > The problem is the bursty behavior of ZFS writes. All the capture > processes do, in essence is poll() on a socket and then read() and > write() any available data from it to a file.There have been some changes recently, including one in b130 that might apply to this workload. What version of the OS are you running? If not b130, try b130. -- richard> The poll() call is done > with a timeout of 250ms, expecting that if no data arrives within 0.25 > seconds, the input is dead and recording stops (I tried increasing > this > value, but the problem still arises, although not as frequently). When > ZFS decides that it wants to commit a transaction group to disk (every > 30 seconds), the system stalls for a short amount of time and > depending > on the number capture of processes currently running, the poll() call > (which usually blocks for 1-2ms), takes on the order of hundreds of > ms, > sometimes even longer. I figured that I might be able to resolve > this by > lowering the txg timeout to something like 1-2 seconds (I need ZFS to > write as soon as data arrives, since it will likely never be > overwritten), but I couldn''t find any tunable parameter for it > anywhere > on the net. On FreeBSD, I think this can be done via the > vfs.zfs.txg_timeout sysctl. A glimpse into the source at > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c > on line 40 made me worry that somebody maybe hard-coded this value > into > the kernel, in which case I''d be pretty much screwed in opensolaris. > > Any help would be greatly appreciated. > > Regards, > - -- > Saso > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAks0/QoACgkQRO8UcfzpOHB9PgCeOuJFVHTCohRzuf7kAEkC1l1i > qBAAn18Jkx+N9OotWVCwpz5iQzNZSsEG > =FCJL > -----END PGP SIGNATURE----- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi there, Try to: zfs set logbias=throughput <yourdataset> Good luck, LK -- This message posted from opensolaris.org
Hi, I''m not sure what "b130" means, I''m fairly new to OpenSolaris. How do I find out? As for the OS version, it is OpenSolaris 2009.06. Regards, -- Saso Richard Elling wrote:> On Dec 25, 2009, at 9:57 AM, Saso Kiselkov wrote: > > I''ve started porting a video streaming application to opensolaris on > ZFS, and am hitting some pretty weird performance issues. The thing I''m > trying to do is run 77 concurrent video capture processes (roughly > 430Mbit/s in total) all writing into separate files on a 12TB J4200 > storage array. The disks in the array are arranged into a single RAID-0 > ZFS volume (though I''ve tried different RAID levels, none helped). CPU > performance is not an issue (barely hitting 35% utilization on a single > CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the > storage array''s sequential write performance is around 600MB/s. > > The problem is the bursty behavior of ZFS writes. All the capture > processes do, in essence is poll() on a socket and then read() and > write() any available data from it to a file. > > > There have been some changes recently, including one in b130 that > > might apply to this workload. What version of the OS are you running? > > If not b130, try b130. > > -- richard > > The poll() call is done > with a timeout of 250ms, expecting that if no data arrives within 0.25 > seconds, the input is dead and recording stops (I tried increasing this > value, but the problem still arises, although not as frequently). When > ZFS decides that it wants to commit a transaction group to disk (every > 30 seconds), the system stalls for a short amount of time and depending > on the number capture of processes currently running, the poll() call > (which usually blocks for 1-2ms), takes on the order of hundreds of ms, > sometimes even longer. I figured that I might be able to resolve this by > lowering the txg timeout to something like 1-2 seconds (I need ZFS to > write as soon as data arrives, since it will likely never be > overwritten), but I couldn''t find any tunable parameter for it anywhere > on the net. On FreeBSD, I think this can be done via the > vfs.zfs.txg_timeout sysctl. A glimpse into the source at > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c > on line 40 made me worry that somebody maybe hard-coded this value into > the kernel, in which case I''d be pretty much screwed in opensolaris. > > Any help would be greatly appreciated. > > Regards,_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi, I tried it and I got the following error message: # zfs set logbias=throughput content cannot set property for ''content'': invalid property ''logbias'' Is it because I''m running some older version which does not have this feature? (2009.06) Regards, -- Saso Leonid Kogan wrote:> Hi there, > Try to: > zfs set logbias=throughput <yourdataset> > > Good luck, > LK >
On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov <skiselkov at gmail.com> wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I''ve started porting a video streaming application to opensolaris on > ZFS, and am hitting some pretty weird performance issues. The thing I''m > trying to do is run 77 concurrent video capture processes (roughly > 430Mbit/s in total) all writing into separate files on a 12TB J4200 > storage array. The disks in the array are arranged into a single RAID-0 > ZFS volume (though I''ve tried different RAID levels, none helped). CPU > performance is not an issue (barely hitting 35% utilization on a single > CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the > storage array''s sequential write performance is around 600MB/s. > > The problem is the bursty behavior of ZFS writes. All the capture > processes do, in essence is poll() on a socket and then read() and > write() any available data from it to a file. The poll() call is done > with a timeout of 250ms, expecting that if no data arrives within 0.25 > seconds, the input is dead and recording stops (I tried increasing this > value, but the problem still arises, although not as frequently). When > ZFS decides that it wants to commit a transaction group to disk (every > 30 seconds), the system stalls for a short amount of time and depending > on the number capture of processes currently running, the poll() call > (which usually blocks for 1-2ms), takes on the order of hundreds of ms, > sometimes even longer. I figured that I might be able to resolve this by > lowering the txg timeout to something like 1-2 seconds (I need ZFS to > write as soon as data arrives, since it will likely never be > overwritten), but I couldn''t find any tunable parameter for it anywhere > on the net. On FreeBSD, I think this can be done via the > vfs.zfs.txg_timeout sysctl. A glimpse into the source at > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c > on line 40 made me worry that somebody maybe hard-coded this value into > the kernel, in which case I''d be pretty much screwed in opensolaris. > > Any help would be greatly appreciated. > > Regards, > - -- > Saso > > >Hang on... if you''ve got 77 concurrent threads going, I don''t see how that''s a "sequential" I/O load. To the backend storage it''s going to look like the equivalent of random I/O. I''d also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091225/b60015fe/attachment.html>
On Fri, Dec 25, 2009 at 7:47 PM, Tim Cook <tim at cook.ms> wrote:> > > On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov <skiselkov at gmail.com> wrote: >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> I''ve started porting a video streaming application to opensolaris on >> ZFS, and am hitting some pretty weird performance issues. The thing I''m >> trying to do is run 77 concurrent video capture processes (roughly >> 430Mbit/s in total) all writing into separate files on a 12TB J4200 >> storage array. The disks in the array are arranged into a single RAID-0 >> ZFS volume (though I''ve tried different RAID levels, none helped). CPU >> performance is not an issue (barely hitting 35% utilization on a single >> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the >> storage array''s sequential write performance is around 600MB/s. >> >> The problem is the bursty behavior of ZFS writes. All the capture >> processes do, in essence is poll() on a socket and then read() and >> write() any available data from it to a file. The poll() call is done >> with a timeout of 250ms, expecting that if no data arrives within 0.25 >> seconds, the input is dead and recording stops (I tried increasing this >> value, but the problem still arises, although not as frequently). When >> ZFS decides that it wants to commit a transaction group to disk (every >> 30 seconds), the system stalls for a short amount of time and depending >> on the number capture of processes currently running, the poll() call >> (which usually blocks for 1-2ms), takes on the order of hundreds of ms, >> sometimes even longer. I figured that I might be able to resolve this by >> lowering the txg timeout to something like 1-2 seconds (I need ZFS to >> write as soon as data arrives, since it will likely never be >> overwritten), but I couldn''t find any tunable parameter for it anywhere >> on the net. On FreeBSD, I think this can be done via the >> vfs.zfs.txg_timeout sysctl. A glimpse into the source at >> >> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c >> on line 40 made me worry that somebody maybe hard-coded this value into >> the kernel, in which case I''d be pretty much screwed in opensolaris. >> >> Any help would be greatly appreciated. >> >> Regards, >> - -- >> Saso >> >> > > > Hang on... if you''ve got 77 concurrent threads going, I don''t see how that''s > a "sequential" I/O load.? To the backend storage it''s going to look like the > equivalent of random I/O.? I''d also be surprised to see 12 1TB disks > supporting 600MB/sec throughput and would be interested in hearing where you > got those numbers from. > > Is your video capture doing 430MB or 430Mbit? > > -- > --Tim > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. The writes lag is noticeable however with ZFS, and the behavior of the transaction group writes. If you have a big write that needs to land on disk, it seems all other I/O, CPU and "niceness" is thrown out the window in favor of getting all that data on disk. I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris support, I''ll try to find that bug number, but I believe some improvements were done in 129 and 130. -- Brent Jones brent at servuhome.net
On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at servuhome.net> wrote:> > >> > >> > > > > > > Hang on... if you''ve got 77 concurrent threads going, I don''t see how > that''s > > a "sequential" I/O load. To the backend storage it''s going to look like > the > > equivalent of random I/O. I''d also be surprised to see 12 1TB disks > > supporting 600MB/sec throughput and would be interested in hearing where > you > > got those numbers from. > > > > Is your video capture doing 430MB or 430Mbit? > > > > -- > > --Tim > > > > > > Think he said 430Mbit/sec, which if these are security cameras, would > be a good sized installation (30+ cameras). > We have a similar system, albeit running on Windows. Writing about > 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and > working quite well on our system without any frame loss or much > latency. >Once again, Mb or MB? They''re two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that''s not really impressive at all. If you''re saying you got 400MB, that''s a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn''t happening with random.> > The writes lag is noticeable however with ZFS, and the behavior of the > transaction group writes. If you have a big write that needs to land > on disk, it seems all other I/O, CPU and "niceness" is thrown out the > window in favor of getting all that data on disk. > I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris > support, I''ll try to find that bug number, but I believe some > improvements were done in 129 and 130. > > >-- --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091225/ab2263f2/attachment.html>
Try b130. http://genunix.org/ Cheers, LK On 12/26/2009 12:59 AM, Saso Kiselkov wrote:> Hi, > > I tried it and I got the following error message: > > # zfs set logbias=throughput content > cannot set property for ''content'': invalid property ''logbias'' > > Is it because I''m running some older version which does not have this > feature? (2009.06) > > Regards, > -- > Saso > > Leonid Kogan wrote: > >> Hi there, >> Try to: >> zfs set logbias=throughput<yourdataset> >> >> Good luck, >> LK >> >> >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The application I''m working on is a kind of large-scale network-PVR system for our IPTV services. It records all running TV channels in a X-hour carrousel (typically 24 or 48-hours), retaining only those bits which users have marked as being interesting to them. The current setup I''m doing development on is a small 12TB array, future deployment is planned on several 96TB X4540 machines. I agree that I kind of misused the term `sequential'' - it really is 77 concurrent sequential writes. However, as I explained, I/O is not the bottleneck here, as the array is capable of writes around 600MBytes/s, and the write load I''m putting on it is around 55MBytes/s (430Mbit/s). The problem is, as Brent explained, that as soon as the OS decides it wants to write the transaction group to disk, it totally ignores all other time-critical activity in the system and focuses on just that, causing an input poll() stall on all network sockets. What I''d need to do is force it to commit transactions to disk more often so as to even the load out over a longer period of time, to bring the CPU usage spikes down to a more manageable and predictable level. Regards, - -- Saso Tim Cook wrote:> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at servuhome.net> wrote: > >>>> >>> >>> Hang on... if you''ve got 77 concurrent threads going, I don''t see how >> that''s >>> a "sequential" I/O load. To the backend storage it''s going to look like >> the >>> equivalent of random I/O. I''d also be surprised to see 12 1TB disks >>> supporting 600MB/sec throughput and would be interested in hearing where >> you >>> got those numbers from. >>> >>> Is your video capture doing 430MB or 430Mbit? >>> >>> -- >>> --Tim >>> >> > >> >> Think he said 430Mbit/sec, which if these are security cameras, would >> be a good sized installation (30+ cameras). >> We have a similar system, albeit running on Windows. Writing about >> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and >> working quite well on our system without any frame loss or much >> latency. >> > > Once again, Mb or MB? They''re two completely different numbers. As for > getting 400Mbit out of 6 SATA drive, that''s not really impressive at all. > If you''re saying you got 400MB, that''s a different story entirely, and while > possible with sequential I/O and a proper raid setup, it isn''t happening > with random. > > >> The writes lag is noticeable however with ZFS, and the behavior of the >> transaction group writes. If you have a big write that needs to land >> on disk, it seems all other I/O, CPU and "niceness" is thrown out the >> window in favor of getting all that data on disk. >> I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris >> support, I''ll try to find that bug number, but I believe some >> improvements were done in 129 and 130. >> >> >> >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks1y8oACgkQRO8UcfzpOHBkDQCgxScaPPS7d+peoiY16Nafo8lu 1nsAoNMwiUdOdQKCZpdyPGoAWz36IWY5 =T6fy -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Would an upgrade to the development repository of 2010.02 do the same? I''d like to avoid having to do a complete reinstall, since I''ve got quite a bit of custom software in the system already in various places and recompiling and fine-tuning would take me another 1-2 days. Regards, - -- Saso Leonid Kogan wrote:> Try b130. > http://genunix.org/ > > Cheers, > LK > > > On 12/26/2009 12:59 AM, Saso Kiselkov wrote: >> Hi, >> >> I tried it and I got the following error message: >> >> # zfs set logbias=throughput content >> cannot set property for ''content'': invalid property ''logbias'' >> >> Is it because I''m running some older version which does not have this >> feature? (2009.06) >> >> Regards, >> -- >> Saso >> >> Leonid Kogan wrote: >> >>> Hi there, >>> Try to: >>> zfs set logbias=throughput<yourdataset> >>> >>> Good luck, >>> LK >>> >>> >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks1zCIACgkQRO8UcfzpOHA1SQCaAqK+2v/+lQnuaXPc4pOju7UC oaIAoNKJO3oOr4DCdCXHCp+vf2/Ri2mW =pmGr -----END PGP SIGNATURE-----
On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms> wrote:> > > On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at servuhome.net> wrote: >> >> >> >> >> >> > >> > >> > Hang on... if you''ve got 77 concurrent threads going, I don''t see how >> > that''s >> > a "sequential" I/O load.? To the backend storage it''s going to look like >> > the >> > equivalent of random I/O.? I''d also be surprised to see 12 1TB disks >> > supporting 600MB/sec throughput and would be interested in hearing where >> > you >> > got those numbers from. >> > >> > Is your video capture doing 430MB or 430Mbit? >> > >> > -- >> > --Tim >> > >> > >> >> Think he said 430Mbit/sec, which if these are security cameras, would >> be a good sized installation (30+ cameras). >> We have a similar system, albeit running on Windows. Writing about >> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and >> working quite well on our system without any frame loss or much >> latency. > > Once again, Mb or MB?? They''re two completely different numbers.? As for > getting 400Mbit out of 6 SATA drive, that''s not really impressive at all. > If you''re saying you got 400MB, that''s a different story entirely, and while > possible with sequential I/O and a proper raid setup, it isn''t happening > with random. >Mb, megabit. 400 megabit is not terribly high, a single SATA drive could write that 24/7 without a sweat. Which is why he is reporting his issue. Sequential or random, any modern system should be able to perform that task without causing disruption to other processes running on the system (if Windows can, Solaris/ZFS most definitely should be able to). I have similar workload on my X4540''s, streaming backups from multiple systems at a time. These are very high end machines, dual quadcore opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. The "write stalls" have been a significant problem since ZFS came out, and hasn''t really been addressed in an acceptable fashion yet, though work has been done to improve it. I''m still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more "classes" to the scheduler, to allow more fair disk I/O and overall "niceness" on the system when ZFS commits a transaction group. -- Brent Jones brent at servuhome.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Brent Jones wrote:> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms> wrote: >> >> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at servuhome.net> wrote: >>>>> >>>> >>>> Hang on... if you''ve got 77 concurrent threads going, I don''t see how >>>> that''s >>>> a "sequential" I/O load. To the backend storage it''s going to look like >>>> the >>>> equivalent of random I/O. I''d also be surprised to see 12 1TB disks >>>> supporting 600MB/sec throughput and would be interested in hearing where >>>> you >>>> got those numbers from. >>>> >>>> Is your video capture doing 430MB or 430Mbit? >>>> >>>> -- >>>> --Tim >>>> >>>> >>> Think he said 430Mbit/sec, which if these are security cameras, would >>> be a good sized installation (30+ cameras). >>> We have a similar system, albeit running on Windows. Writing about >>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and >>> working quite well on our system without any frame loss or much >>> latency. >> Once again, Mb or MB? They''re two completely different numbers. As for >> getting 400Mbit out of 6 SATA drive, that''s not really impressive at all. >> If you''re saying you got 400MB, that''s a different story entirely, and while >> possible with sequential I/O and a proper raid setup, it isn''t happening >> with random. >> > > Mb, megabit. > 400 megabit is not terribly high, a single SATA drive could write that > 24/7 without a sweat. Which is why he is reporting his issue. > > Sequential or random, any modern system should be able to perform that > task without causing disruption to other processes running on the > system (if Windows can, Solaris/ZFS most definitely should be able > to). > > I have similar workload on my X4540''s, streaming backups from multiple > systems at a time. These are very high end machines, dual quadcore > opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. > > The "write stalls" have been a significant problem since ZFS came out, > and hasn''t really been addressed in an acceptable fashion yet, though > work has been done to improve it. > > I''m still trying to find the case number I have open with Sunsolve or > whatever, it was for exactly this issue, and I believe the fix was to > add dozens more "classes" to the scheduler, to allow more fair disk > I/O and overall "niceness" on the system when ZFS commits a > transaction group.Wow, if there were a production-release solution to the problem, that would be great! Reading the mailing list I almost gave up hope that I''d be able to work around this issue without upgrading to the latest bleeding-edge development version. Regards, - -- Saso -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks10xQACgkQRO8UcfzpOHCFUQCeJ0kHwOgM3Vjc6QjIL6XHVip5 ed4AoIYrNGAZR2V69uUk3Gc/MAl3kew3 =5uSX -----END PGP SIGNATURE-----
Fajar A. Nugraha
2009-Dec-26 10:30 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov <skiselkov at gmail.com> wrote:>> I''m still trying to find the case number I have open with Sunsolve or >> whatever, it was for exactly this issue, and I believe the fix was to >> add dozens more "classes" to the scheduler, to allow more fair disk >> I/O and overall "niceness" on the system when ZFS commits a >> transaction group. > > Wow, if there were a production-release solution to the problem, that > would be great!Have you checked this thread? http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg28704.html> Reading the mailing list I almost gave up hope that I''d > be able to work around this issue without upgrading to the latest > bleeding-edge development version.Isn''t opensolaris already bleeding edge? -- Fajar
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thank you, the post you mentioned helped me move a bit forward. I tried putting: zfs:zfs_txg_timeout = 1 in /etc/system and now I''m getting much more even write load (a burst every 5 seconds), which now does not cause any significant poll() stalling anymore. So far I fail to find the timer in the ZFS source code which causes the 5-second timeout instead of what I want (1 second). Another thing that''s left on my mind is why I''m still getting a very slight burst every 60 seconds (causing a poll() delay of around 20-30ms, instead of the usual 0-2ms). It''s not that big a problem, it''s just that I''m curious as to where it''s being created. I assume some 60-second timer is firing, but I don''t know where. Regards, - -- Saso Fajar A. Nugraha wrote:> On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov <skiselkov at gmail.com> wrote: >>> I''m still trying to find the case number I have open with Sunsolve or >>> whatever, it was for exactly this issue, and I believe the fix was to >>> add dozens more "classes" to the scheduler, to allow more fair disk >>> I/O and overall "niceness" on the system when ZFS commits a >>> transaction group. >> Wow, if there were a production-release solution to the problem, that >> would be great! > > Have you checked this thread? > http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg28704.html > >> Reading the mailing list I almost gave up hope that I''d >> be able to work around this issue without upgrading to the latest >> bleeding-edge development version. > > Isn''t opensolaris already bleeding edge? >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks1/+8ACgkQRO8UcfzpOHC6kgCfcTv86Gwh2MvvVQJeJr/BRghe f6IAn2N1t4QNLfwBdafZHUbXCw0grTRk =hUJV -----END PGP SIGNATURE-----
Bob Friesenhahn
2009-Dec-26 15:53 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Fri, 25 Dec 2009, Saso Kiselkov wrote:> sometimes even longer. I figured that I might be able to resolve this by > lowering the txg timeout to something like 1-2 seconds (I need ZFS to > write as soon as data arrives, since it will likely never be > overwritten), but I couldn''t find any tunable parameter for it anywhere > on the net. On FreeBSD, I think this can be done via theWhile there are some useful tunable parameters, another approach is to consider requesting a synchronous write using fdatasync(3RT) or fsync(3C) immediately after the final write() request in one of your poll() time quantums. This will cause the data to be written immediately. System behavior will then seem totally different. Unfortunately, it will also be less efficient. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 12/26/09 09:53, Brent Jones wrote:> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook<tim at cook.ms> wrote: >> >> >> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones<brent at servuhome.net> wrote: >>> >>>>> >>>>> >>>> >>>> >>>> Hang on... if you''ve got 77 concurrent threads going, I don''t see how >>>> that''s >>>> a "sequential" I/O load. To the backend storage it''s going to look like >>>> the >>>> equivalent of random I/O. I''d also be surprised to see 12 1TB disks >>>> supporting 600MB/sec throughput and would be interested in hearing where >>>> you >>>> got those numbers from. >>>> >>>> Is your video capture doing 430MB or 430Mbit? >>>> >>>> -- >>>> --Tim >>>> >>>> >>> >>> Think he said 430Mbit/sec, which if these are security cameras, would >>> be a good sized installation (30+ cameras). >>> We have a similar system, albeit running on Windows. Writing about >>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and >>> working quite well on our system without any frame loss or much >>> latency. >> >> Once again, Mb or MB? They''re two completely different numbers. As for >> getting 400Mbit out of 6 SATA drive, that''s not really impressive at all. >> If you''re saying you got 400MB, that''s a different story entirely, and while >> possible with sequential I/O and a proper raid setup, it isn''t happening >> with random. >> > > Mb, megabit. > 400 megabit is not terribly high, a single SATA drive could write that > 24/7 without a sweat. Which is why he is reporting his issue. > > Sequential or random, any modern system should be able to perform that > task without causing disruption to other processes running on the > system (if Windows can, Solaris/ZFS most definitely should be able > to). > > I have similar workload on my X4540''s, streaming backups from multiple > systems at a time. These are very high end machines, dual quadcore > opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. > > The "write stalls" have been a significant problem since ZFS came out, > and hasn''t really been addressed in an acceptable fashion yet, though > work has been done to improve it. > > I''m still trying to find the case number I have open with Sunsolve or > whatever, it was for exactly this issue, and I believe the fix was to > add dozens more "classes" to the scheduler, to allow more fair disk > I/O and overall "niceness" on the system when ZFS commits a > transaction group. >That would be the new System Duty Cycle Scheduling Class that was putback in build 129: Author: Jonathan Adams <Jonathan.Adams at Sun.COM> Repository: /export/onnv-gate Total changesets: 1 Changeset: 87f3734e64df Comments: 6881015 ZFS write activity prevents other threads from running in a timely manner 6899867 mstate_thread_onproc_time() doesn''t account for runnable time correctly PSARC/2009/615 System Duty Cycle Scheduling Class and ZFS IO Observability See http://arc.opensolaris.org/caselog/PSARC/2009/615/ for more information. If you''re using the "dev" repository, you can pkg image-update to get this new functionality. Cheers, Menno -- Menno Lageman - Sun Microsystems - http://blogs.sun.com/menno
Richard Elling
2009-Dec-26 16:36 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Brent Jones wrote: >> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms> wrote: >>> >>> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones >>> <brent at servuhome.net> wrote: >>>>>> >>>>> >>>>> Hang on... if you''ve got 77 concurrent threads going, I don''t >>>>> see how >>>>> that''s >>>>> a "sequential" I/O load. To the backend storage it''s going to >>>>> look like >>>>> the >>>>> equivalent of random I/O. I''d also be surprised to see 12 1TB >>>>> disks >>>>> supporting 600MB/sec throughput and would be interested in >>>>> hearing where >>>>> you >>>>> got those numbers from. >>>>> >>>>> Is your video capture doing 430MB or 430Mbit? >>>>> >>>>> -- >>>>> --Tim >>>>> >>>>> >>>> Think he said 430Mbit/sec, which if these are security cameras, >>>> would >>>> be a good sized installation (30+ cameras). >>>> We have a similar system, albeit running on Windows. Writing about >>>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and >>>> working quite well on our system without any frame loss or much >>>> latency. >>> Once again, Mb or MB? They''re two completely different numbers. >>> As for >>> getting 400Mbit out of 6 SATA drive, that''s not really impressive >>> at all. >>> If you''re saying you got 400MB, that''s a different story entirely, >>> and while >>> possible with sequential I/O and a proper raid setup, it isn''t >>> happening >>> with random. >>> >> >> Mb, megabit. >> 400 megabit is not terribly high, a single SATA drive could write >> that >> 24/7 without a sweat. Which is why he is reporting his issue. >> >> Sequential or random, any modern system should be able to perform >> that >> task without causing disruption to other processes running on the >> system (if Windows can, Solaris/ZFS most definitely should be able >> to). >> >> I have similar workload on my X4540''s, streaming backups from >> multiple >> systems at a time. These are very high end machines, dual quadcore >> opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. >> >> The "write stalls" have been a significant problem since ZFS came >> out, >> and hasn''t really been addressed in an acceptable fashion yet, though >> work has been done to improve it.PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO Observability was integrated into b129. This creates a scheduling class for ZFS IO and automatically places the zio threads into that class. This is not really an earth-shattering change, Solaris has had a very flexible scheduler for almost 20 years now. Another example is that on a desktop, the application which has mouse focus runs in the interactive scheduling class. This is completely transparent to most folks and there is no tweaking required. Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other threads from running in a timely manner, which is related to the above.>> I''m still trying to find the case number I have open with Sunsolve or >> whatever, it was for exactly this issue, and I believe the fix was to >> add dozens more "classes" to the scheduler, to allow more fair disk >> I/O and overall "niceness" on the system when ZFS commits a >> transaction group. > > Wow, if there were a production-release solution to the problem, that > would be great! Reading the mailing list I almost gave up hope that > I''d > be able to work around this issue without upgrading to the latest > bleeding-edge development version.Changes have to occur someplace first. In the OpenSolaris world, the changes occur first in the dev train and then are back ported to Solaris 10 (sometimes, not always). You should try the latest build first -- be sure to follow the release notes. Then, if the problem persists, you might consider tuning zfs_txg_timeout, which can be done on a live system. -- richard
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks for the advice. I did an in-place upgrade to the latest development b130 release and it seems that the change in scheduling classes for the kernel writer threads worked (not even having to fiddle around with logbias) - now I''m just getting small delays every 60 seconds (on the order of 20-30ms). I''m not sure these have something to do with ZFS, though... they happen outside of the write bursts. Thank you all for the valuable advice! Regards, - -- Saso Richard Elling wrote:> > On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Brent Jones wrote: >>> On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook <tim at cook.ms> wrote: >>>> >>>> On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones <brent at servuhome.net> >>>> wrote: >>>>>>> >>>>>> >>>>>> Hang on... if you''ve got 77 concurrent threads going, I don''t see how >>>>>> that''s >>>>>> a "sequential" I/O load. To the backend storage it''s going to >>>>>> look like >>>>>> the >>>>>> equivalent of random I/O. I''d also be surprised to see 12 1TB disks >>>>>> supporting 600MB/sec throughput and would be interested in hearing >>>>>> where >>>>>> you >>>>>> got those numbers from. >>>>>> >>>>>> Is your video capture doing 430MB or 430Mbit? >>>>>> >>>>>> -- >>>>>> --Tim >>>>>> >>>>>> >>>>> Think he said 430Mbit/sec, which if these are security cameras, would >>>>> be a good sized installation (30+ cameras). >>>>> We have a similar system, albeit running on Windows. Writing about >>>>> 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and >>>>> working quite well on our system without any frame loss or much >>>>> latency. >>>> Once again, Mb or MB? They''re two completely different numbers. As >>>> for >>>> getting 400Mbit out of 6 SATA drive, that''s not really impressive at >>>> all. >>>> If you''re saying you got 400MB, that''s a different story entirely, >>>> and while >>>> possible with sequential I/O and a proper raid setup, it isn''t >>>> happening >>>> with random. >>>> >>> >>> Mb, megabit. >>> 400 megabit is not terribly high, a single SATA drive could write that >>> 24/7 without a sweat. Which is why he is reporting his issue. >>> >>> Sequential or random, any modern system should be able to perform that >>> task without causing disruption to other processes running on the >>> system (if Windows can, Solaris/ZFS most definitely should be able >>> to). >>> >>> I have similar workload on my X4540''s, streaming backups from multiple >>> systems at a time. These are very high end machines, dual quadcore >>> opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. >>> >>> The "write stalls" have been a significant problem since ZFS came out, >>> and hasn''t really been addressed in an acceptable fashion yet, though >>> work has been done to improve it. > > PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO > Observability was integrated into b129. This creates a scheduling class > for ZFS IO and automatically places the zio threads into that class. This > is not really an earth-shattering change, Solaris has had a very flexible > scheduler for almost 20 years now. Another example is that on a desktop, > the application which has mouse focus runs in the interactive scheduling > class. This is completely transparent to most folks and there is no > tweaking > required. > > Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other > threads from running in a timely manner, which is related to the above. > > >>> I''m still trying to find the case number I have open with Sunsolve or >>> whatever, it was for exactly this issue, and I believe the fix was to >>> add dozens more "classes" to the scheduler, to allow more fair disk >>> I/O and overall "niceness" on the system when ZFS commits a >>> transaction group. >> >> Wow, if there were a production-release solution to the problem, that >> would be great! Reading the mailing list I almost gave up hope that I''d >> be able to work around this issue without upgrading to the latest >> bleeding-edge development version. > > Changes have to occur someplace first. In the OpenSolaris world, > the changes occur first in the dev train and then are back ported to > Solaris 10 (sometimes, not always). > > You should try the latest build first -- be sure to follow the release > notes. > Then, if the problem persists, you might consider tuning zfs_txg_timeout, > which can be done on a live system. > -- richard >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks2RfgACgkQRO8UcfzpOHDhCQCeIrJxcy4TcqgvPwGYm/f97NG9 ac8An2zTTqtz/KCK6a4IzKHzgYdEB0Qe =9zO8 -----END PGP SIGNATURE-----
On 12/26/2009 10:41 AM, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Would an upgrade to the development repository of 2010.02 do the same? > I''d like to avoid having to do a complete reinstall, since I''ve got > quite a bit of custom software in the system already in various places > and recompiling and fine-tuning would take me another 1-2 days. > > Regards, > - -- > Saso > > Leonid Kogan wrote: > >> Try b130. >> http://genunix.org/ >> >> Cheers, >> LK >> >> >> On 12/26/2009 12:59 AM, Saso Kiselkov wrote: >> >>> Hi, >>> >>> I tried it and I got the following error message: >>> >>> # zfs set logbias=throughput content >>> cannot set property for ''content'': invalid property ''logbias'' >>> >>> Is it because I''m running some older version which does not have this >>> feature? (2009.06) >>> >>> Regards, >>> -- >>> Saso >>> >>> Leonid Kogan wrote: >>> >>> >>>> Hi there, >>>> Try to: >>>> zfs set logbias=throughput<yourdataset> >>>> >>>> Good luck, >>>> LK >>>> >>>> >>>> >>> >>> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAks1zCIACgkQRO8UcfzpOHA1SQCaAqK+2v/+lQnuaXPc4pOju7UC > oaIAoNKJO3oOr4DCdCXHCp+vf2/Ri2mW > =pmGr > -----END PGP SIGNATURE----- >AFAIK yes. LK
Robert Milkowski
2009-Dec-27 14:22 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On 26/12/2009 12:22, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Thank you, the post you mentioned helped me move a bit forward. I tried > putting: > > zfs:zfs_txg_timeout = 1btw: you can tune it on a live system without a need to do reboots. milek at r600:~# echo zfs_txg_timeout/D | mdb -k zfs_txg_timeout: zfs_txg_timeout:30 milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw zfs_txg_timeout:0x1e = 0x1 milek at r600:~# echo zfs_txg_timeout/D | mdb -k zfs_txg_timeout: zfs_txg_timeout:1 milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw zfs_txg_timeout:0x1 = 0x1e milek at r600:~# echo zfs_txg_timeout/D | mdb -k zfs_txg_timeout: zfs_txg_timeout:30 milek at r600:~# -- Robert Milkowski http://milek.blogspot.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks for the mdb syntax - I wasn''t sure how to set it using mdb at runtime, which is why I used /etc/system. I was quite intrigued to find out that the Solaris kernel was in fact designed for being tuned at runtime using some generic debugging mechanism, rather than like other traditional kernels, using a defined kernel settings interface (sysctl comes to mind). Anyway, upgrading to b130 helped my issue and I hope that by the time we start selling this product, OpenSolaris 2010.02 comes out, so that I can tell people to just grab the latest stable OpenSolaris release, rather than having to go to a development branch or tuning kernel parameters to even get the software working as it should. Regards, - -- Saso Robert Milkowski wrote:> On 26/12/2009 12:22, Saso Kiselkov wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Thank you, the post you mentioned helped me move a bit forward. I tried >> putting: >> >> zfs:zfs_txg_timeout = 1 > btw: you can tune it on a live system without a need to do reboots. > > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:30 > milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw > zfs_txg_timeout:0x1e = 0x1 > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:1 > milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw > zfs_txg_timeout:0x1 = 0x1e > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:30 > milek at r600:~# >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks3lGkACgkQRO8UcfzpOHBzcwCgyDlxr94I9r8kHbVEkTt1lu0Y AOIAmgJnZ5nZw8j7FS+irrJWJ4RBup0Q =0g8/ -----END PGP SIGNATURE-----
Roch Bourbonnais
2009-Dec-27 19:38 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
Le 26 d?c. 09 ? 04:47, Tim Cook a ?crit :> > > On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov > <skiselkov at gmail.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I''ve started porting a video streaming application to opensolaris on > ZFS, and am hitting some pretty weird performance issues. The thing > I''m > trying to do is run 77 concurrent video capture processes (roughly > 430Mbit/s in total) all writing into separate files on a 12TB J4200 > storage array. The disks in the array are arranged into a single > RAID-0 > ZFS volume (though I''ve tried different RAID levels, none helped). CPU > performance is not an issue (barely hitting 35% utilization on a > single > CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the > storage array''s sequential write performance is around 600MB/s. > > The problem is the bursty behavior of ZFS writes. All the capture > processes do, in essence is poll() on a socket and then read() and > write() any available data from it to a file. The poll() call is done > with a timeout of 250ms, expecting that if no data arrives within 0.25 > seconds, the input is dead and recording stops (I tried increasing > this > value, but the problem still arises, although not as frequently). When > ZFS decides that it wants to commit a transaction group to disk (every > 30 seconds), the system stalls for a short amount of time and > depending > on the number capture of processes currently running, the poll() call > (which usually blocks for 1-2ms), takes on the order of hundreds of > ms, > sometimes even longer. I figured that I might be able to resolve > this by > lowering the txg timeout to something like 1-2 seconds (I need ZFS to > write as soon as data arrives, since it will likely never be > overwritten), but I couldn''t find any tunable parameter for it > anywhere > on the net. On FreeBSD, I think this can be done via the > vfs.zfs.txg_timeout sysctl. A glimpse into the source at > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c > on line 40 made me worry that somebody maybe hard-coded this value > into > the kernel, in which case I''d be pretty much screwed in opensolaris. > > Any help would be greatly appreciated. > > Regards, > - -- > Saso > > > > > Hang on... if you''ve got 77 concurrent threads going, I don''t see > how that''s a "sequential" I/O load. To the backend storage it''s > going to look like the equivalent of random I/O.I see this posted once in a while and I''m not sure where that comes from. Sequential workloads are important inasmuch as the FS/VM can detect and issue large request to disk (followed by cache hits) instead of multiple small ones. The detection for ZFS is done at the file level and so the fact that one has N concurrent streams going is not relevant. On writes ZFS and the Copy-On-Write model makes sequential/random distinction not very defining. All writes are targetting free blocks. -r> I''d also be surprised to see 12 1TB disks supporting 600MB/sec > throughput and would be interested in hearing where you got those > numbers from. > > Is your video capture doing 430MB or 430Mbit? > > -- > --Tim > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2431 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/ee32f98e/attachment.bin>
On Sun, Dec 27, 2009 at 1:38 PM, Roch Bourbonnais <Roch.Bourbonnais at sun.com>wrote:> > Le 26 d?c. 09 ? 04:47, Tim Cook a ?crit : > > >> >> On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov <skiselkov at gmail.com> >> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> I''ve started porting a video streaming application to opensolaris on >> ZFS, and am hitting some pretty weird performance issues. The thing I''m >> trying to do is run 77 concurrent video capture processes (roughly >> 430Mbit/s in total) all writing into separate files on a 12TB J4200 >> storage array. The disks in the array are arranged into a single RAID-0 >> ZFS volume (though I''ve tried different RAID levels, none helped). CPU >> performance is not an issue (barely hitting 35% utilization on a single >> CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the >> storage array''s sequential write performance is around 600MB/s. >> >> The problem is the bursty behavior of ZFS writes. All the capture >> processes do, in essence is poll() on a socket and then read() and >> write() any available data from it to a file. The poll() call is done >> with a timeout of 250ms, expecting that if no data arrives within 0.25 >> seconds, the input is dead and recording stops (I tried increasing this >> value, but the problem still arises, although not as frequently). When >> ZFS decides that it wants to commit a transaction group to disk (every >> 30 seconds), the system stalls for a short amount of time and depending >> on the number capture of processes currently running, the poll() call >> (which usually blocks for 1-2ms), takes on the order of hundreds of ms, >> sometimes even longer. I figured that I might be able to resolve this by >> lowering the txg timeout to something like 1-2 seconds (I need ZFS to >> write as soon as data arrives, since it will likely never be >> overwritten), but I couldn''t find any tunable parameter for it anywhere >> on the net. On FreeBSD, I think this can be done via the >> vfs.zfs.txg_timeout sysctl. A glimpse into the source at >> >> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c >> on line 40 made me worry that somebody maybe hard-coded this value into >> the kernel, in which case I''d be pretty much screwed in opensolaris. >> >> Any help would be greatly appreciated. >> >> Regards, >> - -- >> Saso >> >> >> >> >> Hang on... if you''ve got 77 concurrent threads going, I don''t see how >> that''s a "sequential" I/O load. To the backend storage it''s going to look >> like the equivalent of random I/O. >> > > > I see this posted once in a while and I''m not sure where that comes from. > Sequential workloads are important inasmuch as the FS/VM can detect and > issue large request to disk (followed by cache hits) instead of multiple > small ones. The detection for ZFS is done at the file level and so the fact > that one has N concurrent streams going is not relevant. > On writes ZFS and the Copy-On-Write model makes sequential/random > distinction not very defining. All writes are targetting free blocks. > > -r > >That is ONLY true when there''s significant free space available/a fresh pool. Once those files have been deleted and the blocks put back into the free pool, they''re no longer "sequential" on disk, they''re all over the disk. So it makes a VERY big difference. I''m not sure why you''d be shocked someone would bring this up. -- --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/fa695c9d/attachment.html>
Bob Friesenhahn
2009-Dec-28 00:43 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Sun, 27 Dec 2009, Tim Cook wrote:> > That is ONLY true when there''s significant free space available/a > fresh pool.? Once those files have been deleted and the blocks put > back into the free pool, they''re no longer "sequential" on disk, > they''re all over the disk.? So it makes a VERY big difference.? I''m > not sure why you''d be shocked someone would bring this up. ? --While I don''t know what zfs actually does, I do know that it performs large disk allocations (e.g. 1MB) and then parcels 128K zfs blocks from those allocations. If the zfs designers are wise, then they will use knowledge of sequential access to ensure that all of the 128K blocks from a metaslab allocation are pre-assigned for use by that file, and they will try to choose metaslabs which are followed by free metaslabs, or close to other free metaslabs. This approach would tend to limit the sequential-access damage caused by COW and free block fragmentation on a "dirty" disk. This sort of planning is not terribly different than detecting sequential read I/O and scheduling data reads in advance of application requirements. If you can intelligently pre-fetch data blocks, then you can certainly intelligently pre-allocate data blocks. Today I did an interesting (to me) test where I ran two copies of iozone at once on huge (up to 64GB) files. The results were somewhat amazing to me. The cause of the amazement was that I noticed that the reported data rates from iozone did not drop very much (e.g. a single-process write rate of 359MB/second dropped to 298MB/second with two processes). This clearly showed that zfs is doing quite a lot of smart things when writing files and that it is optimized for several/many writers rather than just one. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sun, Dec 27, 2009 at 6:43 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Sun, 27 Dec 2009, Tim Cook wrote: > >> >> That is ONLY true when there''s significant free space available/a fresh >> pool. Once those files have been deleted and the blocks put back into the >> free pool, they''re no longer "sequential" on disk, they''re all over the >> disk. So it makes a VERY big difference. I''m not sure why you''d be shocked >> someone would bring this up. -- >> > > While I don''t know what zfs actually does, I do know that it performs large > disk allocations (e.g. 1MB) and then parcels 128K zfs blocks from those > allocations. If the zfs designers are wise, then they will use knowledge of > sequential access to ensure that all of the 128K blocks from a metaslab > allocation are pre-assigned for use by that file, and they will try to > choose metaslabs which are followed by free metaslabs, or close to other > free metaslabs. This approach would tend to limit the sequential-access > damage caused by COW and free block fragmentation on a "dirty" disk. > >How is that going to prevent blocks being spread all over the disk when you''ve got files several GB in size being written concurrently and deleted at random? And then throw in a mix of small files as well, kiss that goodbye.> This sort of planning is not terribly different than detecting sequential > read I/O and scheduling data reads in advance of application requirements. > If you can intelligently pre-fetch data blocks, then you can certainly > intelligently pre-allocate data blocks. > >Pre-allocating data blocks is also not going to cure head seek and the latency it induces on slow 7200/5400RPM drives.> Today I did an interesting (to me) test where I ran two copies of iozone at > once on huge (up to 64GB) files. The results were somewhat amazing to me. > The cause of the amazement was that I noticed that the reported data rates > from iozone did not drop very much (e.g. a single-process write rate of > 359MB/second dropped to 298MB/second with two processes). This clearly > showed that zfs is doing quite a lot of smart things when writing files and > that it is optimized for several/many writers rather than just one. > >On a new, empty pool, or a pool that''s been filled completely and emptied several times? It''s not amazing to me on a new pool. I would be surprised to see you accomplish this feat repeatedly after filling and emptying the drives. It''s a drawback of every implementation of copy-on-write I''ve ever seen. By it''s very nature, I have no idea how you would avoid it. -- --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/f1659658/attachment.html>
Bob Friesenhahn
2009-Dec-28 02:40 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Sun, 27 Dec 2009, Tim Cook wrote:> How is that going to prevent blocks being spread all over the disk > when you''ve got files several GB in size being written concurrently > and deleted at random?? And then throw in a mix of small files as > well, kiss that goodbye.There would certainly be blocks spread all over the disk, but a (possible) seek ever 1MB of data is not too bad (not considering metadata seeks). If the pool is allowed to get very full, then optimizations based on pre-allocated space stop working.> Pre-allocating data blocks is also not going to cure head seek and > the latency it induces on slow 7200/5400RPM drives.But if the next seek to a data block is on a different drive, that drive can be seeking for the next block while the current block is already being read.> On a new, empty pool, or a pool that''s been filled completely and > emptied several times?? It''s not amazing to me on a new pool.? I > would be surprised to see you accomplish this feat repeatedly after > filling and emptying the drives.? It''s a drawback of every > implementation of copy-on-write I''ve ever seen.? By it''s very > nature, I have no idea how you would avoid it.This is a 2 year old pool which is typically filled (to about 80%) and "emptied" (reduced to 25%) many times. However, when it is "emptied", all of the new files get removed since the extra space is used for testing. I have only seen this pool get faster over time. For example, when the pool was first created, iozone only measured a single-thread large-file (64GB) write rate of 148MB/second but now it is up to 380MB/second with the same hardware. The performance improvement is due to improvements to Solaris 10 software and array (STK2540) firmware. Original vs current: KB reclen write rewrite read reread 67108864 256 148995 165041 463519 453896 67108864 256 380286 377397 551060 550414 Here is an anchient blog entry where Jeff Bonwick discusses ZFS block allocation: http://blogs.sun.com/bonwick/entry/zfs_block_allocation and a somewhat newer one where Jeff describes space maps: http://blogs.sun.com/bonwick/entry/space_maps Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sun, Dec 27, 2009 at 8:40 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Sun, 27 Dec 2009, Tim Cook wrote: > > How is that going to prevent blocks being spread all over the disk when >> you''ve got files several GB in size being written concurrently and deleted >> at random? And then throw in a mix of small files as well, kiss that >> goodbye. >> > > There would certainly be blocks spread all over the disk, but a (possible) > seek ever 1MB of data is not too bad (not considering metadata seeks). If > the pool is allowed to get very full, then optimizations based on > pre-allocated space stop working. > >I guess it depends entirely on the space map :)> > Pre-allocating data blocks is also not going to cure head seek and the >> latency it induces on slow 7200/5400RPM drives. >> > > But if the next seek to a data block is on a different drive, that drive > can be seeking for the next block while the current block is already being > read. > >Well of course. The argument of "if you just throw more disks at the problem" will be valid in almost all situations. Expecting to get the same performance out of drives you get when they''re empty and new vs. full and used, in my experience, is crazy. My point from the start was you will see a significant performance decrease as time and fragmentation take place.> > On a new, empty pool, or a pool that''s been filled completely and emptied >> several times? It''s not amazing to me on a new pool. I would be surprised >> to see you accomplish this feat repeatedly after filling and emptying the >> drives. It''s a drawback of every implementation of copy-on-write I''ve ever >> seen. By it''s very nature, I have no idea how you would avoid it. >> > > This is a 2 year old pool which is typically filled (to about 80%) and > "emptied" (reduced to 25%) many times. However, when it is "emptied", all > of the new files get removed since the extra space is used for testing. I > have only seen this pool get faster over time. > > For example, when the pool was first created, iozone only measured a > single-thread large-file (64GB) write rate of 148MB/second but now it is up > to 380MB/second with the same hardware. The performance improvement is due > to improvements to Solaris 10 software and array (STK2540) firmware. > > Original vs current: > > KB reclen write rewrite read reread > 67108864 256 148995 165041 463519 453896 > 67108864 256 380286 377397 551060 550414 > >Cmon, saying "all I did was change code and firmware" isn''t a valid comparison at all. Ignoring that, I''m still referring to multiple streams which create random I/O to the backend disk. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091227/fbf2e615/attachment.html>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I progressed with testing a bit further and found that I was hitting another scheduling bottleneck - the network. While the write burst was running and ZFS was commiting data to disk, the server was dropping incomming UDP packets ("netstat -s | grep udpInOverflows" grew by about 1000-2000 packets during every write burst). To work around that I had to boost the scheduling priority of recorder processes to the real-time class and I also had to lower zfs_txg_timeout=1 (there was still minor packet drop after just doing priocntl on the processes) to even out the CPU load. Any ideas on why ZFS should completely thrash the network layer and make it drop incomming packets? Regards, - -- Saso Robert Milkowski wrote:> On 26/12/2009 12:22, Saso Kiselkov wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Thank you, the post you mentioned helped me move a bit forward. I tried >> putting: >> >> zfs:zfs_txg_timeout = 1 > btw: you can tune it on a live system without a need to do reboots. > > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:30 > milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw > zfs_txg_timeout:0x1e = 0x1 > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:1 > milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw > zfs_txg_timeout:0x1 = 0x1e > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:30 > milek at r600:~# >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks4sa8ACgkQRO8UcfzpOHAASgCdF1QWcKvpvK58BPBVr9EDmrWK zmoAoLeX3Q+avIDbb+CONlh++pAIGOob =NcRo -----END PGP SIGNATURE-----
Hi, Try to add flow for traffic you want to get prioritized, I noticed that opensolaris tends to drop network connectivity without priority flows defined, I believe this is a feature presented by crossbow itself. flowadm is your friend that is. I found this particularly annoying if you monitor servers with icmp-ping and high load causes checks to fail therefore triggering unnecessary alarms. Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Saso Kiselkov Sent: 28. joulukuuta 2009 15:25 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I progressed with testing a bit further and found that I was hitting another scheduling bottleneck - the network. While the write burst was running and ZFS was commiting data to disk, the server was dropping incomming UDP packets ("netstat -s | grep udpInOverflows" grew by about 1000-2000 packets during every write burst). To work around that I had to boost the scheduling priority of recorder processes to the real-time class and I also had to lower zfs_txg_timeout=1 (there was still minor packet drop after just doing priocntl on the processes) to even out the CPU load. Any ideas on why ZFS should completely thrash the network layer and make it drop incomming packets? Regards, - -- Saso Robert Milkowski wrote:> On 26/12/2009 12:22, Saso Kiselkov wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Thank you, the post you mentioned helped me move a bit forward. I tried >> putting: >> >> zfs:zfs_txg_timeout = 1 > btw: you can tune it on a live system without a need to do reboots. > > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:30 > milek at r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw > zfs_txg_timeout:0x1e = 0x1 > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:1 > milek at r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw > zfs_txg_timeout:0x1 = 0x1e > milek at r600:~# echo zfs_txg_timeout/D | mdb -k > zfs_txg_timeout: > zfs_txg_timeout:30 > milek at r600:~# >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks4sa8ACgkQRO8UcfzpOHAASgCdF1QWcKvpvK58BPBVr9EDmrWK zmoAoLeX3Q+avIDbb+CONlh++pAIGOob =NcRo -----END PGP SIGNATURE----- _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thank you for the advice. After trying flowadm the situation improved somewhat, but I''m still getting occasional packet overflow (10-100 packets about every 10-15 minutes). This is somewhat unnerving, because I don''t know how to track it down. Here are the flowadm settings I use: # flowadm show-flow iptv FLOW LINK IPADDR PROTO LPORT RPORT DSFLD iptv e1000g1 LCL:224.0.0.0/4 -- -- -- -- # flowadm show-flowprop iptv FLOW PROPERTY VALUE DEFAULT POSSIBLE iptv maxbw -- -- ? iptv priority high -- high I also tuned udp_max_buf to 256MB. All recording processes are boosted to the RT priority class and zfs_txg_timeout=1 to force the system to commit data to disk in smaller and more manageable chunks. Is there any further tuning you could recommend? Regards, - -- Saso I need all IP multicast input traffic on e1000g1 to get the highest possible priority. Markus Kovero wrote:> Hi, Try to add flow for traffic you want to get prioritized, I noticed that opensolaris tends to drop network connectivity without priority flows defined, I believe this is a feature presented by crossbow itself. flowadm is your friend that is. > I found this particularly annoying if you monitor servers with icmp-ping and high load causes checks to fail therefore triggering unnecessary alarms. > > Yours > Markus Kovero > > -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Saso Kiselkov > Sent: 28. joulukuuta 2009 15:25 > To: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls > > I progressed with testing a bit further and found that I was hitting > another scheduling bottleneck - the network. While the write burst was > running and ZFS was commiting data to disk, the server was dropping > incomming UDP packets ("netstat -s | grep udpInOverflows" grew by about > 1000-2000 packets during every write burst). > > To work around that I had to boost the scheduling priority of recorder > processes to the real-time class and I also had to lower > zfs_txg_timeout=1 (there was still minor packet drop after just doing > priocntl on the processes) to even out the CPU load. > > Any ideas on why ZFS should completely thrash the network layer and make > it drop incomming packets? > > Regards,_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks406oACgkQRO8UcfzpOHBVFwCguUVlMhTt9PlcbcqUjJzJ8Oij CiIAoJJFHu1wtLMbyOyhXbyDPTkSFSFc =VLoO -----END PGP SIGNATURE-----
Bob Friesenhahn
2009-Dec-28 16:07 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Sun, 27 Dec 2009, Tim Cook wrote:> > Cmon, saying "all I did was change code and firmware" isn''t a valid comparison at all.? Ignoring > that, I''m still referring to multiple streams which create random I/O to the backend disk.I do agree with you that this is a problematic scenario. The issue is with how fast the data arrives. If the data is written quickly, then quite a lot of data will be written in each transaction group and zfs can usefully optimize that transaction group. If the data trickles in, then it is a difficult problem for any general-purpose filesystem to solve. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Roch Bourbonnais
2009-Dec-28 17:19 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
Le 28 d?c. 09 ? 00:59, Tim Cook a ?crit :> > > On Sun, Dec 27, 2009 at 1:38 PM, Roch Bourbonnais <Roch.Bourbonnais at sun.com > > wrote: > > Le 26 d?c. 09 ? 04:47, Tim Cook a ?crit : > > > > On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov > <skiselkov at gmail.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I''ve started porting a video streaming application to opensolaris on > ZFS, and am hitting some pretty weird performance issues. The thing > I''m > trying to do is run 77 concurrent video capture processes (roughly > 430Mbit/s in total) all writing into separate files on a 12TB J4200 > storage array. The disks in the array are arranged into a single > RAID-0 > ZFS volume (though I''ve tried different RAID levels, none helped). CPU > performance is not an issue (barely hitting 35% utilization on a > single > CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the > storage array''s sequential write performance is around 600MB/s. > > The problem is the bursty behavior of ZFS writes. All the capture > processes do, in essence is poll() on a socket and then read() and > write() any available data from it to a file. The poll() call is done > with a timeout of 250ms, expecting that if no data arrives within 0.25 > seconds, the input is dead and recording stops (I tried increasing > this > value, but the problem still arises, although not as frequently). When > ZFS decides that it wants to commit a transaction group to disk (every > 30 seconds), the system stalls for a short amount of time and > depending > on the number capture of processes currently running, the poll() call > (which usually blocks for 1-2ms), takes on the order of hundreds of > ms, > sometimes even longer. I figured that I might be able to resolve > this by > lowering the txg timeout to something like 1-2 seconds (I need ZFS to > write as soon as data arrives, since it will likely never be > overwritten), but I couldn''t find any tunable parameter for it > anywhere > on the net. On FreeBSD, I think this can be done via the > vfs.zfs.txg_timeout sysctl. A glimpse into the source at > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c > on line 40 made me worry that somebody maybe hard-coded this value > into > the kernel, in which case I''d be pretty much screwed in opensolaris. > > Any help would be greatly appreciated. > > Regards, > - -- > Saso > > > > > Hang on... if you''ve got 77 concurrent threads going, I don''t see > how that''s a "sequential" I/O load. To the backend storage it''s > going to look like the equivalent of random I/O. > > > I see this posted once in a while and I''m not sure where that comes > from. Sequential workloads are important inasmuch as the FS/VM can > detect and issue large request to disk (followed by cache hits) > instead of multiple small ones. The detection for ZFS is done at > the file level and so the fact that one has N concurrent streams > going is not relevant. > On writes ZFS and the Copy-On-Write model makes sequential/random > distinction not very defining. All writes are targetting free blocks. > > -r > > > > That is ONLY true when there''s significant free space available/a > fresh pool. Once those files have been deleted and the blocks put > back into the free pool, they''re no longer "sequential" on disk, > they''re all over the disk. So it makes a VERY big difference. I''m > not sure why you''d be shocked someone would bring this up. >So on writes the performance is more defined by the availability of sequential blocks rather than the application write access pattern or concurrency of. On reads, multiple concurrent sequential streams are sequential to the filesystem independant of the number of streams leading to some optimisation at that level. The on-disk I/O pattern is governed by the layout and again the concurrency of streams does not come into play when trying to understand the performance. So IMO, having N file sequential access pattern does not imply that performance will governed by a random I/O response from disks. -r> -- > --Tim
Robert Milkowski
2009-Dec-29 11:03 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
I included networking-discuss@ On 28/12/2009 15:50, Saso Kiselkov wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Thank you for the advice. After trying flowadm the situation improved > somewhat, but I''m still getting occasional packet overflow (10-100 > packets about every 10-15 minutes). This is somewhat unnerving, because > I don''t know how to track it down. > > Here are the flowadm settings I use: > > # flowadm show-flow iptv > FLOW LINK IPADDR PROTO LPORT RPORT > DSFLD > iptv e1000g1 LCL:224.0.0.0/4 -- -- -- -- > > # flowadm show-flowprop iptv > FLOW PROPERTY VALUE DEFAULT POSSIBLE > iptv maxbw -- -- ? > iptv priority high -- high > > I also tuned udp_max_buf to 256MB. All recording processes are boosted > to the RT priority class and zfs_txg_timeout=1 to force the system to > commit data to disk in smaller and more manageable chunks. Is there any > further tuning you could recommend? > > Regards, > - -- > Saso > > I need all IP multicast input traffic on e1000g1 to get the highest > possible priority. > > Markus Kovero wrote: > >> Hi, Try to add flow for traffic you want to get prioritized, I noticed that opensolaris tends to drop network connectivity without priority flows defined, I believe this is a feature presented by crossbow itself. flowadm is your friend that is. >> I found this particularly annoying if you monitor servers with icmp-ping and high load causes checks to fail therefore triggering unnecessary alarms. >> >> Yours >> Markus Kovero >> >> -----Original Message----- >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Saso Kiselkov >> Sent: 28. joulukuuta 2009 15:25 >> To: zfs-discuss at opensolaris.org >> Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls >> >> I progressed with testing a bit further and found that I was hitting >> another scheduling bottleneck - the network. While the write burst was >> running and ZFS was commiting data to disk, the server was dropping >> incomming UDP packets ("netstat -s | grep udpInOverflows" grew by about >> 1000-2000 packets during every write burst). >> >> To work around that I had to boost the scheduling priority of recorder >> processes to the real-time class and I also had to lower >> zfs_txg_timeout=1 (there was still minor packet drop after just doing >> priocntl on the processes) to even out the CPU load. >> >> Any ideas on why ZFS should completely thrash the network layer and make >> it drop incomming packets? >> >> Regards, >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAks406oACgkQRO8UcfzpOHBVFwCguUVlMhTt9PlcbcqUjJzJ8Oij > CiIAoJJFHu1wtLMbyOyhXbyDPTkSFSFc > =VLoO > -----END PGP SIGNATURE----- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I tried removing the flow and subjectively packet loss occurs a bit less often, but still it is happening. Right now I''m trying to figure out of it''s due to the load on the server or not - I''ve left only about 15 concurrent recording instances, producing < 8% load on the system. If the packet loss still occurs, I guess I''ll have to disregard the loss measurements as irrelevant, since at such a load the server should not be dropping packets at all... I guess. Regards, - -- Saso Robert Milkowski wrote:> I included networking-discuss@ > > > On 28/12/2009 15:50, Saso Kiselkov wrote: > Thank you for the advice. After trying flowadm the situation improved > somewhat, but I''m still getting occasional packet overflow (10-100 > packets about every 10-15 minutes). This is somewhat unnerving, because > I don''t know how to track it down. > > Here are the flowadm settings I use: > > # flowadm show-flow iptv > FLOW LINK IPADDR PROTO LPORT RPORT > DSFLD > iptv e1000g1 LCL:224.0.0.0/4 -- -- > -- -- > > # flowadm show-flowprop iptv > FLOW PROPERTY VALUE DEFAULT POSSIBLE > iptv maxbw -- -- ? > iptv priority high -- high > > I also tuned udp_max_buf to 256MB. All recording processes are boosted > to the RT priority class and zfs_txg_timeout=1 to force the system to > commit data to disk in smaller and more manageable chunks. Is there any > further tuning you could recommend? > > Regards,_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss>>> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks58KIACgkQRO8UcfzpOHCSJQCePCPVhbbfdogNHL735qz3A3dI 4acAn2jofXsGsveDYCgkelwg1xXKFVId =UPRk -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ok, I figured out that apparently I was the idiot in this story, again. I forgot to set SO_RCVBUF on my network sockets higher, so that''s why I was dropping input packets. The zfs_txg_timeout=1 flag is still necessary (or else dropping occurs when commiting data to disk), but by increasing network input buffer sizes it seems I was able to cut input packet loss to zero. Thanks for all the valuable advice! Regards, - -- Saso Saso Kiselkov wrote:> I tried removing the flow and subjectively packet loss occurs a bit less > often, but still it is happening. Right now I''m trying to figure out of > it''s due to the load on the server or not - I''ve left only about 15 > concurrent recording instances, producing < 8% load on the system. If > the packet loss still occurs, I guess I''ll have to disregard the loss > measurements as irrelevant, since at such a load the server should not > be dropping packets at all... I guess. > > Regards,-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks7YhIACgkQRO8UcfzpOHC8RACgrryGDuVNBYg7q7FPzTKbL8UJ u+YAoJeUhNYGWwXGi3IqOPPIS4jW9x1j =f+GQ -----END PGP SIGNATURE-----
Thanks for this thread! I was just coming here to discuss this very same problem. I''m running 2009.06 on a Q6600 with 8GB of RAM. I have a Windows system writing multiple OTA HD video streams via CIFS to the 2009.06 system running Samba. I then have multiple clients reading back other HD video streams. The write client never skips a beat, but the read clients have constant problems getting data when the "burst" writes occur. I am now going to try the txg_timeout and see if that helps. It would be nice if these tunables were settable on a per-pool basis though. -- This message posted from opensolaris.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Be sure to also update to the latest dev b130 release, as that also helps with a more smooth scheduling class for the zfs threads. If the upgrade breaks anything, you can always just boot back into the old environment before the upgrade. Regards, - -- Saso Bill Werner wrote:> Thanks for this thread! I was just coming here to discuss this very same problem. I''m running 2009.06 on a Q6600 with 8GB of RAM. I have a Windows system writing multiple OTA HD video streams via CIFS to the 2009.06 system running Samba. > > I then have multiple clients reading back other HD video streams. The write client never skips a beat, but the read clients have constant problems getting data when the "burst" writes occur. > > I am now going to try the txg_timeout and see if that helps. It would be nice if these tunables were settable on a per-pool basis though.-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks/sloACgkQRO8UcfzpOHC7ywCffZSGYBwd3hRZE5BAfMZpT/g6 ebsAmQFDJ5VyOcaCXKW1TN6I7wmE9w1O =Ex5W -----END PGP SIGNATURE-----
Tim Cook writes: > On Sun, Dec 27, 2009 at 6:43 PM, Bob Friesenhahn < > bfriesen at simple.dallas.tx.us> wrote: > > > On Sun, 27 Dec 2009, Tim Cook wrote: > > > >> > >> That is ONLY true when there''s significant free space available/a fresh > >> pool. Once those files have been deleted and the blocks put back into the > >> free pool, they''re no longer "sequential" on disk, they''re all over the > >> disk. So it makes a VERY big difference. I''m not sure why you''d be shocked > >> someone would bring this up. -- > >> > > > > While I don''t know what zfs actually does, I do know that it performs large > > disk allocations (e.g. 1MB) and then parcels 128K zfs blocks from those > > allocations. If the zfs designers are wise, then they will use knowledge of > > sequential access to ensure that all of the 128K blocks from a metaslab > > allocation are pre-assigned for use by that file, and they will try to > > choose metaslabs which are followed by free metaslabs, or close to other > > free metaslabs. This approach would tend to limit the sequential-access > > damage caused by COW and free block fragmentation on a "dirty" disk. > > > > > How is that going to prevent blocks being spread all over the disk when > you''ve got files several GB in size being written concurrently and deleted > at random? And then throw in a mix of small files as well, kiss that > goodbye. > > Big files being deleted creates big chunks of space for reuse. That is a great way to clean up the layout. Within a metaslab ZFS uses cursors to bunch small objects closer together. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c#501 > > > This sort of planning is not terribly different than detecting sequential > > read I/O and scheduling data reads in advance of application requirements. > > If you can intelligently pre-fetch data blocks, then you can certainly > > intelligently pre-allocate data blocks. > > > > > Pre-allocating data blocks is also not going to cure head seek and the > latency it induces on slow 7200/5400RPM drives. > > > > > > Today I did an interesting (to me) test where I ran two copies of iozone at > > once on huge (up to 64GB) files. The results were somewhat amazing to me. > > The cause of the amazement was that I noticed that the reported data rates > > from iozone did not drop very much (e.g. a single-process write rate of > > 359MB/second dropped to 298MB/second with two processes). This clearly > > showed that zfs is doing quite a lot of smart things when writing files and > > that it is optimized for several/many writers rather than just one. > > > > > On a new, empty pool, or a pool that''s been filled completely and emptied > several times? It''s not amazing to me on a new pool. I would be surprised > to see you accomplish this feat repeatedly after filling and emptying the > drives. It''s a drawback of every implementation of copy-on-write I''ve ever > seen. By it''s very nature, I have no idea how you would avoid it. > If you empty the drives you''re back to all free space : http://blogs.sun.com/bonwick/entry/space_maps If you leave yourselve a nive cushion of free space and if you''re profile of object sizes does no radically changes over time, I think people should be fine when it comes to free space fragmentation issues. That said, slab and block selection is still on our radard for improvements. -r
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I''ve encountered a new problem on the opposite end of my app - the write() calls to disk sometimes block for a terribly long time (5-10 seconds) when I start deleting stuff on the filesystem where my recorder processes are writing. Looking at iostat I can see that the disk load is strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes every second, but when I start deleting stuff (e.g. "rm -r *"), huge load spikes appear from time to time, even to the level of blocking all processes writing to the filesystem and filling up the network input buffer and starting to drop packets. Is there a way that I can increase the write I/O priority, or increase the write buffer in ZFS so that write()s won''t block? Regards, - -- Saso Saso Kiselkov wrote:> Ok, I figured out that apparently I was the idiot in this story, again. > I forgot to set SO_RCVBUF on my network sockets higher, so that''s why I > was dropping input packets. > > The zfs_txg_timeout=1 flag is still necessary (or else dropping occurs > when commiting data to disk), but by increasing network input buffer > sizes it seems I was able to cut input packet loss to zero. > > Thanks for all the valuable advice! > > Regards,-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktE0xoACgkQRO8UcfzpOHBvhwCfSl6Acb2nPvtcFFgzZrkTCIFk bhEAoKjfv3BWnIRtEsCZt9W0SfKN3xPT =/f+g -----END PGP SIGNATURE-----
Bob Friesenhahn
2010-Jan-06 18:44 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Wed, 6 Jan 2010, Saso Kiselkov wrote:> I''ve encountered a new problem on the opposite end of my app - the > write() calls to disk sometimes block for a terribly long time (5-10 > seconds) when I start deleting stuff on the filesystem where my recorder > processes are writing. Looking at iostat I can see that the disk load is > strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes > every second, but when I start deleting stuff (e.g. "rm -r *"), huge > load spikes appear from time to time, even to the level of blocking all > processes writing to the filesystem and filling up the network input > buffer and starting to drop packets. > > Is there a way that I can increase the write I/O priority, or increase > the write buffer in ZFS so that write()s won''t block?Deleting stuff results in many small writes to the pool in order to free up blocks and update metadata. It is one of the most challenging tasks that any filesystem will do. It seems that most recent development OpenSolaris has added use of a new scheduling class in order to limit the impact of such "load spikes". I am eagerly looking forward to being able to use this. It is difficult for your application to do much if the network device driver fails to work, but your application can do some of its own buffering and use multithreading so that even a long delay can be handled. Use of the asynchronous write APIs may also help. Writes should be blocked up to the size of the zfs block (e.g. 128K), and also aligned to the zfs block if possible. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I''m aware of the theory and realize that deleting stuff requires writes. I''m also running on the latest b130 and write stuff to disk in large 128k chunks. The thing I was wondering about is whether there is a mechanism that might lower the I/O scheduling priority of a given process (e.g. lower the priority of the rm command) in a manner similar to CPU scheduling priority. Another solution would be to increase the max size the ZFS write buffer, so that writes would not block. What I''d specifically like to avoid doing is buffer writes in the recorder process. Besides being complicated to do (the process periodically closes and reopens several output files at specific moments in time and keeping them in sync is a bit hairy), I need the written data to appear in the filesystem very soon after being received from the network. The logic behind this is that this is streaming media data which a user can immediately start playing back while it''s being recorded. It''s crucial that the user be able to follow the real-time recording with at most a 1-2 second delay (in fact, at the moment I can get down to 1 second behind live TV). If I buffer writes for up to 10 seconds in user-space, other playback processes can fail due to running out of data. Regards, - -- Saso Bob Friesenhahn wrote:> On Wed, 6 Jan 2010, Saso Kiselkov wrote: > >> I''ve encountered a new problem on the opposite end of my app - the >> write() calls to disk sometimes block for a terribly long time (5-10 >> seconds) when I start deleting stuff on the filesystem where my recorder >> processes are writing. Looking at iostat I can see that the disk load is >> strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes >> every second, but when I start deleting stuff (e.g. "rm -r *"), huge >> load spikes appear from time to time, even to the level of blocking all >> processes writing to the filesystem and filling up the network input >> buffer and starting to drop packets. >> >> Is there a way that I can increase the write I/O priority, or increase >> the write buffer in ZFS so that write()s won''t block? > > Deleting stuff results in many small writes to the pool in order to free > up blocks and update metadata. It is one of the most challenging tasks > that any filesystem will do. > > It seems that most recent development OpenSolaris has added use of a new > scheduling class in order to limit the impact of such "load spikes". I > am eagerly looking forward to being able to use this. > > It is difficult for your application to do much if the network device > driver fails to work, but your application can do some of its own > buffering and use multithreading so that even a long delay can be > handled. Use of the asynchronous write APIs may also help. Writes > should be blocked up to the size of the zfs block (e.g. 128K), and also > aligned to the zfs block if possible. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktFAaQACgkQRO8UcfzpOHDsHwCcC4CeWjmZgfINiVYXuyXKAjZg a24AnA2mXCZMJzcAGlu9w8e81X2duNGI =T7qS -----END PGP SIGNATURE-----
Bob Friesenhahn
2010-Jan-06 22:21 UTC
[zfs-discuss] ZFS write bursts cause short app stalls
On Wed, 6 Jan 2010, Saso Kiselkov wrote:> I''m aware of the theory and realize that deleting stuff requires writes. > I''m also running on the latest b130 and write stuff to disk in large > 128k chunks. The thing I was wondering about is whether there is a > mechanism that might lower the I/O scheduling priority of a given > process (e.g. lower the priority of the rm command) in a manner similar > to CPU scheduling priority. Another solution would be to increase the > max size the ZFS write buffer, so that writes would not block.Disks only have so many IOPS available and commands like ''rm -rf'' use quite a lot of them. The ''rm'' command actually does hardly anything since unlink() is just one system call which could cause a flurry of activity. A simple solution is to ban use of ''rm -rf'' and use a substitute which intentionally works slowly. The source code for ''rm'' is available so you could even modify OpenSolaris ''rm'' so that it does a bit of a sleep after removing each file. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Buffering the writes in the OS would work for me as well - I''ve got RAM to spare. Slowing down rm is perhaps one way to go, but definitely not a real solution. On rare occasions I could still get lockups, leading to screwed up recordings and if its one thing people don''t like about IPTV, it''s packet loss. Eliminating even the possibility of packet loss completely would be the best way to go, I think. Regards, - -- Saso Bob Friesenhahn wrote:> On Wed, 6 Jan 2010, Saso Kiselkov wrote: > >> I''m aware of the theory and realize that deleting stuff requires writes. >> I''m also running on the latest b130 and write stuff to disk in large >> 128k chunks. The thing I was wondering about is whether there is a >> mechanism that might lower the I/O scheduling priority of a given >> process (e.g. lower the priority of the rm command) in a manner similar >> to CPU scheduling priority. Another solution would be to increase the >> max size the ZFS write buffer, so that writes would not block. > > Disks only have so many IOPS available and commands like ''rm -rf'' use > quite a lot of them. The ''rm'' command actually does hardly anything > since unlink() is just one system call which could cause a flurry of > activity. A simple solution is to ban use of ''rm -rf'' and use a > substitute which intentionally works slowly. The source code for ''rm'' > is available so you could even modify OpenSolaris ''rm'' so that it does a > bit of a sleep after removing each file. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktFEWsACgkQRO8UcfzpOHATqQCdGoMb7EfUn1k99KJZlyxLcgGT neUAoKyaBvsFMDnFJRDbrc65xX4SEAgq =/pN1 -----END PGP SIGNATURE-----
On Wed, Jan 6, 2010 at 2:40 PM, Saso Kiselkov <skiselkov at gmail.com> wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Buffering the writes in the OS would work for me as well - I''ve got RAM > to spare. Slowing down rm is perhaps one way to go, but definitely not a > real solution. On rare occasions I could still get lockups, leading to > screwed up recordings and if its one thing people don''t like about IPTV, > it''s packet loss. Eliminating even the possibility of packet loss > completely would be the best way to go, I think. > > Regards, > - -- > Saso >I shouldn''t dare suggest this, but what about disabling the ZIL? Since this sounds like transient data to begin with, any risks would be pretty low I''d imagine. -- Brent Jones brent at servuhome.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Just tried and didn''t help :-(. Regards, - -- Saso Brent Jones wrote:> On Wed, Jan 6, 2010 at 2:40 PM, Saso Kiselkov <skiselkov at gmail.com> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Buffering the writes in the OS would work for me as well - I''ve got RAM >> to spare. Slowing down rm is perhaps one way to go, but definitely not a >> real solution. On rare occasions I could still get lockups, leading to >> screwed up recordings and if its one thing people don''t like about IPTV, >> it''s packet loss. Eliminating even the possibility of packet loss >> completely would be the best way to go, I think. >> >> Regards, >> - -- >> Saso >> > > I shouldn''t dare suggest this, but what about disabling the ZIL? Since > this sounds like transient data to begin with, any risks would be > pretty low I''d imagine. >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktFovIACgkQRO8UcfzpOHCWawCfSeXjpYjLvRE/5guwYZaSc0L/ XP8An2Q+5NBMDIurAkq+EF07woVzPuIW =rLoe -----END PGP SIGNATURE-----