Hi, I did some tests on a Sun Fire x4540 with an external J4500 array (connected via two HBA ports). I.e. there are 96 disks in total configured as seven 12-disk raidz2 vdevs (plus system, spares, unused disks) providing a ~ 63 TB pool with fletcher4 checksums. The system was recently equipped with a Sun Flash Accelerator F20 with 4 FMod modules to be used as log devices (ZIL). I was using the latest snv_134 software release. Here are some first performance numbers for the extraction of an uncompressed 50 MB tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the test filesystem (no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard). standard ZIL: 7m40s (ZFS default) 1x SSD ZIL: 4m07s (Flash Accelerator F20) 2x SSD ZIL: 2m42s (Flash Accelerator F20) 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) 3x SSD ZIL: 2m47s (Flash Accelerator F20) 4x SSD ZIL: 2m57s (Flash Accelerator F20) disabled ZIL: 0m15s (local extraction 0m0.269s) I was not so much interested in the absolute numbers but rather in the relative performance differences between the standard ZIL, the SSD ZIL and the disabled ZIL cases. Any opinions on the results? I wish the SSD ZIL performance was closer to the disabled ZIL case than it is right now. ATM I tend to use two F20 FMods for the log and the two other FMods as L2ARC cache devices (although the system has lots of system memory i.e. the L2ARC is not really necessary). But the speedup of disabling the ZIL altogether is appealing (and would probably be acceptable in this environment). -- This message posted from opensolaris.org
Hey Karsten, Very interesting data. Your test is inherently single-threaded so I''m not surprised that the benefits aren''t more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. Adam On Mar 30, 2010, at 3:30 AM, Karsten Weiss wrote:> Hi, I did some tests on a Sun Fire x4540 with an external J4500 array (connected via two > HBA ports). I.e. there are 96 disks in total configured as seven 12-disk raidz2 vdevs > (plus system, spares, unused disks) providing a ~ 63 TB pool with fletcher4 checksums. > The system was recently equipped with a Sun Flash Accelerator F20 with 4 FMod > modules to be used as log devices (ZIL). I was using the latest snv_134 software release. > > Here are some first performance numbers for the extraction of an uncompressed 50 MB > tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the test filesystem > (no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard). > > standard ZIL: 7m40s (ZFS default) > 1x SSD ZIL: 4m07s (Flash Accelerator F20) > 2x SSD ZIL: 2m42s (Flash Accelerator F20) > 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) > 3x SSD ZIL: 2m47s (Flash Accelerator F20) > 4x SSD ZIL: 2m57s (Flash Accelerator F20) > disabled ZIL: 0m15s > (local extraction 0m0.269s) > > I was not so much interested in the absolute numbers but rather in the relative > performance differences between the standard ZIL, the SSD ZIL and the disabled > ZIL cases. > > Any opinions on the results? I wish the SSD ZIL performance was closer to the > disabled ZIL case than it is right now. > > ATM I tend to use two F20 FMods for the log and the two other FMods as L2ARC cache > devices (although the system has lots of system memory i.e. the L2ARC is not really > necessary). But the speedup of disabling the ZIL altogether is appealing (and would > probably be acceptable in this environment). > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
On 3/30/2010 2:44 PM, Adam Leventhal wrote:> Hey Karsten, > > Very interesting data. Your test is inherently single-threaded so I''m not surprised that the benefits aren''t more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. > >Yes it would be interesting to see the Avg numbers for 10 or more clients (or jobs on one client) all performing that same test. -Kyle> Adam > > On Mar 30, 2010, at 3:30 AM, Karsten Weiss wrote: > > >> Hi, I did some tests on a Sun Fire x4540 with an external J4500 array (connected via two >> HBA ports). I.e. there are 96 disks in total configured as seven 12-disk raidz2 vdevs >> (plus system, spares, unused disks) providing a ~ 63 TB pool with fletcher4 checksums. >> The system was recently equipped with a Sun Flash Accelerator F20 with 4 FMod >> modules to be used as log devices (ZIL). I was using the latest snv_134 software release. >> >> Here are some first performance numbers for the extraction of an uncompressed 50 MB >> tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the test filesystem >> (no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard). >> >> standard ZIL: 7m40s (ZFS default) >> 1x SSD ZIL: 4m07s (Flash Accelerator F20) >> 2x SSD ZIL: 2m42s (Flash Accelerator F20) >> 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) >> 3x SSD ZIL: 2m47s (Flash Accelerator F20) >> 4x SSD ZIL: 2m57s (Flash Accelerator F20) >> disabled ZIL: 0m15s >> (local extraction 0m0.269s) >> >> I was not so much interested in the absolute numbers but rather in the relative >> performance differences between the standard ZIL, the SSD ZIL and the disabled >> ZIL cases. >> >> Any opinions on the results? I wish the SSD ZIL performance was closer to the >> disabled ZIL case than it is right now. >> >> ATM I tend to use two F20 FMods for the log and the two other FMods as L2ARC cache >> devices (although the system has lots of system memory i.e. the L2ARC is not really >> necessary). But the speedup of disabling the ZIL altogether is appealing (and would >> probably be acceptable in this environment). >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- > Adam Leventhal, Fishworks http://blogs.sun.com/ahl > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hi Karsten. Adam, List, Adam Leventhal wrote:>Very interesting data. Your test is inherently single-threaded so I''m not surprised that the benefits aren''t more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency.Well, I actually wanted to do a bit more bottleneck searching, but let me weigh in with some measurements of our own :) We''re om a single X4540 with quad-core CPUs so we''re on the older hypertransport bus. Connected it up to two X2200-s running Centos 5, each on its own 1Gb link. Switched write caching off with the following addition to the /kernel/drv/sd.conf file (Karsten: if you didn''t do this already, you _really_ want to :) ): # http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes # Add whitespace to make the vendor ID (VID) 8 ... and Product ID (PID) 16 characters long... sd-config-list = "ATA MARVELL SD88SA02","cache-nonvolatile"; cache-nonvolatile=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1; As test we''ve found that untarring an eclipse sourcetar file is a good use case. So we use that. Called from a shell script that creates a directory, pushes directory and does the unpacking, for 40 times on each machine. Now for the interesting bit: When we use one vmod, both machines are finished in about 6min45, zilstat maxes out at about 4200 IOPS. Using four vmods it takes about 6min55, zilstat maxes out at 2200 IOPS. In both cases, probing the hyper transport bus seems to show no bottleneck there (although I''d like to see the biderectional flow, but I know we can''t :) ). Network stays comfortably under the 400Mbits/s and that''s peak load when using 1 vmod. Looking at the IO-connection architecture, it figures that in this set we traverse the different HT busses quite a lot. So we''ve also placed an Intel dual 1Gb NIC in another PCIE slot, so that ZIL traffic should only have to use 1 HT bus (not counting offloading intelligence). That helped a bit, but not much: Around 6min35 using one vmod and 6min45 using four vmod-s. It made looking at the HT-dtrace more telling though. Since the outgoing HT-bus to the F20 (and the e1000-s) is now, expectedly, a better indication of the ZIL traffic. We didn''t do the 40 x 2 untar test whilst not using a SSD device. As an indication: unpacking a single tarbal then takes about 1min30. In case it means anything, single tarbal unpack no_zil, 1vmod, 1vmod_Intel, 4vmod-s, 4vmod_Intel measures around (decimals only used as indication!): 4s, 12s, 11.2s, 12.5s, 11.6s Taking this all in account, I still don''t see what''s holding it up. Interestingly enough, the client side times are close within about 10 secs, but zilstat shows something different. Hypothesis: Zilstat shows only one vmod andwere capped in a layer above the ZIL? Can''t rule out networking just yet, but my gut tells me we''re not network bound here. That leaves the ZFS ZPL/VFS layer? I''m very open to suggestions on how to proceed... :) With kind regards, Jeroen -- Jeroen Roodhart ICT Consultant University of Amsterdam j.r.roodhart uva.nl Informatiseringscentrum Technical support/ATG -- See http://www.science.uva.nl/~jeroen for openPGP public key -- This message posted from opensolaris.org
On Mar 30, 2010, at 2:50 PM, Jeroen Roodhart wrote:> Hi Karsten. Adam, List, > > Adam Leventhal wrote: > >> Very interesting data. Your test is inherently single-threaded so I''m not surprised that the benefits aren''t more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. > > Well, I actually wanted to do a bit more bottleneck searching, but let me weigh in with some measurements of our own :) > > We''re om a single X4540 with quad-core CPUs so we''re on the older hypertransport bus. Connected it up to two X2200-s running Centos 5, each on its own 1Gb link. Switched write caching off with the following addition to the /kernel/drv/sd.conf file (Karsten: if you didn''t do this already, you _really_ want to :) ): > > # http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes > # Add whitespace to make the vendor ID (VID) 8 ... and Product ID (PID) 16 characters long... > sd-config-list = "ATA MARVELL SD88SA02","cache-nonvolatile"; > cache-nonvolatile=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;If you are going to trick the system into thinking a volatile cache is nonvolatile, you might as well disable the ZIL -- the data corruption potential is the same.> As test we''ve found that untarring an eclipse sourcetar file is a good use case. So we use that. Called from a shell script that creates a directory, pushes directory and does the unpacking, for 40 times on each machine. > > Now for the interesting bit: > > When we use one vmod, both machines are finished in about 6min45, zilstat maxes out at about 4200 IOPS. > Using four vmods it takes about 6min55, zilstat maxes out at 2200 IOPS. > > In both cases, probing the hyper transport bus seems to show no bottleneck there (although I''d like to see the biderectional flow, but I know we can''t :) ). Network stays comfortably under the 400Mbits/s and that''s peak load when using 1 vmod. > > Looking at the IO-connection architecture, it figures that in this set we traverse the different HT busses quite a lot. So we''ve also placed an Intel dual 1Gb NIC in another PCIE slot, so that ZIL traffic should only have to use 1 HT bus (not counting offloading intelligence). That helped a bit, but not much: > > Around 6min35 using one vmod and 6min45 using four vmod-s. > > It made looking at the HT-dtrace more telling though. Since the outgoing HT-bus to the F20 (and the e1000-s) is now, expectedly, a better indication of the ZIL traffic. > > We didn''t do the 40 x 2 untar test whilst not using a SSD device. As an indication: unpacking a single tarbal then takes about 1min30. > > In case it means anything, single tarbal unpack no_zil, 1vmod, 1vmod_Intel, 4vmod-s, 4vmod_Intel measures around (decimals only used as indication!): > 4s, 12s, 11.2s, 12.5s, 11.6s > > > Taking this all in account, I still don''t see what''s holding it up. Interestingly enough, the client side times are close within about 10 secs, but zilstat shows something different. Hypothesis: Zilstat shows only one vmod andwere capped in a layer above the ZIL? Can''t rule out networking just yet, but my gut tells me we''re not network bound here. That leaves the ZFS ZPL/VFS layer?The difference between writing to the ZIL and not writing to the ZIL is perhaps thousands of CPU cycles. For a latency-sensitive workload this will be noticed. -- richard> > I''m very open to suggestions on how to proceed... :) > > With kind regards, > > Jeroen > -- > Jeroen Roodhart > ICT Consultant > University of Amsterdam > j.r.roodhart uva.nl Informatiseringscentrum > Technical support/ATG > -- > See http://www.science.uva.nl/~jeroen for openPGP public key > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
>If you are going to trick the system into thinking a volatile cache is nonvolatile, you >might as well disable the ZIL -- the data corruption potential is the same.I''m sorry? I believe the F20 has a supercap or the like? The advise on: http://wikis.sun.com/display/Performance/Tuning+ZFS+for+the+F5100#TuningZFSfortheF5100-ZFSF5100 Is to disable write caching altogether. We opted not to do _that_ though... :) Are you sure about disabling write cache on the F20 is a bad thing to do? With kind regards, Jeroen -- This message posted from opensolaris.org
On Mar 30, 2010, at 3:32 PM, Jeroen Roodhart wrote:>> If you are going to trick the system into thinking a volatile cache is nonvolatile, you >> might as well disable the ZIL -- the data corruption potential is the same. > > I''m sorry? I believe the F20 has a supercap or the like? The advise on:You are correct, I misread the Marvell (as in F20) and X4540 (as in not X4500) combination.> http://wikis.sun.com/display/Performance/Tuning+ZFS+for+the+F5100#TuningZFSfortheF5100-ZFSF5100 > > Is to disable write caching altogether. We opted not to do _that_ though... :)Good idea. That recommendation is flawed for the general case and only applies when all devices have nonvolatile caches.> Are you sure about disabling write cache on the F20 is a bad thing to do?I agree that it is a reasonable choice. For this case, what is the average latency to the F20? -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Richard Elling wrote:> On Mar 30, 2010, at 3:32 PM, Jeroen Roodhart wrote: >>> If you are going to trick the system into thinking a volatile cache is nonvolatile, you >>> might as well disable the ZIL -- the data corruption potential is the same. >> I''m sorry? I believe the F20 has a supercap or the like? The advise on: > > You are correct, I misread the Marvell (as in F20) and X4540 (as in not X4500) > combination. > >> http://wikis.sun.com/display/Performance/Tuning+ZFS+for+the+F5100#TuningZFSfortheF5100-ZFSF5100 >> >> Is to disable write caching altogether. We opted not to do _that_ though... :) > > Good idea. That recommendation is flawed for the general case and only > applies when all devices have nonvolatile caches. > >> Are you sure about disabling write cache on the F20 is a bad thing to do? > > I agree that it is a reasonable choice.For those following along at home, I''m pretty sure that the terminology being used is confusing at best, and just plain wrong at worst. The write cache is _not_ being disabled. The write cache is being marked as non-volatile. By marking the write cache as non-volatile, one is telling ZFS to not issue cache flush commands. BTW, why is a Sun/Oracle branded product not properly respecting the NV bit in the cache flush command? This seems remarkably broken, and leads to the amazingly bad advice given on the wiki referenced above. -- Carson
> But the speedup of disabling the ZIL altogether is > appealing (and would > probably be acceptable in this environment).Just to make sure you know ... if you disable the ZIL altogether, and you have a power interruption, failed cpu, or kernel halt, then you''re likely to have a corrupt unusable zpool, or at least data corruption. If that is indeed acceptable to you, go nuts. ;-)
> standard ZIL: 7m40s (ZFS default) > 1x SSD ZIL: 4m07s (Flash Accelerator F20) > 2x SSD ZIL: 2m42s (Flash Accelerator F20) > 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) > 3x SSD ZIL: 2m47s (Flash Accelerator F20) > 4x SSD ZIL: 2m57s (Flash Accelerator F20) > disabled ZIL: 0m15s > (local extraction 0m0.269s) > > I was not so much interested in the absolute numbers but rather in the > relative > performance differences between the standard ZIL, the SSD ZIL and the > disabled > ZIL cases.Oh, one more comment. If you don''t mirror your ZIL, and your unmirrored SSD goes bad, you lose your whole pool. Or at least suffer data corruption.
On Tue, 30 Mar 2010, Edward Ned Harvey wrote:>> But the speedup of disabling the ZIL altogether is >> appealing (and would >> probably be acceptable in this environment). > > Just to make sure you know ... if you disable the ZIL altogether, and you > have a power interruption, failed cpu, or kernel halt, then you''re likely to > have a corrupt unusable zpool, or at least data corruption. If that is > indeed acceptable to you, go nuts. ;-)I believe that the above is wrong information as long as the devices involved do flush their caches when requested to. Zfs still writes data in order (at the TXG level) and advances to the next transaction group when the devices written to affirm that they have flushed their cache. Without the ZIL, data claimed to be synchronously written since the previous transaction group may be entirely lost. If the devices don''t flush their caches appropriately, the ZIL is irrelevant to pool corruption. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Again, we can''t get a straight answer on this one...... > (or at least not 1 straight answer...) > > Since the ZIL logs are committed atomically they are either committed > in FULL, or NOT at all (by way of rollback of incomplete ZIL applies at > zpool mount time / or transaction rollbacks if things go exceptionally > bad), the only LOST data would be what hasn''t been transferred from ZIL > to the primary pool...... > > But the pool should be "sane".If this is true ... Suppose you shutdown a system, remove the ZIL device, and power back on again. What will happen? I''m informed that with current versions of solaris, you simply can''t remove a zil device once it''s added to a pool. (That''s changed in recent versions of opensolaris) ... but in any system where removing the zil isn''t allowed, what happens if the zil is removed? I have to assume something which isn''t quite sane happens.
On Tue, 30 Mar 2010, Edward Ned Harvey wrote:> > If this is true ... Suppose you shutdown a system, remove the ZIL device, > and power back on again. What will happen? I''m informed that with current > versions of solaris, you simply can''t remove a zil device once it''s added to > a pool. (That''s changed in recent versions of opensolaris) ... but in any > system where removing the zil isn''t allowed, what happens if the zil is > removed?If the ZIL device goes away then zfs might refuse to use the pool without user affirmation (due to potential loss of uncommitted transactions), but if the dedicated ZIL device is gone, zfs will use disks in the main pool for the ZIL. This has been clarified before on the list by top zfs developers. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> If the ZIL device goes away then zfs might refuse to use the pool > without user affirmation (due to potential loss of uncommitted > transactions), but if the dedicated ZIL device is gone, zfs will use > disks in the main pool for the ZIL. > > This has been clarified before on the list by top zfs developers.Here''s a snippet from man zpool. (Latest version available today in solaris) zpool remove pool device ... Removes the specified device from the pool. This command currently only supports removing hot spares and cache devices. Devices that are part of a mirrored configura- tion can be removed using the zpool detach command. Non-redundant and raidz devices cannot be removed from a pool. So you think it would be ok to shutdown, physically remove the log device, and then power back on again, and force import the pool? So although there may be no "live" way to remove a log device from a pool, it might still be possible if you offline the pool to ensure writes are all completed before removing the device? If it were really just that simple ... if zfs only needed to stop writing to the log device and ensure the cache were flushed, and then you could safely remove the log device ... doesn''t it seem silly that there was ever a time when that wasn''t implemented? Like ... Today. (Still not implemented in solaris, only opensolaris.) I know I am not going to put the health of my pool on the line, assuming this line of thought.
> if you disable the ZIL altogether, and you have a power interruption, failed cpu, > or kernel halt, then you''re likely to have a corrupt unusable zpoolthe pool will always be fine, no matter what.> or at least data corruption.yea, its a good bet that data sent to your file or zvol will not be there when the box comes back, even though your program had finished seconds before the crash. Rob
> So you think it would be ok to shutdown, physically remove the log > device, > and then power back on again, and force import the pool? So although > there > may be no "live" way to remove a log device from a pool, it might still > be > possible if you offline the pool to ensure writes are all completed > before > removing the device? > > If it were really just that simple ... if zfs only needed to stop > writing to > the log device and ensure the cache were flushed, and then you could > safely > remove the log device ... doesn''t it seem silly that there was ever a > time > when that wasn''t implemented? Like ... Today. (Still not implemented > in > solaris, only opensolaris.)Allow me to clarify a little further, why I care about this so much. I have a solaris file server, with all the company jewels on it. I had a pair of intel X.25 SSD mirrored log devices. One of them failed. The replacement device came with a newer version of firmware on it. Now, instead of appearing as 29.802 Gb, it appears at 29.801 Gb. I cannot zpool attach. New device is too small. So apparently I''m the first guy this happened to. Oracle is caught totally off guard. They''re pulling their inventory of X25''s from dispatch warehouses, and inventorying all the firmware versions, and trying to figure it all out. Meanwhile, I''m still degraded. Or at least, I think I am. Nobody knows any way for me to remove my unmirrored log device. Nobody knows any way for me to add a mirror to it (until they can locate a drive with the correct firmware.) All the support people I have on the phone are just as scared as I am. "Well we could upgrade the firmware of your existing drive, but that''ll reduce it by 0.001 Gb, and that might just create a time bomb to destroy your pool at a later date." So we don''t do it. Nobody has suggested that I simply shutdown and remove my unmirrored SSD, and power back on.
On 03/30/10 20:00, Bob Friesenhahn wrote:> On Tue, 30 Mar 2010, Edward Ned Harvey wrote: > >>> But the speedup of disabling the ZIL altogether is >>> appealing (and would >>> probably be acceptable in this environment). >> >> Just to make sure you know ... if you disable the ZIL altogether, and >> you >> have a power interruption, failed cpu, or kernel halt, then you''re >> likely to >> have a corrupt unusable zpool, or at least data corruption. If that is >> indeed acceptable to you, go nuts. ;-) > > I believe that the above is wrong information as long as the devices > involved do flush their caches when requested to. Zfs still writes > data in order (at the TXG level) and advances to the next transaction > group when the devices written to affirm that they have flushed their > cache. Without the ZIL, data claimed to be synchronously written > since the previous transaction group may be entirely lost. > > If the devices don''t flush their caches appropriately, the ZIL is > irrelevant to pool corruption. > > BobYes Bob is correct - that is exactly how it works. Neil.
> > Just to make sure you know ... if you disable the ZIL altogether, and > you > > have a power interruption, failed cpu, or kernel halt, then you''re > likely to > > have a corrupt unusable zpool, or at least data corruption. If that > is > > indeed acceptable to you, go nuts. ;-) > > I believe that the above is wrong information as long as the devices > involved do flush their caches when requested to. Zfs still writes > data in order (at the TXG level) and advances to the next transaction > group when the devices written to affirm that they have flushed their > cache. Without the ZIL, data claimed to be synchronously written > since the previous transaction group may be entirely lost. > > If the devices don''t flush their caches appropriately, the ZIL is > irrelevant to pool corruption.I stand corrected. You don''t lose your pool. You don''t have corrupted filesystem. But you lose whatever writes were not yet completed, so if those writes happen to be things like database transactions, you could have corrupted databases or files, or missing files if you were creating them at the time, and stuff like that. AKA, data corruption. But not pool corruption, and not filesystem corruption.
>Oh, one more comment. If you don''t mirror your ZIL, and your unmirrored SSD >goes bad, you lose your whole pool. Or at least suffer data corruption.Hmmm, I thought that in that case ZFS reverts to the "regular on disks" ZIL? With kind regards, Jeroen -- This message posted from opensolaris.org
>The write cache is _not_ being disabled. The write cache is being marked >as non-volatile.Of course you''re right :) Please filter my postings with a "sed ''s/write cache/write cache flush/g''" ;)>BTW, why is a Sun/Oracle branded product not properly respecting the NV >bit in the cache flush command? This seems remarkably broken, and leads >to the amazingly bad advice given on the wiki referenced above.I suspect it has something to do with "emulating disk semantics" over PCIE. Anyway, this did get us stumped in the beginning, performance wasn''t better than when using an OCZ Vertex Turbo ;) By the way, the URL to the reference is part of the official F20 product documentation (that''s how we found it in the first place)... With kind regards, Jeroen -- This message posted from opensolaris.org
> I stand corrected. You don''t lose your pool. You don''t have corrupted > filesystem. But you lose whatever writes were not yet completed, so if > those writes happen to be things like database transactions, you could have > corrupted databases or files, or missing files if you were creating them at > the time, and stuff like that. AKA, data corruption. > > But not pool corruption, and not filesystem corruption.Yeah, that''s a big difference! :) Of course we could not live with pool or fs corruption. However, we can live with the fact the NFS written data is not all on disk in case of a server crash although the NFS client could rely on the write guaranteed by the NFS protocol. I.e. we do not use it for db transactions or something like that. -- This message posted from opensolaris.org
Hi Adam,> Very interesting data. Your test is inherently > single-threaded so I''m not surprised that the > benefits aren''t more impressive -- the flash modules > on the F20 card are optimized more for concurrent > IOPS than single-threaded latency.Thanks for your reply. I''ll probably test the multiple write case, too. But frankly at the moment I care the most about the single-threaded case because if we put e.g. user homes on this server I think they would be severely disappointed if they would have to wait 2m42s just to extract a rather small 50 MB tarball. The default 7m40s without SSD log were unacceptable and we were hoping that the F20 would make a big difference and bring the performance down to acceptable runtimes. But IMHO 2m42s is still too slow and disabling the ZIL seems to be the only option. Knowing that 100s of users could do this in parallel with good performance is nice but it does not improve the situation for the single user which only cares for his own tar run. If there''s anything else we can do/try to improve the single-threaded case I''m all ears. -- This message posted from opensolaris.org
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss <k.weiss at science-computing.de> wrote:> Hi Adam, > >> Very interesting data. Your test is inherently >> single-threaded so I''m not surprised that the >> benefits aren''t more impressive -- the flash modules >> on the F20 card are optimized more for concurrent >> IOPS than single-threaded latency. > > Thanks for your reply. I''ll probably test the multiple write case, too. > > But frankly at the moment I care the most about the single-threaded case > because if we put e.g. user homes on this server I think they would be > severely disappointed if they would have to wait 2m42s just to extract a rather > small 50 MB tarball. The default 7m40s without SSD log were unacceptable > and we were hoping that the F20 would make a big difference and bring the > performance down to acceptable runtimes. But IMHO 2m42s is still too slow > and disabling the ZIL seems to be the only option. > > Knowing that 100s of users could do this in parallel with good performance > is nice but it does not improve the situation for the single user which only > cares for his own tar run. If there''s anything else we can do/try to improve > the single-threaded case I''m all ears. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Use something other than Open/Solaris with ZFS as an NFS server? :) I don''t think you''ll find the performance you paid for with ZFS and Solaris at this time. I''ve been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. You''d be better off getting NetApp -- Brent Jones brent at servuhome.net
Brent Jones wrote:> > I don''t think you''ll find the performance you paid for with ZFS and > Solaris at this time. I''ve been trying to more than a year, and > watching dozens, if not hundreds of threads. > Getting half-ways decent performance from NFS and ZFS is impossible > unless you disable the ZIL.A few days ago I posted to nfs-discuss with a proposal to add some mount/share options to change semantics of a nfs-mounted filesystem so that they parallel those of a local filesystem. The main point is that data gets flushed to stable storage only if the client explicitly requests so via fsync or O_DSYNC, not implicitly with every close(). That would give you the performance you are seeking without sacrificing data integrity for applications that need it. I get the impression that I''m not the only one who could be interested in that ;) -Arne> > You''d be better off getting NetApp >
> Nobody knows any way for me to remove my unmirrored > log device. Nobody knows any way for me to add a mirror to it (untilSince snv_125 you can remove log devices. See http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 I''ve used this all the time during my testing and was able to remove both mirrored and unmirrored log devices without any problems (and without reboot). I''m using snv_134. -- This message posted from opensolaris.org
> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss > > Use something other than Open/Solaris with ZFS as an NFS server? :) > > I don''t think you''ll find the performance you paid for with ZFS and > Solaris at this time. I''ve been trying to more than a year, and > watching dozens, if not hundreds of threads. > Getting half-ways decent performance from NFS and ZFS is impossible > unless you disable the ZIL. > >Well, for lots of environments disabling ZIL is perfectly acceptable. And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later.> You''d be better off getting NetApp >Well, spend some extra money on a really fast NVRAM solution for ZIL and you will get much faster ZFS environment than NetApp and still you will spend much less money. Not to mention all the extra flexibity compared to NetApp. -- Robert Milkowski http://milek.blogspot.com
> >>> Just to make sure you know ... if you disable the ZIL altogether, and >>> >> you >> >>> have a power interruption, failed cpu, or kernel halt, then you''re >>> >> likely to >> >>> have a corrupt unusable zpool, or at least data corruption. If that >>> >> is >> >>> indeed acceptable to you, go nuts. ;-) >>> >> I believe that the above is wrong information as long as the devices >> involved do flush their caches when requested to. Zfs still writes >> data in order (at the TXG level) and advances to the next transaction >> group when the devices written to affirm that they have flushed their >> cache. Without the ZIL, data claimed to be synchronously written >> since the previous transaction group may be entirely lost. >> >> If the devices don''t flush their caches appropriately, the ZIL is >> irrelevant to pool corruption. >> > I stand corrected. You don''t lose your pool. You don''t have corrupted > filesystem. But you lose whatever writes were not yet completed, so if > those writes happen to be things like database transactions, you could have > corrupted databases or files, or missing files if you were creating them at > the time, and stuff like that. AKA, data corruption. > > But not pool corruption, and not filesystem corruption. > > >Which is an expected behavior when you break NFS requirements as Linux does out of the box. Disabling ZIL on a nfs server makes it no worse than the standard Linux behaviour - now you get decent performance at a cost of some data to get corrupted from a nfs client point of view. But then there are environments when it is perfectly acceptable as you there are not running critical databases but rather user home directories and zfs will flush a transaction maximum after 30s currently so user won''t be able to loose more than last 30s if the nfs server would suddenly lost power. To clarify - if ZIL is disabled it makes no difference at all for a pool/filesystem level consistency. -- Robert Milkowski http://milek.blogspot.com
> >> standard ZIL: 7m40s (ZFS default) >> 1x SSD ZIL: 4m07s (Flash Accelerator F20) >> 2x SSD ZIL: 2m42s (Flash Accelerator F20) >> 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) >> 3x SSD ZIL: 2m47s (Flash Accelerator F20) >> 4x SSD ZIL: 2m57s (Flash Accelerator F20) >> disabled ZIL: 0m15s >> (local extraction 0m0.269s) >> >> I was not so much interested in the absolute numbers but rather in the >> relative >> performance differences between the standard ZIL, the SSD ZIL and the >> disabled >> ZIL cases. >> > Oh, one more comment. If you don''t mirror your ZIL, and your unmirrored SSD > goes bad, you lose your whole pool. Or at least suffer data corruption. > > >This is not true. If ZIL device would die while pool is imported then ZFS would start using z ZIL withing a pool and continue to operate. On the other hand if your server would suddenly lost power and then when you power it up later on and ZFS detects that the ZIL is broken/gone it will require a sysadmin intervation to force the pool import and yes possibly loose some data. But how is it different from any other solution where your log is put on a separate device? Well, it is actually different. With ZFS you can still guearantee it to be consistent on-disk while others generally can''t and often you will have to do fsck to even mount a fs in r/w... -- Robert Milkowski http://milek.blogspot.com
> What would be useful though is to be able to easily disable ZIL per > dataset instead of OS wide switch. > This feature has already been coded and tested and awaits a formal > process to be completed in order to get integrated. > Should be rather sooner than later.I agree.> > You''d be better off getting NetApp > > > Well, spend some extra money on a really fast NVRAM solution for ZIL and > you will get much faster ZFS environment than NetApp and still you will > spend much less money. Not to mention all the extra flexibity compared > to NetApp.Do you have a concrete recommendation we could use in the x4540 instead of the F20? -- This message posted from opensolaris.org
Hi Jeroen, Adam!> link. Switched write caching off with the following > addition to the /kernel/drv/sd.conf file (Karsten: if > you didn''t do this already, you _really_ want to :)Okay, I bite! :) format->inquiry on the F20 FMods disks returns: # Vendor: ATA # Product: MARVELL SD88SA02 So I put this in /kernel/drv/sd.conf and rebooted: # KAW, 2010-03-31 # Set F20 FMod devices to non-volatile mode # See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes sd-config-list = "ATA MARVELL SD88SA02", "nvcache1"; nvcache1=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1; Now the tarball extraction test with active ZIL finishes in ~0m32s! I''ve tested with a mirrored SSD log and two separate SSD log devices. The runtime is nearly the same. Compared to the 2m42s before the /kernel/drv/sd.conf modification this is a huge improvement. The performance with active ZIL would be acceptable now. But is this mode of operation *really* safe? FWIW zilstat during the test shows this: N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB 0 0 0 0 0 0 0 0 0 0 1039072 1039072 1039072 3772416 3772416 3772416 610 299 311 0 1522496 1522496 1522496 5402624 5402624 5402624 874 429 445 0 2292952 2292952 2292952 6746112 6746112 6746112 931 215 716 0 2321272 2321272 2321272 6774784 6774784 6774784 931 208 723 0 2303472 2303472 2303472 6549504 6549504 6549504 897 195 702 0 2222632 2222632 2222632 6733824 6733824 6733824 935 226 709 0 2198328 2198328 2198328 6668288 6668288 6668288 926 224 702 0 2170000 2170000 2170000 6373376 6373376 6373376 878 200 678 0 2185416 2185416 2185416 6352896 6352896 6352896 874 197 677 0 2218040 2218040 2218040 6516736 6516736 6516736 897 203 694 0 2436984 2436984 2436984 6549504 6549504 6549504 885 171 714 0 I.e. ~900 ops/s. -- This message posted from opensolaris.org
> Use something other than Open/Solaris with ZFS as an NFS server? :) > > I don''t think you''ll find the performance you paid for with ZFS and > Solaris at this time. I''ve been trying to more than a year, and > watching dozens, if not hundreds of threads. > Getting half-ways decent performance from NFS and ZFS is impossible > unless you disable the ZIL. > > You''d be better off getting NetAppHah hah. I have a Sun X4275 server exporting NFS. We have clients on all 4 of the Gb ethers, and the Gb ethers are the bottleneck, not the disks or filesystem. I suggest you either enable the WriteBack cache on your HBA, or add SSD''s for ZIL. Performance is 5-10x higher this way than using "naked" disks. But of course, not as high as it is with a disabled ZIL.
Hi Karsten,> But is this mode of operation *really* safe?As far as I can tell it is. -The F20 uses some form of power backup that should provide power to the interface card long enough to get the cache onto solid state in case of power failure. -Recollecting from earlier threads here; in case the card fails (but not the host), there should be enough data residing in memory for ZFS to safely switch to the regular on disk ZIL. -According to my contacts at Sun, the F20 is a viable replacement solution for the X25-E. -Switching write caching off seems to be officially recommended on the Sun performance wiki (translated to "more sane defaults"). If I''m wrong here I''d like to know too, ''cause this is probably the way we''re taking it in production. :) With kind regards, Jeroen -- This message posted from opensolaris.org
> > Nobody knows any way for me to remove my unmirrored > > log device. Nobody knows any way for me to add a mirror to it (until > > Since snv_125 you can remove log devices. See > http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 > > I''ve used this all the time during my testing and was able to remove > both > mirrored and unmirrored log devices without any problems (and without > reboot). I''m using snv_134.Aware. Opensolaris can remove log devices. Solaris cannot. Yet. But if you want your server in production, you can get a support contract for solaris. Opensolaris cannot.
Hi Richard,>For this case, what is the average latency to the F20?I''m not giving the average since I only performed a single run here (still need to get autopilot set up :) ). However here is a graph of iostat IOPS/svc_t sampled in 10sec intervals during a run of untarring an eclipse tarbal 40 times from two hosts. I''m using 1 vmod here. http://www.science.uva.nl/~jeroen/zil_1slog_e1000_iostat_iops_svc_t_10sec_interval.pdf Maximum svc_t is around 2.7ms averaged over 10s. Still wondering why this won''t scale out though. We don''t seem to be CPU bound, unless ZFS limits itself to max 30% cputime? With kind regards, Jeroen -- This message posted from opensolaris.org
> >Oh, one more comment. If you don''t mirror your ZIL, and your > unmirrored SSD > >goes bad, you lose your whole pool. Or at least suffer data > corruption. > > Hmmm, I thought that in that case ZFS reverts to the "regular on disks" > ZIL?I see the source for some confusion. On the ZFS Best Practices page: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide It says: Failure of the log device may cause the storage pool to be inaccessible if you are running the Solaris Nevada release prior to build 96 and a release prior to the Solaris 10 10/09 release. It also says: If a separate log device is not mirrored and the device that contains the log fails, storing log blocks reverts to the storage pool. ... At the time when I built my system (Oct 2009) this is what it said: At present, until [http://bugs.opensolaris.org/view_bug.do?bug_id=6707530 CR 6707530] is integrated, failure of the log device may cause the storage pool to be inaccessible. Protecting the log device by mirroring will allow you to access the storage pool even if a log device has failed.
On Wed, Mar 31, 2010 at 6:31 AM, Edward Ned Harvey <solaris2 at nedharvey.com>wrote:> > > Nobody knows any way for me to remove my unmirrored > > > log device. Nobody knows any way for me to add a mirror to it (until > > > > Since snv_125 you can remove log devices. See > > http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 > > > > I''ve used this all the time during my testing and was able to remove > > both > > mirrored and unmirrored log devices without any problems (and without > > reboot). I''m using snv_134. > > Aware. Opensolaris can remove log devices. Solaris cannot. Yet. But if > you want your server in production, you can get a support contract for > solaris. Opensolaris cannot. >According to who? http://www.opensolaris.com/learn/features/availability/ Full production level support Both Standard and Premium support offerings are available for deployment of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following configurations: --Tim <http://www.opensolaris.com/learn/features/availability/> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/9bc1246e/attachment.html>
On Wed, 31 Mar 2010, Tim Cook wrote:> > http://www.opensolaris.com/learn/features/availability/ > > Full production level support > > Both Standard and Premium support offerings are available for > deployment of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with > following configurations:This formal OpenSolaris release is too anchient to do him any good. In fact, zfs-wise, it lags the Solaris 10 releases. If there is ever another OpenSolaris formal release, then the situation will be different. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, 31 Mar 2010, Karsten Weiss wrote:> > But frankly at the moment I care the most about the single-threaded case > because if we put e.g. user homes on this server I think they would be > severely disappointed if they would have to wait 2m42s just to extract a rather > small 50 MB tarball. The default 7m40s without SSD log were unacceptable > and we were hoping that the F20 would make a big difference and bring the > performance down to acceptable runtimes. But IMHO 2m42s is still too slow > and disabling the ZIL seems to be the only option.Is extracting 50 MB tarballs something that your users do quite a lot of? Would your users be concerned if there was a possibility that after extracting a 50 MB tarball that files are incomplete, whole subdirectories are missing, or file permissions are incorrect? The Sun Flash Accelerator F20 was not strictly designed as a zfs log device. It was originally designed to be a database accelerator. It was repurposed for zfs slog use because it works. It is a bit wimpy for bulk data. If you need fast support for bulk writes, perhaps you need something like STEC''s very expensive ZEUS SSD drive. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, March 30, 2010 22:40, Edward Ned Harvey wrote:> Here''s a snippet from man zpool. (Latest version available today in > solaris) > > zpool remove pool device ... > Removes the specified device from the pool. This command > currently only supports removing hot spares and cache > devices. Devices that are part of a mirrored configura- > tion can be removed using the zpool detach command. > Non-redundant and raidz devices cannot be removed from a > pool. > > So you think it would be ok to shutdown, physically remove the log device, > and then power back on again, and force import the pool? So althoughA "cache device" is for the L2ARC, a "log device" is for ZIL. Log devices are removable as of snv_125 (mentioned in another e-mail). If you want log removal in Solaris proper, and you have a support account, call up and ask that CR 6574286 be fixed: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286
On Wed, Mar 31, 2010 at 9:47 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Wed, 31 Mar 2010, Tim Cook wrote: > >> >> http://www.opensolaris.com/learn/features/availability/ >> >> Full production level support >> >> Both Standard and Premium support offerings are available for deployment >> of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following >> configurations: >> > > This formal OpenSolaris release is too anchient to do him any good. In > fact, zfs-wise, it lags the Solaris 10 releases. > > If there is ever another OpenSolaris formal release, then the situation > will be different. > > Bob >Cmon now, have a little faith. It hasn''t even slipped past March yet :) Of course it''d be way more fun if someone from Sun threw caution to the wind and told us what the hold-up is *cough*oracle*cough*. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/f16f8fd7/attachment.html>
> Allow me to clarify a little further, why I care about this so much. I have > a solaris file server, with all the company jewels on it. I had a pair of > intel X.25 SSD mirrored log devices. One of them failed. The replacement > device came with a newer version of firmware on it. Now, instead of > appearing as 29.802 Gb, it appears at 29.801 Gb. I cannot zpool attach. > New device is too small. > > So apparently I''m the first guy this happened to. Oracle is caught totally > off guard. They''re pulling their inventory of X25''s from dispatch > warehouses, and inventorying all the firmware versions, and trying to figure > it all out. Meanwhile, I''m still degraded. Or at least, I think I am.This isn''t the only problem that SnOracle has had with the X25s. We managed to reproduce a problem with the SSDs as ZIL on an x4250. An I/O error of some sort caused a retryable write error ... which brought throughput to 0 as if a PCI bus reset had occurred. Here''s a sample of our output... you might want to check and see if you''re getting similar errors. Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,25f8 at 4/pci111d,801c at 0/pci111d,801c at 4/pci1000,3150 at 0 (mpt1): Jan 10 21:36:52 tips-fs1.tamu.edu Log info 31126000 received for target 15. Jan 10 21:36:52 tips-fs1.tamu.edu scsi_status=0, ioc_status=804b, scsi_state=c Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,25f8 at 4/pci111d,801c at 0/pci111d,801c at 4/pci1000,3150 at 0 (mpt1): Jan 10 21:36:52 tips-fs1.tamu.edu Log info 31126000 received for target 15. Jan 10 21:36:52 tips-fs1.tamu.edu scsi_status=0, ioc_status=804b, scsi_state=c Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci8086,25f8 at 4/pci111d,801c at 0/pci111d,801c at 4/pci1000,3150 at 0/sd at f,0 (sd28): Jan 10 21:36:52 tips-fs1.tamu.edu Error for Command: write Error Level: Retryable Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Requested Block: 8448 Error Block: 8448 Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM902401BA Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 We were lucky to catch the problem before we went live. There were an exceptionally large number of I/O errors Sun has not gotten back to me with a resolution for this problem yet, but they were able to reproduce the issue. -K Karl Katzke Systems Analyst II TAMU / DRGS
> Would your users be concerned if there was a possibility that > after extracting a 50 MB tarball that files are incomplete, whole > subdirectories are missing, or file permissions are incorrect?Correction: "Would your users be concerned if there was a possibility that after extracting a 50MB tarball *and having a server crash* then files could be corrupted as described above." If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you''ve described, is to have an ungraceful power down or reboot. The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. Obviously, if you cannot accept 5-10 minutes of data loss, such as credit card transactions, this would not be acceptable. You''d need to keep your ZIL enabled. Also, if you have a svn server on the ZFS server, and you have svn clients on other systems ... You should never allow your clients to advance beyond the current rev of the server. So again, you''d have to keep the ZIL enabled on the server. It all depends on your workload. For some, the disabled ZIL is worth the risk.
On Wed, 31 Mar 2010, Tim Cook wrote:> > If there is ever another OpenSolaris formal release, then the situation will be different. > > Cmon now, have a little faith. ?It hasn''t even slipped past March > yet :) ?Of course it''d be way more fun if someone from Sun threw > caution to the wind and told us what the hold-up is > *cough*oracle*cough*.Oracle is a total "cold boot" for me. Everything they have put on their web site seems carefully designed to cast fear and panic into the former Sun customer base and cause substantial doubt, dismay, and even terror. I don''t know what I can and can''t trust. Every bit of trust that Sun earned with me over the past 19 years is clean-slated. Regardless, it seems likely that Oracle is taking time to change all of the copyrights, documentation, and logos to reflect the new othership. They are probably re-evaluating which parts should be included for free in OpenSolaris. The name "Sun" is deeply embedded in Solaris. All of the Solaris 10 packages include "SUN" in their name. Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The "Premium" service plan costs $200 more. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, 31 Mar 2010, Edward Ned Harvey wrote:>> Would your users be concerned if there was a possibility that >> after extracting a 50 MB tarball that files are incomplete, whole >> subdirectories are missing, or file permissions are incorrect? > > Correction: "Would your users be concerned if there was a possibility that > after extracting a 50MB tarball *and having a server crash* then files could > be corrupted as described above." > > If you disable the ZIL, the filesystem still stays correct in RAM, and the > only way you lose any data such as you''ve described, is to have an > ungraceful power down or reboot.Yes, of course. Suppose that you are a system administrator. The server spontaneously reboots. A corporate VP (CFO) comes to you and says that he had just saved the critical presentation to be given to the board of the company (and all shareholders) later that day, and now it is gone due to your spontaneous server reboot. Due to a delayed financial statement, the corporate stock plummets. What are you to do? Do you expect that your employment will continue? Reliable NFS synchronous writes are good for the system administrators. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, March 31, 2010 12:23, Bob Friesenhahn wrote:> Yesterday I noticed that the Sun Studio 12 compiler (used to build > OpenSolaris) now costs a minimum of $1,015/year. The "Premium" > service plan costs $200 more.I feel a great disturbance in the force. It is as if a great multitude of developers screamed and then went out and downloaded GCC.
On Wed, Mar 31, 2010 at 11:23 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Wed, 31 Mar 2010, Tim Cook wrote: > >> >> If there is ever another OpenSolaris formal release, then the situation >> will be different. >> >> Cmon now, have a little faith. It hasn''t even slipped past March yet :) >> Of course it''d be way more fun if someone from Sun threw caution to the >> wind and told us what the hold-up is *cough*oracle*cough*. >> > > Oracle is a total "cold boot" for me. Everything they have put on their > web site seems carefully designed to cast fear and panic into the former Sun > customer base and cause substantial doubt, dismay, and even terror. I don''t > know what I can and can''t trust. Every bit of trust that Sun earned with me > over the past 19 years is clean-slated. > > Regardless, it seems likely that Oracle is taking time to change all of the > copyrights, documentation, and logos to reflect the new othership. They are > probably re-evaluating which parts should be included for free in > OpenSolaris. The name "Sun" is deeply embedded in Solaris. All of the > Solaris 10 packages include "SUN" in their name. > > Yesterday I noticed that the Sun Studio 12 compiler (used to build > OpenSolaris) now costs a minimum of $1,015/year. The "Premium" service plan > costs $200 more. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/Where did you see that? It looks to be free to me: Sun Studio 12 Update 1 - FREE for SDN members. SDN members can download a free, full-license copy of Sun Studio 12 Update 1. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/94236660/attachment.html>
On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote:> Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The "Premium" service plan costs $200 more.The download still seems to be a "free, full-license copy" for SDN members; the $1015 you quote is for the standard Sun Software service plan. Is a service plan now *required*, a la Solaris 10? Cheers, Chris
On Wed, Mar 31, 2010 at 11:39 AM, Chris Ridd <chrisridd at mac.com> wrote:> On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote: > > > Yesterday I noticed that the Sun Studio 12 compiler (used to build > OpenSolaris) now costs a minimum of $1,015/year. The "Premium" service plan > costs $200 more. > > The download still seems to be a "free, full-license copy" for SDN members; > the $1015 you quote is for the standard Sun Software service plan. Is a > service plan now *required*, a la Solaris 10? > > Cheers, > > Chris >It''s still available in the opensolaris repo, and I see no license reference stating you have to have a support contract, so I''m guessing no... *Several releases of Sun Studio Software are available in the OpenSolaris repositories. The following list shows you how to download and install each release, and where you can find the documentation for the release:* - *Sun Studio 12 Update 1:** The Sun Studio 12 Update 1 release is the latest full production release of Sun Studio software. It has recently been added to the OpenSolaris IPS repository. To install this release in your OpenSolaris 2009.06 environment using the Package Manager:* * * --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/aba66868/attachment.html>
On Wed, 31 Mar 2010, Chris Ridd wrote:>> Yesterday I noticed that the Sun Studio 12 compiler (used to build >> OpenSolaris) now costs a minimum of $1,015/year. The "Premium" >> service plan costs $200 more. > > The download still seems to be a "free, full-license copy" for SDN > members; the $1015 you quote is for the standard Sun Software > service plan. Is a service plan now *required*, a la Solaris 10?There is no telling. Everything is subject to evaluation by Oracle and it is not clear which parts of the web site are confirmed and which parts are still subject to change. In the past it was free to join SDN but if one was to put an ''M'' in front of that SDN, then there would be a subtantial yearly charge for membership (up to $10,939 USD per year according to Wikipedia). This is a world that Oracle has been commonly exposed to in the past. Not everyone who uses a compiler qualifies as a "developer". Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 31 Mar 2010, at 17:50, Bob Friesenhahn wrote:> On Wed, 31 Mar 2010, Chris Ridd wrote: > >>> Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The "Premium" service plan costs $200 more. >> >> The download still seems to be a "free, full-license copy" for SDN members; the $1015 you quote is for the standard Sun Software service plan. Is a service plan now *required*, a la Solaris 10? > > There is no telling. Everything is subject to evaluation by Oracle and it is not clear which parts of the web site are confirmed and which parts are still subject to change. In the past it was free to join SDN but if one was to put an ''M'' in front of that SDN, then there would be a subtantial yearly charge for membership (up to $10,939 USD per year according to Wikipedia). This is a world that Oracle has been commonly exposed to in the past. Not everyone who uses a compiler qualifies as a "developer".Indeed, but Microsoft still give out free "express" versions of their tools. If memory serves, you''re not allowed to distribute binaries built with them but otherwise they''re not broken in any significant way. Maybe this will also be the difference between Sun Studio and Sun Studio Express. Perhaps we should take this to tools-compilers. Cheers, Chris
>>>>> "rm" == Robert Milkowski <milek at task.gda.pl> writes:rm> This is not true. If ZIL device would die *while pool is rm> imported* then ZFS would start using z ZIL withing a pool and rm> continue to operate. what you do not say, is that a pool with dead zil cannot be ''import -f''d. So, for example, if your rpool and slog are on the same SSD, and it dies, you have just lost your whole pool. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/c0b5bf14/attachment.bin>
>>>>> "rm" == Robert Milkowski <milek at task.gda.pl> writes:rm> the reason you get better performance out of the box on Linux rm> as NFS server is that it actually behaves like with disabled rm> ZIL careful. Solaris people have been slinging mud at linux for things unfsd did in spite of the fact knfsd has been around for a decade. and ``has options to behave like the ZIL is disabled (sync/async in /etc/exports)'''' != ``always behaves like the ZIL is disabled''''. If you are certain about Linux NFS servers not preserving data for hard mounts when the server reboots even with the ''sync'' option which is the default, please confirm, but otherwise I do not believe you. rm> Which is an expected behavior when you break NFS requirements rm> as Linux does out of the box. wrong. The default is ''sync'' in /etc/exports. The default has changed, but the default is ''sync'', and the whole thing is well-documented. rm> What would be useful though is to be able to easily disable rm> ZIL per dataset instead of OS wide switch. yeah, Linux NFS servers have that granularity for their equivalent option. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/1afc57ec/attachment.bin>
Karsten Weiss wrote:> Knowing that 100s of users could do this in parallel with good performance > is nice but it does not improve the situation for the single user which only > cares for his own tar run. If there''s anything else we can do/try to improve > the single-threaded case I''m all ears.A MegaRAID card with write-back cache? It should also be cheaper than the F20. Wes Felter
Edward Ned Harvey <solaris2 <at> nedharvey.com> writes:> > Allow me to clarify a little further, why I care about this so much. I have > a solaris file server, with all the company jewels on it. I had a pair of > intel X.25 SSD mirrored log devices. One of them failed. The replacement > device came with a newer version of firmware on it. Now, instead of > appearing as 29.802 Gb, it appears at 29.801 Gb. I cannot zpool attach. > New device is too small. > > So apparently I''m the first guy this happened to. Oracle is caught totally > off guard. They''re pulling their inventory of X25''s from dispatch > warehouses, and inventorying all the firmware versions, and trying to figure > it all out. Meanwhile, I''m still degraded. Or at least, I think I am. > > Nobody knows any way for me to remove my unmirrored log device. Nobody > knows any way for me to add a mirror to it (until they can locate a drive > with the correct firmware.) All the support people I have on the phone are > just as scared as I am. "Well we could upgrade the firmware of your > existing drive, but that''ll reduce it by 0.001 Gb, and that might just > create a time bomb to destroy your pool at a later date." So we don''t do > it. > > Nobody has suggested that I simply shutdown and remove my unmirrored SSD, > and power back on. >We ran into something similar with these drives in an X4170 that turned out to be an issue of the preconfigured logical volumes on the drives. Once we made sure all of our Sun PCI HBAs where running the exact same version of firmware and recreated the volumes on new drives arriving from Sun we got back into sync on the X25-E devices sizes.
On 31/03/2010 17:31, Bob Friesenhahn wrote:> On Wed, 31 Mar 2010, Edward Ned Harvey wrote: > >>> Would your users be concerned if there was a possibility that >>> after extracting a 50 MB tarball that files are incomplete, whole >>> subdirectories are missing, or file permissions are incorrect? >> >> Correction: "Would your users be concerned if there was a >> possibility that >> after extracting a 50MB tarball *and having a server crash* then >> files could >> be corrupted as described above." >> >> If you disable the ZIL, the filesystem still stays correct in RAM, >> and the >> only way you lose any data such as you''ve described, is to have an >> ungraceful power down or reboot. > > Yes, of course. Suppose that you are a system administrator. The > server spontaneously reboots. A corporate VP (CFO) comes to you and > says that he had just saved the critical presentation to be given to > the board of the company (and all shareholders) later that day, and > now it is gone due to your spontaneous server reboot. Due to a > delayed financial statement, the corporate stock plummets. What are > you to do? Do you expect that your employment will continue? > > Reliable NFS synchronous writes are good for the system administrators.well, it really depends on your environment. There is place for Oracle database and there is place for MySQL, then you don''t really need to cluster everything and then there are environments where disabling ZIL is perfectly acceptablt. One of such cases is that you need to re-import a database or recover lots of files over NFS - your service is down and disabling ZIL makes a recovery MUCH faster. Then there are cases when leaving the ZIL disabled is acceptable as well. -- Robert Milkowski http://milek.blogspot.com
On 31/03/2010 17:22, Edward Ned Harvey wrote:> > The advice I would give is: Do zfs autosnapshots frequently (say ... every > 5 minutes, keeping the most recent 2 hours of snaps) and then run with no > ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest > snapshot ... and rollback once more for good measure. As long as you can > afford to risk 5-10 minutes of the most recent work after a crash, then you > can get a 10x performance boost most of the time, and no risk of the > aforementioned data corruption. >I don''t really get it - rolling back to a last snapshot doesn''t really improve things here it actually makes it worse as now you are going to loose even more data. Keep in mind that currently the maximum time after which ZFS commits a transaction is 30s - ZIL or not. So with disabled ZIL in worst case scenario you should loose no more than last 30-60s. You can tune it down if you want. Rolling back to a snapshot will only make it worse. Then also keep in mind that it is a worst case scenario here - it well may be there were no outstanding transactions at all - it all goes down basically to a risk assessment, impact assessment and a cost. Unless you are talking about doing regular snapshots and making sure that application is consistent while doing so - for example putting all Oracle tablespaces in a hot backup mode and taking a snapshot... otherwise it doesn''t really make sense. -- Robert Milkowski http://milek.blogspot.com
On 31/03/2010 21:38, Miles Nordin wrote:> rm> Which is an expected behavior when you break NFS requirements > rm> as Linux does out of the box. > > wrong. The default is ''sync'' in /etc/exports. The default has > changed, but the default is ''sync'', and the whole thing is > well-documented. >I double checked the documentation and you''re right - the default has changed to sync. I haven''t found in which RH version it happened but it doesn''t really matter. So yes, I was wrong - the current default it seems to be sync on Linux as well. -- Robert Milkowski http://milek.blogspot.com
On Mar 31, 2010, at 19:41, Robert Milkowski wrote:> I double checked the documentation and you''re right - the default > has changed to sync. > I haven''t found in which RH version it happened but it doesn''t > really matter.From the SourceForge site:> Since version 1.0.1 of the NFS utilities tarball has changed the > server export default to "sync", then, if no behavior is specified > in the export list (thus assuming the default behavior), a warning > will be generated at export time.http://nfs.sourceforge.net/
On Mar 31, 2010, at 5:39 AM, Robert Milkowski <milek at task.gda.pl> wrote:> >> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss >> Use something other than Open/Solaris with ZFS as an NFS >> server? :) >> >> I don''t think you''ll find the performance you paid for with ZFS and >> Solaris at this time. I''ve been trying to more than a year, and >> watching dozens, if not hundreds of threads. >> Getting half-ways decent performance from NFS and ZFS is impossible >> unless you disable the ZIL. >> >> > > Well, for lots of environments disabling ZIL is perfectly acceptable. > And frankly the reason you get better performance out of the box on > Linux as NFS server is that it actually behaves like with disabled > ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using > Linux here or any other OS which behaves in the same manner. > Actually it makes it better as even if ZIL is disabled ZFS > filesystem is always consisten on a disk and you still get all the > other benefits from ZFS. > > What would be useful though is to be able to easily disable ZIL per > dataset instead of OS wide switch. > This feature has already been coded and tested and awaits a formal > process to be completed in order to get integrated. Should be rather > sooner than later.Well being fair to Linux the default for NFS exports is to export them ''sync'' now which syncs to disk on close or fsync. It has been many years before they exported ''async'' by default. Now if Linux admins set their shares ''async'' and loose important data then it''s operator error and not Linux''s fault. If apps don''t care about their data consistency and don''t sync their data I don''t see why the file server has to care for them. I mean if it were a local file system and the machine rebooted the data would be lost too. Should we care more for data written remotely then locally? -Ross
On Mar 31, 2010, at 7:11 PM, Ross Walker wrote:> On Mar 31, 2010, at 5:39 AM, Robert Milkowski <milek at task.gda.pl> wrote: > >> >>> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss >>> Use something other than Open/Solaris with ZFS as an NFS server? :) >>> >>> I don''t think you''ll find the performance you paid for with ZFS and >>> Solaris at this time. I''ve been trying to more than a year, and >>> watching dozens, if not hundreds of threads. >>> Getting half-ways decent performance from NFS and ZFS is impossible >>> unless you disable the ZIL. >>> >>> >> >> Well, for lots of environments disabling ZIL is perfectly acceptable. >> And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. >> >> What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. >> This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later. > > Well being fair to Linux the default for NFS exports is to export them ''sync'' now which syncs to disk on close or fsync. It has been many years before they exported ''async'' by default. Now if Linux admins set their shares ''async'' and loose important data then it''s operator error and not Linux''s fault. > > If apps don''t care about their data consistency and don''t sync their data I don''t see why the file server has to care for them. I mean if it were a local file system and the machine rebooted the data would be lost too. Should we care more for data written remotely then locally?This is not true for sync data written locally, unless you disable the ZIL locally. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Mar 31, 2010, at 10:25 PM, Richard Elling <richard.elling at gmail.com> wrote:> > On Mar 31, 2010, at 7:11 PM, Ross Walker wrote: > >> On Mar 31, 2010, at 5:39 AM, Robert Milkowski <milek at task.gda.pl> >> wrote: >> >>> >>>> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss >>>> Use something other than Open/Solaris with ZFS as an NFS >>>> server? :) >>>> >>>> I don''t think you''ll find the performance you paid for with ZFS and >>>> Solaris at this time. I''ve been trying to more than a year, and >>>> watching dozens, if not hundreds of threads. >>>> Getting half-ways decent performance from NFS and ZFS is impossible >>>> unless you disable the ZIL. >>>> >>>> >>> >>> Well, for lots of environments disabling ZIL is perfectly >>> acceptable. >>> And frankly the reason you get better performance out of the box >>> on Linux as NFS server is that it actually behaves like with >>> disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse >>> than using Linux here or any other OS which behaves in the same >>> manner. Actually it makes it better as even if ZIL is disabled ZFS >>> filesystem is always consisten on a disk and you still get all the >>> other benefits from ZFS. >>> >>> What would be useful though is to be able to easily disable ZIL >>> per dataset instead of OS wide switch. >>> This feature has already been coded and tested and awaits a formal >>> process to be completed in order to get integrated. Should be >>> rather sooner than later. >> >> Well being fair to Linux the default for NFS exports is to export >> them ''sync'' now which syncs to disk on close or fsync. It has been >> many years before they exported ''async'' by default. Now if Linux >> admins set their shares ''async'' and loose important data then it''s >> operator error and not Linux''s fault. >> >> If apps don''t care about their data consistency and don''t sync >> their data I don''t see why the file server has to care for them. I >> mean if it were a local file system and the machine rebooted the >> data would be lost too. Should we care more for data written >> remotely then locally? > > This is not true for sync data written locally, unless you disable > the ZIL locally.No, of course if it''s written sync with ZIL, it just seems over Solaris NFS all writes are delayed not just sync writes. -Ross
> I see the source for some confusion. On the ZFS Best Practices page: > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > > It says: > Failure of the log device may cause the storage pool to be inaccessible > if > you are running the Solaris Nevada release prior to build 96 and a > release > prior to the Solaris 10 10/09 release. > > It also says: > If a separate log device is not mirrored and the device that contains > the > log fails, storing log blocks reverts to the storage pool.I have some more concrete data on this now. Running Solaris 10u8 (which is 10/09), fully updated last weekend. We want to explore the consequences of adding or failing a non-mirrored log device. We created a pool with a non-mirrored ZIL log device. And experimented with it: (a) Simply yank out the non-mirrored log device while the system is live. The result was: Any zfs or zpool command would hang permanently. Even "zfs list" hangs permanently. The system cannot shutdown, cannot reboot, cannot "zfs send" or "zfs snapshot" or anything ... It''s a bad state. You''re basically hosed. Power cycle is the only option. (b) After power cycling, the system won''t boot. It gets part way through the boot process, and eventually just hangs there, infinitely cycling error messages about services that couldn''t start. Random services, such as inetd, which seem unrelated to some random data pool that failed. So we power cycle again, and go into failsafe mode, to clean up and destroy the old messed up pool ... Boot up totally clean again, and create a new totally clean pool with a non-mirrored log device. Just to ensure we really are clean, we simply "zpool export" and "zpool import" with no trouble, and reboot once for good measure. "zfs list" and everything are all working great... (c) Do a "zpool export." Obviously, the ZIL log device is clean and flushed at this point, not being used. We simply yank out the log device, and do "zpool import." Well ... Without that log device, I forget the terminology, it said something like "missing disk." Plain and simple, you *can* *not* import the pool without the log device. It does not say "to force use -f" and even if you specify the -f, it still just throws the same error message, missing disk or whatever. Won''t import. Period. ... So, to anybody who said the failed log device will simply fail over to blocks within the main pool: Sorry. That may be true in some later version, but it is not the slightest bit true in the absolute latest solaris (proper) available today. I''m going to venture a guess this is no longer a problem, after zpool version 19. This is when "ZFS log device removal" was introduced. Unfortunately, the latest version of solaris only goes up to zpool version 15.
> A MegaRAID card with write-back cache? It should also be cheaper than > the F20.I haven''t posted results yet, but I just finished a few weeks of extensive benchmarking various configurations. I can say this: WriteBack cache is much faster than "naked" disks, but if you can buy an SSD or two for ZIL log device, the dedicated ZIL is yet again much faster than WriteBack. It doesn''t have to be F20. You could use the Intel X25 for example. If you''re running solaris proper, you better mirror your ZIL log device. If you''re running opensolaris ... I don''t know if that''s important. I''ll probably test it, just to be sure, but I might never get around to it because I don''t have a justifiable business reason to build the opensolaris machine just for this one little test. Seriously, all disks configured WriteThrough (spindle and SSD disks alike) using the dedicated ZIL SSD device, very noticeably faster than enabling the WriteBack.
> We ran into something similar with these drives in an X4170 that turned > out to > be an issue of the preconfigured logical volumes on the drives. Once > we made > sure all of our Sun PCI HBAs where running the exact same version of > firmware > and recreated the volumes on new drives arriving from Sun we got back > into sync > on the X25-E devices sizes.Can you elaborate? Just today, we got the replacement drive that has precisely the right version of firmware and everything. Still, when we plugged in that drive, and "create simple volume" in the storagetek raid utility, the new drive is 0.001 Gb smaller than the old drive. I''m still hosed. Are you saying I might benefit by sticking the SSD into some laptop, and zero''ing the disk? And then attach to the sun server? Are you saying I might benefit by finding some other way to make the drive available, instead of using the storagetek raid utility? Thanks for the suggestions...
On Mar 31, 2010, at 8:58 PM, Edward Ned Harvey wrote:>> We ran into something similar with these drives in an X4170 that turned >> out to >> be an issue of the preconfigured logical volumes on the drives. Once >> we made >> sure all of our Sun PCI HBAs where running the exact same version of >> firmware >> and recreated the volumes on new drives arriving from Sun we got back >> into sync >> on the X25-E devices sizes. > > Can you elaborate? Just today, we got the replacement drive that has > precisely the right version of firmware and everything. Still, when we > plugged in that drive, and "create simple volume" in the storagetek raid > utility, the new drive is 0.001 Gb smaller than the old drive. I''m still > hosed. > > Are you saying I might benefit by sticking the SSD into some laptop, and > zero''ing the disk? And then attach to the sun server? > > Are you saying I might benefit by finding some other way to make the drive > available, instead of using the storagetek raid utility?Assuming you are also using a PCI LSI HBA from Sun that is managed with a utility called /opt/StorMan/arcconf and reports itself as the amazingly informative model number "Sun STK RAID INT" what worked for me was to run, arcconf delete (to delete the pre-configured volume shipped on the drive) arcconf create (to create a new volume) What I observed was that arcconf getconfig 1 would show the same physical device size for our existing drives and new ones from Sun, but they reported a slightly different logical volume size. I am fairly sure that was due to the Sun factory creating the initial volume with a different version of the HBA controller firmware then we where using to create our own volumes. If I remember the sign correctly, the newer firmware creates larger logical volumes, and you really want to upgrade the firmware if you are going to be running multiple X25-E drives from the same controller. I hope that helps. -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson
On Mar 31, 2010, at 9:22 AM, Edward Ned Harvey wrote:>> Would your users be concerned if there was a possibility that >> after extracting a 50 MB tarball that files are incomplete, whole >> subdirectories are missing, or file permissions are incorrect? > > Correction: "Would your users be concerned if there was a possibility that > after extracting a 50MB tarball *and having a server crash* then files could > be corrupted as described above." > > If you disable the ZIL, the filesystem still stays correct in RAM, and the > only way you lose any data such as you''ve described, is to have an > ungraceful power down or reboot. > > The advice I would give is: Do zfs autosnapshots frequently (say ... every > 5 minutes, keeping the most recent 2 hours of snaps) and then run with no > ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest > snapshot ... and rollback once more for good measure. As long as you can > afford to risk 5-10 minutes of the most recent work after a crash, then you > can get a 10x performance boost most of the time, and no risk of the > aforementioned data corruption.This approach does not solve the problem. When you do a snapshot, the txg is committed. If you wish to reduce the exposure to loss of sync data and run with ZIL disabled, then you can change the txg commit interval -- however changing the txg commit interval will not eliminate the possibility of data loss. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Casper.Dik at Sun.COM
2010-Apr-01 07:09 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>If you disable the ZIL, the filesystem still stays correct in RAM, and the >only way you lose any data such as you''ve described, is to have an >ungraceful power down or reboot.>The advice I would give is: Do zfs autosnapshots frequently (say ... every >5 minutes, keeping the most recent 2 hours of snaps) and then run with no >ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest >snapshot ... and rollback once more for good measure. As long as you can >afford to risk 5-10 minutes of the most recent work after a crash, then you >can get a 10x performance boost most of the time, and no risk of the >aforementioned data corruption.Why do you need the rollback? The current filesystems have correct and consistent data; not different from the last two snapshots. (Snapshots can happen in the middle of untarring) The difference between running with or without ZIL is whether the client has lost data when the server reboots; not different from using Linux as an NFS server. Casper
Casper.Dik at Sun.COM
2010-Apr-01 08:43 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>Well being fair to Linux the default for NFS exports is to export them >''sync'' now which syncs to disk on close or fsync. It has been many >years before they exported ''async'' by default. Now if Linux admins set >their shares ''async'' and loose important data then it''s operator error >and not Linux''s fault.Is that what "sync" means in Linux? As NFS doesn''t use "close" or "fsync", what exactly are the semantics. (For NFSv2/v3 each *operation* is sync and the client needs to make sure it can continue; for NFSv4, some operations are async and the client needs to use COMMIT)>If apps don''t care about their data consistency and don''t sync their >data I don''t see why the file server has to care for them. I mean if >it were a local file system and the machine rebooted the data would be >lost too. Should we care more for data written remotely then locally?If the system crashes the applications is also gone but if the server reboots, data should *never* be lost; the sync may just miss the window. The application continuous to run so clearly we must handle this differently. What you''re saying sounds like that the kernel can forget what you wrote because you didn''t call fsync(). Casper
> >If you disable the ZIL, the filesystem still stays correct in RAM, and > the > >only way you lose any data such as you''ve described, is to have an > >ungraceful power down or reboot. > > >The advice I would give is: Do zfs autosnapshots frequently (say ... > every > >5 minutes, keeping the most recent 2 hours of snaps) and then run with > no > >ZIL. If you have an ungraceful shutdown or reboot, rollback to the > latest > >snapshot ... and rollback once more for good measure. As long as you > can > >afford to risk 5-10 minutes of the most recent work after a crash, > then you > >can get a 10x performance boost most of the time, and no risk of the > >aforementioned data corruption. > > Why do you need the rollback? The current filesystems have correct and > consistent data; not different from the last two snapshots. > (Snapshots can happen in the middle of untarring) > > The difference between running with or without ZIL is whether the > client has lost data when the server reboots; not different from using > Linux as an NFS server.If you have an ungraceful shutdown in the middle of writing stuff, while the ZIL is disabled, then you have corrupt data. Could be files that are partially written. Could be wrong permissions or attributes on files. Could be missing files or directories. Or some other problem. Some changes from the last 1 second of operation before crash might be written, while some changes from the last 4 seconds might be still unwritten. This is data corruption, which could be worse than losing a few minutes of changes. At least, if you rollback, you know the data is consistent, and you know what you lost. You won''t continue having more losses afterward caused by inconsistent data on disk.
> > Can you elaborate? Just today, we got the replacement drive that has > > precisely the right version of firmware and everything. Still, when > we > > plugged in that drive, and "create simple volume" in the storagetek > raid > > utility, the new drive is 0.001 Gb smaller than the old drive. I''m > still > > hosed. > > > > Are you saying I might benefit by sticking the SSD into some laptop, > and > > zero''ing the disk? And then attach to the sun server? > > > > Are you saying I might benefit by finding some other way to make the > drive > > available, instead of using the storagetek raid utility? > > Assuming you are also using a PCI LSI HBA from Sun that is managed with > a utility called /opt/StorMan/arcconf and reports itself as the > amazingly > informative model number "Sun STK RAID INT" what worked for me was to > run, > arcconf delete (to delete the pre-configured volume shipped on the > drive) > arcconf create (to create a new volume) > > What I observed was that > arcconf getconfig 1 > would show the same physical device size for our existing drives and > new > ones from Sun, but they reported a slightly different logical volume > size. > I am fairly sure that was due to the Sun factory creating the initial > volume > with a different version of the HBA controller firmware then we where > using > to create our own volumes. > > If I remember the sign correctly, the newer firmware creates larger > logical > volumes, and you really want to upgrade the firmware if you are going > to > be running multiple X25-E drives from the same controller. > > I hope that helps.Uggh. This is totally different than my system. But thanks for writing. I''ll take this knowledge, and see if we can find some analogous situation with the StorageTek controller. It still may be helpful, so again, thanks.
Casper.Dik at Sun.COM
2010-Apr-01 11:19 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>If you have an ungraceful shutdown in the middle of writing stuff, while the >ZIL is disabled, then you have corrupt data. Could be files that are >partially written. Could be wrong permissions or attributes on files. >Could be missing files or directories. Or some other problem. > >Some changes from the last 1 second of operation before crash might be >written, while some changes from the last 4 seconds might be still >unwritten. This is data corruption, which could be worse than losing a few >minutes of changes. At least, if you rollback, you know the data is >consistent, and you know what you lost. You won''t continue having more >losses afterward caused by inconsistent data on disk.How exactly is this different from "rolling back to some other point of time?". I think you don''t quite understand how ZFS works; all operations are grouped in transaction groups; all the transactions in a particular group are commit in one operation. I don''t know what partial ordering ZFS uses when creating transaction groups, but a snapshot just picks one transaction group as the last group included in the snapshot. When the system reboots, ZFS picks the most recent, valid uberblock; so the data available is "correct upto transaction group N1". If you rollback to a snapshot, you get data "correct upto transaction group N2". But N2 < N1 so you lose more data. Why do you think that a "Snapshot" has a "better quality" than the last snapshot available? Casper
> >If you have an ungraceful shutdown in the middle of writing stuff, > while the > >ZIL is disabled, then you have corrupt data. Could be files that are > >partially written. Could be wrong permissions or attributes on files. > >Could be missing files or directories. Or some other problem. > > > >Some changes from the last 1 second of operation before crash might be > >written, while some changes from the last 4 seconds might be still > >unwritten. This is data corruption, which could be worse than losing > a few > >minutes of changes. At least, if you rollback, you know the data is > >consistent, and you know what you lost. You won''t continue having > more > >losses afterward caused by inconsistent data on disk. > > How exactly is this different from "rolling back to some other point of > time?". > > I think you don''t quite understand how ZFS works; all operations are > grouped in transaction groups; all the transactions in a particular > group > are commit in one operation. I don''t know what partial ordering ZFSDude, don''t be so arrogant. Acting like you know what I''m talking about better than I do. Face it that you have something to learn here. Yes, all the transactions in a transaction group are either committed entirely to disk, or not at all. But they''re not necessarily committed to disk in the same order that the user level applications requested. Meaning: If I have an application that writes to disk in "sync" mode intentionally ... perhaps because my internal file format consistency would be corrupt if I wrote out-of-order ... If the sysadmin has disabled ZIL, my "sync" write will not block, and I will happily issue more write operations. As long as the OS remains operational, no problem. The OS keeps the filesystem consistent in RAM, and correctly manages all the open file handles. But if the OS dies for some reason, some of my later writes may have been committed to disk while some of my earlier writes could be lost, which were still being buffered in system RAM for a later transaction group. This is particularly likely to happen, if my application issues a very small sync write, followed by a larger async write, followed by a very small sync write, and so on. Then the OS will buffer my small sync writes and attempt to aggregate them into a larger sequential block for the sake of accelerated performance. The end result is: My larger async writes are sometimes committed to disk before my small sync writes. But the only reason I would ever know or care about that would be if the ZIL were disabled, and the OS crashed. Afterward, my file has internal inconsistency. Perfect examples of applications behaving this way would be databases and virtual machines.> Why do you think that a "Snapshot" has a "better quality" than the last > snapshot available?If you rollback to a snapshot from several minutes ago, you can rest assured all the transaction groups that belonged to that snapshot have been committed. So although you''re losing the most recent few minutes of data, you can rest assured you haven''t got file corruption in any of the existing files.
> This approach does not solve the problem. When you do a snapshot, > the txg is committed. If you wish to reduce the exposure to loss of > sync data and run with ZIL disabled, then you can change the txg commit > interval -- however changing the txg commit interval will not eliminate > the > possibility of data loss.The default commit interval is what, 30 seconds? Doesn''t that guarantee that any snapshot taken more than 30 seconds ago will have been fully committed to disk? Therefore, any snapshot older than 30 seconds old is guaranteed to be consistent on disk. While anything less than 30 seconds old could possibly have some later writes committed to disk before some older writes from a few seconds before. If I''m wrong about this, please explain. I am envisioning a database, which issues a small sync write, followed by a larger async write. Since the sync write is small, the OS would prefer to defer the write and aggregate into a larger block. So the possibility of the later async write being committed to disk before the older sync write is a real risk. The end result would be inconsistency in my database file. If you rollback to a snapshot that''s at least 30 seconds old, then all the writes for that snapshot are guaranteed to be committed to disk already, and in the right order. You''re acknowledging the loss of some known time worth of data. But you''re gaining a guarantee of internal file consistency.
> Is that what "sync" means in Linux?A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk.
Casper.Dik at Sun.COM
2010-Apr-01 12:22 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>Dude, don''t be so arrogant. Acting like you know what I''m talking about >better than I do. Face it that you have something to learn here.You may say that, but then you post this:>> Why do you think that a "Snapshot" has a "better quality" than the last >> snapshot available? > >If you rollback to a snapshot from several minutes ago, you can rest assured >all the transaction groups that belonged to that snapshot have been >committed. So although you''re losing the most recent few minutes of data, >you can rest assured you haven''t got file corruption in any of the existing >files.But the actual fact is that there is *NO* difference between the last uberblock and an uberblock named as "snapshot-such-and-so". All changes made after the uberblock was written are discarded by rolling back. All the transaction groups referenced by "last uberblock" *are* written to disk. Disabling the ZIL makes sure that fsync() and sync() no longer work; whether you take a named snapshot or the uberblock is immaterial; your strategy will cause more data to be lost. Casper
Casper.Dik at Sun.COM
2010-Apr-01 12:42 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>> Is that what "sync" means in Linux? > >A sync write is one in which the application blocks until the OS acks that >the write has been committed to disk. An async write is given to the OS, >and the OS is permitted to buffer the write to disk at its own discretion. >Meaning the async write function call returns sooner, and the application is >free to continue doing other stuff, including issuing more writes. > >Async writes are faster from the point of view of the application. But sync >writes are done by applications which need to satisfy a race condition for >the sake of internal consistency. Applications which need to know their >next commands will not begin until after the previous sync write was >committed to disk.We''re talking about the "sync" for NFS exports in Linux; what do they mean with "sync" NFS exports? Casper
Casper.Dik at Sun.COM
2010-Apr-01 12:50 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>> This approach does not solve the problem. When you do a snapshot, >> the txg is committed. If you wish to reduce the exposure to loss of >> sync data and run with ZIL disabled, then you can change the txg commit >> interval -- however changing the txg commit interval will not eliminate >> the >> possibility of data loss. > >The default commit interval is what, 30 seconds? Doesn''t that guarantee >that any snapshot taken more than 30 seconds ago will have been fully >committed to disk?When a system boots and it finds the snapshot, then all the data referred by the snapshot are on-disk. But the snapshot doesn''t guarantee more than the last valid uberblock.>Therefore, any snapshot older than 30 seconds old is guaranteed to be >consistent on disk. While anything less than 30 seconds old could possibly >have some later writes committed to disk before some older writes from a few >seconds before. > >If I''m wrong about this, please explain.When a pointer to data is committed to disk by ZFS, then the data is also on disk. (if the pointer is reachable from the uberblock, then the data is also on dissk and reachable from the uberblock) You don''t need to wait 30 seconds. If it''s there, it''s there.>I am envisioning a database, which issues a small sync write, followed by a >larger async write. Since the sync write is small, the OS would prefer to >defer the write and aggregate into a larger block. So the possibility of >the later async write being committed to disk before the older sync write is >a real risk. The end result would be inconsistency in my database file. > >If you rollback to a snapshot that''s at least 30 seconds old, then all the >writes for that snapshot are guaranteed to be committed to disk already, and >in the right order. You''re acknowledging the loss of some known time worth >of data. But you''re gaining a guarantee of internal file consistency.I don''t know what ZFS guarantees when you disable the zil; the one broken promise is that when fsync() returns, that the data may not have committed to stable storage when fsync() returns. I''m not sure whether there is a "barrier" when there is a "sync()/fsync()", if that is the case, then ZFS is still safe for your application. Casper
On Mar 31, 2010, at 11:51 PM, Edward Ned Harvey <solaris2 at nedharvey.com> wrote:>> A MegaRAID card with write-back cache? It should also be cheaper than >> the F20. > > I haven''t posted results yet, but I just finished a few weeks of > extensive > benchmarking various configurations. I can say this: > > WriteBack cache is much faster than "naked" disks, but if you can > buy an SSD > or two for ZIL log device, the dedicated ZIL is yet again much > faster than > WriteBack. > > It doesn''t have to be F20. You could use the Intel X25 for > example. If > you''re running solaris proper, you better mirror your ZIL log > device. If > you''re running opensolaris ... I don''t know if that''s important. I''ll > probably test it, just to be sure, but I might never get around to it > because I don''t have a justifiable business reason to build the > opensolaris > machine just for this one little test. > > Seriously, all disks configured WriteThrough (spindle and SSD disks > alike) > using the dedicated ZIL SSD device, very noticeably faster than > enabling the > WriteBack.What do you get with both SSD ZIL and WriteBack disks enabled? I mean if you have both why not use both? Then both async and sync IO benefits. -Ross
On Mar 31, 2010, at 11:58 PM, Edward Ned Harvey <solaris2 at nedharvey.com> wrote:>> We ran into something similar with these drives in an X4170 that >> turned >> out to >> be an issue of the preconfigured logical volumes on the drives. Once >> we made >> sure all of our Sun PCI HBAs where running the exact same version of >> firmware >> and recreated the volumes on new drives arriving from Sun we got back >> into sync >> on the X25-E devices sizes. > > Can you elaborate? Just today, we got the replacement drive that has > precisely the right version of firmware and everything. Still, when > we > plugged in that drive, and "create simple volume" in the storagetek > raid > utility, the new drive is 0.001 Gb smaller than the old drive. I''m > still > hosed. > > Are you saying I might benefit by sticking the SSD into some laptop, > and > zero''ing the disk? And then attach to the sun server? > > Are you saying I might benefit by finding some other way to make the > drive > available, instead of using the storagetek raid utility?I know it is way after the fact, but I find it best to coerce each drive down to the whole GB boundary using format (create Solaris partition just up to the boundary). Then if you ever get a drive a little smaller it still should fit. -Ross
On Apr 1, 2010, at 8:42 AM, Casper.Dik at Sun.COM wrote:> >>> Is that what "sync" means in Linux? >> >> A sync write is one in which the application blocks until the OS >> acks that >> the write has been committed to disk. An async write is given to >> the OS, >> and the OS is permitted to buffer the write to disk at its own >> discretion. >> Meaning the async write function call returns sooner, and the >> application is >> free to continue doing other stuff, including issuing more writes. >> >> Async writes are faster from the point of view of the application. >> But sync >> writes are done by applications which need to satisfy a race >> condition for >> the sake of internal consistency. Applications which need to know >> their >> next commands will not begin until after the previous sync write was >> committed to disk. > > > We''re talking about the "sync" for NFS exports in Linux; what do > they mean > with "sync" NFS exports?See section A1 in the FAQ: http://nfs.sourceforge.net/ -Ross
On 01/04/2010 14:49, Ross Walker wrote:>> We''re talking about the "sync" for NFS exports in Linux; what do they >> mean >> with "sync" NFS exports? > > See section A1 in the FAQ: > > http://nfs.sourceforge.net/I think B4 is the answer to Casper''s question: ---- BEGIN QUOTE ---- Linux servers (although not the Solaris reference implementation) allow this requirement to be relaxed by setting a per-export option in /etc/exports. The name of this export option is "[a]sync" (note that there is also a client-side mount option by the same name, but it has a different function, and does not defeat NFS protocol compliance). When set to "sync," Linux server behavior strictly conforms to the NFS protocol. This is default behavior in most other server implementations. When set to "async," the Linux server replies to NFS clients before flushing data or metadata modifying operations to permanent storage, thus improving performance, but breaking all guarantees about server reboot recovery. ---- END QUOTE ---- For more info the whole of section B4 though B6. -- Darren J Moffat
On Thu, Apr 1, 2010 at 10:03 AM, Darren J Moffat <darrenm at opensolaris.org> wrote:> On 01/04/2010 14:49, Ross Walker wrote: >>> >>> We''re talking about the "sync" for NFS exports in Linux; what do they >>> mean >>> with "sync" NFS exports? >> >> See section A1 in the FAQ: >> >> http://nfs.sourceforge.net/ > > I think B4 is the answer to Casper''s question: > > ---- BEGIN QUOTE ---- > Linux servers (although not the Solaris reference implementation) allow this > requirement to be relaxed by setting a per-export option in /etc/exports. > The name of this export option is "[a]sync" (note that there is also a > client-side mount option by the same name, but it has a different function, > and does not defeat NFS protocol compliance). > > When set to "sync," Linux server behavior strictly conforms to the NFS > protocol. This is default behavior in most other server implementations. > When set to "async," the Linux server replies to NFS clients before flushing > data or metadata modifying operations to permanent storage, thus improving > performance, but breaking all guarantees about server reboot recovery. > ---- END QUOTE ---- > > For more info the whole of section B4 though B6.True, I was thinking more of the protocol summary.> Is that what "sync" means in Linux? As NFS doesn''t use "close" or > "fsync", what exactly are the semantics. > > (For NFSv2/v3 each *operation* is sync and the client needs to make sure > it can continue; for NFSv4, some operations are async and the client > needs to use COMMIT)Actually the COMMIT command was introduced in NFSv3. The full details: NFS Version 3 introduces the concept of "safe asynchronous writes." A Version 3 client can specify that the server is allowed to reply before it has saved the requested data to disk, permitting the server to gather small NFS write operations into a single efficient disk write operation. A Version 3 client can also specify that the data must be written to disk before the server replies, just like a Version 2 write. The client specifies the type of write by setting the stable_how field in the arguments of each write operation to UNSTABLE to request a safe asynchronous write, and FILE_SYNC for an NFS Version 2 style write. Servers indicate whether the requested data is permanently stored by setting a corresponding field in the response to each NFS write operation. A server can respond to an UNSTABLE write request with an UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the requested data resides on permanent storage yet. An NFS protocol-compliant server must respond to a FILE_SYNC request only with a FILE_SYNC reply. Clients ensure that data that was written using a safe asynchronous write has been written onto permanent storage using a new operation available in Version 3 called a COMMIT. Servers do not send a response to a COMMIT operation until all data specified in the request has been written to permanent storage. NFS Version 3 clients must protect buffered data that has been written using a safe asynchronous write but not yet committed. If a server reboots before a client has sent an appropriate COMMIT, the server can reply to the eventual COMMIT request in a way that forces the client to resend the original write operation. Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close(2) or fsync(2) system call, or when encountering memory pressure.
On Thu, 1 Apr 2010, Edward Ned Harvey wrote:> If I''m wrong about this, please explain. > > I am envisioning a database, which issues a small sync write, followed by a > larger async write. Since the sync write is small, the OS would prefer to > defer the write and aggregate into a larger block. So the possibility of > the later async write being committed to disk before the older sync write is > a real risk. The end result would be inconsistency in my database file.Zfs writes data in transaction groups and each bunch of data which gets written is bounded by a transaction group. The current state of the data at the time the TXG starts will be the state of the data once the TXG completes. If the system spontaneously reboots then it will restart at the last completed TXG so any residual writes which might have occured while a TXG write was in progress will be discarded. Based on this, I think that your ordering concerns (sync writes getting to disk "faster" than async writes) are unfounded for normal file I/O. However, if file I/O is done via memory mapped files, then changed memory pages will not necessarily be written. The changes will not be known to ZFS until the kernel decides that a dirty page should be written or there is a conflicting traditional I/O which would update the same file data. Use of msync(3C) is necessary to assure that file data updated via mmap() will be seen by ZFS and comitted to disk in an orderly fashion. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 01/04/2010 13:01, Edward Ned Harvey wrote:>> Is that what "sync" means in Linux? >> > A sync write is one in which the application blocks until the OS acks that > the write has been committed to disk. An async write is given to the OS, > and the OS is permitted to buffer the write to disk at its own discretion. > Meaning the async write function call returns sooner, and the application is > free to continue doing other stuff, including issuing more writes. > > Async writes are faster from the point of view of the application. But sync > writes are done by applications which need to satisfy a race condition for > the sake of internal consistency. Applications which need to know their > next commands will not begin until after the previous sync write was > committed to disk. > >ROTFL!!! I think you should explain it even further for Casper :) :) :) :) :) :) :) -- Robert Milkowski http://milek.blogspot.com
Casper.Dik at Sun.COM
2010-Apr-01 15:47 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>On 01/04/2010 13:01, Edward Ned Harvey wrote: >>> Is that what "sync" means in Linux? >>> >> A sync write is one in which the application blocks until the OS acks that >> the write has been committed to disk. An async write is given to the OS, >> and the OS is permitted to buffer the write to disk at its own discretion. >> Meaning the async write function call returns sooner, and the application is >> free to continue doing other stuff, including issuing more writes. >> >> Async writes are faster from the point of view of the application. But sync >> writes are done by applications which need to satisfy a race condition for >> the sake of internal consistency. Applications which need to know their >> next commands will not begin until after the previous sync write was >> committed to disk. >> >> >ROTFL!!! > >I think you should explain it even further for Casper :) :) :) :) :) :) :) >:-) So what I *really* wanted to know what "sync" meant for the NFS server in the case of Linux. Apparently it means "implement the NFS protocol to the letter". I''m happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Casper
On Thu, 1 Apr 2010, Edward Ned Harvey wrote:> > Dude, don''t be so arrogant. Acting like you know what I''m talking about > better than I do. Face it that you have something to learn here.Geez!> Yes, all the transactions in a transaction group are either committed > entirely to disk, or not at all. But they''re not necessarily committed to > disk in the same order that the user level applications requested. Meaning: > If I have an application that writes to disk in "sync" mode intentionally > ... perhaps because my internal file format consistency would be corrupt if > I wrote out-of-order ... If the sysadmin has disabled ZIL, my "sync" write > will not block, and I will happily issue more write operations. As long as > the OS remains operational, no problem. The OS keeps the filesystem > consistent in RAM, and correctly manages all the open file handles. But if > the OS dies for some reason, some of my later writes may have been committed > to disk while some of my earlier writes could be lost, which were still > being buffered in system RAM for a later transaction group.The purpose of the ZIL is to act like a fast "log" for synchronous writes. It allows the system to quickly confirm a synchronous write request with the minimum amount of work. As you say, "OS keeps the filesystem consistent in RAM". There is no 1:1 ordering between application write requests and zfs writes and in fact, if the same portion of file is updated many times, or the file is created/deleted many times, zfs only writes the updated data which is current when the next TXG is written. For a synchronous write, zfs advances its index in the slog once the corresponding data has been committed in a TXG. In other words, the "sync" and "async" write paths are the same when it comes to writing final data to disk. There is however the recovery case where synchronous writes were affirmed which were not yet written in a TXG and the system spontaneously reboots. In this case the synchronous writes will occur based on the slog, and uncommitted async writes will have been lost. Perhaps this is the case you are worried about. It does seem like rollback to a snapshot does help here (to assure that sync & async data is consistent), but it certainly does not help any NFS clients. Only a broken application uses sync writes sometimes, and async writes at other times. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Casper.Dik at Sun.COM
2010-Apr-01 15:54 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>It does seem like rollback to a snapshot does help here (to assure >that sync & async data is consistent), but it certainly does not help >any NFS clients. Only a broken application uses sync writes >sometimes, and async writes at other times.But doesn''t that snapshot possibly have the same issues? Casper
On Thu, 1 Apr 2010, Casper.Dik at Sun.COM wrote:>> It does seem like rollback to a snapshot does help here (to assure >> that sync & async data is consistent), but it certainly does not help >> any NFS clients. Only a broken application uses sync writes >> sometimes, and async writes at other times. > > But doesn''t that snapshot possibly have the same issues?No, at least not based on my understanding. My understanding is that zfs uses uniform prioritization of updates and performs writes in order (at least to the level of a TXG). If this is true, then each normal TXG will be a coherent representation of the filesystem. If the slog is used to recover uncommitted writes, then the TXG based on that may not match the in-memory filesystem at the time of the crash since async writes may have been lost. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
hello i have had this problem this week. our zil ssd died (apt slc ssd 16gb). because we had no spare drive in stock, we ignored it. then we decided to update our nexenta 3 alpha to beta, exported the pool and made a fresh install to have a clean system and tried to import the pool. we only got a error message about a missing drive. we googled about this and it seems there is no way to acces the pool !!! (hope this will be fixed in future) we had a backup and the data are not so important, but that could be a real problem. you have a valid zfs3 pool and you cannot access your data due to missing zil. gea www.napp-it.org zfs server -- This message posted from opensolaris.org
Hi Casper,> :-)Leuk te zien dat je straal nog steeds even ver komt :-)>I''m happy to see that it is now the default and I hope this will cause the >Linux NFS client implementation to be faster for conforming NFS servers.Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can''t signal the desired behaviour to the other. At least we didn''t manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... Anyway we seem to be getting of topic here :-) The thread was started to get insight in behaviour of the F20 as ZIL. _My_ particular interest would be to be able to answer why perfomance doesn''t seem to scale up when adding vmod-s... With kind regards, Jeroen -- This message posted from opensolaris.org
Jeroen Roodhart wrote:> The thread was started to get insight in behaviour of the F20 as ZIL. > _My_ particular interest would be to be able to answer why perfomance > doesn''t seem to scale up when adding vmod-s...My best guess would be latency. If you are latency bound, adding additional parallel devices with the same latency will make no difference. It will improve throughput, but may actually make latency worse (additional time to select which parallel device to use). But one of the ZFS gurus may be able to provide a better answer, or some dtrace foo to confirm/deny my thesis. -- Carson
> It doesn''t have to be F20. You could use the Intel > X25 for example.The mlc-based disks are bound to be too slow (we tested with an OCZ Vertex Turbo). So you''re stuck with the X25-E (which Sun stopped supporting for some reason). I believe most "normal" SSDs do have some sort of cache and usually no supercap or other backup power solution. So be wary of that. Having said all this, the new Sandforce based SSDs look promising...>If you''re running solaris proper, you better mirror your > ZIL log device.Absolutely true, I forgot this ''cause we''re running OSOL nv130... (we constantly seem to need features that haven''t landed in Solaris proper :) ).> If you''re running opensolaris ... I don''t know if that''s > important.At least I can confirm ability of adding and removing ZIL devices on the fly with OSOL of a sufficiently recent build.> I''ll probably test it, just to be sure, but I might never > get around to it > because I don''t have a justifiable business reason to > build the opensolaris > machine just for this one little test.I plan to get test this as well, won''t be until late next week though. With kind regards, Jeroen Message was edited by: tuxwield -- This message posted from opensolaris.org
>>>>> "enh" == Edward Ned Harvey <solaris2 at nedharvey.com> writes:enh> Dude, don''t be so arrogant. Acting like you know what I''m enh> talking about better than I do. Face it that you have enh> something to learn here. funny! AIUI you are wrong and Casper is right. ZFS recovers to a crash-consistent state, even without the slog, meaning it recovers to some state through which the filesystem passed in the seconds leading up to the crash. This isn''t what UFS or XFS do. The on-disk log (slog or otherwise), if I understand right, can actually make the filesystem recover to a crash-INconsistent state (a state not equal to a snapshot you might have hypothetically taken in the seconds leading up to the crash), because files that were recently fsync()''d may be of newer versions than files that weren''t---that is, fsync() durably commits only the file it references, by copying that *part* of the in-RAM ZIL to the durable slog. fsync() is not equivalent to ''lockfs -fa'' committing every file on the system (is it?). I guess I could be wrong about that. If I''m right, this isn''t a bad thing because apps that call fsync() are supposed to expect the inconsistency, but it''s still important to understanding what''s going on. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100401/3d688b70/attachment.bin>
On 01/04/2010 20:58, Jeroen Roodhart wrote:> >> I''m happy to see that it is now the default and I hope this will cause the >> Linux NFS client implementation to be faster for conforming NFS servers. >> > Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can''t signal the desired behaviour to the other. At least we didn''t manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... >Which is to be expected as it is not a nfs client which requests the behavior but rather a nfs server. Currently on Linux you can export a share with as sync (default) or async share while on Solaris you can''t really currently force a NFS server to start working in an async mode. -- Robert Milkowski http://milek.blogspot.com
Casper.Dik at Sun.COM
2010-Apr-02 09:21 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>On 01/04/2010 20:58, Jeroen Roodhart wrote: >> >>> I''m happy to see that it is now the default and I hope this will cause the >>> Linux NFS client implementation to be faster for conforming NFS servers. >>> >> Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can''tsignal the desired behaviour to the other. At least we didn''t manage to get a Linux client to asyn chronously mount a Solaris (ZFS backed) NFS export...>> > >Which is to be expected as it is not a nfs client which requests the >behavior but rather a nfs server. >Currently on Linux you can export a share with as sync (default) or >async share while on Solaris you can''t really currently force a NFS >server to start working in an async mode.The other part of the issue is that the Solaris Clients have been developed with a "sync" server. The client write behinds more and continues caching the non-acked data. The Linux client has been developed with a "async" server and has some catching up to do. Casper
Robert Milkowski writes: > On 01/04/2010 20:58, Jeroen Roodhart wrote: > > > >> I''m happy to see that it is now the default and I hope this will cause the > >> Linux NFS client implementation to be faster for conforming NFS servers. > >> > > Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can''t signal the desired behaviour to the other. At least we didn''t manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... > > > > Which is to be expected as it is not a nfs client which requests the > behavior but rather a nfs server. > Currently on Linux you can export a share with as sync (default) or > async share while on Solaris you can''t really currently force a NFS > server to start working in an async mode. > True, and there is an entrenched misconception (not you) that this a ZFS specific problem which it''s not. It''s really an NFS protocol feature which can be circumvented using zil_disable which therefore reinforces the misconception. It''s further reinforced by testing NFS server on disk drives with WCE=1 with filesystem not ZFS. All fast options cause the NFS client to become inconsistent after a server reboot. Whatever was being done in the moments prior to server reboot will need to be wiped out by users if they are told that the server did reboot. That''s manageable for home use not for the entreprise. -r > -- > Robert Milkowski > http://milek.blogspot.com > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > Seriously, all disks configured WriteThrough (spindle and SSD disks > > alike) > > using the dedicated ZIL SSD device, very noticeably faster than > > enabling the > > WriteBack. > > What do you get with both SSD ZIL and WriteBack disks enabled? > > I mean if you have both why not use both? Then both async and sync IO > benefits.Interesting, but unfortunately false. Soon I''ll post the results here. I just need to package them in a way suitable to give the public, and stick it on a website. But I''m fighting IT fires for now and haven''t had the time yet. Roughly speaking, the following are approximately representative. Of course it varies based on tweaks of the benchmark and stuff like that. Stripe 3 mirrors write through: 450-780 IOPS Stripe 3 mirrors write back: 1030-2130 IOPS Stripe 3 mirrors write back + SSD ZIL: 1220-2480 IOPS Stripe 3 mirrors write through + SSD ZIL: 1840-2490 IOPS Overall, I would say WriteBack is 2-3 times faster than naked disks. SSD ZIL is 3-4 times faster than naked disk. And for some reason, having the WriteBack enabled while you have SSD ZIL actually hurts performance by approx 10%. You''re better off to use the SSD ZIL with disks in Write Through mode. That result is surprising to me. But I have a theory to explain it. When you have WriteBack enabled, the OS issues a small write, and the HBA immediately returns to the OS: "Yes, it''s on nonvolatile storage." So the OS quickly gives it another, and another, until the HBA write cache is full. Now the HBA faces the task of writing all those tiny writes to disk, and the HBA must simply follow orders, writing a tiny chunk to the sector it said it would write, and so on. The HBA cannot effectively consolidate the small writes into a larger sequential block write. But if you have the WriteBack disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on SSD, and immediately return to the process: "Yes, it''s on nonvolatile storage." So the application can issue another, and another, and another. ZFS is smart enough to aggregate all these tiny write operations into a single larger sequential write before sending it to the spindle disks. Long story short, the evidence suggests if you have SSD ZIL, you''re better off without WriteBack on the HBA. And I conjecture the reasoning behind it is because ZFS can write buffer better than the HBA can.
> I know it is way after the fact, but I find it best to coerce each > drive down to the whole GB boundary using format (create Solaris > partition just up to the boundary). Then if you ever get a drive a > little smaller it still should fit.It seems like it should be unnecessary. It seems like extra work. But based on my present experience, I reached the same conclusion. If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what''s to prevent the same thing from happening to one of my 1TB spindle disk mirrors? Nothing. That''s what. I take it back. Me. I am to prevent it from happening. And the technique to do so is precisely as you''ve said. First slice every drive to be a little smaller than actual. Then later if I get a replacement device for the mirror, that''s slightly smaller than the others, I have no reason to care.
When we use one vmod, both machines are finished in about 6min45, zilstat maxes out at about 4200 IOPS. Using four vmods it takes about 6min55, zilstat maxes out at 2200 IOPS. Can you try 4 concurrent tar to four different ZFS filesystems (same pool). -r
> > http://nfs.sourceforge.net/ > > I think B4 is the answer to Casper''s question:We were talking about ZFS, and under what circumstances data is flushed to disk, in what way "sync" and "async" writes are handled by the OS, and what happens if you disable ZIL and lose power to your system. We were talking about C/C++ sync and async. Not NFS sync and async. I don''t think anything relating to NFS is the answer to Casper''s question, or else, Casper was simply jumping context by asking it. Don''t get me wrong, I have no objection to his question or anything, it''s just that the conversation has derailed and now people are talking about NFS sync/async instead of what happens when a C/C++ application is doing sync/async writes to a disabled ZIL.
> > I am envisioning a database, which issues a small sync write, > followed by a > > larger async write. Since the sync write is small, the OS would > prefer to > > defer the write and aggregate into a larger block. So the > possibility of > > the later async write being committed to disk before the older sync > write is > > a real risk. The end result would be inconsistency in my database > file. > > Zfs writes data in transaction groups and each bunch of data which > gets written is bounded by a transaction group. The current state of > the data at the time the TXG starts will be the state of the data once > the TXG completes. If the system spontaneously reboots then it will > restart at the last completed TXG so any residual writes which might > have occured while a TXG write was in progress will be discarded. > Based on this, I think that your ordering concerns (sync writes > getting to disk "faster" than async writes) are unfounded for normal > file I/O.So you''re saying that while the OS is building txg''s to write to disk, the OS will never reorder the sequence in which individual write operations get ordered into the txg''s. That is, an application performing a small sync write, followed by a large async write, will never have the second operation flushed to disk before the first. Can you support this belief in any way? If that''s true, if there''s no increased risk of data corruption, then why doesn''t everybody just disable their ZIL all the time on every system? The reason to have a sync() function in C/C++ is so you can ensure data is written to disk before you move on. It''s a blocking call, that doesn''t return until the sync is completed. The only reason you would ever do this is if order matters. If you cannot allow the next command to begin until after the previous one was completed. Such is the situation with databases and sometimes virtual machines.
> hello > > i have had this problem this week. our zil ssd died (apt slc ssd 16gb). > because we had no spare drive in stock, we ignored it. > > then we decided to update our nexenta 3 alpha to beta, exported the > pool and made a fresh install to have a clean system and tried to > import the pool. we only got a error message about a missing drive. > > we googled about this and it seems there is no way to acces the pool > !!! > (hope this will be fixed in future) > > we had a backup and the data are not so important, but that could be a > real problem. > you have a valid zfs3 pool and you cannot access your data due to > missing zil.If you have zpool less than version 19 (when ability to remove log device was introduced) and you have a non-mirrored log device that failed, you had better treat the situation as an emergency. Normally you can find your current zpool version by doing "zpool upgrade," but you cannot now if you''re in this failure state. Do not attempt "zfs send" or "zfs list" or any other zpool or zfs command. Instead, do "man zpool" and look for "zpool remove." If it says "supports removing log devices" then you had better use it to remove your log device. If it says "only supports removing hotspares or cache" then your zpool is lost permanently. If you are running Solaris, take it as given, you do not have zpool version 19. If you are running Opensolaris, I don''t know at which point zpool 19 was introduced. Your only hope is to "zpool remove" the log device. Use tar or cp or something, to try and salvage your data out of there. Your zpool is lost and if it''s functional at all right now, it won''t stay that way for long. Your system will soon hang, and then you will not be able to import your pool. Ask me how I know.
> ZFS recovers to a crash-consistent state, even without the slog, > meaning it recovers to some state through which the filesystem passed > in the seconds leading up to the crash. This isn''t what UFS or XFS > do. > > The on-disk log (slog or otherwise), if I understand right, can > actually make the filesystem recover to a crash-INconsistent state (aYou''re speaking the opposite of common sense. If disabling the ZIL makes the system faster *and* less prone to data corruption, please explain why we don''t all disable the ZIL?
> If you have zpool less than version 19 (when ability to remove log > device > was introduced) and you have a non-mirrored log device that failed, you > had > better treat the situation as an emergency.> Instead, do "man zpool" and look for "zpool > remove." > If it says "supports removing log devices" then you had better use it > to > remove your log device. If it says "only supports removing hotspares > or > cache" then your zpool is lost permanently.I take it back. If you lost your log device on a zpool which is less than version 19, then you *might* have a possible hope if you migrate your disks to a later system. You *might* be able to "zpool import" on a later version of OS.
Casper.Dik at Sun.COM
2010-Apr-02 13:00 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>> > http://nfs.sourceforge.net/ >> >> I think B4 is the answer to Casper''s question: > >We were talking about ZFS, and under what circumstances data is flushed to >disk, in what way "sync" and "async" writes are handled by the OS, and what >happens if you disable ZIL and lose power to your system. > >We were talking about C/C++ sync and async. Not NFS sync and async.I don''t think so. http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg36783.html (This discussion was started, I think, in the context of NFS performance) Casper
Casper.Dik at Sun.COM
2010-Apr-02 13:20 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>So you''re saying that while the OS is building txg''s to write to disk, the >OS will never reorder the sequence in which individual write operations get >ordered into the txg''s. That is, an application performing a small sync >write, followed by a large async write, will never have the second operation >flushed to disk before the first. Can you support this belief in any way?The question is not how the writes are ordered but whether an earlier write can be in a later txg. A transaction group is committed atomically. In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar question to make sure I understand it correctly, and the answer was: "> = Casper", the answer is from Neil Perrin: > Is there a partialy order defined for all filesystem operations? > File system operations will be written in order for all settings of the sync flag. > Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a > file, (I assume by O_DATA you meant O_DSYNC). > that later transactions will not be in an earlier transaction group? > (Or is this already the case?) This is already the case. So what I assumed was true but what you made me doubt, was apparently still true: later transactions cannot be committed in an earlier txg.>If that''s true, if there''s no increased risk of data corruption, then why >doesn''t everybody just disable their ZIL all the time on every system?For an application running on the file server, there is no difference. When the system panics you know that data might be lost. The application also dies. (The snapshot and the last valid uberblock are equally valid) But for an application on an NFS client, without ZIL data will be lost while the NFS client believes the data is written amd it will not try again. With the ZIL, when the NFS server says that data is written then it is actually on stable storage.>The reason to have a sync() function in C/C++ is so you can ensure data is >written to disk before you move on. It''s a blocking call, that doesn''t >return until the sync is completed. The only reason you would ever do this >is if order matters. If you cannot allow the next command to begin until >after the previous one was completed. Such is the situation with databases >and sometimes virtual machines.So the question is: when will your data invalid? What happens with the data when the system dies before the fsync() call? What happens with the data when the system dies after the fsync() call? What happens with the data when the system dies after more I/O operations? With the zil disabled, you call fsync() but you may encounter data from before the call to fsync(). That could happen before, so I assume you can actually recover from that situation. Casper
> >Dude, don''t be so arrogant. Acting like you know what I''m talking > about > >better than I do. Face it that you have something to learn here. > > You may say that, but then you post this:Acknowledged. I read something arrogant, and I replied even more arrogant. That was dumb of me.
> Only a broken application uses sync writes > sometimes, and async writes at other times.Suppose there is a virtual machine, with virtual processes inside it. Some virtual process issues a sync write to the virtual OS, meanwhile another virtual process issues an async write. Then the virtual OS will sometimes issue sync writes and sometimes async writes to the host OS. Are you saying this makes qemu, and vbox, and vmware "broken applications?"
> The purpose of the ZIL is to act like a fast "log" for synchronous > writes. It allows the system to quickly confirm a synchronous write > request with the minimum amount of work.Bob and Casper and some others clearly know a lot here. But I''m hearing conflicting information, and don''t know what to believe. Does anyone here work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim "I can answer this question, I wrote that code, or at least have read it?" Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? Is it ever used to accelerate async writes? Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? At boot time, or "zpool import" time, what is taken to be "the current filesystem?" The latest uberblock? Something else? My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. Somebody, (Casper?) said it before, and now I''m starting to realize ... This is also true of the snapshots. If you disable your ZIL, then there is no guarantee your snapshots are consistent either. Rolling back doesn''t necessarily gain you anything. The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG''s before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent.
Casper.Dik at Sun.COM
2010-Apr-02 15:04 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>Questions to answer would be: > >Is a ZIL log device used only by sync() and fsync() system calls? Is it >ever used to accelerate async writes?There are quite a few of "sync" writes, specifically when you mix in the NFS server.>Suppose there is an application which sometimes does sync writes, and >sometimes async writes. In fact, to make it easier, suppose two processes >open two files, one of which always writes asynchronously, and one of which >always writes synchronously. Suppose the ZIL is disabled. Is it possible >for writes to be committed to disk out-of-order? Meaning, can a large block >async write be put into a TXG and committed to disk before a small sync >write to a different file is committed to disk, even though the small sync >write was issued by the application before the large async write? Remember, >the point is: ZIL is disabled. Question is whether the async could >possibly be committed to disk before the sync.What I quoted from the other discussion, it seems to be that later writes cannot be committed in an earlier TXG then your sync write or other earlier writes.>I make the assumption that an uberblock is the term for a TXG after it is >committed to disk. Correct?The "uberblock" is the "root of all the data". All the data in a ZFS pool is referenced by it; after the txg is in stable storage then the uberblock is updated.>At boot time, or "zpool import" time, what is taken to be "the current >filesystem?" The latest uberblock? Something else?The current "zpool" and the filesystems such as referenced by the last uberblock.>My understanding is that enabling a dedicated ZIL device guarantees sync() >and fsync() system calls block until the write has been committed to >nonvolatile storage, and attempts to accelerate by using a physical device >which is faster or more idle than the main storage pool. My understanding >is that this provides two implicit guarantees: (1) sync writes are always >guaranteed to be committed to disk in order, relevant to other sync writes. >(2) In the event of OS halting or ungraceful shutdown, sync writes committed >to disk are guaranteed to be equal or greater than the async writes that >were taking place at the same time. That is, if two processes both complete >a write operation at the same time, one in sync mode and the other in async >mode, then it is guaranteed the data on disk will never have the async data >committed before the sync data.sync() is actually *async* and returning from sync() says nothing about stable storage. After fsync() returns it signals that all the data is in stable storage (except if you disable ZIL), or, apparently, in Linux when the write caches for your disks are enabled (the default for PC drives). ZFS doesn''t care about the writecache; it makes sure it is flushed. (There''s fsyc() and open(..., O_DSYNC|O_SYNC)>Based on this understanding, if you disable ZIL, then there is no guarantee >about order of writes being committed to disk. Neither of the above >guarantees is valid anymore. Sync writes may be completed out of order. >Async writes that supposedly happened after sync writes may be committed to >disk before the sync writes. > >Somebody, (Casper?) said it before, and now I''m starting to realize ... This >is also true of the snapshots. If you disable your ZIL, then there is no >guarantee your snapshots are consistent either. Rolling back doesn''t >necessarily gain you anything. > >The only way to guarantee consistency in the snapshot is to always >(regardless of ZIL enabled/disabled) give priority for sync writes to get >into the TXG before async writes. > >If the OS does give priority for sync writes going into TXG''s before async >writes (even with ZIL disabled), then after spontaneous ungraceful reboot, >the latest uberblock is guaranteed to be consistent.I believe that the writes are still ordered so the consistency you want is actually delivered even without the ZIL enabled. Casper
On Fri, 2 Apr 2010, Edward Ned Harvey wrote:> > So you''re saying that while the OS is building txg''s to write to disk, the > OS will never reorder the sequence in which individual write operations get > ordered into the txg''s. That is, an application performing a small sync > write, followed by a large async write, will never have the second operation > flushed to disk before the first. Can you support this belief in any way?I am like a "pool" or "tank" of regurgitated zfs knowledge. I simply pay attention when someone who really knows explains something (e.g. Neil Perrin, as Casper referred to) so I can regurgitate it later. I try to do so faithfully. If I had behaved this way in school, I would have been a good student. Sometimes I am wrong or the design has somewhat changed since the original information was provided. There are indeed popular filesystems (e.g. Linux EXT4) which write data to disk in different order than cronologically requested so it is good that you are paying attention to these issues. While in the slog-based recovery scenario, it is possible for a TXG to be generated which lacks async data, this only happens after a system crash and if all of the critical data is written as a sync request, it will be faithfully preserved. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:>> I know it is way after the fact, but I find it best to coerce each >> drive down to the whole GB boundary using format (create Solaris >> partition just up to the boundary). Then if you ever get a drive a >> little smaller it still should fit. >> > It seems like it should be unnecessary. It seems like extra work. But > based on my present experience, I reached the same conclusion. > > If my new replacement SSD with identical part number and firmware is 0.001 > Gb smaller than the original and hence unable to mirror, what''s to prevent > the same thing from happening to one of my 1TB spindle disk mirrors? > Nothing. That''s what. > >Actually, It''s my experience that Sun (and other vendors) do exactly that for you when you buy their parts - at least for rotating drives, I have no experience with SSD''s. The Sun disk label shipped on all the drives is setup to make the drive the standard size for that sun part number. They have to do this since they (for many reasons) have many sources (diff. vendors, even diff. parts from the same vendor) for the actual disks they use for a particular Sun part number. This isn''t new, I beleive IBM, EMC, HP, etc all do it also for the same reasons. I''m a little surprised that the engineers would suddenly stop doing it only on SSD''s. But who knows. -Kyle> I take it back. Me. I am to prevent it from happening. And the technique > to do so is precisely as you''ve said. First slice every drive to be a > little smaller than actual. Then later if I get a replacement device for > the mirror, that''s slightly smaller than the others, I have no reason to > care. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey <solaris2 at nedharvey.com> wrote:>> The purpose of the ZIL is to act like a fast "log" for synchronous >> writes. ?It allows the system to quickly confirm a synchronous write >> request with the minimum amount of work. > > Bob and Casper and some others clearly know a lot here. ?But I''m hearing > conflicting information, and don''t know what to believe. ?Does anyone here > work on ZFS as an actual ZFS developer for Sun/Oracle? ?Can claim "I can > answer this question, I wrote that code, or at least have read it?" > > Questions to answer would be: > > Is a ZIL log device used only by sync() and fsync() system calls? ?Is it > ever used to accelerate async writes?sync() will tell the filesystems to flush writes to disk. sync() will not use ZIL, it will just start a new TXG, and could return before the writes are done. fsync() is what you are interested in.> > Suppose there is an application which sometimes does sync writes, and > sometimes async writes. ?In fact, to make it easier, suppose two processes > open two files, one of which always writes asynchronously, and one of which > always writes synchronously. ?Suppose the ZIL is disabled. ?Is it possible > for writes to be committed to disk out-of-order? ?Meaning, can a large block > async write be put into a TXG and committed to disk before a small sync > write to a different file is committed to disk, even though the small sync > write was issued by the application before the large async write? ?Remember, > the point is: ?ZIL is disabled. ?Question is whether the async could > possibly be committed to disk before the sync. >Writers from a TXG will not be used until the whole TXG is committed to disk. Everything from a half written TXG will be ignored after a crash. This means that the order of writes within a TXG is not important. The only way to do a sync write without ZIL is to start a new TXG after the write. That costs a lot so we have the ZIL for sync writes.
On Fri, 2 Apr 2010, Edward Ned Harvey wrote:> were taking place at the same time. That is, if two processes both complete > a write operation at the same time, one in sync mode and the other in async > mode, then it is guaranteed the data on disk will never have the async data > committed before the sync data. > > Based on this understanding, if you disable ZIL, then there is no guarantee > about order of writes being committed to disk. Neither of the above > guarantees is valid anymore. Sync writes may be completed out of order. > Async writes that supposedly happened after sync writes may be committed to > disk before the sync writes.You seem to be assuming that Solaris is an incoherent operating system. With ZFS, the filesystem in memory is coherent, and transaction groups are constructed in simple chronological order (capturing combined changes up to that point in time), without regard to SYNC options. The only possible exception to the coherency is for memory mapped files, where the mapped memory is a copy of data (originally) from the ZFS ARC and needs to be reconciled with the ARC if an application has dirtied it. This differs from UFS and the way Solaris worked prior to Solaris 10. Synchronous writes are not "faster" than asynchronous writes. If you drop heavy and light objects from the same height, they fall at the same rate. This was proven long ago. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Apr 2, 2010, at 5:08 AM, Edward Ned Harvey wrote:>> I know it is way after the fact, but I find it best to coerce each >> drive down to the whole GB boundary using format (create Solaris >> partition just up to the boundary). Then if you ever get a drive a >> little smaller it still should fit. > > It seems like it should be unnecessary. It seems like extra work. But > based on my present experience, I reached the same conclusion. > > If my new replacement SSD with identical part number and firmware is 0.001 > Gb smaller than the original and hence unable to mirror, what''s to prevent > the same thing from happening to one of my 1TB spindle disk mirrors? > Nothing. That''s what. > > I take it back. Me. I am to prevent it from happening. And the technique > to do so is precisely as you''ve said. First slice every drive to be a > little smaller than actual. Then later if I get a replacement device for > the mirror, that''s slightly smaller than the others, I have no reason to > care.However, I believe there are some downsides to letting ZFS manage just a slice rather than an entire drive, but perhaps those do not apply as significantly to SSD devices? Thanks -- Stuart Anderson anderson at ligo.caltech.edu http://www.ligo.caltech.edu/~anderson
On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey <solaris2 at nedharvey.com> wrote:>> > Seriously, all disks configured WriteThrough (spindle and SSD disks >> > alike) >> > using the dedicated ZIL SSD device, very noticeably faster than >> > enabling the >> > WriteBack. >> >> What do you get with both SSD ZIL and WriteBack disks enabled? >> >> I mean if you have both why not use both? Then both async and sync IO >> benefits. > > Interesting, but unfortunately false. ?Soon I''ll post the results here. ?I > just need to package them in a way suitable to give the public, and stick it > on a website. ?But I''m fighting IT fires for now and haven''t had the time > yet. > > Roughly speaking, the following are approximately representative. ?Of course > it varies based on tweaks of the benchmark and stuff like that. > ? ? ? ?Stripe 3 mirrors write through: ?450-780 IOPS > ? ? ? ?Stripe 3 mirrors write back: ?1030-2130 IOPS > ? ? ? ?Stripe 3 mirrors write back + SSD ZIL: ?1220-2480 IOPS > ? ? ? ?Stripe 3 mirrors write through + SSD ZIL: ?1840-2490 IOPS > > Overall, I would say WriteBack is 2-3 times faster than naked disks. ?SSD > ZIL is 3-4 times faster than naked disk. ?And for some reason, having the > WriteBack enabled while you have SSD ZIL actually hurts performance by > approx 10%. ?You''re better off to use the SSD ZIL with disks in Write > Through mode. > > That result is surprising to me. ?But I have a theory to explain it. ?When > you have WriteBack enabled, the OS issues a small write, and the HBA > immediately returns to the OS: ?"Yes, it''s on nonvolatile storage." ?So the > OS quickly gives it another, and another, until the HBA write cache is full. > Now the HBA faces the task of writing all those tiny writes to disk, and the > HBA must simply follow orders, writing a tiny chunk to the sector it said it > would write, and so on. ?The HBA cannot effectively consolidate the small > writes into a larger sequential block write. ?But if you have the WriteBack > disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on > SSD, and immediately return to the process: ?"Yes, it''s on nonvolatile > storage." ?So the application can issue another, and another, and another. > ZFS is smart enough to aggregate all these tiny write operations into a > single larger sequential write before sending it to the spindle disks.Hmm, when you did the write-back test was the ZIL SSD included in the write-back? What I was proposing was write-back only on the disks, and ZIL SSD with no write-back. Not all operations hit the ZIL, so it would still be nice to have the non-ZIL operations return quickly. -Ross
On 02/04/2010 16:04, Casper.Dik at Sun.COM wrote:> > sync() is actually *async* and returning from sync() says nothing about > >to clarify - in case of ZFS sync() is actually synchronous. -- Robert Milkowski http://milek.blogspot.com
> If my new replacement SSD with identical part number and firmware is 0.001 > Gb smaller than the original and hence unable to mirror, what''s to prevent > the same thing from happening to one of my 1TB spindle disk mirrors?There is a standard for sizes that many manufatures use (IDEMA LBA1-02): LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ? 50.0)) Sizes should match exactly if the manufacturer follows the standard. See: http://opensolaris.org/jive/message.jspa?messageID=393336#393336 http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066 -- This message posted from opensolaris.org
>>>>> "enh" == Edward Ned Harvey <solaris2 at nedharvey.com> writes:enh> If you have zpool less than version 19 (when ability to remove enh> log device was introduced) and you have a non-mirrored log enh> device that failed, you had better treat the situation as an enh> emergency. Ed the log device removal support is only good for adding a slog to try it out, then changing your mind and removing the slog (which was not possible before). It doesn''t change the reliability situation one bit: pools with dead slogs are not importable. There''ve been threads on this for a while. It''s well-discussed because it''s an example of IMHO broken process of ``obviously a critical requirement but not technically part of the original RFE which is already late,'''' as well as a dangerous pitfall for ZFS admins. I imagine the process works well in other cases to keep stuff granular enough that it can be prioritized effectively, but in this case it''s made the slog feature significantly incomplete for a couple years and put many production systems in a precarious spot, and the whole mess was predicted before the slog feature was integrated. >> The on-disk log (slog or otherwise), if I understand right, can >> actually make the filesystem recover to a crash-INconsistent >> state enh> You''re speaking the opposite of common sense. Yeah, I''m doing it on purpose to suggest that just guessing how you feel things ought to work based on vague notions of economy isn''t a good idea. enh> If disabling the ZIL makes the system faster *and* less prone enh> to data corruption, please explain why we don''t all disable enh> the ZIL? I said complying with fsync can make the system recover to a state not equal to one you might have hypothetically snapshotted in a moment leading up to the crash. Elsewhere I might''ve said disabling the ZIL does not make the system more prone to data corruption, *iff* you are not an NFS server. If you are, disabling the ZIL can lead to lost writes if an NFS server reboots and an NFS client does not, which can definitely cause app-level data corruption. Disabling the ZIL breaks the D requirement of ACID databases which might screw up apps that replicate, or keep databases on several separate servers in sync, and it might lead to lost mail on an MTA, but because unlike non-COW filesystems it costs nothing extra for ZFS to preserve write ordering even without fsync(), AIUI you will not get corrupted application-level data by disabling the ZIL. you just get missing data that the app has a right to expect should be there. The dire warnings written by kernel developers in the wikis of ``don''t EVER disable the ZIL'''' are totally ridiculous and inappropriate IMO. I think they probably just worked really hard to write the ZIL piece of ZFS, and don''t want people telling their brilliant code to fuckoff just because it makes things a little slower. so we get all this ``enterprise'''' snobbery and so on. ``crash consistent'''' is a technical term not a common-sense term, and I may have used it incorrectly: http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html With a system that loses power on which fsync() had been in use, the files getting fsync()''ed will probably recover to more recent versions than the rest of the files, which means the recovered state achieved by yanking the cord couldn''t have been emulated by cloning a snapshot and not actually having lost power. However, the app calling fsync() will expect this, so it''s not supposed to lead to application-level inconsistency. If you test your app''s recovery ability in just that way, by cloning snapshots of filesystems on which the app is actively writing and then seeing if the app can recover the clone, then you''re unfortunately not testing the app quite hard enough if fsync() is involved, so yeah I guess disabling the ZIL might in theory make incorrectly-written apps less prone to data corruption. Likewise, no testing of the app on a ZFS will be aggressive enough to make the app powerfail-proof on a non-COW POSIX system because ZFS keeps more ordering than the API actually guarantees to the app. I''m repeating myself though. I wish you''ll just read my posts with at least paragraph granularity instead of just picking out individual sentences and discarding everything that seems too complicated or too awkwardly stated. I''m basing this all on the ``common sense'''' that to do otherwise, fsync() would have to completely ignore its filedescriptor argument. It''d have to copy the entire in-memory ZIL to the slog and behave the same as ''lockfs -fa'', which I think would perform too badly compared to non-ZFS filesystems'' fsync()s, and would lead to emphatic performance advice like ``segregate files that get lots of fsync()s into separate ZFS datasets from files that get high write bandwidth,'''' and we don''t have advice like that in the blogs/lists/wikis which makes me think it''s not beneficial (the benefit would be dramatic if it were!) and that fsync() works the way I think it does. It''s a slightly more convoluted type of ``common sense'''' than yours, but mine could still be wrong. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100402/1b87452f/attachment.bin>
On Fri, Apr 2, 2010 at 10:08 AM, Kyle McDonald <kmcdonald at egenera.com>wrote:> On 4/2/2010 8:08 AM, Edward Ned Harvey wrote: > >> I know it is way after the fact, but I find it best to coerce each > >> drive down to the whole GB boundary using format (create Solaris > >> partition just up to the boundary). Then if you ever get a drive a > >> little smaller it still should fit. > >> > > It seems like it should be unnecessary. It seems like extra work. But > > based on my present experience, I reached the same conclusion. > > > > If my new replacement SSD with identical part number and firmware is > 0.001 > > Gb smaller than the original and hence unable to mirror, what''s to > prevent > > the same thing from happening to one of my 1TB spindle disk mirrors? > > Nothing. That''s what. > > > > > Actually, It''s my experience that Sun (and other vendors) do exactly > that for you when you buy their parts - at least for rotating drives, I > have no experience with SSD''s. > > The Sun disk label shipped on all the drives is setup to make the drive > the standard size for that sun part number. They have to do this since > they (for many reasons) have many sources (diff. vendors, even diff. > parts from the same vendor) for the actual disks they use for a > particular Sun part number. > > This isn''t new, I beleive IBM, EMC, HP, etc all do it also for the same > reasons. > I''m a little surprised that the engineers would suddenly stop doing it > only on SSD''s. But who knows. > > -Kyle > >If I were forced to ignorantly cast a stone, it would be into Intel''s lap (if the SSD''s indeed came directly from Sun). Sun''s "normal" drive vendors have been in this game for decades, and know the expectations. Intel on the other hand, may not have quite the same QC in place yet. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100402/92eb2ce3/attachment.html>
On Fri, Apr 2 at 11:14, Tirso Alonso wrote:>> If my new replacement SSD with identical part number and firmware is 0.001 >> Gb smaller than the original and hence unable to mirror, what''s to prevent >> the same thing from happening to one of my 1TB spindle disk mirrors? > >There is a standard for sizes that many manufatures use (IDEMA LBA1-02): > >LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0)) > >Sizes should match exactly if the manufacturer follows the standard. > >See: >http://opensolaris.org/jive/message.jspa?messageID=393336#393336 >http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066Problem is that it only applies to devices that are >= 50GB in size, and the X25 in question is only 32GB. That being said, I''d be skeptical of either the sourcing of the parts, or else some other configuration feature on the drives (like HPA or DCO) that is changing the capacity. It''s possible one of these is in effect. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On 04/02/10 08:24, Edward Ned Harvey wrote:>> The purpose of the ZIL is to act like a fast "log" for synchronous >> writes. It allows the system to quickly confirm a synchronous write >> request with the minimum amount of work. >> > > Bob and Casper and some others clearly know a lot here. But I''m hearing > conflicting information, and don''t know what to believe. Does anyone here > work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim "I can > answer this question, I wrote that code, or at least have read it?" >I''m one of the ZFS developers. I wrote most of the zil code. Still I don''t have all the answers. There''s a lot of knowledgeable people on this alias. I usually monitor this alias and sometimes chime in when there''s some misinformation being spread, but sometimes the volume is so high. Since I started this reply there''s been 20 new posts on this thread alone!> Questions to answer would be: > > Is a ZIL log device used only by sync() and fsync() system calls? >- The intent log (separate device(s) or not) is only used by fsync, O_DSYNC, O_SYNC, O_RSYNC. NFS commits are seen to ZFS as fsyncs. Note sync(1m) and sync(2s) do not use the intent log. They force transaction group (txg) commits on all pools. So zfs goes beyond the the requirement for sync() which only requires it schedules but does not necessarily complete the writing before returning. The zfs interpretation is rather expensive but seemed broken so we fixed it.> Is it ever used to accelerate async writes?The zil is not used to accelerate async writes.> Suppose there is an application which sometimes does sync writes, and > sometimes async writes. In fact, to make it easier, suppose two processes > open two files, one of which always writes asynchronously, and one of which > always writes synchronously. Suppose the ZIL is disabled. Is it possible > for writes to be committed to disk out-of-order? Meaning, can a large block > async write be put into a TXG and committed to disk before a small sync > write to a different file is committed to disk, even though the small sync > write was issued by the application before the large async write? Remember, > the point is: ZIL is disabled. Question is whether the async could > possibly be committed to disk before the sync. >Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there.> I make the assumption that an uberblock is the term for a TXG after it is > committed to disk. Correct? >- Kind of. The uberblock contains the root of the txg.> At boot time, or "zpool import" time, what is taken to be "the current > filesystem?" The latest uberblock? Something else? >A txg is for the whole pool which can contain many filesystems. The latest txg defines the current state of the pool and each individual fs.> My understanding is that enabling a dedicated ZIL device guarantees sync() > and fsync() system calls block until the write has been committed to > nonvolatile storage, and attempts to accelerate by using a physical device > which is faster or more idle than the main storage pool.Correct (except replace sync() with O_DSYNC, etc). This also assumes hardware that for example handles correctly the flushing of it''s caches.> My understanding > is that this provides two implicit guarantees: (1) sync writes are always > guaranteed to be committed to disk in order, relevant to other sync writes. > (2) In the event of OS halting or ungraceful shutdown, sync writes committed > to disk are guaranteed to be equal or greater than the async writes that > were taking place at the same time. That is, if two processes both complete > a write operation at the same time, one in sync mode and the other in async > mode, then it is guaranteed the data on disk will never have the async data > committed before the sync data. >The ZIL doesn''t make such guarantees. It''s the DMU that handles transactions and their grouping into txgs. It ensures that writes are committed in order by it''s transactional nature. The function of the zil is to merely ensure that synchronous operations are stable and replayed after a crash/power fail onto the latest txg.> Based on this understanding, if you disable ZIL, then there is no guarantee > about order of writes being committed to disk. Neither of the above > guarantees is valid anymore. Sync writes may be completed out of order. > Async writes that supposedly happened after sync writes may be committed to > disk before the sync writes. >No, disabling the ZIL does not disable the DMU.> Somebody, (Casper?) said it before, and now I''m starting to realize ... This > is also true of the snapshots. If you disable your ZIL, then there is no > guarantee your snapshots are consistent either. Rolling back doesn''t > necessarily gain you anything. >No, a snapshot forces a txg which is a consistent up to date view of the pool and it''s file systems. The zil is not involved. See also http://blogs.sun.com/perrin/entry/the_lumberjack - which is a bit dated and simplistic but still largely true. Neil. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100402/7e54cff0/attachment.html>
Hi Jeroen, Have you tried the DDRdrive from Christopher George <cgeorge at ddrdrive.com>? Looks to me like a much better fit for your application than the F20? It would not hurt to check it out. Looks to me like you need a product with low *latency* - and a RAM based cache would be a much better performer than any solution based solely on flash. Let us know (on the list) how this works out for you. Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 214.233.5089 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Casper.Dik at Sun.COM
2010-Apr-03 10:28 UTC
[zfs-discuss] Sun Flash Accelerator F20 numbers
>The only way to guarantee consistency in the snapshot is to always >(regardless of ZIL enabled/disabled) give priority for sync writes to get >into the TXG before async writes. > >If the OS does give priority for sync writes going into TXG''s before async >writes (even with ZIL disabled), then after spontaneous ungraceful reboot, >the latest uberblock is guaranteed to be consistent.This is what Jeff Bonwick says in the "zil synchronicity" arc case: "What I mean is that the barrier semantic is implicit even with no ZIL at all. In ZFS, if event A happens before event B, and you lose power, then what you''ll see on disk is either nothing, A, or both A and B. Never just B. It is impossible for us not to have at least barrier semantics." So there''s no chance that a *later* async write will overtake an earlier sync *or* async write. Casper
Hi Al,> Have you tried the DDRdrive from Christopher George > <cgeorge at ddrdrive.com>? > Looks to me like a much better fit for your application than the F20? > > It would not hurt to check it out. Looks to me like > you need a product with low *latency* - and a RAM based cache > would be a much better performer than any solution based solely on > flash. > > Let us know (on the list) how this works out for you.Well, I did look at it but at that time there was no Solaris support yet. Right now it seems there is only a beta driver? I kind of remember that if you''d want reliable fallback to nvram, you''d need an UPS feeding the card. I could be very wrong there, but the product documentation isn''t very clear on this (at least to me ;) ) Also, we''d kind of like to have a SnOracle supported option. But yeah, on paper it does seem it could be an attractive solution... With kind regards, Jeroen -- This message posted from opensolaris.org
> Well, I did look at it but at that time there was no Solaris support yet. Right now it > seems there is only a beta driver?Correct, we just completed functional validation of the OpenSolaris driver. Our focus has now turned to performance tuning and benchmarking. We expect to formally introduce the DDRdrive X1 to the ZFS community later this quarter. It is our goal to focus exclusively on the dedicated ZIL device market going forward.> I kind of remember that if you''d want reliable fallback to nvram, you''d need an > UPS feeding the card.Currently, a dedicated external UPS is required for correct operation. Based on community feedback, we will be offering automatic backup/restore prior to release. This guarantees the UPS will only be required for 60 secs to successfully backup the drive contents on a host power or hardware failure. Dutifully on the next reboot the restore will occur prior to the OS loading for seamless non-volatile operation. Also,we have heard loud and clear the requests for a internal power option. It is our intention the X1 will be the first in a family of products all dedicated to ZIL acceleration for not only OpenSolaris but also Solaris 10 and FreeBSD.> Also, we''d kind of like to have a SnOracle supported option.Although a much smaller company, we believe our singular focus and absolute passion for ZFS and the potential of Hybrid Storage Pools will serve our customers well. We are actively designing our soon to be available support plans. Your voice will be heard, please email directly at <cgeorge at ddrdrive dot com> for requests, comments and/or questions. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
On 1 apr 2010, at 06.15, Stuart Anderson wrote:> Assuming you are also using a PCI LSI HBA from Sun that is managed with > a utility called /opt/StorMan/arcconf and reports itself as the amazingly > informative model number "Sun STK RAID INT" what worked for me was to run, > arcconf delete (to delete the pre-configured volume shipped on the drive) > arcconf create (to create a new volume)Just to sort things out (or not? :-): I more than agree that this product is highly confusing, but I don''t think there is anything LSI in or about that card. I believe it is an Adaptec card, developed, manufactured and supported by Intel for Adaptec, licensed (or something) to StorageTek, and later included in Sun machines (since Sun bought StorageTek, I suppose). Now we could add Oracle to this name dropping inferno, if we would want to. I am not sure why they (Sun) put those in there, they don''t seem very fast or smart or anything. /ragge
On 2 apr 2010, at 22.47, Neil Perrin wrote:>> Suppose there is an application which sometimes does sync writes, and >> sometimes async writes. In fact, to make it easier, suppose two processes >> open two files, one of which always writes asynchronously, and one of which >> always writes synchronously. Suppose the ZIL is disabled. Is it possible >> for writes to be committed to disk out-of-order? Meaning, can a large block >> async write be put into a TXG and committed to disk before a small sync >> write to a different file is committed to disk, even though the small sync >> write was issued by the application before the large async write? Remember, >> the point is: ZIL is disabled. Question is whether the async could >> possibly be committed to disk before the sync. >> >> > > Threads can be pre-empted in the OS at any time. So even though thread A issued > W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. > Multi-threaded applications have to handle this. > > If this was a single thread issuing W1 then W2 then yes the order is guaranteed > regardless of whether W1 or W2 are synchronous or asynchronous. > Of course if the system crashes then the async operations might not be there.Could you please clarify this last paragraph a little: Do you mean that this is in the case that you have ZIL enabled and the txg for W1 and W2 hasn''t been commited, so that upon reboot the ZIL is replayed, and therefore only the sync writes are eventually there? If, lets say, W1 is an async small write, W2 is a sync small write, W1 arrives to zfs before W2, and W2 arrives before the txg is commited, will both writes always be in the txg on disk? If so, it would mean that zfs itself never buffer up async writes to larger blurbs to write at a later txg, correct? I take it that ZIL enabled or not does not make any difference here (we pretend the system did _not_ crash), correct? Thanks! /ragge
On Apr 3, 2010, at 5:47 PM, Ragnar Sundblad wrote:> On 2 apr 2010, at 22.47, Neil Perrin wrote: > >>> Suppose there is an application which sometimes does sync writes, and >>> sometimes async writes. In fact, to make it easier, suppose two processes >>> open two files, one of which always writes asynchronously, and one of which >>> always writes synchronously. Suppose the ZIL is disabled. Is it possible >>> for writes to be committed to disk out-of-order? Meaning, can a large block >>> async write be put into a TXG and committed to disk before a small sync >>> write to a different file is committed to disk, even though the small sync >>> write was issued by the application before the large async write? Remember, >>> the point is: ZIL is disabled. Question is whether the async could >>> possibly be committed to disk before the sync. >>> >>> >> >> Threads can be pre-empted in the OS at any time. So even though thread A issued >> W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. >> Multi-threaded applications have to handle this. >> >> If this was a single thread issuing W1 then W2 then yes the order is guaranteed >> regardless of whether W1 or W2 are synchronous or asynchronous. >> Of course if the system crashes then the async operations might not be there. > > Could you please clarify this last paragraph a little: > Do you mean that this is in the case that you have ZIL enabled > and the txg for W1 and W2 hasn''t been commited, so that upon reboot > the ZIL is replayed, and therefore only the sync writes are > eventually there?yes. The ZIL needs to be replayed on import after an unclean shutdown.> If, lets say, W1 is an async small write, W2 is a sync small write, > W1 arrives to zfs before W2, and W2 arrives before the txg is > commited, will both writes always be in the txg on disk?yes> If so, it would mean that zfs itself never buffer up async writes to > larger blurbs to write at a later txg, correct?correct> I take it that ZIL enabled or not does not make any difference here > (we pretend the system did _not_ crash), correct?For import following a clean shutdown, there are no transactions in the ZIL to apply. For async-only workloads, there are no transactions in the ZIL to apply. Do not assume that power outages are the only cause of unclean shutdowns. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On 4 apr 2010, at 06.01, Richard Elling wrote: Thank you for your reply! Just wanted to make sure.> Do not assume that power outages are the only cause of unclean shutdowns. > -- richardThanks, I have seen that mistake several times with other (file)systems, and hope I''ll never ever make it myself! :-) /ragge s
> Hmm, when you did the write-back test was the ZIL SSD included in the > write-back? > > What I was proposing was write-back only on the disks, and ZIL SSD > with no write-back.The tests I did were: All disks write-through All disks write-back With/without SSD for ZIL All the permutations of the above. So, unfortunately, no, I didn''t test with WriteBack enabled only for spindles, and WriteThrough on SSD. It has been suggested, and this is actually what I now believe based on my experience, that precisely the opposite would be the better configuration. If the spindles are configured WriteThrough, while the SSD is configured WriteBack. I believe would be optimal. If I get the opportunity to test further, I''m interested and I will. But who knows when/if that will happen.
> Actually, It''s my experience that Sun (and other vendors) do exactly > that for you when you buy their parts - at least for rotating drives, I > have no experience with SSD''s. > > The Sun disk label shipped on all the drives is setup to make the drive > the standard size for that sun part number. They have to do this since > they (for many reasons) have many sources (diff. vendors, even diff. > parts from the same vendor) for the actual disks they use for a > particular Sun part number.Actually, if there is a fdisk partition and/or disklabel on a drive when it arrives, I''m pretty sure that''s irrelevant. Because when I first connect a new drive to the HBA, of course the HBA has to sign and initialize the drive at a lower level than what the OS normally sees. So unless I do some sort of special operation to tell the HBA to preserve/import a foreign disk, the HBA will make the disk blank before the OS sees it anyway.
Richard Elling
2010-Apr-05 03:32 UTC
[zfs-discuss] writeback vs writethrough [was: Sun Flash Accelerator F20 numbers]
On Apr 2, 2010, at 5:03 AM, Edward Ned Harvey wrote:>>> Seriously, all disks configured WriteThrough (spindle and SSD disks >>> alike) >>> using the dedicated ZIL SSD device, very noticeably faster than >>> enabling the >>> WriteBack. >> >> What do you get with both SSD ZIL and WriteBack disks enabled? >> >> I mean if you have both why not use both? Then both async and sync IO >> benefits. > > Interesting, but unfortunately false. Soon I''ll post the results here. I > just need to package them in a way suitable to give the public, and stick it > on a website. But I''m fighting IT fires for now and haven''t had the time > yet. > > Roughly speaking, the following are approximately representative. Of course > it varies based on tweaks of the benchmark and stuff like that. > Stripe 3 mirrors write through: 450-780 IOPS > Stripe 3 mirrors write back: 1030-2130 IOPS > Stripe 3 mirrors write back + SSD ZIL: 1220-2480 IOPS > Stripe 3 mirrors write through + SSD ZIL: 1840-2490 IOPSThanks for sharing these interesting numbers.> Overall, I would say WriteBack is 2-3 times faster than naked disks. SSD > ZIL is 3-4 times faster than naked disk. And for some reason, having the > WriteBack enabled while you have SSD ZIL actually hurts performance by > approx 10%. You''re better off to use the SSD ZIL with disks in Write > Through mode.YMMV. The write workload for ZFS is best characterized by looking at the txg commit. In a very short period of time ZFS sends a lot[1] of write I/O to the vdevs. It is not surprising that this can blow through the relatively small caches on controllers. Once you blow through the cache, then the [in]efficiency of the disks behind the cache is experienced as well as the [in]efficiency of the cache controller. Alas, little public information seems to be published regarding how those caches work. Changing to write-through effectively changes the G/M/1 queue [2] at the controller to a G/M/n queue at the disks. Sorta like: 1. write-back controller (ZFS) N*#vdev I/Os --> controller --> disks (ZFS) M/M/n --> G/M/1 --> M/M/n 2. write-through controller (ZFS) N*#vdev I/Os --> disks (ZFS) M/M/n --> G/M/n This can simply be a case of the middleman becoming the bottleneck. [1] a "lot" means up to 35 I/Os per vdev for older releases, 4-10 I/Os per vdev for more recent releases [2] queuing theory enthusiasts will note that ZFS writes do not exhibit an exponential arrival rate at the controller or disks except for sync writes.> That result is surprising to me. But I have a theory to explain it. When > you have WriteBack enabled, the OS issues a small write, and the HBA > immediately returns to the OS: "Yes, it''s on nonvolatile storage." So the > OS quickly gives it another, and another, until the HBA write cache is full. > Now the HBA faces the task of writing all those tiny writes to disk, and the > HBA must simply follow orders, writing a tiny chunk to the sector it said it > would write, and so on. The HBA cannot effectively consolidate the small > writes into a larger sequential block write. But if you have the WriteBack > disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on > SSD, and immediately return to the process: "Yes, it''s on nonvolatile > storage." So the application can issue another, and another, and another. > ZFS is smart enough to aggregate all these tiny write operations into a > single larger sequential write before sending it to the spindle disks.I agree, though this paragraph has 3 different thoughts embedded. Taken separately: 1. queuing surprises people :-) 2. writeback inserts a middleman with its own queue 3. separate logs radically change the write workload seen by the controller and disks> Long story short, the evidence suggests if you have SSD ZIL, you''re better > off without WriteBack on the HBA. And I conjecture the reasoning behind it > is because ZFS can write buffer better than the HBA can.I think the way the separate log works is orthogonal. However, not having a separate log can influence the ability of the controller and disks to respond to read requests during this workload. Perhaps this is a long way around to saying that a well tuned system will have harmony among its parts. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On 4/4/2010 11:04 PM, Edward Ned Harvey wrote:>> Actually, It''s my experience that Sun (and other vendors) do exactly >> that for you when you buy their parts - at least for rotating drives, I >> have no experience with SSD''s. >> >> The Sun disk label shipped on all the drives is setup to make the drive >> the standard size for that sun part number. They have to do this since >> they (for many reasons) have many sources (diff. vendors, even diff. >> parts from the same vendor) for the actual disks they use for a >> particular Sun part number. >> > Actually, if there is a fdisk partition and/or disklabel on a drive when it > arrives, I''m pretty sure that''s irrelevant. Because when I first connect a > new drive to the HBA, of course the HBA has to sign and initialize the drive > at a lower level than what the OS normally sees. So unless I do some sort > of special operation to tell the HBA to preserve/import a foreign disk, the > HBA will make the disk blank before the OS sees it anyway. > >That may be true. Though these days they may be spec''ing the drives to the manufacturer''s at an even lower level. So does your HBA have newer firmware now than it did when the first disk was connected? Maybe it''s the HBA that is handling the new disks differently now, than it did when the first one was plugged in? Can you down rev the HBA FW? Do you have another HBa that might still have the older Rev you coudltest it on? -Kyle
> From: Kyle McDonald [mailto:kmcdonald at egenera.com] > > So does your HBA have newer firmware now than it did when the first > disk > was connected? > Maybe it''s the HBA that is handling the new disks differently now, than > it did when the first one was plugged in? > > Can you down rev the HBA FW? Do you have another HBa that might still > have the older Rev you coudltest it on?I''m planning to get the support guys more involved tomorrow, so ... things have been pretty stagnant for several days now, I think it''s time to start putting more effort into this. Long story short, I don''t know yet. But there is one glaring clue: Prior to OS installation, I don''t know how to configure the HBA. This means the HBA must have been preconfigured with the factory installed disks, and I followed a different process with my new disks, because I was using the GUI within the OS. My best hope right now is to find some other way to configure the HBA, possibly through the ILOM, but I already searched there and looked at everything. Maybe I have to shutdown (power cycle) the system and attach keyboard & monitor. I don''t know yet...
Hi Roch,> Can you try 4 concurrent tar to four different ZFS > filesystems (same pool).Hmmm, you''re on to something here: http://www.science.uva.nl/~jeroen/zil_compared_e1000_iostat_iops_svc_t_10sec_interval.pdf In short: when using two exported file systems total time goes down to around 4mins (IOPS maxes out at around 5500 when adding all four vmods together). When using four file systems total time goes down to around 3min30s (IOPS maxing out at about 9500). I figured it is either NFS or a per file system data structure in the ZFS/ZIL interface. To rule out NFS I tried exporting two directories using "default NFS" shares (via /etc/dfs/dfstab entries). To my surprise this seems to bypass the ZIL all together (dropping to 100 IOPS, which results from our RAIDZ2 configuration). So clearly "ZFS sharenfs" is more than a nice front end for NFS configuration :). But back to your suggestion: You clearly had a hypothesis behind your question. Care to elaborate? With kind regards, Jeroen -- This message posted from opensolaris.org
> > We ran into something similar with these drives in an X4170 that > turned > > out to > > be an issue of the preconfigured logical volumes on the drives. Once > > we made > > sure all of our Sun PCI HBAs where running the exact same version of > > firmware > > and recreated the volumes on new drives arriving from Sun we got back > > into sync > > on the X25-E devices sizes. > > Can you elaborate? Just today, we got the replacement drive that has > precisely the right version of firmware and everything. Still, when we > plugged in that drive, and "create simple volume" in the storagetek > raid utility, the new drive is 0.001 Gb smaller than the old drive. > I''m still hosed. > > Are you saying I might benefit by sticking the SSD into some laptop, > and zero''ing the disk? And then attach to the sun server? > > Are you saying I might benefit by finding some other way to make the > drive available, instead of using the storagetek raid utility? > > Thanks for the suggestions...Sorry for the double post. Since the wrong-sized drive was discussed in two separate threads, I want to stick a link here to the other one, where the question was answered. Just incase anyone comes across this discussion by search or whatever... http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039669.html
Hi list,> If you''re running solaris proper, you better mirror > your > > ZIL log device....> I plan to get to test this as well, won''t be until > late next week though.Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up "normally" with the following message in ''zpool status'': ... pool: mypool state: FAULTED status: An intent log record could not be read. Waiting for adminstrator intervention to fix the faulted pool. action: Either restore the affected device(s) and run ''zpool online'', or ignore the intent log records by running ''zpool clear''. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: none requested config: NAME STATE READ WRITE CKSUM mypool FAULTED 0 0 0 bad intent log ... Nice! Running a later version of ZFS seems to lessen the need for ZIL-mirroring... With kind regards, Jeroen -- This message posted from opensolaris.org
Hi list,> If you''re running solaris proper, you better mirror > your > > ZIL log device....> I plan to get to test this as well, won''t be until > late next week though.Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up "normally" with the following message in ''zpool status'': ... pool: mypool state: FAULTED status: An intent log record could not be read. Waiting for adminstrator intervention to fix the faulted pool. action: Either restore the affected device(s) and run ''zpool online'', or ignore the intent log records by running ''zpool clear''. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: none requested config: NAME STATE READ WRITE CKSUM mypool FAULTED 0 0 0 bad intent log ... Nice! Running a later version of ZFS seems to lessen the need for ZIL-mirroring... With kind regards, Jeroen -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jeroen Roodhart > > > If you''re running solaris proper, you better mirror > > your > > > ZIL log device. > ... > > I plan to get to test this as well, won''t be until > > late next week though. > > Running OSOL nv130. Power off the machine, removed the F20 and power > back on. Machines boots OK and comes up "normally" [...] > > Nice! Running a later version of ZFS seems to lessen the need for ZIL- > mirroring...Yes, since zpool 19, which is not available in any version of solaris yet, and is not available in osol 2009.06 unless you update to "developer builds," Since zpool 19, you have the ability to "zpool remove" log devices. And if a log device fails during operation, the system is supposed to fall back and just start using ZIL blocks from the main pool instead. So the recommendation for zpool <19 would be *strongly* recommended. Mirror your log device if you care about using your pool. And the recommendation for zpool >=19 would be ... don''t mirror your log device. If you have more than one, just add them both unmirrored. I edited the ZFS Best Practices yesterday to reflect these changes. I always have a shade of doubt about things that are "supposed to" do something. Later this week, I am building an OSOL machine, updating it, adding an unmirrored log device, starting a sync-write benchmark (to ensure the log device is heavily in use) and then I''m going to yank out the log device, and see what happens.
On 7 apr 2010, at 14.28, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jeroen Roodhart >> >>> If you''re running solaris proper, you better mirror >>> your >>>> ZIL log device. >> ... >>> I plan to get to test this as well, won''t be until >>> late next week though. >> >> Running OSOL nv130. Power off the machine, removed the F20 and power >> back on. Machines boots OK and comes up "normally" [...] >> >> Nice! Running a later version of ZFS seems to lessen the need for ZIL- >> mirroring... > > Yes, since zpool 19, which is not available in any version of solaris yet, > and is not available in osol 2009.06 unless you update to "developer > builds," Since zpool 19, you have the ability to "zpool remove" log > devices. And if a log device fails during operation, the system is supposed > to fall back and just start using ZIL blocks from the main pool instead. > > So the recommendation for zpool <19 would be *strongly* recommended. Mirror > your log device if you care about using your pool. > And the recommendation for zpool >=19 would be ... don''t mirror your log > device. If you have more than one, just add them both unmirrored.Rather: ... >=19 would be ... if you don''t mind loosing data written the ~30 seconds before the crash, you don''t have to mirror your log device. For a file server, mail server, etc etc, where things are stored and supposed to be available later, you almost certainly want redundancy on your slog too. (There may be file servers where this doesn''t apply, but they are special cases that should not be mentioned in the general documentation.)> I edited the ZFS Best Practices yesterday to reflect these changes.I''d say, that "In zpool version 19 or greater, it is recommended not to mirror log devices." is not a very good advice and should be changed. /ragge
On 07/04/2010 13:58, Ragnar Sundblad wrote:> > Rather: ...>=19 would be ... if you don''t mind loosing data written > the ~30 seconds before the crash, you don''t have to mirror your log > device. > > For a file server, mail server, etc etc, where things are stored > and supposed to be available later, you almost certainly want > redundancy on your slog too. (There may be file servers where > this doesn''t apply, but they are special cases that should not > be mentioned in the general documentation.) > >While I agree with you I want to mention that it is all about understanding a risk. In this case not only your server has to crash in such a way so data has not been synced (sudden power loss for example) but there would have to be some data committed to a slog device(s) which was not written to a main pool and when your server restarts your slog device would have to completely die as well. Other than that you are fine even with unmirrored slog device. -- Robert Milkowski http://milek.blogspot.com
On Wed, 7 Apr 2010, Ragnar Sundblad wrote:>> >> So the recommendation for zpool <19 would be *strongly* recommended. Mirror >> your log device if you care about using your pool. >> And the recommendation for zpool >=19 would be ... don''t mirror your log >> device. If you have more than one, just add them both unmirrored. > > Rather: ... >=19 would be ... if you don''t mind loosing data written > the ~30 seconds before the crash, you don''t have to mirror your log > device.It is also worth pointing out that in normal operation the slog is essentially a write-only device which is only read at boot time. The writes are assumed to work if the device claims "success". If the log device fails to read (oops!), then a mirror would be quite useful. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 07/04/2010 15:35, Bob Friesenhahn wrote:> On Wed, 7 Apr 2010, Ragnar Sundblad wrote: >>> >>> So the recommendation for zpool <19 would be *strongly* >>> recommended. Mirror >>> your log device if you care about using your pool. >>> And the recommendation for zpool >=19 would be ... don''t mirror your >>> log >>> device. If you have more than one, just add them both unmirrored. >> >> Rather: ... >=19 would be ... if you don''t mind loosing data written >> the ~30 seconds before the crash, you don''t have to mirror your log >> device. > > It is also worth pointing out that in normal operation the slog is > essentially a write-only device which is only read at boot time. The > writes are assumed to work if the device claims "success". If the log > device fails to read (oops!), then a mirror would be quite useful. >it is only read at boot if there are uncomitted data on it - during normal reboots zfs won''t read data from slog. -- Robert Milkowski http://milek.blogspot.com
On Wed, 7 Apr 2010, Robert Milkowski wrote:>> > it is only read at boot if there are uncomitted data on it - during normal > reboots zfs won''t read data from slog.How does zfs know if there is uncomitted data on the slog device without reading it? The minimal read would be quite small, but it seems that a read is still required. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 04/07/10 09:19, Bob Friesenhahn wrote:> On Wed, 7 Apr 2010, Robert Milkowski wrote: >>> >> it is only read at boot if there are uncomitted data on it - during >> normal reboots zfs won''t read data from slog. > > How does zfs know if there is uncomitted data on the slog device > without reading it? The minimal read would be quite small, but it > seems that a read is still required. > > BobIf there''s ever been synchronous activity then there an empty tail block ("stubby") that will be read even after a clean shutdown. Neil.
> From: Ragnar Sundblad [mailto:ragge at csc.kth.se] > > Rather: ... >=19 would be ... if you don''t mind loosing data written > the ~30 seconds before the crash, you don''t have to mirror your log > device.If you have a system crash, *and* a failed log device at the same time, this is an important consideration. But if you have either a system crash, or a failed log device, that don''t happen at the same time, then your sync writes are safe, right up to the nanosecond. Using unmirrored nonvolatile log device on zpool >= 19.> I''d say, that "In zpool version 19 or greater, it is recommended not to > mirror log devices." is not a very good advice and should be changed.See above. Still disagree? If desired, I could clarify the statement, by basically pasting what''s written above.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > It is also worth pointing out that in normal operation the slog is > essentially a write-only device which is only read at boot time. The > writes are assumed to work if the device claims "success". If the log > device fails to read (oops!), then a mirror would be quite useful.An excellent point. BTW, does the system *ever* read from the log device during normal operation? Such as perhaps during a scrub? It really would be nice to detect failure of log devices in advance, that are claiming to write correctly, but which are really unreadable.
On 04/07/10 10:18, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn >> >> It is also worth pointing out that in normal operation the slog is >> essentially a write-only device which is only read at boot time. The >> writes are assumed to work if the device claims "success". If the log >> device fails to read (oops!), then a mirror would be quite useful. >> > > An excellent point. > > BTW, does the system *ever* read from the log device during normal > operation? Such as perhaps during a scrub? It really would be nice to > detect failure of log devices in advance, that are claiming to write > correctly, but which are really unreadable.A scrub will read the log blocks but only for unplayed logs. Because of the transient nature of the log and becuase it operates outside of the transaction group model it''s hard to read the in-flight log blocks to validate them. There have previously been suggestions to read slogs periodically. I don''t know if there''s a CR raised for this though. Neil. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100407/668b5ee4/attachment.html>
On Wed, 7 Apr 2010, Neil Perrin wrote:> There have previously been suggestions to read slogs periodically. I > don''t know if there''s a CR raised for this though.Roch wrote up CR 6938883 "Need to exercise read from slog dynamically" Regards, markm
On Wed, 7 Apr 2010, Edward Ned Harvey wrote:>> From: Ragnar Sundblad [mailto:ragge at csc.kth.se] >> >> Rather: ... >=19 would be ... if you don''t mind loosing data written >> the ~30 seconds before the crash, you don''t have to mirror your log >> device. > > If you have a system crash, *and* a failed log device at the same time, this > is an important consideration. But if you have either a system crash, or a > failed log device, that don''t happen at the same time, then your sync writes > are safe, right up to the nanosecond. Using unmirrored nonvolatile log > device on zpool >= 19.The point is that the slog is a write-only device and a device which fails such that its acks each write, but fails to read the data that it "wrote", could silently fail at any time during the normal operation of the system. It is not necessary for the slog device to fail at the exact same time that the system spontaneously reboots. I don''t know if Solaris implements a background scrub of the slog as a normal course of operation which would cause a device with this sort of failure to be exposed quickly. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, 7 Apr 2010, Edward Ned Harvey wrote:> > BTW, does the system *ever* read from the log device during normal > operation? Such as perhaps during a scrub? It really would be nice to > detect failure of log devices in advance, that are claiming to write > correctly, but which are really unreadable.To make matters worse, a SSD with a large cache might satisfy such reads from its cache so a "scrub" of the (possibly) tiny bit of pending synchronous writes may not validate anything. A lightly loaded slog should usually be empty. We already know that some (many?) SSDs are not very good about persisting writes to FLASH, even after acking a cache flush request. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Apr 7, 2010, at 10:19 AM, Bob Friesenhahn wrote:> On Wed, 7 Apr 2010, Edward Ned Harvey wrote: >>> From: Ragnar Sundblad [mailto:ragge at csc.kth.se] >>> >>> Rather: ... >=19 would be ... if you don''t mind loosing data written >>> the ~30 seconds before the crash, you don''t have to mirror your log >>> device. >> >> If you have a system crash, *and* a failed log device at the same time, this >> is an important consideration. But if you have either a system crash, or a >> failed log device, that don''t happen at the same time, then your sync writes >> are safe, right up to the nanosecond. Using unmirrored nonvolatile log >> device on zpool >= 19. > > The point is that the slog is a write-only device and a device which fails such that its acks each write, but fails to read the data that it "wrote", could silently fail at any time during the normal operation of the system. It is not necessary for the slog device to fail at the exact same time that the system spontaneously reboots. I don''t know if Solaris implements a background scrub of the slog as a normal course of operation which would cause a device with this sort of failure to be exposed quickly.You are playing against marginal returns. An ephemeral storage requirement is very different than permanent storage requirement. For permanent storage services, scrubs work well -- you can have good assurance that if you read the data once then you will likely be able to read the same data again with some probability based on the expected decay of the data. For ephemeral data, you do not read the same data more than once, so there is no correlation between reading once and reading again later. In other words, testing the readability of an ephemeral storage service is like a cat chasing its tail. IMHO, this is particularly problematic for contemporary SSDs that implement wear leveling. <sidebar> For clusters the same sort of problem exists for path monitoring. If you think about paths (networks, SANs, cups-n-strings) then there is no assurance that a failed transfer means all subsequent transfers will also fail. Some other permanence test is required to predict future transfer failures. s/fail/pass/g </sidebar> Bottom line: if you are more paranoid, mirror the separate log devices and sleep through the night. Pleasant dreams! :-) -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
>>>>> "jr" == Jeroen Roodhart <j.r.roodhart at uva.nl> writes:jr> Running OSOL nv130. Power off the machine, removed the F20 and jr> power back on. Machines boots OK and comes up "normally" with jr> the following message in ''zpool status'': yeah, but try it again and this time put rpool on the F20 as well and try to import the pool from a LiveCD: if you lose zpool.cache at this stage, your pool is toast.</end repeat mode> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100407/3f4076b4/attachment.bin>
On 7 apr 2010, at 18.13, Edward Ned Harvey wrote:>> From: Ragnar Sundblad [mailto:ragge at csc.kth.se] >> >> Rather: ... >=19 would be ... if you don''t mind loosing data written >> the ~30 seconds before the crash, you don''t have to mirror your log >> device. > > If you have a system crash, *and* a failed log device at the same time, this > is an important consideration. But if you have either a system crash, or a > failed log device, that don''t happen at the same time, then your sync writes > are safe, right up to the nanosecond. Using unmirrored nonvolatile log > device on zpool >= 19.Right, but if you have a power or a hardware problem, chances are that more things really break at the same time, including the slog device(s).>> I''d say, that "In zpool version 19 or greater, it is recommended not to >> mirror log devices." is not a very good advice and should be changed. > > See above. Still disagree? > > If desired, I could clarify the statement, by basically pasting what''s > written above.I believe that for a mail server, NFS server (to be spec compliant), general purpose file server and the like, where the last written data is as important as older data (maybe even more), it would be wise to have at least as good redundancy on the slog as on the data disks. If one can stand the (pretty small) risk of of loosing the last transaction group before a crash, at the moment typically up to the last 30 seconds of changes, you may have less redundancy on the slog. (And if you don''t care at all, like on a web cache perhaps, you could of course disable the zil all together - that is kind of the other end of the scale, which puts this in perspective.) As Robert M so wisely and simply put it; It is all about understanding a risk. I think the documentation should help people take educated decisions, though I am not right now sure how to put the words to describe this in an easily understandable way. /ragge