Karl Wagner
2011-Jan-18 14:40 UTC
[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage
Hi all This is just an off-the-cuff idea at the moment, but I would like to sound it out. Consider the situation where someone has a large amount of off-site data storage (of the order of 100s of TB or more). They have a slow network link to this storage. My idea is that this could be used to build the main vdevs for a ZFS pool. On top of this, an array of disks (of the order of TBs to 10s of TB) is available locally, which can be used as L2ARC. There are also smaller, faster arrays (of the order of 100s of GB) which, in my mind, could be used as a ZIL. Now, in this theoretical situation, in-play read data is kept on the L2ARC, and can be accessed about as fast as if this array was just used as the main pool vdevs. Written data goes to the ZIL, as is then sent down the slow link to the offsite storage. Rarely used data is still available as if on site (shows up in the same file structure), but is effectively "archived" to the offsite storage. Now, here comes the problem. According to what I have read, the maximum size for the ZIL is approx 50% of the physical memory in the system, which would be too small for this particular situation. Also, you cannot mirror the L2ARC, which would have dire performance consequences in the case of a disk failure in the L2ARC. I also believe (correct me if I am wrong) that the L2ARC is invalidated on reboot, so would have to "warm up" again). And finally, if the network link was to die, I am assuming the entire ZPool would become unavailable. This is a setup which I can see many use cases for, but it introduces too many failure modes. What I would like to see is an extension to ZFS''s hierarchical storage environment, such that an additional layer can be put behind the main pool vdevs as an "archive" store (i.e. it goes [ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used files/blocks could be pushed into this storage, but appear to be available as normal. It would, for example, allow old snapshot data to be pushed down, as this is very rarely going to be used, or files which must be archived for legal reasons. It would also utilise the bandwidth available more efficiently, as only data being specifically sent to it would need transferring. In the case where the archive storage becomes unavailable, there would be a number of possible actions (e.g. error on access, block on access, make the files "disappear" temporarily). I know there are already solutions out there which do similar jobs. The company I work for use one which pushes "archive" data to a tape stacker, and pulls it back when accessed. But I think this is a ripe candidate for becoming part of the ZFS stack. So, what does everyone think? Rgds Karl
Edward Ned Harvey
2011-Jan-19 01:41 UTC
[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Karl Wagner > > Consider the situation where someone has a large amount of off-site data > storage (of the order of 100s of TB or more). They have a slow networklink> to this storage. > > My idea is that this could be used to build the main vdevs for a ZFS pool. > On top of this, an array of disks (of the order of TBs to 10s of TB) is > available locally, which can be used as L2ARC. There are also smaller, > faster arrays (of the order of 100s of GB) which, in my mind, could beused> as a ZIL. > > Now, in this theoretical situation, in-play read data is kept on theL2ARC,> and can be accessed about as fast as if this array was just used as themain> pool vdevs. Written data goes to the ZIL, as is then sent down the slowlink> to the offsite storage. Rarely used data is still available as if on site > (shows up in the same file structure), but is effectively "archived" tothe> offsite storage. > > Now, here comes the problem. According to what I have read, the maximum > size > for the ZIL is approx 50% of the physical memory in the system, whichwould Here''s the bigger problem: You seem to be thinking of ZIL as write buffer. This is not the case. ZIL only allows sync writes to become async writes, which are buffered in RAM. Depending on your system, it will refuse to buffer more than 5sec or 30sec of async writes, and your async writes are still going to be slow. Also, L2ARC is not persistent, and there is a maximum fill rate (which I don''t know much about.) So populating the L2ARC might not happen as fast as you want, and every time you reboot it will have to be repopulated. If at all possible, instead of using the remote storage as the primary storage, you can use the remote storage to receive incremental periodic snapshots, and that would perform optimally, because the remote storage is then isolated from rapid volatile changes. The zfs send | zfs receive datastreams will be full of large sequential blocks and not small random IO. Most likely you will gain performance by enabling both compression and dedup. But of course, that depends on the nature of your data.> And > finally, if the network link was to die, I am assuming the entire ZPool > would become unavailable.The behavior in this situation is configurable via "failmode." The default is "wait" which essentially pauses the filesystem until the disks become available again. Unfortunately, until the disks become available again, the system can become ... pretty undesirable to use, and possibly require a power cycle. You can also use "panic" or "continue," which you can read about in the zpool manpage if your want.> vdevs as an "archive" store (i.e. it goes > [ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used files/blockscould You''re pretty much describing precisely what I''m suggesting... using zfs send | zfs receive. I suppose the difference between what you''re suggesting and what I''m suggesting, is the separation of two pools versus "misrepresenting" the remote storage as part of the local pool, etc. That''s a pretty major architectural change.
Karl Wagner
2011-Jan-19 18:05 UTC
[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage
> -----Original Message----- > From: Edward Ned Harvey > [mailto:opensolarisisdeadlongliveopensolaris at nedharvey.com] > Sent: 19 January 2011 01:42 > To: ''Karl Wagner''; zfs-discuss at opensolaris.org > Subject: RE: [zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow > storage > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Karl Wagner > > > > Consider the situation where someone has a large amount of off-site data > > storage (of the order of 100s of TB or more). They have a slow network > link > > to this storage. > > > > My idea is that this could be used to build the main vdevs for a ZFS > pool. > > On top of this, an array of disks (of the order of TBs to 10s of TB) is > > available locally, which can be used as L2ARC. There are also smaller, > > faster arrays (of the order of 100s of GB) which, in my mind, could be > used > > as a ZIL. > > > > Now, in this theoretical situation, in-play read data is kept on the > L2ARC, > > and can be accessed about as fast as if this array was just used as the > main > > pool vdevs. Written data goes to the ZIL, as is then sent down the slow > link > > to the offsite storage. Rarely used data is still available as if on > site > > (shows up in the same file structure), but is effectively "archived" to > the > > offsite storage. > > > > Now, here comes the problem. According to what I have read, the maximum > > size > > for the ZIL is approx 50% of the physical memory in the system, which > would > > Here''s the bigger problem: > You seem to be thinking of ZIL as write buffer. This is not the case. > ZIL > only allows sync writes to become async writes, which are buffered in RAM. > Depending on your system, it will refuse to buffer more than 5sec or 30sec > of async writes, and your async writes are still going to be slow. > > Also, L2ARC is not persistent, and there is a maximum fill rate (which I > don''t know much about.) So populating the L2ARC might not happen as fast > as > you want, and every time you reboot it will have to be repopulated. > > If at all possible, instead of using the remote storage as the primary > storage, you can use the remote storage to receive incremental periodic > snapshots, and that would perform optimally, because the remote storage is > then isolated from rapid volatile changes. The zfs send | zfs receive > datastreams will be full of large sequential blocks and not small random > IO. > > Most likely you will gain performance by enabling both compression and > dedup. But of course, that depends on the nature of your data. > > > > And > > finally, if the network link was to die, I am assuming the entire ZPool > > would become unavailable. > > The behavior in this situation is configurable via "failmode." The > default > is "wait" which essentially pauses the filesystem until the disks become > available again. Unfortunately, until the disks become available again, > the > system can become ... pretty undesirable to use, and possibly require a > power cycle. > > You can also use "panic" or "continue," which you can read about in the > zpool manpage if your want. > > > vdevs as an "archive" store (i.e. it goes > > [ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used files/blocks > could > > You''re pretty much describing precisely what I''m suggesting... using zfs > send | zfs receive. > > I suppose the difference between what you''re suggesting and what I''m > suggesting, is the separation of two pools versus "misrepresenting" the > remote storage as part of the local pool, etc. That''s a pretty major > architectural change.I understand the use of ZFS send/receive. The only real problem with this approach is that the data, assuming you snapshot it, send it out, then delete it, is no longer available. The idea is to have all data appear to be locally available. To think of it another way, suppose you have SSD based ZIL & L2ARC and some fast SCSI discs. This gives you a nice, fast pool, but not enough storage. So you want to add some "slow" large SATA discs to the pool. As things stand, to the best of my knowledge, data would mostly be written to the slower discs, as they are larger so ZFS chooses them first. At best they would be written to fairly equally at first. But this is not what you really want. You could set up another pool, and manually push old data to it, but you then have the problem of either making users find their data, or automating the process yourself. ZFS already deals with the concept of tiered/hierarchical storage, but what I am suggesting is a natural extension of this idea to additional levels. I don''t think it is a major architectural change, more a slight implementation change with a few additional features. Thanks for your comments anyway. The reason I posted this in the first place was to see what people thought :) Regards Karl
Brandon High
2011-Jan-19 19:07 UTC
[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage
On Tue, Jan 18, 2011 at 6:40 AM, Karl Wagner <karl at mouse-hole.com> wrote:> What I would like to see is an extension to ZFS''s hierarchical storage > environment, such that an additional layer can be put behind the main pool > vdevs as an "archive" store (i.e. it goes > [ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used files/blocks could > be pushed into this storage, but appear to be available as normal. It would, > for example, allow old snapshot data to be pushed down, as this is very > rarely going to be used, or files which must be archived for legal reasons. > It would also utilise the bandwidth available more efficiently, as only data > being specifically sent to it would need transferring.This is exactly what an HSM does. Take a look at SamFS / QFS. -B -- Brandon High : bhigh at freaks.com