thr3ads.net - zfs discuss - [zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Karl Wagner

2011-Jan-18 14:40 UTC

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage

Hi all

This is just an off-the-cuff idea at the moment, but I would like to sound
it out.

Consider the situation where someone has a large amount of off-site data
storage (of the order of 100s of TB or more). They have a slow network link
to this storage.

My idea is that this could be used to build the main vdevs for a ZFS pool.
On top of this, an array of disks (of the order of TBs to 10s of TB) is
available locally, which can be used as L2ARC. There are also smaller,
faster arrays (of the order of 100s of GB) which, in my mind, could be used
as a ZIL.

Now, in this theoretical situation, in-play read data is kept on the L2ARC,
and can be accessed about as fast as if this array was just used as the main
pool vdevs. Written data goes to the ZIL, as is then sent down the slow link
to the offsite storage. Rarely used data is still available as if on site
(shows up in the same file structure), but is effectively "archived"
to the
offsite storage.

Now, here comes the problem. According to what I have read, the maximum size
for the ZIL is approx 50% of the physical memory in the system, which would
be too small for this particular situation. Also, you cannot mirror the
L2ARC, which would have dire performance consequences in the case of a disk
failure in the L2ARC. I also believe (correct me if I am wrong) that the
L2ARC is invalidated on reboot, so would have to "warm up" again). And
finally, if the network link was to die, I am assuming the entire ZPool
would become unavailable.

This is a setup which I can see many use cases for, but it introduces too
many failure modes.

What I would like to see is an extension to ZFS''s hierarchical storage
environment, such that an additional layer can be put behind the main pool
vdevs as an "archive" store (i.e. it goes
[ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used files/blocks
could
be pushed into this storage, but appear to be available as normal. It would,
for example, allow old snapshot data to be pushed down, as this is very
rarely going to be used, or files which must be archived for legal reasons.
It would also utilise the bandwidth available more efficiently, as only data
being specifically sent to it would need transferring.

In the case where the archive storage becomes unavailable, there would be a
number of possible actions (e.g. error on access, block on access, make the
files "disappear" temporarily).

I know there are already solutions out there which do similar jobs. The
company I work for use one which pushes "archive" data to a tape
stacker,
and pulls it back when accessed. But I think this is a ripe candidate for
becoming part of the ZFS stack.

So, what does everyone think?

Rgds
Karl

Edward Ned Harvey

2011-Jan-19 01:41 UTC

head link

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Karl Wagner
> 
> Consider the situation where someone has a large amount of off-site data
> storage (of the order of 100s of TB or more). They have a slow network
link> to this storage.
> 
> My idea is that this could be used to build the main vdevs for a ZFS pool.
> On top of this, an array of disks (of the order of TBs to 10s of TB) is
> available locally, which can be used as L2ARC. There are also smaller,
> faster arrays (of the order of 100s of GB) which, in my mind, could be
used> as a ZIL.
> 
> Now, in this theoretical situation, in-play read data is kept on the
L2ARC,> and can be accessed about as fast as if this array was just used as the
main> pool vdevs. Written data goes to the ZIL, as is then sent down the slow
link> to the offsite storage. Rarely used data is still available as if on site
> (shows up in the same file structure), but is effectively
"archived" to
the> offsite storage.
> 
> Now, here comes the problem. According to what I have read, the maximum
> size
> for the ZIL is approx 50% of the physical memory in the system, whichwould

Here''s the bigger problem:
You seem to be thinking of ZIL as write buffer.  This is not the case.  ZIL
only allows sync writes to become async writes, which are buffered in RAM.
Depending on your system, it will refuse to buffer more than 5sec or 30sec
of async writes, and your async writes are still going to be slow.

Also, L2ARC is not persistent, and there is a maximum fill rate (which I
don''t know much about.)  So populating the L2ARC might not happen as
fast as
you want, and every time you reboot it will have to be repopulated.

If at all possible, instead of using the remote storage as the primary
storage, you can use the remote storage to receive incremental periodic
snapshots, and that would perform optimally, because the remote storage is
then isolated from rapid volatile changes.  The zfs send | zfs receive
datastreams will be full of large sequential blocks and not small random IO.

Most likely you will gain performance by enabling both compression and
dedup.  But of course, that depends on the nature of your data.

> And
> finally, if the network link was to die, I am assuming the entire ZPool
> would become unavailable.
The behavior in this situation is configurable via "failmode."  The
default
is "wait" which essentially pauses the filesystem until the disks
become
available again.  Unfortunately, until the disks become available again, the
system can become ... pretty undesirable to use, and possibly require a
power cycle.

You can also use "panic" or "continue," which you can read
about in the
zpool manpage if your want.
> vdevs as an "archive" store (i.e. it goes
> [ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used
files/blockscould

You''re pretty much describing precisely what I''m suggesting...
using zfs
send | zfs receive.

I suppose the difference between what you''re suggesting and what
I''m
suggesting, is the separation of two pools versus "misrepresenting"
the
remote storage as part of the local pool, etc.  That''s a pretty major
architectural change.

Karl Wagner

2011-Jan-19 18:05 UTC

head link

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage

> -----Original Message-----
> From: Edward Ned Harvey
> [mailto:opensolarisisdeadlongliveopensolaris at nedharvey.com]
> Sent: 19 January 2011 01:42
> To: ''Karl Wagner''; zfs-discuss at opensolaris.org
> Subject: RE: [zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow
> storage
> 
> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Karl Wagner
> >
> > Consider the situation where someone has a large amount of off-site
data
> > storage (of the order of 100s of TB or more). They have a slow network
> link
> > to this storage.
> >
> > My idea is that this could be used to build the main vdevs for a ZFS
> pool.
> > On top of this, an array of disks (of the order of TBs to 10s of TB)
is
> > available locally, which can be used as L2ARC. There are also smaller,
> > faster arrays (of the order of 100s of GB) which, in my mind, could be
> used
> > as a ZIL.
> >
> > Now, in this theoretical situation, in-play read data is kept on the
> L2ARC,
> > and can be accessed about as fast as if this array was just used as
the
> main
> > pool vdevs. Written data goes to the ZIL, as is then sent down the
slow
> link
> > to the offsite storage. Rarely used data is still available as if on
> site
> > (shows up in the same file structure), but is effectively
"archived" to
> the
> > offsite storage.
> >
> > Now, here comes the problem. According to what I have read, the
maximum
> > size
> > for the ZIL is approx 50% of the physical memory in the system, which
> would
> 
> Here''s the bigger problem:
> You seem to be thinking of ZIL as write buffer.  This is not the case.
> ZIL
> only allows sync writes to become async writes, which are buffered in RAM.
> Depending on your system, it will refuse to buffer more than 5sec or 30sec
> of async writes, and your async writes are still going to be slow.
> 
> Also, L2ARC is not persistent, and there is a maximum fill rate (which I
> don''t know much about.)  So populating the L2ARC might not happen
as fast
> as
> you want, and every time you reboot it will have to be repopulated.
> 
> If at all possible, instead of using the remote storage as the primary
> storage, you can use the remote storage to receive incremental periodic
> snapshots, and that would perform optimally, because the remote storage is
> then isolated from rapid volatile changes.  The zfs send | zfs receive
> datastreams will be full of large sequential blocks and not small random
> IO.
> 
> Most likely you will gain performance by enabling both compression and
> dedup.  But of course, that depends on the nature of your data.
> 
> 
> > And
> > finally, if the network link was to die, I am assuming the entire
ZPool
> > would become unavailable.
> 
> The behavior in this situation is configurable via "failmode." 
The
> default
> is "wait" which essentially pauses the filesystem until the disks
become
> available again.  Unfortunately, until the disks become available again,
> the
> system can become ... pretty undesirable to use, and possibly require a
> power cycle.
> 
> You can also use "panic" or "continue," which you can
read about in the
> zpool manpage if your want.
> 
> > vdevs as an "archive" store (i.e. it goes
> > [ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used
files/blocks
> could
> 
> You''re pretty much describing precisely what I''m
suggesting... using zfs
> send | zfs receive.
> 
> I suppose the difference between what you''re suggesting and what
I''m
> suggesting, is the separation of two pools versus
"misrepresenting" the
> remote storage as part of the local pool, etc.  That''s a pretty
major
> architectural change.
I understand the use of ZFS send/receive. The only real problem with this
approach is that the data, assuming you snapshot it, send it out, then
delete it, is no longer available.

The idea is to have all data appear to be locally available.

To think of it another way, suppose you have SSD based ZIL & L2ARC and some
fast SCSI discs. This gives you a nice, fast pool, but not enough storage.
So you want to add some "slow" large SATA discs to the pool. As things
stand, to the best of my knowledge, data would mostly be written to the
slower discs, as they are larger so ZFS chooses them first. At best they
would be written to fairly equally at first. But this is not what you really
want.

You could set up another pool, and manually push old data to it, but you
then have the problem of either making users find their data, or automating
the process yourself.

ZFS already deals with the concept of tiered/hierarchical storage, but what
I am suggesting is a natural extension of this idea to additional levels. I
don''t think it is a major architectural change, more a slight
implementation
change with a few additional features.

Thanks for your comments anyway. The reason I posted this in the first place
was to see what people thought :)

Regards
Karl

Brandon High

2011-Jan-19 19:07 UTC

head link

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage

On Tue, Jan 18, 2011 at 6:40 AM, Karl Wagner <karl at mouse-hole.com>
wrote:> What I would like to see is an extension to ZFS''s hierarchical
storage
> environment, such that an additional layer can be put behind the main pool
> vdevs as an "archive" store (i.e. it goes
> [ARC]->[L2ARC/ZIL]->[main]->[archive]). Infrequently used
files/blocks could
> be pushed into this storage, but appear to be available as normal. It
would,
> for example, allow old snapshot data to be pushed down, as this is very
> rarely going to be used, or files which must be archived for legal reasons.
> It would also utilise the bandwidth available more efficiently, as only
data
> being specifically sent to it would need transferring.
This is exactly what an HSM does. Take a look at SamFS / QFS.

-B

-- 
Brandon High : bhigh at freaks.com

zfs discuss - Jan 2011 - Request for comments: L2ARC, ZIL, RAM, and slow storage

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage

[zfs-discuss] Request for comments: L2ARC, ZIL, RAM, and slow storage