thr3ads.net - zfs code - [zfs-code] ARC cache reference counting [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Jeremy Archer

2009-Jun-01 03:19 UTC

[zfs-code] ARC cache reference counting

Greets,

I have read a couple of earlier posts by Jeff and Mark Maybee explaining how Arc
reference counting works.
These posts did help clarifying this piece of code ( a bit complex, to say the
least).
I would like to solicit more comments elucidating ARC reference counting.

The usage pattern I see in arc.c does not seem to be following a simple pattern,
like the one we see in VFS.
The crux of the matter: who are the "users" that own these references?

Perhaps I am  misreading the code but this is how it looks to me:

arc_buf_hdr->b_count  keeps track of simultaneous readers of an arc_buf_hdr
who called with a non-null callback parameter. The expected scenario is
simultaneous access from 2 or more files (which are clones or snapshots).

The reason for the ref count and the buffer cloning: to maintain cache integrity
when multiple readers access the same arc cache entry, and one of these modifies
the entry and ultimately ends up releasing the arc entry from the cache.

At such time, one of the "users" needs an anonymous entry that can be
dirtied
and written out to a different spot, while the other needs an unchanged ARC
entry.

This was probably expected to be a relatively rare occurrence, and it is not
expected that a large number of simultaneous  readers would access the same ARC
entry.
I am a bit puzzled why a new ARC entry could not be cloned in arc_release, but 
perhaps there is a good reason why pre-allocating this data is better than the
JIT alternative.

I notice that prefetch functions (dbuf_prefetch and traverse_prefetcher) call
arc_read_nolock without a callback parameter, I wonder if this could create  a
problem if a prefetch function, and a "regular" read simultaneously
access the same ARC cache entry. (Cloning in this case would not happen, so one
thread could end up releasing the entry from the cache while the other is
messing with it).

Comments and corrections to my interpretation are cordially solicited.
-- 
This message posted from opensolaris.org

Mark Maybee

2009-Jun-01 18:10 UTC

head link

[zfs-code] ARC cache reference counting

Jeremy Archer wrote:> Greets,
> 
> I have read a couple of earlier posts by Jeff and Mark Maybee explaining
how Arc reference counting works.
> These posts did help clarifying this piece of code ( a bit complex, to say
the least).
> I would like to solicit more comments elucidating ARC reference counting.
> 
> The usage pattern I see in arc.c does not seem to be following a simple
pattern, like the one we see in VFS.
> The crux of the matter: who are the "users" that own these
references?
> 
> Perhaps I am  misreading the code but this is how it looks to me:
> 
> arc_buf_hdr->b_count  keeps track of simultaneous readers of an
arc_buf_hdr who called with a non-null callback parameter. The expected scenario
is simultaneous access from 2 or more files (which are clones or snapshots).
> Correct.
> The reason for the ref count and the buffer cloning: to maintain cache
integrity
> when multiple readers access the same arc cache entry, and one of these
modifies the entry and ultimately ends up releasing the arc entry from the
cache.
> Correct.
> At such time, one of the "users" needs an anonymous entry that
can be dirtied
> and written out to a different spot, while the other needs an unchanged ARC
entry.
> Correct.
> This was probably expected to be a relatively rare occurrence, and it is
not expected that a large number of simultaneous  readers would access the same
ARC entry.
Correct.
> I am a bit puzzled why a new ARC entry could not be cloned in arc_release,
but
> perhaps there is a good reason why pre-allocating this data is better than
the JIT alternative.
> We have already handed out a reference to the data at the point that
buffer is being released, so we cannot allocate a new block "JIT".
> I notice that prefetch functions (dbuf_prefetch and traverse_prefetcher)
call arc_read_nolock without a callback parameter, I wonder if this could create
a problem if a prefetch function, and a "regular" read simultaneously
access the same ARC cache entry. (Cloning in this case would not happen, so one
thread could end up releasing the entry from the cache while the other is
messing with it).
> No, a "regular" read request on a prefetch buffer will end up adding a
callback record.
> Comments and corrections to my interpretation are cordially solicited.
Hope that helps.

-Mark

Jeremy Archer

2009-Jun-01 20:21 UTC

head link

[zfs-code] ARC cache reference counting

Thanks for replying. 
>>I am a bit puzzled why a new ARC entry could not be cloned in
arc_release.....
>
>We have already handed out a reference to the data at the point that
>buffer is being released, so we cannot allocate a new block "JIT".
I think I get it.  Each colliding thread threads need a separate copy of the
buffer from
the point of the read collision, so that they can dirty these if they so desire.
You have to be prepared for the worst case scenario, where all threads end up
modifying their buffer.
We may end up with N different buffer contents for N possible colliding threads.

COW is a bit of a mind-bender to get used to.  

So if we have a number of threads colliding on read, we have to maintain all 
of the cloned data in the ARC, even if none of the threads ends up modifying
their buffer.
-- 
This message posted from opensolaris.org

Jeremy Archer

2009-Jun-01 20:57 UTC

head link

[zfs-code] ARC cache reference counting

I started to look at ref counting to convince myself that the db_bu field in a
cached dmu_impl_t object
 is guaranteed to point at a valid arc_buf_t.

I have seen a  "deadbeef" crash on a busy system when zfs_write() is
pre-pagefaulting in
the file''s pages.

The page  fault handler eventually winds its way to dbuf_hold_impl, who manages
to
find a cached dmu_impl_t record.  This record however, points to a freed
arc_buf_t
via its b_data field. The field is not null, but it points to a freed object,
hence the
crash upon trying to lock the rwlock of the alleged arc_buf.

Ref counting  should prevent something like this, correct?
-- 
This message posted from opensolaris.org

Mark Maybee

2009-Jun-02 16:28 UTC

head link

[zfs-code] ARC cache reference counting

Jeremy Archer wrote:> I started to look at ref counting to convince myself that the db_bu field
in a cached dmu_impl_t object
>  is guaranteed to point at a valid arc_buf_t.
> 
> I have seen a  "deadbeef" crash on a busy system when zfs_write()
is pre-pagefaulting in
> the file''s pages.
> 
> The page  fault handler eventually winds its way to dbuf_hold_impl, who
manages to
> find a cached dmu_impl_t record.  This record however, points to a freed
arc_buf_t
> via its b_data field. The field is not null, but it points to a freed
object, hence the
> crash upon trying to lock the rwlock of the alleged arc_buf.
> 
> Ref counting  should prevent something like this, correct?
Correct.  If you are running recent bits and have a core file please
file a bug on this.

-Mark

Jeremy Archer

2009-Jun-22 16:01 UTC

head link

[zfs-code] ARC cache reference counting

Hello,

I believe the following is true, correct me if it is not:

If more than one objects reference a block (e.g. 2 files have the same block
open)
there must be multiple clones of the arc_buf_t ( and associated dmu_impl_t )
records
present, one for each of the objects.  This is always so, even if the block is
not
modified, "just in case the block a should end up being modified".
So: if there are 100 files  accessing the same block in the same txg, there will
be
100 clones of the data, even if none of the files ultimately modifies this
block.
Seems a bit wasteful.

This dos not feel like COW to me, rather, "copy always, just in case"
at least in the arc/dmu realm.
I fail to see why the above scenario should not be able to  get by with a
single,
shared, reference counted record. A clone would only have to be made of a block
if
a given file decides to modify the block.   As it is, reference counting is
significantly
complicated by mixing it with this pre-cloning. 

On to some code comprehension questions:

It seems to be that the conceptual model of a file in the dmu layer:
A number of dmu buffers, hanging off of a dnode (i.e the per-dnode the list 
formed via the db_link "list enabler"). Not all blocks of the file are
in this list,
only the "active" ones.  I take "active" to mean
"recently accessed".

There is a somewhat opaque aspect to dmu, that is missing from the otherwise 
excellent data structure chart.  I am talking about dirty buffer management.   

db_data_pending?  db_last_dirty?  db_dirtycnt?  Could someone provide the 
10K mile overview on dirty buffers?


The dbuf_states are a bit of a mystery: 

What is the difference between "DB_READ" and "DB_FILL"? 

My guess, maybe the data is coming from a different direction into the
cache.>From below: Read from disk, (maybe) 
>From above: Nascent data coming from an application (newly created data?).
I am guessing DB_NOFILL is a short-circuit path to throw obsoleted data away. 
It would be nice to comment the states ( beyond an unexplained  state transition
diagramm.

ZFS would be  more approachable to newcomers if the code was 
a bit more commented.  
I am not talking about copious comments, just every field 
in the major data structures, and minimum a one-liner per function as to what 
the function does.    

Yes, given enough perseverance and a lot of time one can figure
everything out from studying the usage patterns but the pain of this 
could be lessened.  

The more people understand ZFS, the stronger it will become.
-- 
This message posted from opensolaris.org

Mark Maybee

2009-Jun-22 17:36 UTC

head link

[zfs-code] ARC cache reference counting

Hi Jeremy,

Jeremy Archer wrote:> Hello,
> 
> I believe the following is true, correct me if it is not:
> 
> If more than one objects reference a block (e.g. 2 files have the same
block open)
> there must be multiple clones of the arc_buf_t ( and associated dmu_impl_t
) records
> present, one for each of the objects.  This is always so, even if the block
is not
> modified, "just in case the block a should end up being
modified".
> So: if there are 100 files  accessing the same block in the same txg, there
will be
> 100 clones of the data, even if none of the files ultimately modifies this
block.
> Seems a bit wasteful.
> 
> This dos not feel like COW to me, rather, "copy always, just in
case" at least in the arc/dmu realm.
Correct.  Memory management is not currently "COW".  Note that,
currently, this is not a significant issue for most environments.
The only way to "share" a data block is if the same file is accessed
simultaneously from multiple clones and/or snapshots.  This is rare.
However, with the advent of dedup (coming soon), this will become a
bigger issue.
> I fail to see why the above scenario should not be able to  get by with a
single,
> shared, reference counted record. A clone would only have to be made of a
block if
> a given file decides to modify the block.   As it is, reference counting is
significantly
> complicated by mixing it with this pre-cloning. 
> The "simple solution" you propose is actually quite complicated to
implement.  We are working on it though.
> On to some code comprehension questions:
> 
> It seems to be that the conceptual model of a file in the dmu layer:
> A number of dmu buffers, hanging off of a dnode (i.e the per-dnode the list
> formed via the db_link "list enabler"). Not all blocks of the
file are in this list,
> only the "active" ones.  I take "active" to mean
"recently accessed".
> 
> There is a somewhat opaque aspect to dmu, that is missing from the
otherwise
> excellent data structure chart.  I am talking about dirty buffer
management.
> 
> db_data_pending?  db_last_dirty?  db_dirtycnt?  Could someone provide the 
> 10K mile overview on dirty buffers?
> Dirty buffers are buffers that have been modified, and so must be
written to stable storage.  Because IO is staged to disk, we have to
manage these buffers in separate lists: a dirty buffer goes onto the
list corresponding to the txg it belongs to.
> 
> The dbuf_states are a bit of a mystery: 
> 
> What is the difference between "DB_READ" and "DB_FILL"?
> 
> My guess, maybe the data is coming from a different direction into the
cache.
>>From below: Read from disk, (maybe) 
Yes.
>>From above: Nascent data coming from an application (newly created
data?).
> Yes.
> I am guessing DB_NOFILL is a short-circuit path to throw obsoleted data
away.
> It would be nice to comment the states ( beyond an unexplained  state
transition
> diagramm.
This is used when we are pre-allocating space.  It is only used in
special circumstances (i.e., creating a swap device).
> 
> ZFS would be  more approachable to newcomers if the code was 
> a bit more commented.  
> I am not talking about copious comments, just every field 
> in the major data structures, and minimum a one-liner per function as to
what
> the function does.    
> 
> Yes, given enough perseverance and a lot of time one can figure
> everything out from studying the usage patterns but the pain of this 
> could be lessened.  
> 
> The more people understand ZFS, the stronger it will become.
I agree.  We haven''t been as good as you should about commenting.
You are welcome to submit code updates that improve this. :-)

-Mark

Jeremy Archer

2009-Jun-22 21:28 UTC

head link

[zfs-code] ARC cache reference counting

Thanks for the explanation.
>  .... a dirty
> buffer goes onto the
> list corresponding to the txg it belongs to.
Ok. I see that all dirty buffers are put on a per txg list.  
This is for easy synchronization, makes  sense. 
  
The per dmu_buf_impl_t details are a bit fuzzy.

I see there can be more than one dirty buffers per dmu_buf_impl_t.
Are these multiple "edited versions" of the same buffer that occurred
in the same txg?

Looks like the dirty buffer (leaf) records point to an arc_buf_t, which is 
presumably yet un-cached (until it is synched out).

db_data_pending points to the most current of these.  Seems that db_data_pending
always points to the last (newest) dirty buffer for the dmu buffer.

A the moment, the purpose of maintaining the per-dbuf dirty list eludes me.
(Maybe to dispose  of these  except for the latest one, which will be written 
out to disk)

Also: what is  BP_IS_HOLE?  I see it has no birth_txg.
-- 
This message posted from opensolaris.org

Jonathan Adams

2009-Jun-22 22:06 UTC

head link

[zfs-code] ARC cache reference counting

On Mon, Jun 22, 2009 at 02:28:25PM -0700, Jeremy Archer
wrote:>
> Also: what is  BP_IS_HOLE?  I see it has no birth_txg.
A "HOLE" BP is similar to a NULL pointer; it doesn''t point to
anything, and
is used to indicate there is no data underneath that block pointer.  The
representation is all-zeros.  Since any real block has to have a (non-zero)
birth txg, we only test birth_txg.

An example of the use of holes is in a file''s data tree.  A file which
was
created like:

	fd = open("foo", O_CREAT, 0777);
	pwrite(fd, "foo", 3, 1024*1024);
	close(fd);

will only have one block in it''s data tree; the rest of the pointers
will
be HOLEs.

When compression is turned on, any data block of all-zeros will be transformed
into a HOLE, instead of being written to disk.

The SEEK_HOLE and SEEK_DATA arguments to lseek(2) can be used to find the
actual data in a file.

Cheers,
- jonathan

Jeremy Archer

2009-Jun-23 14:37 UTC

head link

[zfs-code] ARC cache reference counting

Thanks for clarifying the HOLE-s.  

Why are there (possibly) multiple dbuf_dirty_record_t -s hanging off of a given 
dmu_buf_impl_t?
-- 
This message posted from opensolaris.org

Mark Maybee

2009-Jun-23 19:24 UTC

head link

[zfs-code] ARC cache reference counting

Jeremy Archer wrote:> Thanks for the explanation.
> 
>>  .... a dirty
>> buffer goes onto the
>> list corresponding to the txg it belongs to.
> 
> Ok. I see that all dirty buffers are put on a per txg list.  
> This is for easy synchronization, makes  sense. 
>   
> The per dmu_buf_impl_t details are a bit fuzzy.
> 
> I see there can be more than one dirty buffers per dmu_buf_impl_t.
> Are these multiple "edited versions" of the same buffer that
occurred in the same txg?
> If a dbuf is "dirtied" in more than one txg, then it will have
multiple
dirty records.  So these are the "edited versions" separated by
transaction group boundaries.
> Looks like the dirty buffer (leaf) records point to an arc_buf_t, which is 
> presumably yet un-cached (until it is synched out).
> Correct.  It becomes "cached" when it obtains an "identity",
which does
not happen until we write it out.
> db_data_pending points to the most current of these.  Seems that
db_data_pending
> always points to the last (newest) dirty buffer for the dmu buffer.
> No, db_data_pending points to the record that is currently being
written.
> A the moment, the purpose of maintaining the per-dbuf dirty list eludes me.
> (Maybe to dispose  of these  except for the latest one, which will be
written
> out to disk)
See my comment above.  We must maintain separate versions or else get
"future leaks" (record data written in txg n+1 in txg n) and possibly
trash our checksum (we must not change the data once the checksum is
calculated).
> 
> Also: what is  BP_IS_HOLE?  I see it has no birth_txg.

Jeremy Archer

2009-Jun-23 21:45 UTC

head link

[zfs-code] ARC cache reference counting

Awesome, thanks.
-- 
This message posted from opensolaris.org

Marcelo Leal

2009-Jun-24 17:58 UTC

head link

[zfs-code] ARC cache reference counting

Hello Jonathan,
 You are saying that the SEEK_HOLE and SEEK_DATA are implemented on the
"birth_txg" and not on the "fill_count"?

 Leal
[ http://www.eall.com.br/blog ]
-- 
This message posted from opensolaris.org

Jonathan Adams

2009-Jun-24 23:38 UTC

head link

[zfs-code] ARC cache reference counting

On Wed, Jun 24, 2009 at 10:58:49AM -0700, Marcelo Leal
wrote:> Hello Jonathan,
>  You are saying that the SEEK_HOLE and SEEK_DATA are implemented on
>  the "birth_txg" and not on the "fill_count"?
Um, no, I don''t think so.  I was just saying that for most people
checking for
holes, they use "blk_birth" to check for them.  Since all the fields
are
zero, that''s not always the case.  For SEEK_*, it''s actually
based on the fill
value.

uts/common/fs/zfs/dnode.c:
static int
dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset,
        int lvl, uint64_t blkfill, uint64_t txg)
{
...
                        if (bp[i].blk_fill >= minfill &&
                            bp[i].blk_fill <= maxfill &&
                            (hole || bp[i].blk_birth > txg))
                                break;
...


Cheers,
- jonathan

Jeremy Archer

2009-Jun-30 14:31 UTC

head link

[zfs-code] ARC hash key and txg

I am pondering the fact that the txg is part of the hash key for the arc. It
seems to me,
this has a profound implication: read caching is per txg. 

If I read a record into the cache in txg N, and the current txg is closed, this
record
becomes effectively un-cached in the new (N+1) txg.  

(Well, there may be some on-going operation that still references the txg N
entry,
but for read caching I believe this is correct) 

So the next read will have to read it from disk again, and put it into the arc
again,
this time with txg N+1.
Now we have 2 arc entries for the same piece of data.  The entry for txg N will
get
aged out of the cache eventually, 
but in theory, there may be potentially 
many entries of the same (identical data content and location) data block
present
in the arc,  if we keep re-reading the same block. 

Is my understanding correct?
-- 
This message posted from opensolaris.org

Darren J Moffat

2009-Jun-30 14:55 UTC

head link

[zfs-code] ARC hash key and txg

Jeremy Archer wrote:> I am pondering the fact that the txg is part of the hash key for the arc.
It seems to me,
> this has a profound implication: read caching is per txg. 
I assuming you are looking at:

BUF_HASH_INDEX() and buf_hash() in $SRC/uts/common/fs/zfs/arc.c right ?

Note that the txg is the *birth* txg of the block which is doesn''t 
change.  It is that txg that is used in the hash not the txg that we are 
doing the read of the block in.

-- 
Darren J Moffat

Jeremy Archer

2009-Jun-30 16:08 UTC

head link

[zfs-code] ARC hash key and txg

Hi Darren,

Thanks for replying.  

I am confused then. What is the purpose of incorporating the birth txg 
in the arc hash?   

How do 2 different files find a shared block?

Obviously, the first one  has to read it from disk, and subsequently remember
the
birth txg in addition to to the dva.  

How does the 2nd file know what is the birth txg of the desired block?
Does each file have do an initial read from disk to learn the birth txg of the
block?

I am looking for easy to understand metaphores, 
such as "cache entries are valid for current txg only"... 
but it looks I am out of luck in this case.
-- 
This message posted from opensolaris.org

Neil Perrin

2009-Jun-30 16:40 UTC

head link

[zfs-code] ARC hash key and txg

The dva is just part of a block pointer blkptr_t. 
It''s the blkptr_t that is stored on disk and used to traverse the pool.
It also contains such details as the block size, ditto dvas (redundant
blocks), the birth txg, block checksum, endianess, ...

Neil.

Jeremy Archer wrote:> Hi Darren,
> 
> Thanks for replying.  
> 
> I am confused then. What is the purpose of incorporating the birth txg 
> in the arc hash?   
> 
> How do 2 different files find a shared block?
> 
> Obviously, the first one  has to read it from disk, and subsequently
remember the
> birth txg in addition to to the dva.  
> 
> How does the 2nd file know what is the birth txg of the desired block?
> Does each file have do an initial read from disk to learn the birth txg of
the block?
> 
> I am looking for easy to understand metaphores, 
> such as "cache entries are valid for current txg only"... 
> but it looks I am out of luck in this case.

Jonathan Adams

2009-Jun-30 16:41 UTC

head link

[zfs-code] ARC hash key and txg

On Tue, Jun 30, 2009 at 09:08:17AM -0700, Jeremy Archer
wrote:> Hi Darren,
> 
> Thanks for replying.  
> 
> I am confused then. What is the purpose of incorporating the birth txg 
> in the arc hash?   
Consider:

	TXG	action
	 5	write of first block of File A, assigned DVA 5, birth TXG 5
	10	file A is deleted
	15	write to first block of File B, assigned DVA 5 birth TXG 15

The two blocks are distinct, and are cached seperately in the ARC.

Cheers,
- jonathan
> How do 2 different files find a shared block?
> 
> Obviously, the first one  has to read it from disk, and subsequently
remember the
> birth txg in addition to to the dva.  
> 
> How does the 2nd file know what is the birth txg of the desired block?
> Does each file have do an initial read from disk to learn the birth txg of
the block?
> 
> I am looking for easy to understand metaphores, 
> such as "cache entries are valid for current txg only"... 
> but it looks I am out of luck in this case.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Jeremy Archer

2009-Jun-30 17:10 UTC

head link

[zfs-code] ARC hash key and txg

>TXG action
>5 write of first block of File A, assigned DVA 5, birth TXG 5
>10 file A is deleted
>15 write to first block of File B, assigned DVA 5 birth TXG 15
>
>The two blocks are distinct, and are cached seperately in the ARC.
Once the file is deleted, ARC entry {dva 5 txg 5}  is no longer valid.
So why do we need it?
-- 
This message posted from opensolaris.org

Victor Latushkin

2009-Jun-30 17:12 UTC

head link

[zfs-code] ARC hash key and txg

On 30.06.09 21:10, Jeremy Archer wrote:>> TXG action
>> 5 write of first block of File A, assigned DVA 5, birth TXG 5
>> 10 file A is deleted
>> 15 write to first block of File B, assigned DVA 5 birth TXG 15
>>
>> The two blocks are distinct, and are cached seperately in the ARC.
> 
> Once the file is deleted, ARC entry {dva 5 txg 5}  is no longer valid.
> So why do we need it?
Let''s suppose file is removed from the directory but is still open by 
some process...

Jeremy Archer

2009-Jun-30 17:58 UTC

head link

[zfs-code] ARC hash key and txg

OK.. 
> Once the file is deleted, ARC entry {dva 5 txg 5} is no longer valid.
> So why do we need it?
>>Let''s suppose file is removed from the directory but is still
open by
>>some process...
On a traditional fle system, you would not be able to delete an open file. 
Which I believe is good. 

So, on ZFS you say, you can write to a file  that you opened in txg 5 
that was already deleted in txg 10?   

When txg 5 completes, you will have the updated content, and when txg 10 
finishes, you will have a deleted file.  

I guess transactional semantics are hard to get used to.
-- 
This message posted from opensolaris.org

Frank Batschulat (Home)

2009-Jun-30 18:06 UTC

head link

[zfs-code] ARC hash key and txg

On Tue, 30 Jun 2009 19:58:22 +0200, Jeremy Archer <j4rch3r at gmail.com>
wrote:
>>>Let''s suppose file is removed from the directory but is
still open by
>>>some process...
>
> On a traditional fle system, you would not be able to delete an open file.
> Which I believe is good.
even in traditional file systems you can always remove/delete an open file,
zfs is no different here.

that is just a namespace operation and removes the file name from the namespace.

the process that has the file still open still can use it and its content.

only the last close will eventually remove the file and its content from disk.

process 1:

opteron.batschul./export/home/batschul.=> cat > xyz

process 2:

opteron.batschul./export/home/batschul.=> ls -la xyz
-rw-r--r--   1 batschul other          0 Jun 30 20:02 xyz

opteron.batschul./export/home/batschul.=> fuser xyz
xyz:     3432o -> process 1

now remove it

opteron.batschul./export/home/batschul.=> rm xyz
opteron.batschul./export/home/batschul.=> ls -la xyz
xyz: No such file or directory

this was UFS btw ;-)

---
frankB

Ricardo M. Correia

2009-Jun-30 18:27 UTC

head link

[zfs-code] ARC hash key and txg

On Ter, 2009-06-30 at 10:58 -0700, Jeremy Archer wrote:> OK.. 
> 
> > Once the file is deleted, ARC entry {dva 5 txg 5} is no longer valid.
> > So why do we need it?
> 
> >>Let''s suppose file is removed from the directory but is
still open by
> >>some process...
I think both of you (Jeremy and Victor) are getting a bit confused (or
maybe it''s me?).

Even if file A is removed from the directory but is still open, this
wouldn''t cause a block of File B from getting assigned the same DVA as
a
block from the (unlinked from the directory, but still existent on-disk)
file A, as in the example provided.

The blocks of the unlinked file A would only be freed when the file is
closed by the process (assuming no snapshot is present), at which point
there''s no need to access its cached blocks anymore.

If a snapshot is taken after file A is written but before it is
unlinked, then the blocks of file B would never be assigned the same
DVAs as the blocks of file A, since the blocks of file A would only be
freed after the snapshot is deleted.
> On a traditional fle system, you would not be able to delete an open file. 
> Which I believe is good. 
Actually, I think you can do that on most UNIX filesystems.
> So, on ZFS you say, you can write to a file  that you opened in txg 5 
> that was already deleted in txg 10?
If by deleted you mean unlinked from the directory but still open by a
process, then yes (as in most UNIX filesystems).
> When txg 5 completes, you will have the updated content, and when txg 10 
> finishes, you will have a deleted file.  
The file will still be there in txg 10, but linked to a special ZAP,
hidden from the namespace.

As to why the ARC can cache two blocks with the same DVA allocated in
different txgs (which means one of them would be invalid?), that I
don''t
know..

Cheers,
Ricardo

Victor Latushkin

2009-Jun-30 19:23 UTC

head link

[zfs-code] ARC hash key and txg

On 30.06.09 22:27, Ricardo M. Correia wrote:> On Ter, 2009-06-30 at 10:58 -0700, Jeremy Archer wrote:
>> OK.. 
>>
>>> Once the file is deleted, ARC entry {dva 5 txg 5} is no longer
valid.
>>> So why do we need it?
>>>> Let''s suppose file is removed from the directory but
is still open by
>>>> some process...
> 
> I think both of you (Jeremy and Victor) are getting a bit confused (or
> maybe it''s me?).
Agree, my example was not good for the proposed scenario. Need to think 
more before typing. May be it is just time to go home...

> 
> Even if file A is removed from the directory but is still open, this
> wouldn''t cause a block of File B from getting assigned the same
DVA as a
> block from the (unlinked from the directory, but still existent on-disk)
> file A, as in the example provided.
> 
> The blocks of the unlinked file A would only be freed when the file is
> closed by the process (assuming no snapshot is present), at which point
> there''s no need to access its cached blocks anymore.
> 
> If a snapshot is taken after file A is written but before it is
> unlinked, then the blocks of file B would never be assigned the same
> DVAs as the blocks of file A, since the blocks of file A would only be
> freed after the snapshot is deleted.
> 
>> On a traditional fle system, you would not be able to delete an open
file.
>> Which I believe is good. 
> 
> Actually, I think you can do that on most UNIX filesystems.
> 
>> So, on ZFS you say, you can write to a file  that you opened in txg 5 
>> that was already deleted in txg 10?
> 
> If by deleted you mean unlinked from the directory but still open by a
> process, then yes (as in most UNIX filesystems).
> 
>> When txg 5 completes, you will have the updated content, and when txg
10
>> finishes, you will have a deleted file.  
> 
> The file will still be there in txg 10, but linked to a special ZAP,
> hidden from the namespace.
> 
> As to why the ARC can cache two blocks with the same DVA allocated in
> different txgs (which means one of them would be invalid?), that I
don''t
> know..
> 
> Cheers,
> Ricardo
> 
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

zfs code - Jun 2009 - ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC cache reference counting

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg

[zfs-code] ARC hash key and txg