Greets, I have read a couple of earlier posts by Jeff and Mark Maybee explaining how Arc reference counting works. These posts did help clarifying this piece of code ( a bit complex, to say the least). I would like to solicit more comments elucidating ARC reference counting. The usage pattern I see in arc.c does not seem to be following a simple pattern, like the one we see in VFS. The crux of the matter: who are the "users" that own these references? Perhaps I am misreading the code but this is how it looks to me: arc_buf_hdr->b_count keeps track of simultaneous readers of an arc_buf_hdr who called with a non-null callback parameter. The expected scenario is simultaneous access from 2 or more files (which are clones or snapshots). The reason for the ref count and the buffer cloning: to maintain cache integrity when multiple readers access the same arc cache entry, and one of these modifies the entry and ultimately ends up releasing the arc entry from the cache. At such time, one of the "users" needs an anonymous entry that can be dirtied and written out to a different spot, while the other needs an unchanged ARC entry. This was probably expected to be a relatively rare occurrence, and it is not expected that a large number of simultaneous readers would access the same ARC entry. I am a bit puzzled why a new ARC entry could not be cloned in arc_release, but perhaps there is a good reason why pre-allocating this data is better than the JIT alternative. I notice that prefetch functions (dbuf_prefetch and traverse_prefetcher) call arc_read_nolock without a callback parameter, I wonder if this could create a problem if a prefetch function, and a "regular" read simultaneously access the same ARC cache entry. (Cloning in this case would not happen, so one thread could end up releasing the entry from the cache while the other is messing with it). Comments and corrections to my interpretation are cordially solicited. -- This message posted from opensolaris.org
Jeremy Archer wrote:> Greets, > > I have read a couple of earlier posts by Jeff and Mark Maybee explaining how Arc reference counting works. > These posts did help clarifying this piece of code ( a bit complex, to say the least). > I would like to solicit more comments elucidating ARC reference counting. > > The usage pattern I see in arc.c does not seem to be following a simple pattern, like the one we see in VFS. > The crux of the matter: who are the "users" that own these references? > > Perhaps I am misreading the code but this is how it looks to me: > > arc_buf_hdr->b_count keeps track of simultaneous readers of an arc_buf_hdr who called with a non-null callback parameter. The expected scenario is simultaneous access from 2 or more files (which are clones or snapshots). >Correct.> The reason for the ref count and the buffer cloning: to maintain cache integrity > when multiple readers access the same arc cache entry, and one of these modifies the entry and ultimately ends up releasing the arc entry from the cache. >Correct.> At such time, one of the "users" needs an anonymous entry that can be dirtied > and written out to a different spot, while the other needs an unchanged ARC entry. >Correct.> This was probably expected to be a relatively rare occurrence, and it is not expected that a large number of simultaneous readers would access the same ARC entry.Correct.> I am a bit puzzled why a new ARC entry could not be cloned in arc_release, but > perhaps there is a good reason why pre-allocating this data is better than the JIT alternative. >We have already handed out a reference to the data at the point that buffer is being released, so we cannot allocate a new block "JIT".> I notice that prefetch functions (dbuf_prefetch and traverse_prefetcher) call arc_read_nolock without a callback parameter, I wonder if this could create a problem if a prefetch function, and a "regular" read simultaneously access the same ARC cache entry. (Cloning in this case would not happen, so one thread could end up releasing the entry from the cache while the other is messing with it). >No, a "regular" read request on a prefetch buffer will end up adding a callback record.> Comments and corrections to my interpretation are cordially solicited.Hope that helps. -Mark
Thanks for replying.>>I am a bit puzzled why a new ARC entry could not be cloned in arc_release..... > >We have already handed out a reference to the data at the point that >buffer is being released, so we cannot allocate a new block "JIT".I think I get it. Each colliding thread threads need a separate copy of the buffer from the point of the read collision, so that they can dirty these if they so desire. You have to be prepared for the worst case scenario, where all threads end up modifying their buffer. We may end up with N different buffer contents for N possible colliding threads. COW is a bit of a mind-bender to get used to. So if we have a number of threads colliding on read, we have to maintain all of the cloned data in the ARC, even if none of the threads ends up modifying their buffer. -- This message posted from opensolaris.org
I started to look at ref counting to convince myself that the db_bu field in a cached dmu_impl_t object is guaranteed to point at a valid arc_buf_t. I have seen a "deadbeef" crash on a busy system when zfs_write() is pre-pagefaulting in the file''s pages. The page fault handler eventually winds its way to dbuf_hold_impl, who manages to find a cached dmu_impl_t record. This record however, points to a freed arc_buf_t via its b_data field. The field is not null, but it points to a freed object, hence the crash upon trying to lock the rwlock of the alleged arc_buf. Ref counting should prevent something like this, correct? -- This message posted from opensolaris.org
Jeremy Archer wrote:> I started to look at ref counting to convince myself that the db_bu field in a cached dmu_impl_t object > is guaranteed to point at a valid arc_buf_t. > > I have seen a "deadbeef" crash on a busy system when zfs_write() is pre-pagefaulting in > the file''s pages. > > The page fault handler eventually winds its way to dbuf_hold_impl, who manages to > find a cached dmu_impl_t record. This record however, points to a freed arc_buf_t > via its b_data field. The field is not null, but it points to a freed object, hence the > crash upon trying to lock the rwlock of the alleged arc_buf. > > Ref counting should prevent something like this, correct?Correct. If you are running recent bits and have a core file please file a bug on this. -Mark
Hello, I believe the following is true, correct me if it is not: If more than one objects reference a block (e.g. 2 files have the same block open) there must be multiple clones of the arc_buf_t ( and associated dmu_impl_t ) records present, one for each of the objects. This is always so, even if the block is not modified, "just in case the block a should end up being modified". So: if there are 100 files accessing the same block in the same txg, there will be 100 clones of the data, even if none of the files ultimately modifies this block. Seems a bit wasteful. This dos not feel like COW to me, rather, "copy always, just in case" at least in the arc/dmu realm. I fail to see why the above scenario should not be able to get by with a single, shared, reference counted record. A clone would only have to be made of a block if a given file decides to modify the block. As it is, reference counting is significantly complicated by mixing it with this pre-cloning. On to some code comprehension questions: It seems to be that the conceptual model of a file in the dmu layer: A number of dmu buffers, hanging off of a dnode (i.e the per-dnode the list formed via the db_link "list enabler"). Not all blocks of the file are in this list, only the "active" ones. I take "active" to mean "recently accessed". There is a somewhat opaque aspect to dmu, that is missing from the otherwise excellent data structure chart. I am talking about dirty buffer management. db_data_pending? db_last_dirty? db_dirtycnt? Could someone provide the 10K mile overview on dirty buffers? The dbuf_states are a bit of a mystery: What is the difference between "DB_READ" and "DB_FILL"? My guess, maybe the data is coming from a different direction into the cache.>From below: Read from disk, (maybe) >From above: Nascent data coming from an application (newly created data?).I am guessing DB_NOFILL is a short-circuit path to throw obsoleted data away. It would be nice to comment the states ( beyond an unexplained state transition diagramm. ZFS would be more approachable to newcomers if the code was a bit more commented. I am not talking about copious comments, just every field in the major data structures, and minimum a one-liner per function as to what the function does. Yes, given enough perseverance and a lot of time one can figure everything out from studying the usage patterns but the pain of this could be lessened. The more people understand ZFS, the stronger it will become. -- This message posted from opensolaris.org
Hi Jeremy, Jeremy Archer wrote:> Hello, > > I believe the following is true, correct me if it is not: > > If more than one objects reference a block (e.g. 2 files have the same block open) > there must be multiple clones of the arc_buf_t ( and associated dmu_impl_t ) records > present, one for each of the objects. This is always so, even if the block is not > modified, "just in case the block a should end up being modified". > So: if there are 100 files accessing the same block in the same txg, there will be > 100 clones of the data, even if none of the files ultimately modifies this block. > Seems a bit wasteful. > > This dos not feel like COW to me, rather, "copy always, just in case" at least in the arc/dmu realm.Correct. Memory management is not currently "COW". Note that, currently, this is not a significant issue for most environments. The only way to "share" a data block is if the same file is accessed simultaneously from multiple clones and/or snapshots. This is rare. However, with the advent of dedup (coming soon), this will become a bigger issue.> I fail to see why the above scenario should not be able to get by with a single, > shared, reference counted record. A clone would only have to be made of a block if > a given file decides to modify the block. As it is, reference counting is significantly > complicated by mixing it with this pre-cloning. >The "simple solution" you propose is actually quite complicated to implement. We are working on it though.> On to some code comprehension questions: > > It seems to be that the conceptual model of a file in the dmu layer: > A number of dmu buffers, hanging off of a dnode (i.e the per-dnode the list > formed via the db_link "list enabler"). Not all blocks of the file are in this list, > only the "active" ones. I take "active" to mean "recently accessed". > > There is a somewhat opaque aspect to dmu, that is missing from the otherwise > excellent data structure chart. I am talking about dirty buffer management. > > db_data_pending? db_last_dirty? db_dirtycnt? Could someone provide the > 10K mile overview on dirty buffers? >Dirty buffers are buffers that have been modified, and so must be written to stable storage. Because IO is staged to disk, we have to manage these buffers in separate lists: a dirty buffer goes onto the list corresponding to the txg it belongs to.> > The dbuf_states are a bit of a mystery: > > What is the difference between "DB_READ" and "DB_FILL"? > > My guess, maybe the data is coming from a different direction into the cache. >>From below: Read from disk, (maybe)Yes.>>From above: Nascent data coming from an application (newly created data?). >Yes.> I am guessing DB_NOFILL is a short-circuit path to throw obsoleted data away. > It would be nice to comment the states ( beyond an unexplained state transition > diagramm.This is used when we are pre-allocating space. It is only used in special circumstances (i.e., creating a swap device).> > ZFS would be more approachable to newcomers if the code was > a bit more commented. > I am not talking about copious comments, just every field > in the major data structures, and minimum a one-liner per function as to what > the function does. > > Yes, given enough perseverance and a lot of time one can figure > everything out from studying the usage patterns but the pain of this > could be lessened. > > The more people understand ZFS, the stronger it will become.I agree. We haven''t been as good as you should about commenting. You are welcome to submit code updates that improve this. :-) -Mark
Thanks for the explanation.> .... a dirty > buffer goes onto the > list corresponding to the txg it belongs to.Ok. I see that all dirty buffers are put on a per txg list. This is for easy synchronization, makes sense. The per dmu_buf_impl_t details are a bit fuzzy. I see there can be more than one dirty buffers per dmu_buf_impl_t. Are these multiple "edited versions" of the same buffer that occurred in the same txg? Looks like the dirty buffer (leaf) records point to an arc_buf_t, which is presumably yet un-cached (until it is synched out). db_data_pending points to the most current of these. Seems that db_data_pending always points to the last (newest) dirty buffer for the dmu buffer. A the moment, the purpose of maintaining the per-dbuf dirty list eludes me. (Maybe to dispose of these except for the latest one, which will be written out to disk) Also: what is BP_IS_HOLE? I see it has no birth_txg. -- This message posted from opensolaris.org
On Mon, Jun 22, 2009 at 02:28:25PM -0700, Jeremy Archer wrote:> > Also: what is BP_IS_HOLE? I see it has no birth_txg.A "HOLE" BP is similar to a NULL pointer; it doesn''t point to anything, and is used to indicate there is no data underneath that block pointer. The representation is all-zeros. Since any real block has to have a (non-zero) birth txg, we only test birth_txg. An example of the use of holes is in a file''s data tree. A file which was created like: fd = open("foo", O_CREAT, 0777); pwrite(fd, "foo", 3, 1024*1024); close(fd); will only have one block in it''s data tree; the rest of the pointers will be HOLEs. When compression is turned on, any data block of all-zeros will be transformed into a HOLE, instead of being written to disk. The SEEK_HOLE and SEEK_DATA arguments to lseek(2) can be used to find the actual data in a file. Cheers, - jonathan
Thanks for clarifying the HOLE-s. Why are there (possibly) multiple dbuf_dirty_record_t -s hanging off of a given dmu_buf_impl_t? -- This message posted from opensolaris.org
Jeremy Archer wrote:> Thanks for the explanation. > >> .... a dirty >> buffer goes onto the >> list corresponding to the txg it belongs to. > > Ok. I see that all dirty buffers are put on a per txg list. > This is for easy synchronization, makes sense. > > The per dmu_buf_impl_t details are a bit fuzzy. > > I see there can be more than one dirty buffers per dmu_buf_impl_t. > Are these multiple "edited versions" of the same buffer that occurred in the same txg? >If a dbuf is "dirtied" in more than one txg, then it will have multiple dirty records. So these are the "edited versions" separated by transaction group boundaries.> Looks like the dirty buffer (leaf) records point to an arc_buf_t, which is > presumably yet un-cached (until it is synched out). >Correct. It becomes "cached" when it obtains an "identity", which does not happen until we write it out.> db_data_pending points to the most current of these. Seems that db_data_pending > always points to the last (newest) dirty buffer for the dmu buffer. >No, db_data_pending points to the record that is currently being written.> A the moment, the purpose of maintaining the per-dbuf dirty list eludes me. > (Maybe to dispose of these except for the latest one, which will be written > out to disk)See my comment above. We must maintain separate versions or else get "future leaks" (record data written in txg n+1 in txg n) and possibly trash our checksum (we must not change the data once the checksum is calculated).> > Also: what is BP_IS_HOLE? I see it has no birth_txg.
Awesome, thanks. -- This message posted from opensolaris.org
Hello Jonathan, You are saying that the SEEK_HOLE and SEEK_DATA are implemented on the "birth_txg" and not on the "fill_count"? Leal [ http://www.eall.com.br/blog ] -- This message posted from opensolaris.org
On Wed, Jun 24, 2009 at 10:58:49AM -0700, Marcelo Leal wrote:> Hello Jonathan, > You are saying that the SEEK_HOLE and SEEK_DATA are implemented on > the "birth_txg" and not on the "fill_count"?Um, no, I don''t think so. I was just saying that for most people checking for holes, they use "blk_birth" to check for them. Since all the fields are zero, that''s not always the case. For SEEK_*, it''s actually based on the fill value. uts/common/fs/zfs/dnode.c: static int dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset, int lvl, uint64_t blkfill, uint64_t txg) { ... if (bp[i].blk_fill >= minfill && bp[i].blk_fill <= maxfill && (hole || bp[i].blk_birth > txg)) break; ... Cheers, - jonathan
I am pondering the fact that the txg is part of the hash key for the arc. It seems to me, this has a profound implication: read caching is per txg. If I read a record into the cache in txg N, and the current txg is closed, this record becomes effectively un-cached in the new (N+1) txg. (Well, there may be some on-going operation that still references the txg N entry, but for read caching I believe this is correct) So the next read will have to read it from disk again, and put it into the arc again, this time with txg N+1. Now we have 2 arc entries for the same piece of data. The entry for txg N will get aged out of the cache eventually, but in theory, there may be potentially many entries of the same (identical data content and location) data block present in the arc, if we keep re-reading the same block. Is my understanding correct? -- This message posted from opensolaris.org
Jeremy Archer wrote:> I am pondering the fact that the txg is part of the hash key for the arc. It seems to me, > this has a profound implication: read caching is per txg.I assuming you are looking at: BUF_HASH_INDEX() and buf_hash() in $SRC/uts/common/fs/zfs/arc.c right ? Note that the txg is the *birth* txg of the block which is doesn''t change. It is that txg that is used in the hash not the txg that we are doing the read of the block in. -- Darren J Moffat
Hi Darren, Thanks for replying. I am confused then. What is the purpose of incorporating the birth txg in the arc hash? How do 2 different files find a shared block? Obviously, the first one has to read it from disk, and subsequently remember the birth txg in addition to to the dva. How does the 2nd file know what is the birth txg of the desired block? Does each file have do an initial read from disk to learn the birth txg of the block? I am looking for easy to understand metaphores, such as "cache entries are valid for current txg only"... but it looks I am out of luck in this case. -- This message posted from opensolaris.org
The dva is just part of a block pointer blkptr_t. It''s the blkptr_t that is stored on disk and used to traverse the pool. It also contains such details as the block size, ditto dvas (redundant blocks), the birth txg, block checksum, endianess, ... Neil. Jeremy Archer wrote:> Hi Darren, > > Thanks for replying. > > I am confused then. What is the purpose of incorporating the birth txg > in the arc hash? > > How do 2 different files find a shared block? > > Obviously, the first one has to read it from disk, and subsequently remember the > birth txg in addition to to the dva. > > How does the 2nd file know what is the birth txg of the desired block? > Does each file have do an initial read from disk to learn the birth txg of the block? > > I am looking for easy to understand metaphores, > such as "cache entries are valid for current txg only"... > but it looks I am out of luck in this case.
On Tue, Jun 30, 2009 at 09:08:17AM -0700, Jeremy Archer wrote:> Hi Darren, > > Thanks for replying. > > I am confused then. What is the purpose of incorporating the birth txg > in the arc hash?Consider: TXG action 5 write of first block of File A, assigned DVA 5, birth TXG 5 10 file A is deleted 15 write to first block of File B, assigned DVA 5 birth TXG 15 The two blocks are distinct, and are cached seperately in the ARC. Cheers, - jonathan> How do 2 different files find a shared block? > > Obviously, the first one has to read it from disk, and subsequently remember the > birth txg in addition to to the dva. > > How does the 2nd file know what is the birth txg of the desired block? > Does each file have do an initial read from disk to learn the birth txg of the block? > > I am looking for easy to understand metaphores, > such as "cache entries are valid for current txg only"... > but it looks I am out of luck in this case. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
>TXG action >5 write of first block of File A, assigned DVA 5, birth TXG 5 >10 file A is deleted >15 write to first block of File B, assigned DVA 5 birth TXG 15 > >The two blocks are distinct, and are cached seperately in the ARC.Once the file is deleted, ARC entry {dva 5 txg 5} is no longer valid. So why do we need it? -- This message posted from opensolaris.org
On 30.06.09 21:10, Jeremy Archer wrote:>> TXG action >> 5 write of first block of File A, assigned DVA 5, birth TXG 5 >> 10 file A is deleted >> 15 write to first block of File B, assigned DVA 5 birth TXG 15 >> >> The two blocks are distinct, and are cached seperately in the ARC. > > Once the file is deleted, ARC entry {dva 5 txg 5} is no longer valid. > So why do we need it?Let''s suppose file is removed from the directory but is still open by some process...
OK..> Once the file is deleted, ARC entry {dva 5 txg 5} is no longer valid. > So why do we need it?>>Let''s suppose file is removed from the directory but is still open by >>some process...On a traditional fle system, you would not be able to delete an open file. Which I believe is good. So, on ZFS you say, you can write to a file that you opened in txg 5 that was already deleted in txg 10? When txg 5 completes, you will have the updated content, and when txg 10 finishes, you will have a deleted file. I guess transactional semantics are hard to get used to. -- This message posted from opensolaris.org
On Tue, 30 Jun 2009 19:58:22 +0200, Jeremy Archer <j4rch3r at gmail.com> wrote:>>>Let''s suppose file is removed from the directory but is still open by >>>some process... > > On a traditional fle system, you would not be able to delete an open file. > Which I believe is good.even in traditional file systems you can always remove/delete an open file, zfs is no different here. that is just a namespace operation and removes the file name from the namespace. the process that has the file still open still can use it and its content. only the last close will eventually remove the file and its content from disk. process 1: opteron.batschul./export/home/batschul.=> cat > xyz process 2: opteron.batschul./export/home/batschul.=> ls -la xyz -rw-r--r-- 1 batschul other 0 Jun 30 20:02 xyz opteron.batschul./export/home/batschul.=> fuser xyz xyz: 3432o -> process 1 now remove it opteron.batschul./export/home/batschul.=> rm xyz opteron.batschul./export/home/batschul.=> ls -la xyz xyz: No such file or directory this was UFS btw ;-) --- frankB
On Ter, 2009-06-30 at 10:58 -0700, Jeremy Archer wrote:> OK.. > > > Once the file is deleted, ARC entry {dva 5 txg 5} is no longer valid. > > So why do we need it? > > >>Let''s suppose file is removed from the directory but is still open by > >>some process...I think both of you (Jeremy and Victor) are getting a bit confused (or maybe it''s me?). Even if file A is removed from the directory but is still open, this wouldn''t cause a block of File B from getting assigned the same DVA as a block from the (unlinked from the directory, but still existent on-disk) file A, as in the example provided. The blocks of the unlinked file A would only be freed when the file is closed by the process (assuming no snapshot is present), at which point there''s no need to access its cached blocks anymore. If a snapshot is taken after file A is written but before it is unlinked, then the blocks of file B would never be assigned the same DVAs as the blocks of file A, since the blocks of file A would only be freed after the snapshot is deleted.> On a traditional fle system, you would not be able to delete an open file. > Which I believe is good.Actually, I think you can do that on most UNIX filesystems.> So, on ZFS you say, you can write to a file that you opened in txg 5 > that was already deleted in txg 10?If by deleted you mean unlinked from the directory but still open by a process, then yes (as in most UNIX filesystems).> When txg 5 completes, you will have the updated content, and when txg 10 > finishes, you will have a deleted file.The file will still be there in txg 10, but linked to a special ZAP, hidden from the namespace. As to why the ARC can cache two blocks with the same DVA allocated in different txgs (which means one of them would be invalid?), that I don''t know.. Cheers, Ricardo
On 30.06.09 22:27, Ricardo M. Correia wrote:> On Ter, 2009-06-30 at 10:58 -0700, Jeremy Archer wrote: >> OK.. >> >>> Once the file is deleted, ARC entry {dva 5 txg 5} is no longer valid. >>> So why do we need it? >>>> Let''s suppose file is removed from the directory but is still open by >>>> some process... > > I think both of you (Jeremy and Victor) are getting a bit confused (or > maybe it''s me?).Agree, my example was not good for the proposed scenario. Need to think more before typing. May be it is just time to go home...> > Even if file A is removed from the directory but is still open, this > wouldn''t cause a block of File B from getting assigned the same DVA as a > block from the (unlinked from the directory, but still existent on-disk) > file A, as in the example provided. > > The blocks of the unlinked file A would only be freed when the file is > closed by the process (assuming no snapshot is present), at which point > there''s no need to access its cached blocks anymore. > > If a snapshot is taken after file A is written but before it is > unlinked, then the blocks of file B would never be assigned the same > DVAs as the blocks of file A, since the blocks of file A would only be > freed after the snapshot is deleted. > >> On a traditional fle system, you would not be able to delete an open file. >> Which I believe is good. > > Actually, I think you can do that on most UNIX filesystems. > >> So, on ZFS you say, you can write to a file that you opened in txg 5 >> that was already deleted in txg 10? > > If by deleted you mean unlinked from the directory but still open by a > process, then yes (as in most UNIX filesystems). > >> When txg 5 completes, you will have the updated content, and when txg 10 >> finishes, you will have a deleted file. > > The file will still be there in txg 10, but linked to a special ZAP, > hidden from the namespace. > > As to why the ARC can cache two blocks with the same DVA allocated in > different txgs (which means one of them would be invalid?), that I don''t > know.. > > Cheers, > Ricardo > > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code