thr3ads.net - Btrfs devel - basic questions regarding COW in Btrfs [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Aastha Mehta

2013-Feb-20 17:28 UTC

basic questions regarding COW in Btrfs

Hello,

I am trying to understand the COW mechanism in Btrfs. Is it correct to
say that unless nodatacow option is specified, Btrfs always performs
COW for all the data+metadata extents used in the system?

I saw that COWing is implemented in btrfs_cow_block() function, which
is called at the time of searching a slot for a particular item, while
inserting into a new slot, committing transactions, while creating
pending snapshots and few other places.

However, while tracing through the complete write path, I could not
quite figure out when extents actually get COWed. Could you please
point me to the place where COWing takes place? Is there any time
when, for performance or any other reasons, the extents are not COWed
but overwritten in place (apart from the explicit nodatacow flag being
set during mount)?

Thanks,
Aastha.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Feb-20 17:54 UTC

head link

Re: basic questions regarding COW in Btrfs

On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta
wrote:> Hello,
> 
> I am trying to understand the COW mechanism in Btrfs. Is it correct to
> say that unless nodatacow option is specified, Btrfs always performs
> COW for all the data+metadata extents used in the system?
> 
So we always cow the metadata, but yes nodatacow means we don''t cow the
actual
data in the data extents.
> I saw that COWing is implemented in btrfs_cow_block() function, which
> is called at the time of searching a slot for a particular item, while
> inserting into a new slot, committing transactions, while creating
> pending snapshots and few other places.
> 
> However, while tracing through the complete write path, I could not
> quite figure out when extents actually get COWed. Could you please
> point me to the place where COWing takes place? Is there any time
> when, for performance or any other reasons, the extents are not COWed
> but overwritten in place (apart from the explicit nodatacow flag being
> set during mount)?
You''ll want to look at the tree operation ->fill_delalloc().  Thats
where we do
cow_file_range().  We allocate new space and write.  When we finish the ordered
io we do btrfs_drop_extents() on the range we just wrote which will free up any
existing extents that exist, and then insert our new file extent.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aastha Mehta

2013-Feb-21 17:32 UTC

head link

Re: basic questions regarding COW in Btrfs

Thanks a lot for the prompt response. I had seen that, but I am still
not sure of where it really
happens within fill_delalloc. Could you help me a little further in that path?

Secondly, now I am confused between the btree_writepages and
btrfs_writepages/btrfs_writepage
methods. I thought btrfs_writepages was for writing the pages holding
inodes and btree_writepages
for writing the other indirect and leaf extents of the btree. Then, it
seems that the write operations
lead to update of the file system data structures in a top-down
manner, i.e. first changing the inode
and then the data extents. Is that correct?

Thirdly, it seems that the old extents maybe dropped before the new
extents are flushed to the disk.
What would happen if the write fails before the disk commit? What am I
missing here?

Thanks,
Aastha.

On 20 February 2013 18:54, Josef Bacik <jbacik@fusionio.com>
wrote:> On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote:
>> Hello,
>>
>> I am trying to understand the COW mechanism in Btrfs. Is it correct to
>> say that unless nodatacow option is specified, Btrfs always performs
>> COW for all the data+metadata extents used in the system?
>>
>
> So we always cow the metadata, but yes nodatacow means we don''t
cow the actual
> data in the data extents.
>
>> I saw that COWing is implemented in btrfs_cow_block() function, which
>> is called at the time of searching a slot for a particular item, while
>> inserting into a new slot, committing transactions, while creating
>> pending snapshots and few other places.
>>
>> However, while tracing through the complete write path, I could not
>> quite figure out when extents actually get COWed. Could you please
>> point me to the place where COWing takes place? Is there any time
>> when, for performance or any other reasons, the extents are not COWed
>> but overwritten in place (apart from the explicit nodatacow flag being
>> set during mount)?
>
> You''ll want to look at the tree operation ->fill_delalloc(). 
Thats where we do
> cow_file_range().  We allocate new space and write.  When we finish the
ordered
> io we do btrfs_drop_extents() on the range we just wrote which will free up
any
> existing extents that exist, and then insert our new file extent.  Thanks,
>
> Josef--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aastha Mehta

2013-Feb-23 09:33 UTC

head link

Re: basic questions regarding COW in Btrfs

A gentle reminder on this one.

Thanks,
Aastha.

On 21 February 2013 18:32, Aastha Mehta <aasthakm@gmail.com>
wrote:> Thanks a lot for the prompt response. I had seen that, but I am still
> not sure of where it really
> happens within fill_delalloc. Could you help me a little further in that
path?
>
> Secondly, now I am confused between the btree_writepages and
> btrfs_writepages/btrfs_writepage
> methods. I thought btrfs_writepages was for writing the pages holding
> inodes and btree_writepages
> for writing the other indirect and leaf extents of the btree. Then, it
> seems that the write operations
> lead to update of the file system data structures in a top-down
> manner, i.e. first changing the inode
> and then the data extents. Is that correct?
>
> Thirdly, it seems that the old extents maybe dropped before the new
> extents are flushed to the disk.
> What would happen if the write fails before the disk commit? What am I
> missing here?
>
> Thanks,
> Aastha.
>
> On 20 February 2013 18:54, Josef Bacik <jbacik@fusionio.com> wrote:
>> On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote:
>>> Hello,
>>>
>>> I am trying to understand the COW mechanism in Btrfs. Is it correct
to
>>> say that unless nodatacow option is specified, Btrfs always
performs
>>> COW for all the data+metadata extents used in the system?
>>>
>>
>> So we always cow the metadata, but yes nodatacow means we
don''t cow the actual
>> data in the data extents.
>>
>>> I saw that COWing is implemented in btrfs_cow_block() function,
which
>>> is called at the time of searching a slot for a particular item,
while
>>> inserting into a new slot, committing transactions, while creating
>>> pending snapshots and few other places.
>>>
>>> However, while tracing through the complete write path, I could not
>>> quite figure out when extents actually get COWed. Could you please
>>> point me to the place where COWing takes place? Is there any time
>>> when, for performance or any other reasons, the extents are not
COWed
>>> but overwritten in place (apart from the explicit nodatacow flag
being
>>> set during mount)?
>>
>> You''ll want to look at the tree operation
->fill_delalloc().  Thats where we do
>> cow_file_range().  We allocate new space and write.  When we finish the
ordered
>> io we do btrfs_drop_extents() on the range we just wrote which will
free up any
>> existing extents that exist, and then insert our new file extent. 
Thanks,
>>
>> Josef


--
Aastha Mehta
MPI-SWS, Germany
E-mail: aasthakm@mpi-sws.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aastha Mehta

2013-Feb-25 15:15 UTC

head link

Re: basic questions regarding COW in Btrfs

Thanks again Josef.

I understood that cow_file_range is called for a regular file. Just to
clarify, in cow_file_range is cow done at the time of reserving
extents in the extent btree for the io to be done in this delalloc? I
see the following comment above find_free_extent() which is called
while trying to reserve extents:

/*
 * walks the btree of allocated extents and find a hole of a given size.
 * The key ins is changed to record the hole:
 * ins->objectid == block start
 * ins->flags = BTRFS_EXTENT_ITEM_KEY
 * ins->offset == number of blocks
 * Any available blocks before search_start are skipped.
 */

This seems to be the only place where a cow might be done, because a
key is being inserted into an extent which modifies it.

Thanks,
Aastha.

On 24 February 2013 02:39, Josef Bacik <josef@toxicpanda.com>
wrote:> On Thu, Feb 21, 2013 at 12:32 PM, Aastha Mehta <aasthakm@gmail.com>
wrote:
>>
>> Thanks a lot for the prompt response. I had seen that, but I am still
>> not sure of where it really
>> happens within fill_delalloc. Could you help me a little further in
that
>> path?
>>
>
> So we check the properties of the inode and do one of 3 things, either we
> call btrfs_cow_file_range directly in the case of a normal file,
> run_delalloc_nocow in the case of a file with prealloc extents or NOCOW, or
> we do the compression dance.  We make an ordered extent for this range and
> return.  And then the normal io path happens.
>
>>
>> Secondly, now I am confused between the btree_writepages and
>> btrfs_writepages/btrfs_writepage
>> methods. I thought btrfs_writepages was for writing the pages holding
>> inodes and btree_writepages
>> for writing the other indirect and leaf extents of the btree. Then, it
>> seems that the write operations
>> lead to update of the file system data structures in a top-down
>> manner, i.e. first changing the inode
>> and then the data extents. Is that correct?
>
>
> You are right that btrfs_writepages/writepage are for normal files and
> btree_writepages is for the metadata.  The write operations do start in
data
> and then modify metadata later down the line if that is what you are
getting
> at.
>
>>
>> Thirdly, it seems that the old extents maybe dropped before the new
>> extents are flushed to the disk.
>> What would happen if the write fails before the disk commit? What am I
>> missing here?
>>
>
> Yeah, the metadata isn''t updated until the data is on the disk. 
In
> ->fill_delalloc we setup an btrfs_ordered_extent that describes the
range of
> the dirty pages we are writing.  When we''ve written all these
pages we run
> btrfs_finish_ordered_io, which will drop the old extent entries if there
are
> any and then add the new extent entries and update the references and such.
> So if something fails we just continue to point to the original file extent
> entries and return an EIO, we maintain consistency by making sure the
> metadata is updated only after the data is written out.  I hope that helps.
> Thanks,
>
> Josef


--
Aastha Mehta
MPI-SWS, Germany
E-mail: aasthakm@mpi-sws.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Feb-25 17:27 UTC

head link

Re: basic questions regarding COW in Btrfs

On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta
wrote:> Thanks again Josef.
> 
> I understood that cow_file_range is called for a regular file. Just to
> clarify, in cow_file_range is cow done at the time of reserving
> extents in the extent btree for the io to be done in this delalloc? I
> see the following comment above find_free_extent() which is called
> while trying to reserve extents:
> 
> /*
>  * walks the btree of allocated extents and find a hole of a given size.
>  * The key ins is changed to record the hole:
>  * ins->objectid == block start
>  * ins->flags = BTRFS_EXTENT_ITEM_KEY
>  * ins->offset == number of blocks
>  * Any available blocks before search_start are skipped.
>  */
> 
> This seems to be the only place where a cow might be done, because a
> key is being inserted into an extent which modifies it.
>
The key isn''t inserted at this time, it''s just returned with
those values for us
to do as we please.  There is no update of the btree until
insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io.
Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aastha Mehta

2013-Feb-25 19:00 UTC

head link

Re: basic questions regarding COW in Btrfs

Ah okay, I now see how it works. Thanks a lot for your response.

Regards,
Aastha.


On 25 February 2013 18:27, Josef Bacik <jbacik@fusionio.com>
wrote:> On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:
>> Thanks again Josef.
>>
>> I understood that cow_file_range is called for a regular file. Just to
>> clarify, in cow_file_range is cow done at the time of reserving
>> extents in the extent btree for the io to be done in this delalloc? I
>> see the following comment above find_free_extent() which is called
>> while trying to reserve extents:
>>
>> /*
>>  * walks the btree of allocated extents and find a hole of a given
size.
>>  * The key ins is changed to record the hole:
>>  * ins->objectid == block start
>>  * ins->flags = BTRFS_EXTENT_ITEM_KEY
>>  * ins->offset == number of blocks
>>  * Any available blocks before search_start are skipped.
>>  */
>>
>> This seems to be the only place where a cow might be done, because a
>> key is being inserted into an extent which modifies it.
>>
>
> The key isn''t inserted at this time, it''s just returned
with those values for us
> to do as we please.  There is no update of the btree until
> insert_reserved_extent/btrfs_mark_extent_written in
btrfs_finish_ordered_io.
> Thanks,
>
> Josef--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Lyakas

2013-Mar-02 21:07 UTC

head link

Re: basic questions regarding COW in Btrfs

Hi Josef,
I hope it''s ok to piggy back on this thread for the following question:

I see that in btrfs_cross_ref_exist()=>check_committed_ref() path,
there is the following check:

if (btrfs_extent_generation(leaf, ei) <   
btrfs_root_last_snapshot(&root->root_item))
	goto out;

So this basically means that after we have taken a snap of a subvol,
then all subvol''s extents must be COW''ed, even if we delete
the snap a
minute later.
I wonder, why is that so?
Is this because file extents can be shared indirectly, like when we
create a snap, we only COW the root and only mark all root''s
*immediate* children shared in the extent tree?
Can the new backref walking code be used here to check more
accurately, if the extent is shared by anybody else?

Thanks,
Alex.



On Mon, Feb 25, 2013 at 9:00 PM, Aastha Mehta <aasthakm@gmail.com>
wrote:> Ah okay, I now see how it works. Thanks a lot for your response.
>
> Regards,
> Aastha.
>
>
> On 25 February 2013 18:27, Josef Bacik <jbacik@fusionio.com> wrote:
>> On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:
>>> Thanks again Josef.
>>>
>>> I understood that cow_file_range is called for a regular file. Just
to
>>> clarify, in cow_file_range is cow done at the time of reserving
>>> extents in the extent btree for the io to be done in this delalloc?
I
>>> see the following comment above find_free_extent() which is called
>>> while trying to reserve extents:
>>>
>>> /*
>>>  * walks the btree of allocated extents and find a hole of a given
size.
>>>  * The key ins is changed to record the hole:
>>>  * ins->objectid == block start
>>>  * ins->flags = BTRFS_EXTENT_ITEM_KEY
>>>  * ins->offset == number of blocks
>>>  * Any available blocks before search_start are skipped.
>>>  */
>>>
>>> This seems to be the only place where a cow might be done, because
a
>>> key is being inserted into an extent which modifies it.
>>>
>>
>> The key isn''t inserted at this time, it''s just
returned with those values for us
>> to do as we please.  There is no update of the btree until
>> insert_reserved_extent/btrfs_mark_extent_written in
btrfs_finish_ordered_io.
>> Thanks,
>>
>> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aastha Mehta

2013-Mar-03 15:41 UTC

head link

Re: basic questions regarding COW in Btrfs

Hi Josef,

I have some more questions following up on my previous e-mails.
I now do somewhat understand the place where extent entries get
cow''ed. But I am unclear about the order of operations.

Is it correct that the data extent written first, then the pointer in
the indirect block needs to be updated, so then it is cowed and
written to disk and so on recursively up the tree? Or is the entire
path from leaf to node that is going to be affected by the write cowed
first and then all the cowed extents are written to the disk and then
the rest of the metadata pointers, (for example, in checksum tree,
extent tree, etc., I am not sure about this)?

Also, I need to understand specifically how the data (leaf nodes) of a
file is written to disk v/s the metadata including the indirect nodes
of the file. In extent_writepage I only know the pages of a file that
are to be written. I guess, I can identify metadata pages based on the
inode of the page''s owner. But is it possible to distinguish the pages
available in extent_writepage path as belonging to the leaf node or
internal node for a file? If it cannot be identified at this point,
where earlier in the path can this be decided?

Many thanks,
Aastha.

On 25 February 2013 20:00, Aastha Mehta <aasthakm@gmail.com>
wrote:> Ah okay, I now see how it works. Thanks a lot for your response.
>
> Regards,
> Aastha.
>
>
> On 25 February 2013 18:27, Josef Bacik <jbacik@fusionio.com> wrote:
>> On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:
>>> Thanks again Josef.
>>>
>>> I understood that cow_file_range is called for a regular file. Just
to
>>> clarify, in cow_file_range is cow done at the time of reserving
>>> extents in the extent btree for the io to be done in this delalloc?
I
>>> see the following comment above find_free_extent() which is called
>>> while trying to reserve extents:
>>>
>>> /*
>>>  * walks the btree of allocated extents and find a hole of a given
size.
>>>  * The key ins is changed to record the hole:
>>>  * ins->objectid == block start
>>>  * ins->flags = BTRFS_EXTENT_ITEM_KEY
>>>  * ins->offset == number of blocks
>>>  * Any available blocks before search_start are skipped.
>>>  */
>>>
>>> This seems to be the only place where a cow might be done, because
a
>>> key is being inserted into an extent which modifies it.
>>>
>>
>> The key isn''t inserted at this time, it''s just
returned with those values for us
>> to do as we please.  There is no update of the btree until
>> insert_reserved_extent/btrfs_mark_extent_written in
btrfs_finish_ordered_io.
>> Thanks,
>>
>> Josef--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Mar-03 23:42 UTC

head link

Re: basic questions regarding COW in Btrfs

On Sat, Mar 2, 2013 at 4:07 PM, Alex Lyakas
<alex.btrfs@zadarastorage.com> wrote:> Hi Josef,
> I hope it''s ok to piggy back on this thread for the following
question:
>
> I see that in btrfs_cross_ref_exist()=>check_committed_ref() path,
> there is the following check:
>
> if (btrfs_extent_generation(leaf, ei) <>    
btrfs_root_last_snapshot(&root->root_item))
>         goto out;
>
> So this basically means that after we have taken a snap of a subvol,
> then all subvol''s extents must be COW''ed, even if we
delete the snap a
> minute later.
> I wonder, why is that so?
> Is this because file extents can be shared indirectly, like when we
> create a snap, we only COW the root and only mark all root''s
> *immediate* children shared in the extent tree?
Yes that''s exactly it.  We have no way of knowing that there are no
snapshots left for this particular root so if there ever was a
snapshot we have to err on the side of caution.
> Can the new backref walking code be used here to check more
> accurately, if the extent is shared by anybody else?
>
Probably, if we could figure out if there is a way for more than one
root to point to this extent then yes this would be ideal so we don''t
have to force COW in cases we would rather not.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Mar-03 23:52 UTC

head link

Re: basic questions regarding COW in Btrfs

On Sun, Mar 3, 2013 at 10:41 AM, Aastha Mehta <aasthakm@gmail.com>
wrote:> Hi Josef,
>
> I have some more questions following up on my previous e-mails.
> I now do somewhat understand the place where extent entries get
> cow''ed. But I am unclear about the order of operations.
>
> Is it correct that the data extent written first, then the pointer in
> the indirect block needs to be updated, so then it is cowed and
> written to disk and so on recursively up the tree? Or is the entire
> path from leaf to node that is going to be affected by the write cowed
> first and then all the cowed extents are written to the disk and then
> the rest of the metadata pointers, (for example, in checksum tree,
> extent tree, etc., I am not sure about this)?
The second one.  We COW the entire path from root to leaf as things
need COW''ing.  We start a transaction, we insert the file extent
entries, we add the checksums, and we add the delayed ref updates to
the extent tree.  The delayed things are guaranteed to happen in that
transaction so we have consistency there.  The COW''ing from top to
bottom works like that for all trees.
>
> Also, I need to understand specifically how the data (leaf nodes) of a
> file is written to disk v/s the metadata including the indirect nodes
> of the file. In extent_writepage I only know the pages of a file that
> are to be written. I guess, I can identify metadata pages based on the
> inode of the page''s owner. But is it possible to distinguish the
pages
> available in extent_writepage path as belonging to the leaf node or
> internal node for a file? If it cannot be identified at this point,
> where earlier in the path can this be decided?
>
So they are different things, and they could change from the time we
write to the time that the write completes because of COW.  Also keep
in mind that the metadata (the file extent items and such) for the
inodes are not stored specifically within the inode, they''re stored
inside the same tree that the inode resides in.  So you can have a
leaf node with multiple inodes and extents for those different inodes.
 And so any sort of random things can happen, other inodes can be
deleted and this inode''s metadata will be shifted into a new leaf, or
another inode could be added and this inode''s data could be pushed off
into an adjacent leaf.  The only way to know which leaf/page the inode
is associated with is to search for whatever you are looking for in
the tree, and then while you are holding all of the locks and
reference counting you can be sure that those pages contain the
metadata you are looking for, but once you let that go there are no
guarantees.

So as far as how it is written to disk, that is where transactions
come in.  We track all the dirty metadata pages we have per
transaction, and then at transaction commit time we make sure that all
of those pages are written to disk and then we commit our super to
point to the new root of the tree root, which in turn points at all of
our new roots because of COW.  These pages can be written before the
commit though because of memory pressure, and if they are written and
then modified again within in the same transaction we will re-cow them
to make sure we don''t have any partial-page updates.  Keeping track of
where a specific inodes metadata is contained is a tricky business.
Let me know if that helped.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aastha Mehta

2013-Mar-05 00:57 UTC

head link

Re: basic questions regarding COW in Btrfs

I must admit, it is quite convoluted :-)

Please tell me if I understand this. A file system tree (containing
the inodes, the extents of all the inodes, etc.) is itself laid out in
the leaf extents of another big tree, which is the root tree. This is
why you say that inode and other such metadata may be lying in the
leaf nodes. Correct?

I did not completely understand what you meant when you said that the
metadata (the file extent items and such) for the inodes are stored
inside the same tree that the inode resides in. I thought the
btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to
the actual data of a file.

Okay, now I am not even sure if in btrfs there is something like an
indirect block for a huge file. In file systems with fixed block size,
one can hold only as many pointers to data blocks and hence when the
file size grows indirects are added in the file''s tree. Is there any
equivalent indirect extent required for huge files in btrfs, or do all
the files fit within one level? If there are indirects, what item type
do they have? Would something like btrfs_get_extent() be useful to get
the indirect extents of a file?

Too many questions, sorry :(

Thanks.

On 4 March 2013 00:52, Josef Bacik <josef@toxicpanda.com>
wrote:> On Sun, Mar 3, 2013 at 10:41 AM, Aastha Mehta <aasthakm@gmail.com>
wrote:
>> Hi Josef,
>>
>> I have some more questions following up on my previous e-mails.
>> I now do somewhat understand the place where extent entries get
>> cow''ed. But I am unclear about the order of operations.
>>
>> Is it correct that the data extent written first, then the pointer in
>> the indirect block needs to be updated, so then it is cowed and
>> written to disk and so on recursively up the tree? Or is the entire
>> path from leaf to node that is going to be affected by the write cowed
>> first and then all the cowed extents are written to the disk and then
>> the rest of the metadata pointers, (for example, in checksum tree,
>> extent tree, etc., I am not sure about this)?
>
> The second one.  We COW the entire path from root to leaf as things
> need COW''ing.  We start a transaction, we insert the file extent
> entries, we add the checksums, and we add the delayed ref updates to
> the extent tree.  The delayed things are guaranteed to happen in that
> transaction so we have consistency there.  The COW''ing from top to
> bottom works like that for all trees.
>
>>
>> Also, I need to understand specifically how the data (leaf nodes) of a
>> file is written to disk v/s the metadata including the indirect nodes
>> of the file. In extent_writepage I only know the pages of a file that
>> are to be written. I guess, I can identify metadata pages based on the
>> inode of the page''s owner. But is it possible to distinguish
the pages
>> available in extent_writepage path as belonging to the leaf node or
>> internal node for a file? If it cannot be identified at this point,
>> where earlier in the path can this be decided?
>>
>
> So they are different things, and they could change from the time we
> write to the time that the write completes because of COW.  Also keep
> in mind that the metadata (the file extent items and such) for the
> inodes are not stored specifically within the inode, they''re
stored
> inside the same tree that the inode resides in.  So you can have a
> leaf node with multiple inodes and extents for those different inodes.
>  And so any sort of random things can happen, other inodes can be
> deleted and this inode''s metadata will be shifted into a new leaf,
or
> another inode could be added and this inode''s data could be pushed
off
> into an adjacent leaf.  The only way to know which leaf/page the inode
> is associated with is to search for whatever you are looking for in
> the tree, and then while you are holding all of the locks and
> reference counting you can be sure that those pages contain the
> metadata you are looking for, but once you let that go there are no
> guarantees.
>
> So as far as how it is written to disk, that is where transactions
> come in.  We track all the dirty metadata pages we have per
> transaction, and then at transaction commit time we make sure that all
> of those pages are written to disk and then we commit our super to
> point to the new root of the tree root, which in turn points at all of
> our new roots because of COW.  These pages can be written before the
> commit though because of memory pressure, and if they are written and
> then modified again within in the same transaction we will re-cow them
> to make sure we don''t have any partial-page updates.  Keeping
track of
> where a specific inodes metadata is contained is a tricky business.
> Let me know if that helped.  Thanks,
>
> Josef--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Mar-05 01:51 UTC

head link

Re: basic questions regarding COW in Btrfs

On Mon, Mar 4, 2013 at 7:57 PM, Aastha Mehta <aasthakm@gmail.com>
wrote:> I must admit, it is quite convoluted :-)
>
> Please tell me if I understand this. A file system tree (containing
> the inodes, the extents of all the inodes, etc.) is itself laid out in
> the leaf extents of another big tree, which is the root tree. This is
> why you say that inode and other such metadata may be lying in the
> leaf nodes. Correct?
>
Sort of.  We have lot''s of tree''s, but the inode data is laid
out in
what we refer to as fs trees.  All these trees are just b-trees that
have different data in them.  In the fs-trees they will hold inode
items, directory items, file extent items, xattr items and orphan
items.  So any given leaf in this tree could have any number of those
items in them referring to any number of inodes.  You could have

[inode item for inode 1][file extent item for inode 1][inode item for
inode 2][xattr for inode 2][file extent item for inode 2]

all contained within one leaf.  Does that make sense?
> I did not completely understand what you meant when you said that the
> metadata (the file extent items and such) for the inodes are stored
> inside the same tree that the inode resides in. I thought the
> btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to
> the actual data of a file.
>
Yes the btrfs_file_extent_item points to a [offset, size] pair that
describes a data extent.
> Okay, now I am not even sure if in btrfs there is something like an
> indirect block for a huge file. In file systems with fixed block size,
> one can hold only as many pointers to data blocks and hence when the
> file size grows indirects are added in the file''s tree. Is there
any
> equivalent indirect extent required for huge files in btrfs, or do all
> the files fit within one level? If there are indirects, what item type
> do they have? Would something like btrfs_get_extent() be useful to get
> the indirect extents of a file?
>
So there are no indirects, there are just btrfs_file_extent_items that
are held within the btree that describe all of the extents that relate
to a particular file.  So you can have (in the case of large
fragmented files) hundreds of leaves within the btree that just
contain btrfs_file_extent_items for all the ranges for a file.
btrfs_get_extent() just looks up the relevant btrfs_file_extent_item
for the range that you are wondering about, and maps it to our
extent_map structure internally.  Hth,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aastha Mehta

2013-Mar-05 01:55 UTC

head link

Re: basic questions regarding COW in Btrfs

Okay, that makes lot more sense to me now.

Thank you very much.

Regards,
Aastha.

On 5 March 2013 02:51, Josef Bacik <josef@toxicpanda.com>
wrote:> On Mon, Mar 4, 2013 at 7:57 PM, Aastha Mehta <aasthakm@gmail.com>
wrote:
>> I must admit, it is quite convoluted :-)
>>
>> Please tell me if I understand this. A file system tree (containing
>> the inodes, the extents of all the inodes, etc.) is itself laid out in
>> the leaf extents of another big tree, which is the root tree. This is
>> why you say that inode and other such metadata may be lying in the
>> leaf nodes. Correct?
>>
>
> Sort of.  We have lot''s of tree''s, but the inode data is
laid out in
> what we refer to as fs trees.  All these trees are just b-trees that
> have different data in them.  In the fs-trees they will hold inode
> items, directory items, file extent items, xattr items and orphan
> items.  So any given leaf in this tree could have any number of those
> items in them referring to any number of inodes.  You could have
>
> [inode item for inode 1][file extent item for inode 1][inode item for
> inode 2][xattr for inode 2][file extent item for inode 2]
>
> all contained within one leaf.  Does that make sense?
>
>> I did not completely understand what you meant when you said that the
>> metadata (the file extent items and such) for the inodes are stored
>> inside the same tree that the inode resides in. I thought the
>> btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to
>> the actual data of a file.
>>
>
> Yes the btrfs_file_extent_item points to a [offset, size] pair that
> describes a data extent.
>
>> Okay, now I am not even sure if in btrfs there is something like an
>> indirect block for a huge file. In file systems with fixed block size,
>> one can hold only as many pointers to data blocks and hence when the
>> file size grows indirects are added in the file''s tree. Is
there any
>> equivalent indirect extent required for huge files in btrfs, or do all
>> the files fit within one level? If there are indirects, what item type
>> do they have? Would something like btrfs_get_extent() be useful to get
>> the indirect extents of a file?
>>
>
> So there are no indirects, there are just btrfs_file_extent_items that
> are held within the btree that describe all of the extents that relate
> to a particular file.  So you can have (in the case of large
> fragmented files) hundreds of leaves within the btree that just
> contain btrfs_file_extent_items for all the ranges for a file.
> btrfs_get_extent() just looks up the relevant btrfs_file_extent_item
> for the range that you are wondering about, and maps it to our
> extent_map structure internally.  Hth,
>
> Josef--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Feb 2013 - basic questions regarding COW in Btrfs

basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs

Re: basic questions regarding COW in Btrfs