Hello, I am trying to understand the COW mechanism in Btrfs. Is it correct to say that unless nodatacow option is specified, Btrfs always performs COW for all the data+metadata extents used in the system? I saw that COWing is implemented in btrfs_cow_block() function, which is called at the time of searching a slot for a particular item, while inserting into a new slot, committing transactions, while creating pending snapshots and few other places. However, while tracing through the complete write path, I could not quite figure out when extents actually get COWed. Could you please point me to the place where COWing takes place? Is there any time when, for performance or any other reasons, the extents are not COWed but overwritten in place (apart from the explicit nodatacow flag being set during mount)? Thanks, Aastha. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote:> Hello, > > I am trying to understand the COW mechanism in Btrfs. Is it correct to > say that unless nodatacow option is specified, Btrfs always performs > COW for all the data+metadata extents used in the system? >So we always cow the metadata, but yes nodatacow means we don''t cow the actual data in the data extents.> I saw that COWing is implemented in btrfs_cow_block() function, which > is called at the time of searching a slot for a particular item, while > inserting into a new slot, committing transactions, while creating > pending snapshots and few other places. > > However, while tracing through the complete write path, I could not > quite figure out when extents actually get COWed. Could you please > point me to the place where COWing takes place? Is there any time > when, for performance or any other reasons, the extents are not COWed > but overwritten in place (apart from the explicit nodatacow flag being > set during mount)?You''ll want to look at the tree operation ->fill_delalloc(). Thats where we do cow_file_range(). We allocate new space and write. When we finish the ordered io we do btrfs_drop_extents() on the range we just wrote which will free up any existing extents that exist, and then insert our new file extent. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks a lot for the prompt response. I had seen that, but I am still not sure of where it really happens within fill_delalloc. Could you help me a little further in that path? Secondly, now I am confused between the btree_writepages and btrfs_writepages/btrfs_writepage methods. I thought btrfs_writepages was for writing the pages holding inodes and btree_writepages for writing the other indirect and leaf extents of the btree. Then, it seems that the write operations lead to update of the file system data structures in a top-down manner, i.e. first changing the inode and then the data extents. Is that correct? Thirdly, it seems that the old extents maybe dropped before the new extents are flushed to the disk. What would happen if the write fails before the disk commit? What am I missing here? Thanks, Aastha. On 20 February 2013 18:54, Josef Bacik <jbacik@fusionio.com> wrote:> On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote: >> Hello, >> >> I am trying to understand the COW mechanism in Btrfs. Is it correct to >> say that unless nodatacow option is specified, Btrfs always performs >> COW for all the data+metadata extents used in the system? >> > > So we always cow the metadata, but yes nodatacow means we don''t cow the actual > data in the data extents. > >> I saw that COWing is implemented in btrfs_cow_block() function, which >> is called at the time of searching a slot for a particular item, while >> inserting into a new slot, committing transactions, while creating >> pending snapshots and few other places. >> >> However, while tracing through the complete write path, I could not >> quite figure out when extents actually get COWed. Could you please >> point me to the place where COWing takes place? Is there any time >> when, for performance or any other reasons, the extents are not COWed >> but overwritten in place (apart from the explicit nodatacow flag being >> set during mount)? > > You''ll want to look at the tree operation ->fill_delalloc(). Thats where we do > cow_file_range(). We allocate new space and write. When we finish the ordered > io we do btrfs_drop_extents() on the range we just wrote which will free up any > existing extents that exist, and then insert our new file extent. Thanks, > > Josef-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
A gentle reminder on this one. Thanks, Aastha. On 21 February 2013 18:32, Aastha Mehta <aasthakm@gmail.com> wrote:> Thanks a lot for the prompt response. I had seen that, but I am still > not sure of where it really > happens within fill_delalloc. Could you help me a little further in that path? > > Secondly, now I am confused between the btree_writepages and > btrfs_writepages/btrfs_writepage > methods. I thought btrfs_writepages was for writing the pages holding > inodes and btree_writepages > for writing the other indirect and leaf extents of the btree. Then, it > seems that the write operations > lead to update of the file system data structures in a top-down > manner, i.e. first changing the inode > and then the data extents. Is that correct? > > Thirdly, it seems that the old extents maybe dropped before the new > extents are flushed to the disk. > What would happen if the write fails before the disk commit? What am I > missing here? > > Thanks, > Aastha. > > On 20 February 2013 18:54, Josef Bacik <jbacik@fusionio.com> wrote: >> On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote: >>> Hello, >>> >>> I am trying to understand the COW mechanism in Btrfs. Is it correct to >>> say that unless nodatacow option is specified, Btrfs always performs >>> COW for all the data+metadata extents used in the system? >>> >> >> So we always cow the metadata, but yes nodatacow means we don''t cow the actual >> data in the data extents. >> >>> I saw that COWing is implemented in btrfs_cow_block() function, which >>> is called at the time of searching a slot for a particular item, while >>> inserting into a new slot, committing transactions, while creating >>> pending snapshots and few other places. >>> >>> However, while tracing through the complete write path, I could not >>> quite figure out when extents actually get COWed. Could you please >>> point me to the place where COWing takes place? Is there any time >>> when, for performance or any other reasons, the extents are not COWed >>> but overwritten in place (apart from the explicit nodatacow flag being >>> set during mount)? >> >> You''ll want to look at the tree operation ->fill_delalloc(). Thats where we do >> cow_file_range(). We allocate new space and write. When we finish the ordered >> io we do btrfs_drop_extents() on the range we just wrote which will free up any >> existing extents that exist, and then insert our new file extent. Thanks, >> >> Josef-- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks again Josef. I understood that cow_file_range is called for a regular file. Just to clarify, in cow_file_range is cow done at the time of reserving extents in the extent btree for the io to be done in this delalloc? I see the following comment above find_free_extent() which is called while trying to reserve extents: /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: * ins->objectid == block start * ins->flags = BTRFS_EXTENT_ITEM_KEY * ins->offset == number of blocks * Any available blocks before search_start are skipped. */ This seems to be the only place where a cow might be done, because a key is being inserted into an extent which modifies it. Thanks, Aastha. On 24 February 2013 02:39, Josef Bacik <josef@toxicpanda.com> wrote:> On Thu, Feb 21, 2013 at 12:32 PM, Aastha Mehta <aasthakm@gmail.com> wrote: >> >> Thanks a lot for the prompt response. I had seen that, but I am still >> not sure of where it really >> happens within fill_delalloc. Could you help me a little further in that >> path? >> > > So we check the properties of the inode and do one of 3 things, either we > call btrfs_cow_file_range directly in the case of a normal file, > run_delalloc_nocow in the case of a file with prealloc extents or NOCOW, or > we do the compression dance. We make an ordered extent for this range and > return. And then the normal io path happens. > >> >> Secondly, now I am confused between the btree_writepages and >> btrfs_writepages/btrfs_writepage >> methods. I thought btrfs_writepages was for writing the pages holding >> inodes and btree_writepages >> for writing the other indirect and leaf extents of the btree. Then, it >> seems that the write operations >> lead to update of the file system data structures in a top-down >> manner, i.e. first changing the inode >> and then the data extents. Is that correct? > > > You are right that btrfs_writepages/writepage are for normal files and > btree_writepages is for the metadata. The write operations do start in data > and then modify metadata later down the line if that is what you are getting > at. > >> >> Thirdly, it seems that the old extents maybe dropped before the new >> extents are flushed to the disk. >> What would happen if the write fails before the disk commit? What am I >> missing here? >> > > Yeah, the metadata isn''t updated until the data is on the disk. In > ->fill_delalloc we setup an btrfs_ordered_extent that describes the range of > the dirty pages we are writing. When we''ve written all these pages we run > btrfs_finish_ordered_io, which will drop the old extent entries if there are > any and then add the new extent entries and update the references and such. > So if something fails we just continue to point to the original file extent > entries and return an EIO, we maintain consistency by making sure the > metadata is updated only after the data is written out. I hope that helps. > Thanks, > > Josef-- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:> Thanks again Josef. > > I understood that cow_file_range is called for a regular file. Just to > clarify, in cow_file_range is cow done at the time of reserving > extents in the extent btree for the io to be done in this delalloc? I > see the following comment above find_free_extent() which is called > while trying to reserve extents: > > /* > * walks the btree of allocated extents and find a hole of a given size. > * The key ins is changed to record the hole: > * ins->objectid == block start > * ins->flags = BTRFS_EXTENT_ITEM_KEY > * ins->offset == number of blocks > * Any available blocks before search_start are skipped. > */ > > This seems to be the only place where a cow might be done, because a > key is being inserted into an extent which modifies it. >The key isn''t inserted at this time, it''s just returned with those values for us to do as we please. There is no update of the btree until insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ah okay, I now see how it works. Thanks a lot for your response. Regards, Aastha. On 25 February 2013 18:27, Josef Bacik <jbacik@fusionio.com> wrote:> On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote: >> Thanks again Josef. >> >> I understood that cow_file_range is called for a regular file. Just to >> clarify, in cow_file_range is cow done at the time of reserving >> extents in the extent btree for the io to be done in this delalloc? I >> see the following comment above find_free_extent() which is called >> while trying to reserve extents: >> >> /* >> * walks the btree of allocated extents and find a hole of a given size. >> * The key ins is changed to record the hole: >> * ins->objectid == block start >> * ins->flags = BTRFS_EXTENT_ITEM_KEY >> * ins->offset == number of blocks >> * Any available blocks before search_start are skipped. >> */ >> >> This seems to be the only place where a cow might be done, because a >> key is being inserted into an extent which modifies it. >> > > The key isn''t inserted at this time, it''s just returned with those values for us > to do as we please. There is no update of the btree until > insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io. > Thanks, > > Josef-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Josef, I hope it''s ok to piggy back on this thread for the following question: I see that in btrfs_cross_ref_exist()=>check_committed_ref() path, there is the following check: if (btrfs_extent_generation(leaf, ei) < btrfs_root_last_snapshot(&root->root_item)) goto out; So this basically means that after we have taken a snap of a subvol, then all subvol''s extents must be COW''ed, even if we delete the snap a minute later. I wonder, why is that so? Is this because file extents can be shared indirectly, like when we create a snap, we only COW the root and only mark all root''s *immediate* children shared in the extent tree? Can the new backref walking code be used here to check more accurately, if the extent is shared by anybody else? Thanks, Alex. On Mon, Feb 25, 2013 at 9:00 PM, Aastha Mehta <aasthakm@gmail.com> wrote:> Ah okay, I now see how it works. Thanks a lot for your response. > > Regards, > Aastha. > > > On 25 February 2013 18:27, Josef Bacik <jbacik@fusionio.com> wrote: >> On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote: >>> Thanks again Josef. >>> >>> I understood that cow_file_range is called for a regular file. Just to >>> clarify, in cow_file_range is cow done at the time of reserving >>> extents in the extent btree for the io to be done in this delalloc? I >>> see the following comment above find_free_extent() which is called >>> while trying to reserve extents: >>> >>> /* >>> * walks the btree of allocated extents and find a hole of a given size. >>> * The key ins is changed to record the hole: >>> * ins->objectid == block start >>> * ins->flags = BTRFS_EXTENT_ITEM_KEY >>> * ins->offset == number of blocks >>> * Any available blocks before search_start are skipped. >>> */ >>> >>> This seems to be the only place where a cow might be done, because a >>> key is being inserted into an extent which modifies it. >>> >> >> The key isn''t inserted at this time, it''s just returned with those values for us >> to do as we please. There is no update of the btree until >> insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io. >> Thanks, >> >> Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Josef, I have some more questions following up on my previous e-mails. I now do somewhat understand the place where extent entries get cow''ed. But I am unclear about the order of operations. Is it correct that the data extent written first, then the pointer in the indirect block needs to be updated, so then it is cowed and written to disk and so on recursively up the tree? Or is the entire path from leaf to node that is going to be affected by the write cowed first and then all the cowed extents are written to the disk and then the rest of the metadata pointers, (for example, in checksum tree, extent tree, etc., I am not sure about this)? Also, I need to understand specifically how the data (leaf nodes) of a file is written to disk v/s the metadata including the indirect nodes of the file. In extent_writepage I only know the pages of a file that are to be written. I guess, I can identify metadata pages based on the inode of the page''s owner. But is it possible to distinguish the pages available in extent_writepage path as belonging to the leaf node or internal node for a file? If it cannot be identified at this point, where earlier in the path can this be decided? Many thanks, Aastha. On 25 February 2013 20:00, Aastha Mehta <aasthakm@gmail.com> wrote:> Ah okay, I now see how it works. Thanks a lot for your response. > > Regards, > Aastha. > > > On 25 February 2013 18:27, Josef Bacik <jbacik@fusionio.com> wrote: >> On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote: >>> Thanks again Josef. >>> >>> I understood that cow_file_range is called for a regular file. Just to >>> clarify, in cow_file_range is cow done at the time of reserving >>> extents in the extent btree for the io to be done in this delalloc? I >>> see the following comment above find_free_extent() which is called >>> while trying to reserve extents: >>> >>> /* >>> * walks the btree of allocated extents and find a hole of a given size. >>> * The key ins is changed to record the hole: >>> * ins->objectid == block start >>> * ins->flags = BTRFS_EXTENT_ITEM_KEY >>> * ins->offset == number of blocks >>> * Any available blocks before search_start are skipped. >>> */ >>> >>> This seems to be the only place where a cow might be done, because a >>> key is being inserted into an extent which modifies it. >>> >> >> The key isn''t inserted at this time, it''s just returned with those values for us >> to do as we please. There is no update of the btree until >> insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io. >> Thanks, >> >> Josef-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 2, 2013 at 4:07 PM, Alex Lyakas <alex.btrfs@zadarastorage.com> wrote:> Hi Josef, > I hope it''s ok to piggy back on this thread for the following question: > > I see that in btrfs_cross_ref_exist()=>check_committed_ref() path, > there is the following check: > > if (btrfs_extent_generation(leaf, ei) <> btrfs_root_last_snapshot(&root->root_item)) > goto out; > > So this basically means that after we have taken a snap of a subvol, > then all subvol''s extents must be COW''ed, even if we delete the snap a > minute later. > I wonder, why is that so? > Is this because file extents can be shared indirectly, like when we > create a snap, we only COW the root and only mark all root''s > *immediate* children shared in the extent tree?Yes that''s exactly it. We have no way of knowing that there are no snapshots left for this particular root so if there ever was a snapshot we have to err on the side of caution.> Can the new backref walking code be used here to check more > accurately, if the extent is shared by anybody else? >Probably, if we could figure out if there is a way for more than one root to point to this extent then yes this would be ideal so we don''t have to force COW in cases we would rather not. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 3, 2013 at 10:41 AM, Aastha Mehta <aasthakm@gmail.com> wrote:> Hi Josef, > > I have some more questions following up on my previous e-mails. > I now do somewhat understand the place where extent entries get > cow''ed. But I am unclear about the order of operations. > > Is it correct that the data extent written first, then the pointer in > the indirect block needs to be updated, so then it is cowed and > written to disk and so on recursively up the tree? Or is the entire > path from leaf to node that is going to be affected by the write cowed > first and then all the cowed extents are written to the disk and then > the rest of the metadata pointers, (for example, in checksum tree, > extent tree, etc., I am not sure about this)?The second one. We COW the entire path from root to leaf as things need COW''ing. We start a transaction, we insert the file extent entries, we add the checksums, and we add the delayed ref updates to the extent tree. The delayed things are guaranteed to happen in that transaction so we have consistency there. The COW''ing from top to bottom works like that for all trees.> > Also, I need to understand specifically how the data (leaf nodes) of a > file is written to disk v/s the metadata including the indirect nodes > of the file. In extent_writepage I only know the pages of a file that > are to be written. I guess, I can identify metadata pages based on the > inode of the page''s owner. But is it possible to distinguish the pages > available in extent_writepage path as belonging to the leaf node or > internal node for a file? If it cannot be identified at this point, > where earlier in the path can this be decided? >So they are different things, and they could change from the time we write to the time that the write completes because of COW. Also keep in mind that the metadata (the file extent items and such) for the inodes are not stored specifically within the inode, they''re stored inside the same tree that the inode resides in. So you can have a leaf node with multiple inodes and extents for those different inodes. And so any sort of random things can happen, other inodes can be deleted and this inode''s metadata will be shifted into a new leaf, or another inode could be added and this inode''s data could be pushed off into an adjacent leaf. The only way to know which leaf/page the inode is associated with is to search for whatever you are looking for in the tree, and then while you are holding all of the locks and reference counting you can be sure that those pages contain the metadata you are looking for, but once you let that go there are no guarantees. So as far as how it is written to disk, that is where transactions come in. We track all the dirty metadata pages we have per transaction, and then at transaction commit time we make sure that all of those pages are written to disk and then we commit our super to point to the new root of the tree root, which in turn points at all of our new roots because of COW. These pages can be written before the commit though because of memory pressure, and if they are written and then modified again within in the same transaction we will re-cow them to make sure we don''t have any partial-page updates. Keeping track of where a specific inodes metadata is contained is a tricky business. Let me know if that helped. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I must admit, it is quite convoluted :-) Please tell me if I understand this. A file system tree (containing the inodes, the extents of all the inodes, etc.) is itself laid out in the leaf extents of another big tree, which is the root tree. This is why you say that inode and other such metadata may be lying in the leaf nodes. Correct? I did not completely understand what you meant when you said that the metadata (the file extent items and such) for the inodes are stored inside the same tree that the inode resides in. I thought the btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to the actual data of a file. Okay, now I am not even sure if in btrfs there is something like an indirect block for a huge file. In file systems with fixed block size, one can hold only as many pointers to data blocks and hence when the file size grows indirects are added in the file''s tree. Is there any equivalent indirect extent required for huge files in btrfs, or do all the files fit within one level? If there are indirects, what item type do they have? Would something like btrfs_get_extent() be useful to get the indirect extents of a file? Too many questions, sorry :( Thanks. On 4 March 2013 00:52, Josef Bacik <josef@toxicpanda.com> wrote:> On Sun, Mar 3, 2013 at 10:41 AM, Aastha Mehta <aasthakm@gmail.com> wrote: >> Hi Josef, >> >> I have some more questions following up on my previous e-mails. >> I now do somewhat understand the place where extent entries get >> cow''ed. But I am unclear about the order of operations. >> >> Is it correct that the data extent written first, then the pointer in >> the indirect block needs to be updated, so then it is cowed and >> written to disk and so on recursively up the tree? Or is the entire >> path from leaf to node that is going to be affected by the write cowed >> first and then all the cowed extents are written to the disk and then >> the rest of the metadata pointers, (for example, in checksum tree, >> extent tree, etc., I am not sure about this)? > > The second one. We COW the entire path from root to leaf as things > need COW''ing. We start a transaction, we insert the file extent > entries, we add the checksums, and we add the delayed ref updates to > the extent tree. The delayed things are guaranteed to happen in that > transaction so we have consistency there. The COW''ing from top to > bottom works like that for all trees. > >> >> Also, I need to understand specifically how the data (leaf nodes) of a >> file is written to disk v/s the metadata including the indirect nodes >> of the file. In extent_writepage I only know the pages of a file that >> are to be written. I guess, I can identify metadata pages based on the >> inode of the page''s owner. But is it possible to distinguish the pages >> available in extent_writepage path as belonging to the leaf node or >> internal node for a file? If it cannot be identified at this point, >> where earlier in the path can this be decided? >> > > So they are different things, and they could change from the time we > write to the time that the write completes because of COW. Also keep > in mind that the metadata (the file extent items and such) for the > inodes are not stored specifically within the inode, they''re stored > inside the same tree that the inode resides in. So you can have a > leaf node with multiple inodes and extents for those different inodes. > And so any sort of random things can happen, other inodes can be > deleted and this inode''s metadata will be shifted into a new leaf, or > another inode could be added and this inode''s data could be pushed off > into an adjacent leaf. The only way to know which leaf/page the inode > is associated with is to search for whatever you are looking for in > the tree, and then while you are holding all of the locks and > reference counting you can be sure that those pages contain the > metadata you are looking for, but once you let that go there are no > guarantees. > > So as far as how it is written to disk, that is where transactions > come in. We track all the dirty metadata pages we have per > transaction, and then at transaction commit time we make sure that all > of those pages are written to disk and then we commit our super to > point to the new root of the tree root, which in turn points at all of > our new roots because of COW. These pages can be written before the > commit though because of memory pressure, and if they are written and > then modified again within in the same transaction we will re-cow them > to make sure we don''t have any partial-page updates. Keeping track of > where a specific inodes metadata is contained is a tricky business. > Let me know if that helped. Thanks, > > Josef-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 4, 2013 at 7:57 PM, Aastha Mehta <aasthakm@gmail.com> wrote:> I must admit, it is quite convoluted :-) > > Please tell me if I understand this. A file system tree (containing > the inodes, the extents of all the inodes, etc.) is itself laid out in > the leaf extents of another big tree, which is the root tree. This is > why you say that inode and other such metadata may be lying in the > leaf nodes. Correct? >Sort of. We have lot''s of tree''s, but the inode data is laid out in what we refer to as fs trees. All these trees are just b-trees that have different data in them. In the fs-trees they will hold inode items, directory items, file extent items, xattr items and orphan items. So any given leaf in this tree could have any number of those items in them referring to any number of inodes. You could have [inode item for inode 1][file extent item for inode 1][inode item for inode 2][xattr for inode 2][file extent item for inode 2] all contained within one leaf. Does that make sense?> I did not completely understand what you meant when you said that the > metadata (the file extent items and such) for the inodes are stored > inside the same tree that the inode resides in. I thought the > btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to > the actual data of a file. >Yes the btrfs_file_extent_item points to a [offset, size] pair that describes a data extent.> Okay, now I am not even sure if in btrfs there is something like an > indirect block for a huge file. In file systems with fixed block size, > one can hold only as many pointers to data blocks and hence when the > file size grows indirects are added in the file''s tree. Is there any > equivalent indirect extent required for huge files in btrfs, or do all > the files fit within one level? If there are indirects, what item type > do they have? Would something like btrfs_get_extent() be useful to get > the indirect extents of a file? >So there are no indirects, there are just btrfs_file_extent_items that are held within the btree that describe all of the extents that relate to a particular file. So you can have (in the case of large fragmented files) hundreds of leaves within the btree that just contain btrfs_file_extent_items for all the ranges for a file. btrfs_get_extent() just looks up the relevant btrfs_file_extent_item for the range that you are wondering about, and maps it to our extent_map structure internally. Hth, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Okay, that makes lot more sense to me now. Thank you very much. Regards, Aastha. On 5 March 2013 02:51, Josef Bacik <josef@toxicpanda.com> wrote:> On Mon, Mar 4, 2013 at 7:57 PM, Aastha Mehta <aasthakm@gmail.com> wrote: >> I must admit, it is quite convoluted :-) >> >> Please tell me if I understand this. A file system tree (containing >> the inodes, the extents of all the inodes, etc.) is itself laid out in >> the leaf extents of another big tree, which is the root tree. This is >> why you say that inode and other such metadata may be lying in the >> leaf nodes. Correct? >> > > Sort of. We have lot''s of tree''s, but the inode data is laid out in > what we refer to as fs trees. All these trees are just b-trees that > have different data in them. In the fs-trees they will hold inode > items, directory items, file extent items, xattr items and orphan > items. So any given leaf in this tree could have any number of those > items in them referring to any number of inodes. You could have > > [inode item for inode 1][file extent item for inode 1][inode item for > inode 2][xattr for inode 2][file extent item for inode 2] > > all contained within one leaf. Does that make sense? > >> I did not completely understand what you meant when you said that the >> metadata (the file extent items and such) for the inodes are stored >> inside the same tree that the inode resides in. I thought the >> btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to >> the actual data of a file. >> > > Yes the btrfs_file_extent_item points to a [offset, size] pair that > describes a data extent. > >> Okay, now I am not even sure if in btrfs there is something like an >> indirect block for a huge file. In file systems with fixed block size, >> one can hold only as many pointers to data blocks and hence when the >> file size grows indirects are added in the file''s tree. Is there any >> equivalent indirect extent required for huge files in btrfs, or do all >> the files fit within one level? If there are indirects, what item type >> do they have? Would something like btrfs_get_extent() be useful to get >> the indirect extents of a file? >> > > So there are no indirects, there are just btrfs_file_extent_items that > are held within the btree that describe all of the extents that relate > to a particular file. So you can have (in the case of large > fragmented files) hundreds of leaves within the btree that just > contain btrfs_file_extent_items for all the ranges for a file. > btrfs_get_extent() just looks up the relevant btrfs_file_extent_item > for the range that you are wondering about, and maps it to our > extent_map structure internally. Hth, > > Josef-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html