Folks, I am writing an article for Linux.com to answer Henry Newman''s article at http://www.enterprisestorageforum.com/sans/features/article.php/3749926 concerning Linux and massive filesystems. Is there someone here that can field some questions about BTRFS? Thanks! Tom King -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> All the issues he complains about actually are solved by XFS, and XFS actuallydoes better in> exactly these environments than either zfs on Solaris or JFS2 on AIX. > >I asked the author that question and he states XFS is actually a pretty good answer to most of those issues but believes it still falls short where "the metadata areas are not aligned with RAID strips and allocation units are FAR too small but better than ext." Another detail he brought out was sending data and metadata to different devices in those environments and referenced RT XFS. Otherwise having them on the same device increases the possibility of corruption and/or a longer filesystem check/repair. Will btrfs offer something like this in the future? Do y''all foresee btrfs being used in exabtye installations? Does/Will btrfs have RAID awareness in that it will align "the superblock and metadata to the RAID stripe"? What is the largest block allocation available? Will btrfs be T10 DIF/block protect aware? I remember reading that CRFS relies on btrfs, but will btrfs support NFS, specifically version 4.1? Thanks! Tom King -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thomas King wrote:>> All the issues he complains about actually are solved by XFS, and XFS actually > does better in >> exactly these environments than either zfs on Solaris or JFS2 on AIX. >> >> > > I asked the author that question and he states XFS is actually a pretty good > answer to most of those issues but believes it still falls short where "the > metadata areas are not aligned with RAID strips and allocation units are FAR too > small but better than ext." Another detail he brought out was sending data and > metadata to different devices in those environments and referenced RT XFS. > Otherwise having them on the same device increases the possibility of corruption > and/or a longer filesystem check/repair. Will btrfs offer something like this in > the future? > > Do y''all foresee btrfs being used in exabtye installations? > Does/Will btrfs have RAID awareness in that it will align "the > superblock and metadata to the RAID stripe"? > What is the largest block allocation available? > Will btrfs be T10 DIF/block protect aware? > I remember reading that CRFS relies on btrfs, but will btrfs support NFS, > specifically version 4.1?You don''t mention what I believe is the *key* issue (and I don''t think the author did either, but I skimmed his article): data integrity. I''m not talking about blatant failures or known need for an fsck, but rather silent corruption. Where I work, we are considering multi-petabyte scenarios, and with the specs of current drives, we are talking hundreds of silent errors per read of the volume of data - unacceptable. With large filesystems (and he''s talking 100 PB, etc.), this is the #1 issue for me. -Joe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi. On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King (kingttx@tomslinux.homelinux.org) wrote:> I asked the author that question and he states XFS is actually a pretty good > answer to most of those issues but believes it still falls short where "the > metadata areas are not aligned with RAID strips and allocation units are FAR too > small but better than ext." Another detail he brought out was sending data and > metadata to different devices in those environments and referenced RT XFS. > Otherwise having them on the same device increases the possibility of corruption > and/or a longer filesystem check/repair. Will btrfs offer something like this in > the future?Right now btrfs can be created on top of multiple devices. AFAIK, there are no policies on hwo to put data and metadata between them.> Do y''all foresee btrfs being used in exabtye installations? > Does/Will btrfs have RAID awareness in that it will align "the > superblock and metadata to the RAID stripe"? > What is the largest block allocation available? > Will btrfs be T10 DIF/block protect aware? > I remember reading that CRFS relies on btrfs, but will btrfs support NFS, > specifically version 4.1?Original author does not belive in networked filesystem as a key method to organize large storages :) Changes to filesystem are quite simple in order fs would be exported via NFS, so that should not be a problem. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Joe" == Joe Peterson <lavajoe@gentoo.org> writes:Joe> You don''t mention what I believe is the *key* issue (and I don''t Joe> think the author did either, but I skimmed his article): data Joe> integrity. I''m not talking about blatant failures or known need Joe> for an fsck, but rather silent corruption. We''re very concerned about data integrity. With btrfs everything is checksummed at the logical level. This allows you to detect data corruption, repair bad blocks using redundant, good copies, perform data scrubbing, etc. A related, but orthogonal data integrity measure is the T10 DIF infrastructure that I am working on. DIF enables protection at the sector level and includes stuff like a data checksum and a locality check which ensures that the sector ends up the right place on disk. If there is a mismatch the I/O will be reject by either the HBA or the storage device. That allows us to catch a lot of the corruption scenarios where we accidentally write bad stuff to disk. Right now the DIF checksum is added at the block layer level. Work is in progress to move it up into the filesystems and from there into user space. Eventually we''d like to be able to generate the checksum in the application and pass it along the I/O path all the way out to the physical disk. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Tue, Jun 3, 2008 at 4:52 PM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:> Hi. > > On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King (kingttx@tomslinux.homelinux.org) wrote: >> I asked the author that question and he states XFS is actually a pretty good >> answer to most of those issues but believes it still falls short where "the >> metadata areas are not aligned with RAID strips and allocation units are FAR too >> small but better than ext." Another detail he brought out was sending data and >> metadata to different devices in those environments and referenced RT XFS. >> Otherwise having them on the same device increases the possibility of corruption >> and/or a longer filesystem check/repair. Will btrfs offer something like this in >> the future? > > Right now btrfs can be created on top of multiple devices. > AFAIK, there are no policies on hwo to put data and metadata between them. >But it does allow to specify to have different replication/stripping policies for metadata and data. Such has: configure a raid0 with N drives, but mirror the metadata across all of them.>> Do y''all foresee btrfs being used in exabtye installations? >> Does/Will btrfs have RAID awareness in that it will align "the >> superblock and metadata to the RAID stripe"?This is a feature that is intented to provided in the future, this was talked about in the #btrfs@freenode.org irc channel. There isn''t code for this currently. -- Miguel Sousa Filipe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin K. Petersen wrote:> We''re very concerned about data integrity. With btrfs everything is > checksummed at the logical level. This allows you to detect data > corruption, repair bad blocks using redundant, good copies, perform > data scrubbing, etc.That''s the main reason I am interesting in btrfs, actually. :)> A related, but orthogonal data integrity measure is the T10 DIF > infrastructure that I am working on. DIF enables protection at the > sector level and includes stuff like a data checksum and a locality > check which ensures that the sector ends up the right place on disk.Great! Really great to hear that this issue is being actively worked.> Right now the DIF checksum is added at the block layer level. Work is > in progress to move it up into the filesystems and from there into > user space. Eventually we''d like to be able to generate the checksum > in the application and pass it along the I/O path all the way out to > the physical disk.Yep, end-to-end is a great idea. Kudos to this and to btrfs! -Joe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King wrote:> > All the issues he complains about actually are solved by XFS, and XFS actually > does better in > > exactly these environments than either zfs on Solaris or JFS2 on AIX. > > > > > > I asked the author that question and he states XFS is actually a pretty good > answer to most of those issues but believes it still falls short where "the > metadata areas are not aligned with RAID strips and allocation units are FAR too > small but better than ext."I think it would be best to let the XFS developers answer this part. But, XFS is designed for and used in massive installations, and I think it represents a scalability goal for Btrfs.> Another detail he brought out was sending data and > metadata to different devices in those environments and referenced RT XFS. > Otherwise having them on the same device increases the possibility of corruption > and/or a longer filesystem check/repair. Will btrfs offer something like this in > the future?Btrfs can duplicate metadata via the internal raid1 and raid10 code. On single spindles it will duplicate metadata as well. This is different from RT XFS which I do not understand well. There is not code today in btrfs to force data and metadata to different devices, but the disk format has the bits it needs to make that happen. I think it is an oversimplification to say that splitting the two between devices changes the chances of a corruption, or changes the time a repair takes. Btrfs does split data and metadata allocations, grouping metadata together in large chunks on the drive. This does make FS check/repair faster by reducing seeks between metadata blocks.> > Do y''all foresee btrfs being used in exabtye installations?Yes> Does/Will btrfs have RAID awareness in that it will align "the > superblock and metadata to the RAID stripe"?Today the superblock is not stripe aligned, but it will be in a future release that supports super block duplication. At least, the blocks that are frequently written will be striped aligned.> What is the largest block allocation available?2^64 bytes. But, in COW filesystems massive extents have different costs than they do in traditional filesystems. It isn''t always a good idea to make a huge extent.> Will btrfs be T10 DIF/block protect aware?I work closely with Martin, and we''ll leverage the T10 DIF code as much as possible.> I remember reading that CRFS relies on btrfs, but will btrfs support NFS, > specifically version 4.1? >We''ll definitely support NFS. It doesn''t work today, but it will before 1.0. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 3, 2008 at 11:37 PM, Thomas King <kingttx@tomslinux.homelinux.org> wrote:>> All the issues he complains about actually are solved by XFS, and XFS actually > does better in >> exactly these environments than either zfs on Solaris or JFS2 on AIX. >> >> > > I asked the author that question and he states XFS is actually a pretty good > answer to most of those issues but believes it still falls short where "the > metadata areas are not aligned with RAID strips and allocation units are FAR too > small but better than ext." Another detail he brought out was sending data and > metadata to different devices in those environments and referenced RT XFS. > Otherwise having them on the same device increases the possibility of corruption > and/or a longer filesystem check/repair. Will btrfs offer something like this in > the future? > > Do y''all foresee btrfs being used in exabtye installations? > Does/Will btrfs have RAID awareness in that it will align "the > superblock and metadata to the RAID stripe"? > What is the largest block allocation available? > Will btrfs be T10 DIF/block protect aware? > I remember reading that CRFS relies on btrfs, but will btrfs support NFS, > specifically version 4.1? >I also would like to comment that btrfs is ready for the future storage - the solid state drive. Btrfs performs well on both HDD and SSD. AFAIK, the ssd option of btrfs only affects the block allocation behavior. However, under hybrid combination of HDD and SSD with the multi-device support of btrfs, there can be more interesting optimizations that utilize the physical characteristics of each device. -- Dongjun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Tue, Jun 03, 2008 at 09:37:27AM -0500, Thomas King wrote: >> > All the issues he complains about actually are solved by XFS, and XFS >> actually >> does better in >> > exactly these environments than either zfs on Solaris or JFS2 on AIX. >> > >> > >> >> I asked the author that question and he states XFS is actually a pretty good >> answer to most of those issues but believes it still falls short where "the >> metadata areas are not aligned with RAID strips and allocation units are FAR >> too >> small but better than ext." > > I think it would be best to let the XFS developers answer this part. > But, XFS is designed for and used in massive installations, and I think > it represents a scalability goal for Btrfs. > >> Another detail he brought out was sending data and >> metadata to different devices in those environments and referenced RT XFS. >> Otherwise having them on the same device increases the possibility of >> corruption >> and/or a longer filesystem check/repair. Will btrfs offer something like this >> in >> the future? > > Btrfs can duplicate metadata via the internal raid1 and raid10 code. On > single spindles it will duplicate metadata as well. This is different > from RT XFS which I do not understand well. > > There is not code today in btrfs to force data and metadata to different > devices, but the disk format has the bits it needs to make that happen. > I think it is an oversimplification to say that splitting the two > between devices changes the chances of a corruption, or changes the time > a repair takes. > > Btrfs does split data and metadata allocations, grouping metadata > together in large chunks on the drive. This does make FS check/repair > faster by reducing seeks between metadata blocks. > >> >> Do y''all foresee btrfs being used in exabtye installations? > > Yes > >> Does/Will btrfs have RAID awareness in that it will align "the >> superblock and metadata to the RAID stripe"? > > Today the superblock is not stripe aligned, but it will be in a future > release that supports super block duplication. At least, the > blocks that are frequently written will be striped aligned. > >> What is the largest block allocation available? > > 2^64 bytes. But, in COW filesystems massive extents have different > costs than they do in traditional filesystems. It isn''t always a good > idea to make a huge extent. > >> Will btrfs be T10 DIF/block protect aware? > > I work closely with Martin, and we''ll leverage the T10 DIF code as much > as possible. > >> I remember reading that CRFS relies on btrfs, but will btrfs support NFS, >> specifically version 4.1? >> > > We''ll definitely support NFS. It doesn''t work today, but it will before > 1.0. > > -chris > >Chris, Thanks a ton for answering all these questions. I''ve asked the XFS developers what was discussed here and they gave some excellent info as well. Enjoy your day! Tom King -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I also would like to comment that btrfs is ready for the future storage > - the solid state drive. Btrfs performs well on both HDD and SSD.SSD is still very expensive when compared to traditional hard disks. *If* btrfs supported compression, I would second your opinion that btrfs is (will be, when it''s stable) ready for the future storage. -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> SSD is still very expensive when compared to traditional hard disks.When measured by GB/$, sure. Many data centers, though, care more about (ops/sec) / ($ * power * heat). SSDs look much more compelling by that metric. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html