Dear all, While going through the archived mailing list and crawling along the wiki I didn''t find any clues if there would be any optimizations in Btrfs to make efficient use of functions and features that today exist on enterprise class storage arrays. One exception to that was the ssd option which I think can make a improvement on read and write IO''s however when attached to a storage array, from an OS perspective, it doesn''t really matter since it can''t look behind the array front-end interface anyhow(whether it FC/iSCSI or any other). There are however more options that we could think of. Almost all storage arrays these days have the capabilities to replicate volume (or part of it in COW cases) either in the system or remotely. It would be handy that if a Btrfs formatted volume could make use of those features since this might offload a lot of the processing time involved in maintaining these. The arrays already have optimized code to make these snapshots. I''m not saying we should step away from the host based snapshots but integration would be very nice. Furthermore some enterprise array have a feature that allows for full or partial staging data in cache. By this I mean when a volume contains a certain amount of blocks you can define to have the first X number of blocks pre-staged in cache which enables you to have extremely high IO rates on these first ones. An option related to the -ssd parameter could be to have a mount command say "mount -t btrfs -ssd 0-10000" so Btrfs knows what to expect from the partial area and maybe can optimize the locality of frequently used blocks to optimize performance. Another thing is that some arrays have the capability to "thin-provision" volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero''s and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those "non-used" blocks? Given the scalability targets of Btrfs it will most likely be heavily used in the enterprise environment once it reaches a stable code level. If we would be able to interface with these array based features that would be very beneficial. Furthermore one question also pops to mind and that''s when looking at the scalability of Btrfs and its targeted capacity levels I think we will run into problems with the capabilities of the server hardware itself. From what I can see now it will not be designed as a distributed file-system with integrated distributed lock manager to scale out over multiple nodes. (I know Oracle is working on a similar thing but this might get things more complicated than it already is.) This might impose some serious issues with recovery scenarios like backup/restore since it will take quite some time to backup/restore a multi PB system when it resides on just 1 physical host even when we''re talking high end P-series, I25K''s or Superdome class. I''m not a coder but am heavily involved in the storage industry for the past 15 years so this is just some of the things I come across in real life enterprise customer environments so these are just some of my mind spinnings. There are some more however these would be best covered in another topic. Let me know your thoughts. Kind regards, Erwin van Londen Systems Engineer HITACHI DATA SYSTEMS Level 4, 441 St. Kilda Rd. Melbourne, Victoria, 3004 Australia -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Erwin, Erwin van Londen wrote (ao):> Another thing is that some arrays have the capability to > "thin-provision" volumes. In the back-end on the physical layer the > array configures, let say, a 1 TB volume and virtually provisions 5TB > to the host. On writes it dynamically allocates more pages in the pool > up to the 5TB point. Now if for some reason large holes occur on the > volume, maybe a couple of ISO images that have been deleted, what > normally happens is just some pointers in the inodes get deleted so > from an array perspective there is still data on those locations and > will never release those allocated blocks. New firmware/microcode > versions are able to reclaim that space if it sees a certain number of > consecutive zero''s and will reclaim that space to the volume pool. Are > there any thoughts on writing a low-priority tread that zeros out > those "non-used" blocks?SSD would also benefit from such a feature as it doesn''t need to copy deleted data when erasing blocks. The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for that? I have one question on thin provisioning: if Windows XP performs defrag on a 20GB ''virtual'' size LUN with 2GB in actuall use, whil the volume grow to 20GB on the storage and never shrink afterwards anymore, while the client still has only 2GB in use? This would make thin provisioning on virtual desktops less useful. Do you have any numbers on the performance impact of thin provisioning? I can imagine that thin provisioning causes on-storage defragmentation of disk images, which would kill any OS optimisations like grouping often read files. With kind regards, Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Erwin van Londen wrote:> Dear all, > > While going through the archived mailing list and crawling along the wiki I didn''t find any clues if there would be any optimizations in Btrfs to make efficient use of functions and features that today exist on enterprise class storage arrays. > > One exception to that was the ssd option which I think can make a improvement on read and write IO''s however when attached to a storage array, from an OS perspective, it doesn''t really matter since it can''t look behind the array front-end interface anyhow(whether it FC/iSCSI or any other). > > There are however more options that we could think of. Almost all storage arrays these days have the capabilities to replicate volume (or part of it in COW cases) either in the system or remotely. It would be handy that if a Btrfs formatted volume could make use of those features since this might offload a lot of the processing time involved in maintaining these. The arrays already have optimized code to make these snapshots. I''m not saying we should step away from the host based snapshots but integration would be very nice. >I agree - it would be great to have a standard way to invoke built in snap shots in enterprise arrays. Does HDS export something that we could invoke from kernel context to perform a snap shot?> Furthermore some enterprise array have a feature that allows for full or partial staging data in cache. By this I mean when a volume contains a certain amount of blocks you can define to have the first X number of blocks pre-staged in cache which enables you to have extremely high IO rates on these first ones. An option related to the -ssd parameter could be to have a mount command say "mount -t btrfs -ssd 0-10000" so Btrfs knows what to expect from the partial area and maybe can optimize the locality of frequently used blocks to optimize performance. >Effectively, you prefetch and pin these blocks in cache? Is this something we can preconfigure via some interface per LUN?> Another thing is that some arrays have the capability to "thin-provision" volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero''s and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread that zeros out those "non-used" blocks?>Patches have been floating around to support this - see the recent patches around "DISCARD" on linux-ide and lkml. It would be great to get access to a box that implemented the T10 proposed UNMAP commands that we could test against.> Given the scalability targets of Btrfs it will most likely be heavily used in the enterprise environment once it reaches a stable code level. If we would be able to interface with these array based features that would be very beneficial. > > Furthermore one question also pops to mind and that''s when looking at the scalability of Btrfs and its targeted capacity levels I think we will run into problems with the capabilities of the server hardware itself. From what I can see now it will not be designed as a distributed file-system with integrated distributed lock manager to scale out over multiple nodes. (I know Oracle is working on a similar thing but this might get things more complicated than it already is.) This might impose some serious issues with recovery scenarios like backup/restore since it will take quite some time to backup/restore a multi PB system when it resides on just 1 physical host even when we''re talking high end P-series, I25K''s or Superdome class. > > I''m not a coder but am heavily involved in the storage industry for the past 15 years so this is just some of the things I come across in real life enterprise customer environments so these are just some of my mind spinnings. > > There are some more however these would be best covered in another topic. > > Let me know your thoughts. > > Kind regards, > > Erwin van Londen > Systems Engineer > HITACHI DATA SYSTEMS > Level 4, 441 St. Kilda Rd. > Melbourne, Victoria, 3004 > Australia > >Erwin, the real key here is to figure out what standard interfaces we can use to invoke these functions (from kernel context). Having a background in storage myself, the challenge in taking advantage of these advanced array features has been that each vendor has their own back door API''s to control this. Linux likes to take advantage of well supported by multiple vendor features. If you have HDS engineers interested in exploring this or spilling details on how to trigger these, it would be a great conversation (but not just for btrfs, you would need to talk to the broader SCSI, LVM, etc lists as well :-)) Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sander wrote:> Dear Erwin, > > Erwin van Londen wrote (ao): > >> Another thing is that some arrays have the capability to >> "thin-provision" volumes. In the back-end on the physical layer the >> array configures, let say, a 1 TB volume and virtually provisions 5TB >> to the host. On writes it dynamically allocates more pages in the pool >> up to the 5TB point. Now if for some reason large holes occur on the >> volume, maybe a couple of ISO images that have been deleted, what >> normally happens is just some pointers in the inodes get deleted so >> from an array perspective there is still data on those locations and >> will never release those allocated blocks. New firmware/microcode >> versions are able to reclaim that space if it sees a certain number of >> consecutive zero''s and will reclaim that space to the volume pool. Are >> there any thoughts on writing a low-priority tread that zeros out >> those "non-used" blocks? >> > > SSD would also benefit from such a feature as it doesn''t need to copy > deleted data when erasing blocks. >The joy of the proposed T10 UNMAP commands is that you send one block down with special bits set which lets the target know you don''t need the whole block range. No real data movement to or from the storage device (other than that one special sector).> The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for > that? >That is the current plan & is being implemented in a way that should let us transparently (from a btrfs level) take advantage of arrays that do unmap or SSD trim enabled devices with no fs level changes. Currently, as far as I know, we have no thin enabled devices to play with.> I have one question on thin provisioning: if Windows XP performs defrag > on a 20GB ''virtual'' size LUN with 2GB in actuall use, whil the volume > grow to 20GB on the storage and never shrink afterwards anymore, while > the client still has only 2GB in use? >That is the inverse of what would happen - the windows people will probably hack defrag to emit its equivalent of block discard commands for the old blocks after defragging a file (just guessing here). With a per fs defrag pass, you could use this kind of hook to resync the actual allocated block state between your fs image and the storage.> This would make thin provisioning on virtual desktops less useful. > > Do you have any numbers on the performance impact of thin provisioning? > I can imagine that thin provisioning causes on-storage defragmentation > of disk images, which would kill any OS optimisations like grouping often > read files. > > With kind regards, Sander > >Big arrays have virtual block ranges anyway, this is just a different layer of abstraction that is invisible to us. Thin enabled devices might have a performance hit for small discard commands (that would impact truncate/unlink performance). I suspect the performance impact will vary a lot depending on how the feature is coded internally in the storage, but we will only know when we get one to play with :-) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:> > New firmware/microcode versions are able to reclaim that space if it > > sees a certain number of consecutive zero''s and will reclaim that > > space to the volume pool. Are there any thoughts on writing a > > low-priority tread that zeros out those "non-used" blocks? > > Patches have been floating around to support this - see the recent > patches around "DISCARD" on linux-ide and lkml. It would be great to > get access to a box that implemented the T10 proposed UNMAP commands > that we could test against.We''ve already made btrfs support TRIM, and Matthew has patches which hook it up for ATA/IDE devices. Adding SCSI support shouldn''t be hard once the dust settles on the spec. I don''t think I''ve seen anybody talking about deliberately writing zeroes instead of just issuing a discard command though. That doesn''t seem like a massively cunning plan. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse wrote:> On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote: > >>> New firmware/microcode versions are able to reclaim that space if it >>> sees a certain number of consecutive zero''s and will reclaim that >>> space to the volume pool. Are there any thoughts on writing a >>> low-priority tread that zeros out those "non-used" blocks? >>> >> Patches have been floating around to support this - see the recent >> patches around "DISCARD" on linux-ide and lkml. It would be great to >> get access to a box that implemented the T10 proposed UNMAP commands >> that we could test against. >> > > We''ve already made btrfs support TRIM, and Matthew has patches which > hook it up for ATA/IDE devices. Adding SCSI support shouldn''t be hard > once the dust settles on the spec. > > I don''t think I''ve seen anybody talking about deliberately writing > zeroes instead of just issuing a discard command though. That doesn''t > seem like a massively cunning plan. > >What the SCSI spec says is that you can use "WRITE SAME" with a discard bit set. What the array would do with that is array dependent - it could in fact write that same block out to each of the blocks if it chooses to do so. The intention would be, of course, to manipulate internal array tracking so that you do no IO. We should avoid doing that command to arrays that don''t really implement the unmap part, it could take a long time to complete a single largish discard request :-) The nice part of the write same with unmap flavour of the T10 command is that it is very clear about the semantics of what you should get back, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen wrote:> Dear all, > > While going through the archived mailing list and crawling along the > wiki I didn''t find any clues if there would be any optimizations in > Btrfs to make efficient use of functions and features that today exist > on enterprise class storage arrays. > > One exception to that was the ssd option which I think can make a > improvement on read and write IO''s however when attached to a storage > array, from an OS perspective, it doesn''t really matter since it can''t > look behind the array front-end interface anyhow(whether it FC/iSCSI > or any other). > > There are however more options that we could think of. Almost all > storage arrays these days have the capabilities to replicate volume > (or part of it in COW cases) either in the system or remotely. It > would be handy that if a Btrfs formatted volume could make use of > those features since this might offload a lot of the processing time > involved in maintaining these. The arrays already have optimized code > to make these snapshots. I''m not saying we should step away from the > host based snapshots but integration would be very nice.Storage based snapshotting would definitely be useful for replication in btrfs, and in that case we could wire it up from userland. Basically there is a point during commit where a storage snapshot could be taken and fully consistent. Outside of replication though, I''m not sure exactly where storage based snapshotting would come in. It wouldn''t really be compatible with the snapshots btrfs is already doing (but I''m always open to more ideas).> Furthermore some enterprise array have a feature that allows for full > or partial staging data in cache. By this I mean when a volume > contains a certain amount of blocks you can define to have the first X > number of blocks pre-staged in cache which enables you to have > extremely high IO rates on these first ones. An option related to the > -ssd parameter could be to have a mount command say "mount -t btrfs > -ssd 0-10000" so Btrfs knows what to expect from the partial area and > maybe can optimize the locality of frequently used blocks to optimize > performance.This would be very useful, although I would tend to export it to btrfs as a second lun. My long term goal is to have code in btrfs that supports a super fast staging lun, which might be an ssd or cache carved out of a high end array.> > Another thing is that some arrays have the capability to > "thin-provision" volumes. In the back-end on the physical layer the > array configures, let say, a 1 TB volume and virtually provisions 5TB > to the host. On writes it dynamically allocates more pages in the pool > up to the 5TB point. Now if for some reason large holes occur on the > volume, maybe a couple of ISO images that have been deleted, what > normally happens is just some pointers in the inodes get deleted so > from an array perspective there is still data on those locations and > will never release those allocated blocks. New firmware/microcode > versions are able to reclaim that space if it sees a certain number of > consecutive zero''s and will reclaim that space to the volume pool. Are > there any thoughts on writing a low-priority tread that zeros out > those "non-used" blocks?Other people have replied about the trim commands, which btrfs can issue on every block it frees. But, another way to look at this is that btrfs already is thinly provisioned. When you add storage to btrfs, it allocates from that storage in 1GB chunks, and then hands those over to the FS allocation code for more fine grained use. It may make sense to talk about how that can fit in with your own thin provisioning.> Given the scalability targets of Btrfs it will most likely be heavily > used in the enterprise environment once it reaches a stable code > level. If we would be able to interface with these array based > features that would be very beneficial. > > Furthermore one question also pops to mind and that''s when looking at > the scalability of Btrfs and its targeted capacity levels I think we > will run into problems with the capabilities of the server hardware > itself. From what I can see now it will not be designed as a > distributed file-system with integrated distributed lock manager to > scale out over multiple nodes. (I know Oracle is working on a similar > thing but this might get things more complicated than it already is.) > This might impose some serious issues with recovery scenarios like > backup/restore since it will take quite some time to backup/restore a > multi PB system when it resides on just 1 physical host even when > we''re talking high end P-series, I25K''s or Superdome class.This is true. Things like replication and failover are the best plans for it today. Thanks for your interest, we''re always looking for ways to better utilize high end storage features. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote:> On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote: > > > New firmware/microcode versions are able to reclaim that space if it > > > sees a certain number of consecutive zero''s and will reclaim that > > > space to the volume pool. Are there any thoughts on writing a > > > low-priority tread that zeros out those "non-used" blocks? > > > > Patches have been floating around to support this - see the recent > > patches around "DISCARD" on linux-ide and lkml. It would be great to > > get access to a box that implemented the T10 proposed UNMAP commands > > that we could test against. > > We''ve already made btrfs support TRIM, and Matthew has patches which > hook it up for ATA/IDE devices. Adding SCSI support shouldn''t be hard > once the dust settles on the spec.It seems like the dust has settled ... I just need to check that my code still conforms to the spec. Understandably, I''ve been focused on TRIM ;-)> I don''t think I''ve seen anybody talking about deliberately writing > zeroes instead of just issuing a discard command though. That doesn''t > seem like a massively cunning plan.Yeah, WRITE SAME with the discard bit. A bit of a crappy way to go, to be sure. I''m not exactly sure how we''re supposed to be deciding whether to issue an UNMAP or WRITE SAME command. Perhaps if I read the spec properly it''ll tell me. I just had a quick chat with someone from another storage vendor who don''t yet implement UNMAP -- if you do a WRITE SAME with all zeroes, their device will notice that and unmap the LBAs in question. Something for the plane on Sunday anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-04-03 at 06:27 -0700, Matthew Wilcox wrote:> On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote: > > On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote: > > > > New firmware/microcode versions are able to reclaim that space if it > > > > sees a certain number of consecutive zero''s and will reclaim that > > > > space to the volume pool. Are there any thoughts on writing a > > > > low-priority tread that zeros out those "non-used" blocks? > > > > > > Patches have been floating around to support this - see the recent > > > patches around "DISCARD" on linux-ide and lkml. It would be great to > > > get access to a box that implemented the T10 proposed UNMAP commands > > > that we could test against. > > > > We''ve already made btrfs support TRIM, and Matthew has patches which > > hook it up for ATA/IDE devices. Adding SCSI support shouldn''t be hard > > once the dust settles on the spec. > > It seems like the dust has settled ... I just need to check that > my code still conforms to the spec. Understandably, I''ve been focused > on TRIM ;-) > > > I don''t think I''ve seen anybody talking about deliberately writing > > zeroes instead of just issuing a discard command though. That doesn''t > > seem like a massively cunning plan. > > Yeah, WRITE SAME with the discard bit. A bit of a crappy way to go, to > be sure. I''m not exactly sure how we''re supposed to be deciding whether > to issue an UNMAP or WRITE SAME command. Perhaps if I read the spec > properly it''ll tell me.Actually, the point about WRITE SAME is that it''s a far smaller patch to the standards (just a couple of bits). Plus it gets around the problem of what does the array return when an unmapped block is requested (which occupies pages in the UNMAP proposal), so from that point of view it seems very logical.> I just had a quick chat with someone from another storage vendor who > don''t yet implement UNMAP -- if you do a WRITE SAME with all zeroes, > their device will notice that and unmap the LBAs in question.I actually already looked at using WRITE SAME in sd.c ... it turns out to be surprisingly little work ... the thing you''ll like about it is that there are no extents to worry about and if you plan on writing all zeros, you can keep a static zeroed data buffer around for the purpose ... James -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-04-03 at 07:43 -0400, Ric Wheeler wrote:> Erwin van Londen wrote: > > Another thing is that some arrays have the capability to "thin-provision" volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero''s and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority treadthat zeros out those "non-used" blocks?> > > > Patches have been floating around to support this - see the recent > patches around "DISCARD" on linux-ide and lkml. It would be great to > get access to a box that implemented the T10 proposed UNMAP commands > that we could test against.So we went several times around the block in the upcoming Linux Filesystem and Storage workshop to see if anyone from the array vendors might be interested in discussing thin provisioning. The general result was no since travel is tight. The upshot will be that most of our discard infrastructure will be focussed on SSD TRIM, but that we''ll try to preserve the TP option for arrays ... there are still private conversations going on with various people who know the UNMAP/WRITE SAME requirements of the various arrays at the various vendors. James -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris,> -----Original Message----- > From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs- > owner@vger.kernel.org] On Behalf Of Chris Mason > Sent: Saturday, 4 April 2009 12:23 AM > To: Erwin van Londen > Cc: linux-btrfs@vger.kernel.org > Subject: Re: btrfs for enterprise raid arrays > > On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen wrote: > > Dear all, > > > > While going through the archived mailing list and crawling along the > > wiki I didn''t find any clues if there would be any optimizations in > > Btrfs to make efficient use of functions and features that todayexist> > on enterprise class storage arrays. > > > > One exception to that was the ssd option which I think can make a > > improvement on read and write IO''s however when attached to astorage> > array, from an OS perspective, it doesn''t really matter since itcan''t> > look behind the array front-end interface anyhow(whether it FC/iSCSI > > or any other). > > > > There are however more options that we could think of. Almost all > > storage arrays these days have the capabilities to replicate volume > > (or part of it in COW cases) either in the system or remotely. It > > would be handy that if a Btrfs formatted volume could make use of > > those features since this might offload a lot of the processing time > > involved in maintaining these. The arrays already have optimizedcode> > to make these snapshots. I''m not saying we should step away from the > > host based snapshots but integration would be very nice. > > Storage based snapshotting would definitely be useful for replicationin> btrfs, and in that case we could wire it up from userland. Basically > there is a point during commit where a storage snapshot could be taken > and fully consistent. > > Outside of replication though, I''m not sure exactly where storagebased> snapshotting would come in. It wouldn''t really be compatible with the > snapshots btrfs is already doing (but I''m always open to more ideas).There is a Linux interface however I don''t think it Open Source (unfortunately) which runs from userland but directly "talks" to the array to a so-called "command device". From a usability perspective you could mount this snapshot/shadow image to a second server and process data from there. (backups etc.)> > > Furthermore some enterprise array have a feature that allows forfull> > or partial staging data in cache. By this I mean when a volume > > contains a certain amount of blocks you can define to have the firstX> > number of blocks pre-staged in cache which enables you to have > > extremely high IO rates on these first ones. An option related tothe> > -ssd parameter could be to have a mount command say "mount -t btrfs > > -ssd 0-10000" so Btrfs knows what to expect from the partial areaand> > maybe can optimize the locality of frequently used blocks tooptimize> > performance. > > This would be very useful, although I would tend to export it to btrfs > as a second lun. My long term goal is to have code in btrfs that > supports a super fast staging lun, which might be an ssd or cachecarved> out of a high end array.The problem with that is addressability especially if you have a significant amount of volumes attached to a host and are using FC multi-pathing tools underneath. From an administrative point of view this will complicate things a lot. The option that I mentioned is transparent to the admin and he only needs to get the number of blocks added to the mount command or fstab. Bear in mind that this method (staging in cache) is still a lot faster that having flash drives since there is no back-end traffic going on. All IO''s only touch cache and front-end ports. As I said there is also the option to put a full volume in cache however from a financial point of view this will become expensive. That''s one of the reasons why we came up with the partial bit.> > > > Another thing is that some arrays have the capability to > > "thin-provision" volumes. In the back-end on the physical layer the > > array configures, let say, a 1 TB volume and virtually provisions5TB> > to the host. On writes it dynamically allocates more pages in thepool> > up to the 5TB point. Now if for some reason large holes occur on the > > volume, maybe a couple of ISO images that have been deleted, what > > normally happens is just some pointers in the inodes get deleted so > > from an array perspective there is still data on those locations and > > will never release those allocated blocks. New firmware/microcode > > versions are able to reclaim that space if it sees a certain numberof> > consecutive zero''s and will reclaim that space to the volume pool.Are> > there any thoughts on writing a low-priority tread that zeros out > > those "non-used" blocks? > > Other people have replied about the trim commands, which btrfs canissue> on every block it frees. But, another way to look at this is thatbtrfs> already is thinly provisioned. When you add storage to btrfs, it > allocates from that storage in 1GB chunks, and then hands those overto> the FS allocation code for more fine grained use. > > It may make sense to talk about how that can fit in with your own thin > provisioning.The problem is that the arrays have to be pre-configured for volume level allocation. Whether this is a normal volume or thin-provisioned volume it doesn''t matter. You''re right if you say that if the array had interface(s) so it would dynamically allocate blocks from those pools as soon as btrfs addresses those that would be fantastic. Unfortunately today that''s not the case and it''s one-way traffic. Not only from our arrays but from other vendors as well. So a thin-provisioned volume still presents a fixed number of block to the host although in the back it will derive those blocks as soon they get "touched". After this those blocks will be reserved for that volume from the pool unless something tells the array to free it up. Currently it''s only achievable, from an array perspective, to release those pages if it sees a consecutive number of zero''s. I''m currently not aware that the array software fulfills the trim commands but I can have a look around.> > > Given the scalability targets of Btrfs it will most likely beheavily> > used in the enterprise environment once it reaches a stable code > > level. If we would be able to interface with these array based > > features that would be very beneficial. > > > > Furthermore one question also pops to mind and that''s when lookingat> > the scalability of Btrfs and its targeted capacity levels I think we > > will run into problems with the capabilities of the server hardware > > itself. From what I can see now it will not be designed as a > > distributed file-system with integrated distributed lock manager to > > scale out over multiple nodes. (I know Oracle is working on asimilar> > thing but this might get things more complicated than it alreadyis.)> > This might impose some serious issues with recovery scenarios like > > backup/restore since it will take quite some time to backup/restorea> > multi PB system when it resides on just 1 physical host even when > > we''re talking high end P-series, I25K''s or Superdome class. > > This is true. Things like replication and failover are the best plans > for it today. > > Thanks for your interest, we''re always looking for ways to better > utilize high end storage features.No problem. My interest is in the adoption level as well and given the fact that large companies will be mostly utilizing these array''s they are the ones who will most benefit from a filesystem that gives them the flexibility and robustness that we''re targeting for with btrfs.> > -chris > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"in> the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html