thr3ads.net - Btrfs devel - btrfs for enterprise raid arrays [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Erwin van Londen

2009-Apr-03 04:34 UTC

btrfs for enterprise raid arrays

Dear all,

While going through the archived mailing list and crawling along the wiki I
didn''t find any clues if there would be any optimizations in Btrfs to
make efficient use of functions and features that today exist on enterprise
class storage arrays.

One exception to that was the ssd option which I think can make a improvement on
read and write IO''s however when attached to a storage array, from an
OS perspective, it doesn''t really matter since it can''t look
behind the array front-end interface anyhow(whether it FC/iSCSI or any other).

There are however more options that we could think of. Almost all storage arrays
these days have the capabilities to replicate volume (or part of it in COW
cases) either in the system or remotely. It would be handy that if a Btrfs
formatted volume could make use of those features since this might offload a lot
of the processing time involved in maintaining these. The arrays already have
optimized code to make these snapshots. I''m not saying we should step
away from the host based snapshots but integration would be very nice.

Furthermore some enterprise array have a feature that allows for full or partial
staging data in cache. By this I mean when a volume contains a certain amount of
blocks you can define to have the first X number of blocks pre-staged in cache
which enables you to have extremely high IO rates on these first ones. An option
related to the -ssd parameter could be to have a mount command say "mount
-t btrfs -ssd 0-10000" so Btrfs knows what to expect from the partial area
and maybe can optimize the locality of frequently used blocks to optimize
performance.

Another thing is that some arrays have the capability to
"thin-provision" volumes. In the back-end on the physical layer the
array configures, let say, a 1 TB volume and virtually provisions 5TB to the
host. On writes it dynamically allocates more pages in the pool up to the 5TB
point. Now if for some reason large holes occur on the volume, maybe a couple of
ISO images that have been deleted, what normally happens is just some pointers
in the inodes get deleted so from an array perspective there is still data on
those locations and will never release those allocated blocks. New
firmware/microcode versions are able to reclaim that space if it sees a certain
number of consecutive zero''s and will reclaim that space to the volume
pool. Are there any thoughts on writing a low-priority tread that zeros out
those "non-used" blocks?

Given the scalability targets of Btrfs it will most likely be heavily used in
the enterprise environment once it reaches a stable code level. If we would be
able to interface with these array based features that would be very beneficial.

Furthermore one question also pops to mind and that''s when looking at
the scalability of Btrfs and its targeted capacity levels I think we will run
into problems with the capabilities of the server hardware itself. From what I
can see now it will not be designed as a distributed file-system with integrated
distributed lock manager to scale out over multiple nodes. (I know Oracle is
working on a similar thing but this might get things more complicated than it
already is.) This might impose some serious issues with recovery scenarios like
backup/restore since it will take quite some time to backup/restore a multi PB
system when it resides on just 1 physical host even when we''re talking
high end P-series, I25K''s or Superdome class.

I''m not a coder but am heavily involved in the storage industry for the
past 15 years so this is just some of the things I come across in real life
enterprise customer environments so these are just some of my mind spinnings.

There are some more however these would be best covered in another topic.

Let me know your thoughts.

Kind regards,

Erwin van Londen
Systems Engineer
HITACHI DATA SYSTEMS
Level 4, 441 St. Kilda Rd.
Melbourne, Victoria, 3004
Australia
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sander

2009-Apr-03 07:32 UTC

head link

Re: btrfs for enterprise raid arrays

Dear Erwin,

Erwin van Londen wrote (ao):> Another thing is that some arrays have the capability to
> "thin-provision" volumes. In the back-end on the physical layer
the
> array configures, let say, a 1 TB volume and virtually provisions 5TB
> to the host. On writes it dynamically allocates more pages in the pool
> up to the 5TB point. Now if for some reason large holes occur on the
> volume, maybe a couple of ISO images that have been deleted, what
> normally happens is just some pointers in the inodes get deleted so
> from an array perspective there is still data on those locations and
> will never release those allocated blocks. New firmware/microcode
> versions are able to reclaim that space if it sees a certain number of
> consecutive zero''s and will reclaim that space to the volume pool.
Are
> there any thoughts on writing a low-priority tread that zeros out
> those "non-used" blocks?
SSD would also benefit from such a feature as it doesn''t need to copy
deleted data when erasing blocks.

The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for
that?

I have one question on thin provisioning: if Windows XP performs defrag
on a 20GB ''virtual'' size LUN with 2GB in actuall use, whil the
volume
grow to 20GB on the storage and never shrink afterwards anymore, while
the client still has only 2GB in use?

This would make thin provisioning on virtual desktops less useful.

Do you have any numbers on the performance impact of thin provisioning?
I can imagine that thin provisioning causes on-storage defragmentation
of disk images, which would kill any OS optimisations like grouping often
read files.

	With kind regards, Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ric Wheeler

2009-Apr-03 11:43 UTC

head link

Re: btrfs for enterprise raid arrays

Erwin van Londen wrote:> Dear all,
>
> While going through the archived mailing list and crawling along the wiki I
didn''t find any clues if there would be any optimizations in Btrfs to
make efficient use of functions and features that today exist on enterprise
class storage arrays.
>
> One exception to that was the ssd option which I think can make a
improvement on read and write IO''s however when attached to a storage
array, from an OS perspective, it doesn''t really matter since it
can''t look behind the array front-end interface anyhow(whether it
FC/iSCSI or any other).
>
> There are however more options that we could think of. Almost all storage
arrays these days have the capabilities to replicate volume (or part of it in
COW cases) either in the system or remotely. It would be handy that if a Btrfs
formatted volume could make use of those features since this might offload a lot
of the processing time involved in maintaining these. The arrays already have
optimized code to make these snapshots. I''m not saying we should step
away from the host based snapshots but integration would be very nice.
>   I agree - it would be great to have a standard way to invoke built in 
snap shots in enterprise arrays. Does HDS export something that we could 
invoke from kernel context to perform a snap shot?
> Furthermore some enterprise array have a feature that allows for full or
partial staging data in cache. By this I mean when a volume contains a certain
amount of blocks you can define to have the first X number of blocks pre-staged
in cache which enables you to have extremely high IO rates on these first ones.
An option related to the -ssd parameter could be to have a mount command say
"mount -t btrfs -ssd 0-10000" so Btrfs knows what to expect from the
partial area and maybe can optimize the locality of frequently used blocks to
optimize performance.
>   Effectively, you prefetch and pin these blocks in cache?  Is this 
something we can preconfigure via some interface per
LUN?> Another thing is that some arrays have the capability to
"thin-provision" volumes. In the back-end on the physical layer the
array configures, let say, a 1 TB volume and virtually provisions 5TB to the
host. On writes it dynamically allocates more pages in the pool up to the 5TB
point. Now if for some reason large holes occur on the volume, maybe a couple of
ISO images that have been deleted, what normally happens is just some pointers
in the inodes get deleted so from an array perspective there is still data on
those locations and will never release those allocated blocks. New
firmware/microcode versions are able to reclaim that space if it sees a certain
number of consecutive zero''s and will reclaim that space to the volume
pool. Are there any thoughts on writing a low-priority tread t
 hat zeros out those "non-used" blocks?>   
Patches have been floating around to support this - see the recent 
patches around "DISCARD" on linux-ide and lkml.  It would be great to 
get access to a box that implemented the T10 proposed UNMAP commands 
that we could test against. > Given the scalability targets of Btrfs it will most likely be heavily used
in the enterprise environment once it reaches a stable code level. If we would
be able to interface with these array based features that would be very
beneficial.
>
> Furthermore one question also pops to mind and that''s when looking
at the scalability of Btrfs and its targeted capacity levels I think we will run
into problems with the capabilities of the server hardware itself. From what I
can see now it will not be designed as a distributed file-system with integrated
distributed lock manager to scale out over multiple nodes. (I know Oracle is
working on a similar thing but this might get things more complicated than it
already is.) This might impose some serious issues with recovery scenarios like
backup/restore since it will take quite some time to backup/restore a multi PB
system when it resides on just 1 physical host even when we''re talking
high end P-series, I25K''s or Superdome class.
>
> I''m not a coder but am heavily involved in the storage industry
for the past 15 years so this is just some of the things I come across in real
life enterprise customer environments so these are just some of my mind
spinnings.
>
> There are some more however these would be best covered in another topic.
>
> Let me know your thoughts.
>
> Kind regards,
>
> Erwin van Londen
> Systems Engineer
> HITACHI DATA SYSTEMS
> Level 4, 441 St. Kilda Rd.
> Melbourne, Victoria, 3004
> Australia
>  
>   Erwin, the real key here is to figure out what standard interfaces we 
can use to invoke these functions (from kernel context). Having a 
background in storage myself, the challenge in taking advantage of these 
advanced array features has been that each vendor has their own back 
door API''s to control this. Linux likes to take advantage of well 
supported by multiple vendor features.

If you have HDS engineers interested in exploring this or spilling 
details on how to trigger these, it would be a great conversation (but 
not just for btrfs, you would need to talk to the broader SCSI, LVM, etc 
lists as well :-))

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ric Wheeler

2009-Apr-03 11:51 UTC

head link

Re: btrfs for enterprise raid arrays

Sander wrote:> Dear Erwin,
>
> Erwin van Londen wrote (ao):
>   
>> Another thing is that some arrays have the capability to
>> "thin-provision" volumes. In the back-end on the physical
layer the
>> array configures, let say, a 1 TB volume and virtually provisions 5TB
>> to the host. On writes it dynamically allocates more pages in the pool
>> up to the 5TB point. Now if for some reason large holes occur on the
>> volume, maybe a couple of ISO images that have been deleted, what
>> normally happens is just some pointers in the inodes get deleted so
>> from an array perspective there is still data on those locations and
>> will never release those allocated blocks. New firmware/microcode
>> versions are able to reclaim that space if it sees a certain number of
>> consecutive zero''s and will reclaim that space to the volume
pool. Are
>> there any thoughts on writing a low-priority tread that zeros out
>> those "non-used" blocks?
>>     
>
> SSD would also benefit from such a feature as it doesn''t need to
copy
> deleted data when erasing blocks.
>   
The joy of the proposed T10 UNMAP commands is that you send one block 
down with special bits set which lets the target know you don''t need
the
whole block range. No real data movement to or from the storage device 
(other than that one special sector).
> The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for
> that?
>   That is the current plan & is being implemented in a way that should let 
us transparently (from a btrfs level) take advantage of arrays that do 
unmap or SSD trim enabled devices with no fs level changes.  Currently, 
as far as I know, we have no thin enabled devices to play
with.> I have one question on thin provisioning: if Windows XP performs defrag
> on a 20GB ''virtual'' size LUN with 2GB in actuall use,
whil the volume
> grow to 20GB on the storage and never shrink afterwards anymore, while
> the client still has only 2GB in use?
>   
That is the inverse of what would happen - the windows people will 
probably hack defrag to emit its equivalent of block discard commands 
for the old blocks after defragging a file (just guessing here). With a 
per fs defrag pass, you could use this kind of hook to resync the actual 
allocated block state between your fs image and the
storage.> This would make thin provisioning on virtual desktops less useful.
>
> Do you have any numbers on the performance impact of thin provisioning?
> I can imagine that thin provisioning causes on-storage defragmentation
> of disk images, which would kill any OS optimisations like grouping often
> read files.
>
> 	With kind regards, Sander
>
>   Big arrays have virtual block ranges anyway, this is just a different 
layer of abstraction that is invisible to us.

Thin enabled devices might have a performance hit for small discard 
commands (that would impact truncate/unlink performance).  I suspect the 
performance impact will vary a lot depending on how the feature is coded 
internally in the storage, but we will only know when we get one to play 
with :-)

ric


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Woodhouse

2009-Apr-03 11:58 UTC

head link

Re: btrfs for enterprise raid arrays

On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:> > New firmware/microcode versions are able to reclaim that space if it
> > sees a certain number of consecutive zero''s and will reclaim
that
> > space to the volume pool. Are there any thoughts on writing a 
> > low-priority tread that zeros out those "non-used" blocks?
> 
> Patches have been floating around to support this - see the recent 
> patches around "DISCARD" on linux-ide and lkml.  It would be
great to
> get access to a box that implemented the T10 proposed UNMAP commands 
> that we could test against. 
We''ve already made btrfs support TRIM, and Matthew has patches which
hook it up for ATA/IDE devices. Adding SCSI support shouldn''t be hard
once the dust settles on the spec.

I don''t think I''ve seen anybody talking about deliberately
writing
zeroes instead of just issuing a discard command though. That doesn''t
seem like a massively cunning plan.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ric Wheeler

2009-Apr-03 12:02 UTC

head link

Re: btrfs for enterprise raid arrays

David Woodhouse wrote:> On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
>   
>>> New firmware/microcode versions are able to reclaim that space if
it
>>> sees a certain number of consecutive zero''s and will
reclaim that
>>> space to the volume pool. Are there any thoughts on writing a 
>>> low-priority tread that zeros out those "non-used"
blocks?
>>>       
>> Patches have been floating around to support this - see the recent 
>> patches around "DISCARD" on linux-ide and lkml.  It would be
great to
>> get access to a box that implemented the T10 proposed UNMAP commands 
>> that we could test against. 
>>     
>
> We''ve already made btrfs support TRIM, and Matthew has patches
which
> hook it up for ATA/IDE devices. Adding SCSI support shouldn''t be
hard
> once the dust settles on the spec.
>
> I don''t think I''ve seen anybody talking about
deliberately writing
> zeroes instead of just issuing a discard command though. That
doesn''t
> seem like a massively cunning plan.
>
>   What the SCSI spec says is that you can use "WRITE SAME" with a
discard
bit set.  What the array would do with that is array dependent - it 
could in fact write that same block out to each of the blocks if it 
chooses to do so. The intention would be, of course, to manipulate 
internal array tracking so that you do no IO.

We should avoid doing that command to arrays that don''t really
implement
the unmap part, it could take a long time to complete a single largish 
discard request :-)

The nice part of the write same with unmap flavour of the T10 command is 
that it is very clear about the semantics of what you should get back,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Apr-03 13:22 UTC

head link

Re: btrfs for enterprise raid arrays

On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen
wrote:> Dear all,
> 
> While going through the archived mailing list and crawling along the
> wiki I didn''t find any clues if there would be any optimizations
in
> Btrfs to make efficient use of functions and features that today exist
> on enterprise class storage arrays.
> 
> One exception to that was the ssd option which I think can make a
> improvement on read and write IO''s however when attached to a
storage
> array, from an OS perspective, it doesn''t really matter since it
can''t
> look behind the array front-end interface anyhow(whether it FC/iSCSI
> or any other).
> 
> There are however more options that we could think of. Almost all
> storage arrays these days have the capabilities to replicate volume
> (or part of it in COW cases) either in the system or remotely. It
> would be handy that if a Btrfs formatted volume could make use of
> those features since this might offload a lot of the processing time
> involved in maintaining these. The arrays already have optimized code
> to make these snapshots. I''m not saying we should step away from
the
> host based snapshots but integration would be very nice.
Storage based snapshotting would definitely be useful for replication in
btrfs, and in that case we could wire it up from userland.  Basically
there is a point during commit where a storage snapshot could be taken
and fully consistent.

Outside of replication though, I''m not sure exactly where storage based
snapshotting would come in.  It wouldn''t really be compatible with the
snapshots btrfs is already doing (but I''m always open to more ideas).
> Furthermore some enterprise array have a feature that allows for full
> or partial staging data in cache. By this I mean when a volume
> contains a certain amount of blocks you can define to have the first X
> number of blocks pre-staged in cache which enables you to have
> extremely high IO rates on these first ones. An option related to the
> -ssd parameter could be to have a mount command say "mount -t btrfs
> -ssd 0-10000" so Btrfs knows what to expect from the partial area and
> maybe can optimize the locality of frequently used blocks to optimize
> performance.
This would be very useful, although I would tend to export it to btrfs
as a second lun.  My long term goal is to have code in btrfs that
supports a super fast staging lun, which might be an ssd or cache carved
out of a high end array.> 
> Another thing is that some arrays have the capability to
> "thin-provision" volumes. In the back-end on the physical layer
the
> array configures, let say, a 1 TB volume and virtually provisions 5TB
> to the host. On writes it dynamically allocates more pages in the pool
> up to the 5TB point. Now if for some reason large holes occur on the
> volume, maybe a couple of ISO images that have been deleted, what
> normally happens is just some pointers in the inodes get deleted so
> from an array perspective there is still data on those locations and
> will never release those allocated blocks. New firmware/microcode
> versions are able to reclaim that space if it sees a certain number of
> consecutive zero''s and will reclaim that space to the volume pool.
Are
> there any thoughts on writing a low-priority tread that zeros out
> those "non-used" blocks?
Other people have replied about the trim commands, which btrfs can issue
on every block it frees.  But, another way to look at this is that btrfs
already is thinly provisioned.  When you add storage to btrfs, it
allocates from that storage in 1GB chunks, and then hands those over to
the FS allocation code for more fine grained use.

It may make sense to talk about how that can fit in with your own thin
provisioning.
> Given the scalability targets of Btrfs it will most likely be heavily
> used in the enterprise environment once it reaches a stable code
> level. If we would be able to interface with these array based
> features that would be very beneficial. 
> 
> Furthermore one question also pops to mind and that''s when looking
at
> the scalability of Btrfs and its targeted capacity levels I think we
> will run into problems with the capabilities of the server hardware
> itself. From what I can see now it will not be designed as a
> distributed file-system with integrated distributed lock manager to
> scale out over multiple nodes. (I know Oracle is working on a similar
> thing but this might get things more complicated than it already is.)
> This might impose some serious issues with recovery scenarios like
> backup/restore since it will take quite some time to backup/restore a
> multi PB system when it resides on just 1 physical host even when
> we''re talking high end P-series, I25K''s or Superdome
class.
This is true.  Things like replication and failover are the best plans
for it today.

Thanks for your interest, we''re always looking for ways to better
utilize high end storage features.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Apr-03 13:27 UTC

head link

Re: btrfs for enterprise raid arrays

On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse
wrote:> On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
> > > New firmware/microcode versions are able to reclaim that space if
it
> > > sees a certain number of consecutive zero''s and will
reclaim that
> > > space to the volume pool. Are there any thoughts on writing a 
> > > low-priority tread that zeros out those "non-used"
blocks?
> > 
> > Patches have been floating around to support this - see the recent 
> > patches around "DISCARD" on linux-ide and lkml.  It would be
great to
> > get access to a box that implemented the T10 proposed UNMAP commands 
> > that we could test against. 
> 
> We''ve already made btrfs support TRIM, and Matthew has patches
which
> hook it up for ATA/IDE devices. Adding SCSI support shouldn''t be
hard
> once the dust settles on the spec.
It seems like the dust has settled ... I just need to check that
my code still conforms to the spec.  Understandably, I''ve been focused
on TRIM ;-)
> I don''t think I''ve seen anybody talking about
deliberately writing
> zeroes instead of just issuing a discard command though. That
doesn''t
> seem like a massively cunning plan.
Yeah, WRITE SAME with the discard bit.  A bit of a crappy way to go, to
be sure.  I''m not exactly sure how we''re supposed to be
deciding whether
to issue an UNMAP or WRITE SAME command.  Perhaps if I read the spec
properly it''ll tell me.

I just had a quick chat with someone from another storage vendor who
don''t yet implement UNMAP -- if you do a WRITE SAME with all zeroes,
their device will notice that and unmap the LBAs in question.

Something for the plane on Sunday anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

James Bottomley

2009-Apr-03 13:48 UTC

head link

Re: btrfs for enterprise raid arrays

On Fri, 2009-04-03 at 06:27 -0700, Matthew Wilcox wrote:> On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote:
> > On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
> > > > New firmware/microcode versions are able to reclaim that
space if it
> > > > sees a certain number of consecutive zero''s and
will reclaim that
> > > > space to the volume pool. Are there any thoughts on writing
a
> > > > low-priority tread that zeros out those "non-used"
blocks?
> > > 
> > > Patches have been floating around to support this - see the
recent
> > > patches around "DISCARD" on linux-ide and lkml.  It
would be great to
> > > get access to a box that implemented the T10 proposed UNMAP
commands
> > > that we could test against. 
> > 
> > We''ve already made btrfs support TRIM, and Matthew has
patches which
> > hook it up for ATA/IDE devices. Adding SCSI support shouldn''t
be hard
> > once the dust settles on the spec.
> 
> It seems like the dust has settled ... I just need to check that
> my code still conforms to the spec.  Understandably, I''ve been
focused
> on TRIM ;-)
> 
> > I don''t think I''ve seen anybody talking about
deliberately writing
> > zeroes instead of just issuing a discard command though. That
doesn''t
> > seem like a massively cunning plan.
> 
> Yeah, WRITE SAME with the discard bit.  A bit of a crappy way to go, to
> be sure.  I''m not exactly sure how we''re supposed to be
deciding whether
> to issue an UNMAP or WRITE SAME command.  Perhaps if I read the spec
> properly it''ll tell me.
Actually, the point about WRITE SAME is that it''s a far smaller patch
to
the standards (just a couple of bits).  Plus it gets around the problem
of what does the array return when an unmapped block is requested (which
occupies pages in the UNMAP proposal), so from that point of view it
seems very logical.
> I just had a quick chat with someone from another storage vendor who
> don''t yet implement UNMAP -- if you do a WRITE SAME with all
zeroes,
> their device will notice that and unmap the LBAs in question.
I actually already looked at using WRITE SAME in sd.c ... it turns out
to be surprisingly little work ... the thing you''ll like about it is
that there are no extents to worry about and if you plan on writing all
zeros, you can keep a static zeroed data buffer around for the
purpose ...

James


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

James Bottomley

2009-Apr-03 13:51 UTC

head link

Re: btrfs for enterprise raid arrays

On Fri, 2009-04-03 at 07:43 -0400, Ric Wheeler wrote:> Erwin van Londen wrote:
> > Another thing is that some arrays have the capability to
"thin-provision" volumes. In the back-end on the physical layer the
array configures, let say, a 1 TB volume and virtually provisions 5TB to the
host. On writes it dynamically allocates more pages in the pool up to the 5TB
point. Now if for some reason large holes occur on the volume, maybe a couple of
ISO images that have been deleted, what normally happens is just some pointers
in the inodes get deleted so from an array perspective there is still data on
those locations and will never release those allocated blocks. New
firmware/microcode versions are able to reclaim that space if it sees a certain
number of consecutive zero''s and will reclaim that space to the volume
pool. Are there any thoughts on writing a low-priority tread
  that zeros out those "non-used" blocks?> >   
> 
> Patches have been floating around to support this - see the recent 
> patches around "DISCARD" on linux-ide and lkml.  It would be
great to
> get access to a box that implemented the T10 proposed UNMAP commands 
> that we could test against. 
So we went several times around the block in the upcoming Linux
Filesystem and Storage workshop to see if anyone from the array vendors
might be interested in discussing thin provisioning.  The general result
was no since travel is tight.  The upshot will be that most of our
discard infrastructure will be focussed on SSD TRIM, but that we''ll try
to preserve the TP option for arrays ... there are still private
conversations going on with various people who know the UNMAP/WRITE SAME
requirements of the various arrays at the various vendors.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Erwin van Londen

2009-Apr-06 00:29 UTC

head link

RE: btrfs for enterprise raid arrays

Chris,
 
> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> owner@vger.kernel.org] On Behalf Of Chris Mason
> Sent: Saturday, 4 April 2009 12:23 AM
> To: Erwin van Londen
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: btrfs for enterprise raid arrays
> 
> On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen wrote:
> > Dear all,
> >
> > While going through the archived mailing list and crawling along the
> > wiki I didn''t find any clues if there would be any
optimizations in
> > Btrfs to make efficient use of functions and features that today
exist> > on enterprise class storage arrays.
> >
> > One exception to that was the ssd option which I think can make a
> > improvement on read and write IO''s however when attached to a
storage> > array, from an OS perspective, it doesn''t really matter since
it
can''t> > look behind the array front-end interface anyhow(whether it FC/iSCSI
> > or any other).
> >
> > There are however more options that we could think of. Almost all
> > storage arrays these days have the capabilities to replicate volume
> > (or part of it in COW cases) either in the system or remotely. It
> > would be handy that if a Btrfs formatted volume could make use of
> > those features since this might offload a lot of the processing time
> > involved in maintaining these. The arrays already have optimized
code> > to make these snapshots. I''m not saying we should step away
from the
> > host based snapshots but integration would be very nice.
> 
> Storage based snapshotting would definitely be useful for replication
in> btrfs, and in that case we could wire it up from userland.  Basically
> there is a point during commit where a storage snapshot could be taken
> and fully consistent.
> 
> Outside of replication though, I''m not sure exactly where storage
based> snapshotting would come in.  It wouldn''t really be compatible with
the
> snapshots btrfs is already doing (but I''m always open to more
ideas).
There is a Linux interface however I don''t think it Open Source
(unfortunately) which runs from userland but directly "talks" to the
array to a so-called "command device". From a usability perspective
you
could mount this snapshot/shadow image to a second server and process
data from there. (backups etc.)
> 
> > Furthermore some enterprise array have a feature that allows for
full> > or partial staging data in cache. By this I mean when a volume
> > contains a certain amount of blocks you can define to have the first
X> > number of blocks pre-staged in cache which enables you to have
> > extremely high IO rates on these first ones. An option related to
the> > -ssd parameter could be to have a mount command say "mount -t
btrfs
> > -ssd 0-10000" so Btrfs knows what to expect from the partial area
and> > maybe can optimize the locality of frequently used blocks to
optimize> > performance.
> 
> This would be very useful, although I would tend to export it to btrfs
> as a second lun.  My long term goal is to have code in btrfs that
> supports a super fast staging lun, which might be an ssd or cache
carved> out of a high end array.
The problem with that is addressability especially if you have a
significant amount of volumes attached to a host and are using FC
multi-pathing tools underneath. From an administrative point of view
this will complicate things a lot. The option that I mentioned is
transparent to the admin and he only needs to get the number of blocks
added to the mount command or fstab.

Bear in mind that this method (staging in cache) is still a lot faster
that having flash drives since there is no back-end traffic going on.
All IO''s only touch cache and front-end ports. As I said there is also
the option to put a full volume in cache however from a financial point
of view this will become expensive. That''s one of the reasons why we
came up with the partial bit.
> >
> > Another thing is that some arrays have the capability to
> > "thin-provision" volumes. In the back-end on the physical
layer the
> > array configures, let say, a 1 TB volume and virtually provisions
5TB> > to the host. On writes it dynamically allocates more pages in the
pool> > up to the 5TB point. Now if for some reason large holes occur on the
> > volume, maybe a couple of ISO images that have been deleted, what
> > normally happens is just some pointers in the inodes get deleted so
> > from an array perspective there is still data on those locations and
> > will never release those allocated blocks. New firmware/microcode
> > versions are able to reclaim that space if it sees a certain number
of> > consecutive zero''s and will reclaim that space to the volume
pool.
Are> > there any thoughts on writing a low-priority tread that zeros out
> > those "non-used" blocks?
> 
> Other people have replied about the trim commands, which btrfs can
issue> on every block it frees.  But, another way to look at this is that
btrfs> already is thinly provisioned.  When you add storage to btrfs, it
> allocates from that storage in 1GB chunks, and then hands those over
to> the FS allocation code for more fine grained use.
> 
> It may make sense to talk about how that can fit in with your own thin
> provisioning.
The problem is that the arrays have to be pre-configured for volume
level allocation. Whether this is a normal volume or thin-provisioned
volume it doesn''t matter. You''re right if you say that if the
array had
interface(s) so it would dynamically allocate blocks from those pools as
soon as btrfs addresses those that would be fantastic. Unfortunately
today that''s not the case and it''s one-way traffic. Not only
from our
arrays but from other vendors as well. So a thin-provisioned volume
still presents a fixed number of block to the host although in the back
it will derive those blocks as soon they get "touched". After this
those
blocks will be reserved for that volume from the pool unless something
tells the array to free it up. Currently it''s only achievable, from an
array perspective, to release those pages if it sees a consecutive
number of zero''s. I''m currently not aware that the array
software
fulfills the trim commands but I can have a look around.  
> 
> > Given the scalability targets of Btrfs it will most likely be
heavily> > used in the enterprise environment once it reaches a stable code
> > level. If we would be able to interface with these array based
> > features that would be very beneficial.
> >
> > Furthermore one question also pops to mind and that''s when
looking
at> > the scalability of Btrfs and its targeted capacity levels I think we
> > will run into problems with the capabilities of the server hardware
> > itself. From what I can see now it will not be designed as a
> > distributed file-system with integrated distributed lock manager to
> > scale out over multiple nodes. (I know Oracle is working on a
similar> > thing but this might get things more complicated than it already
is.)> > This might impose some serious issues with recovery scenarios like
> > backup/restore since it will take quite some time to backup/restore
a> > multi PB system when it resides on just 1 physical host even when
> > we''re talking high end P-series, I25K''s or Superdome
class.
> 
> This is true.  Things like replication and failover are the best plans
> for it today.
> 
> Thanks for your interest, we''re always looking for ways to better
> utilize high end storage features.
No problem. My interest is in the adoption level as well and given the
fact that large companies will be mostly utilizing these array''s they
are the ones who will most benefit from a filesystem that gives them the
flexibility and robustness  that we''re targeting for with btrfs.
> 
> -chris
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs"
in> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Apparently Analagous Threads

Search for more reasonably related threads

Btrfs devel - Apr 2009 - btrfs for enterprise raid arrays

btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

Re: btrfs for enterprise raid arrays

RE: btrfs for enterprise raid arrays

Apparently Analagous Threads