thr3ads.net - Btrfs devel - SSD Optimizations [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Gordan Bobic

2010-Mar-10 19:49 UTC

SSD Optimizations

I''m looking to try BTRFS on a SSD, and I would like to know what SSD 
optimizations it applies. Is there a comprehensive list of what ssd 
mount option does? How are the blocks and metadata arranged? Are there 
options available comparable to ext2/ext3 to help reduce wear and 
improve performance?

Specifically, on ext2 (journal means more writes, so I don''t use ext3
on
SSDs, since fsck typically only takes a few seconds when access time is 
< 100us), I usually apply the
-b 4096 -E stripe-width = (erase_block/4096)
parameters to mkfs in order to reduce the multiple erase cycles on the 
same underlying block.

Are there similar optimizations available in BTRFS?

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcus Fritzsch

2010-Mar-10 21:14 UTC

head link

Re: SSD Optimizations

Hi there,

On Wed, Mar 10, 2010 at 8:49 PM, Gordan Bobic <gordan@bobich.net>
wrote:> [...]
> Are there similar optimizations available in BTRFS?
There is an SSD mount option available[1].

Cheers,
Marcus

[1] http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_Options
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcus Fritzsch

2010-Mar-10 21:22 UTC

head link

Re: SSD Optimizations

Erm... You know, sorry for the noise.

Cheers,
Marcus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Fedyk

2010-Mar-10 23:12 UTC

head link

Re: SSD Optimizations

On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net>
wrote:> I''m looking to try BTRFS on a SSD, and I would like to know what
SSD
> optimizations it applies. Is there a comprehensive list of what ssd mount
> option does? How are the blocks and metadata arranged? Are there options
> available comparable to ext2/ext3 to help reduce wear and improve
> performance?
>
> Specifically, on ext2 (journal means more writes, so I don''t use
ext3 on
> SSDs, since fsck typically only takes a few seconds when access time is
<
> 100us), I usually apply the
> -b 4096 -E stripe-width = (erase_block/4096)
> parameters to mkfs in order to reduce the multiple erase cycles on the same
> underlying block.
>
> Are there similar optimizations available in BTRFS?
I think you''ll get more out of btrfs, but another thing you can look
into is ext4 without the journal.  Support was added for that recently
(thanks to google).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-10 23:13 UTC

head link

Re: SSD Optimizations

Marcus Fritzsch wrote:> Hi there,
> 
> On Wed, Mar 10, 2010 at 8:49 PM, Gordan Bobic <gordan@bobich.net>
wrote:
>> [...]
>> Are there similar optimizations available in BTRFS?
> 
> There is an SSD mount option available[1].
> 
> [1] http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_Options
But what _exactly_ does it do?

Is there a way to leverage any knowledge of erase block size at file 
system creation time?

Are there any special parameters that might affect locations of 
superblocks and metadata?

Is there a way to ensure they don''t span erase block boundaries?

What about ATA TRIM command support? Is this available? Is it included 
in the version in Fedora 13?

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-10 23:22 UTC

head link

Re: SSD Optimizations

Mike Fedyk wrote:> On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net>
wrote:
>> I''m looking to try BTRFS on a SSD, and I would like to know
what SSD
>> optimizations it applies. Is there a comprehensive list of what ssd
mount
>> option does? How are the blocks and metadata arranged? Are there
options
>> available comparable to ext2/ext3 to help reduce wear and improve
>> performance?
>>
>> Specifically, on ext2 (journal means more writes, so I don''t
use ext3 on
>> SSDs, since fsck typically only takes a few seconds when access time is
<
>> 100us), I usually apply the
>> -b 4096 -E stripe-width = (erase_block/4096)
>> parameters to mkfs in order to reduce the multiple erase cycles on the
same
>> underlying block.
>>
>> Are there similar optimizations available in BTRFS?
> 
> I think you''ll get more out of btrfs, but another thing you can
look
> into is ext4 without the journal.  Support was added for that recently
> (thanks to google).
How is this different to using mkfs.ext2 from e4fsprogs?

And while I appreciate hopeful remarks along the lines of "I think 
you''ll get more out of btrfs", I am really after specifics of what
the
ssd mount option does, and what features comparable to the optimizations 
that can be done with ext2/3/4 (e.g. the mentioned stripe-width option) 
are available to get the best possible alignment of data and metadata to 
increase both performance and life expectancy of a SSD.

Also, for drives that don''t support TRIM, is there a way to make the FS
apply aggressive re-use of erased space (in order to help the drive''s 
internal wear-leveling)?

I have looked through the documentation and the wiki, but it provides 
very little of actual substance.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sander

2010-Mar-11 07:38 UTC

head link

Re: SSD Optimizations

Hello Gordan,

Gordan Bobic wrote (ao):> Mike Fedyk wrote:
> >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic
<gordan@bobich.net> wrote:
> >>Are there options available comparable to ext2/ext3 to help reduce
> >>wear and improve performance?
With SSDs you don''t have to worry about wear.
> And while I appreciate hopeful remarks along the lines of "I think
> you''ll get more out of btrfs", I am really after specifics of
what
> the ssd mount option does, and what features comparable to the
> optimizations that can be done with ext2/3/4 (e.g. the mentioned
> stripe-width option) are available to get the best possible
> alignment of data and metadata to increase both performance and life
> expectancy of a SSD.
Alignment is about the partition, not the fs, and thus taken care of
with fdisk and the like.

If you don''t create a partition, the fs is aligned with the SSD.
> Also, for drives that don''t support TRIM, is there a way to make
the
> FS apply aggressive re-use of erased space (in order to help the
> drive''s internal wear-leveling)?
TRIM has nothing to do with wear-leveling (although it helps reducing
wear).
TRIM lets the OS tell the disk which blocks are not in use anymore, and
thus don''t have to be copied during a rewrite of the blocks.
Wear-leveling is the SSD making sure all blocks are more or less equally
written to avoid continuous load on the same blocks.

	Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Daniel J Blueman

2010-Mar-11 10:35 UTC

head link

Re: SSD Optimizations

On Wed, Mar 10, 2010 at 11:13 PM, Gordan Bobic <gordan@bobich.net>
wrote:> Marcus Fritzsch wrote:
>>
>> Hi there,
>>
>> On Wed, Mar 10, 2010 at 8:49 PM, Gordan Bobic <gordan@bobich.net>
wrote:
>>>
>>> [...]
>>> Are there similar optimizations available in BTRFS?
>>
>> There is an SSD mount option available[1].
>>
>> [1]
http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_Options
>
> But what _exactly_ does it do?
Chris explains the change to favour spatial locality in allocator
behaviour in with ''-o ssd''. ''-o ssd_spread''
does the opposite, where
RMW cycles are higher penalty. Elsewhere IIRC, Chris also said BTRFS
attempts to submit 128KB BIOs where possible (or wishful thinking?):

http://markmail.org/message/4sq4uco2lghgxzzz
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-Mar-11 10:59 UTC

head link

Re: SSD Optimizations

On Thursday 11 March 2010 08:38:53 Sander wrote:> Hello Gordan,
> 
> Gordan Bobic wrote (ao):
> > Mike Fedyk wrote:
> > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic
<gordan@bobich.net> wrote:
> > >>Are there options available comparable to ext2/ext3 to help
reduce
> > >>wear and improve performance?
> 
> With SSDs you don''t have to worry about wear.
Sorry, but you do have to worry about wear. I was able to destroy a relatively 
new SD card (2007 or early 2008) just by writing on the first 10MiB over and 
over again for two or three days. The end of the card still works without 
problems but about 10 sectors on the beginning give write errors.

And with journaled file systems that write over and over again on the same spot 
you do have to worry about wear leveling. It depends on the underlying block 
allocation algorithm, but I''m sure that most of the cheap SSDs do wear 
leveling only inside big blocks, not on whole hard drive, making it much 
easier to hit the 10 000-100 000 erase cycles boundary.

Still, I think that if you can prolong the life of hardware without noticable 
performance degradation, you should do it. Just because it may help the drive 
with some defects last those 3-5years between upgreades without any problems.
> 
> > And while I appreciate hopeful remarks along the lines of "I
think
> > you''ll get more out of btrfs", I am really after
specifics of what
> > the ssd mount option does, and what features comparable to the
> > optimizations that can be done with ext2/3/4 (e.g. the mentioned
> > stripe-width option) are available to get the best possible
> > alignment of data and metadata to increase both performance and life
> > expectancy of a SSD.
> 
> Alignment is about the partition, not the fs, and thus taken care of
> with fdisk and the like.
> 
> If you don''t create a partition, the fs is aligned with the SSD.
But it does not align internal FS structures to the SSD erase block size and 
that''s what Gordon asked for.

And sorry Gordon, I don''t know. But there''s a
''ssd_spread'' option that tries
to allocate blocks as far as possible (within reason) from  themselfs. That 
should, in most cases, make the fs structures reside on an erase block by  
themself.
I''m afraid that you''ll need to dive into the code to know
about block
alignment or one of the developers will need to provide us with info.
> 
> > Also, for drives that don''t support TRIM, is there a way to
make the
> > FS apply aggressive re-use of erased space (in order to help the
> > drive''s internal wear-leveling)?
> 
> TRIM has nothing to do with wear-leveling (although it helps reducing
> wear).
> TRIM lets the OS tell the disk which blocks are not in use anymore, and
> thus don''t have to be copied during a rewrite of the blocks.
> Wear-leveling is the SSD making sure all blocks are more or less equally
> written to avoid continuous load on the same blocks.
Isn''t this all about wear leveling? TRIM has no meaning for magnetic
media.
It''s used to tell the drive which parts of medium contain only junk
data and
can be used in block rotation, making the wear-leveling easier and more 
effective.

-- 
Hubert Kario
QBS - Quality Business Software
ul. Ksawerów 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephan von Krawczynski

2010-Mar-11 11:31 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 11:59:57 +0100
Hubert Kario <hka@qbs.com.pl> wrote:
> On Thursday 11 March 2010 08:38:53 Sander wrote:
> > Hello Gordan,
> > 
> > Gordan Bobic wrote (ao):
> > > Mike Fedyk wrote:
> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic
<gordan@bobich.net> wrote:
> > > >>Are there options available comparable to ext2/ext3 to
help reduce
> > > >>wear and improve performance?
> > 
> > With SSDs you don''t have to worry about wear.
> 
> Sorry, but you do have to worry about wear. I was able to destroy a
relatively
> new SD card (2007 or early 2008) just by writing on the first 10MiB over
and
> over again for two or three days. The end of the card still works without 
> problems but about 10 sectors on the beginning give write errors.
Sorry, the topic was SSD, not SD. SSDs have controllers that contain heavy
closed magic to circumvent all kinds of troubles you get when using classical
flash and SD cards.
Honestly I would just drop the idea of an SSD option simply because the
vendors implement all kinds of neat strategies in their devices. So in the end
you cannot really tell if the option does something constructive and not
destructive in combination with a SSD controller.
Of course you may well discuss about an option for passive flash devices like
ide-CF/SD or the like. There is no controller involved so your fs
implementation may well work out.

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 11:59 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 08:38:53 +0100, Sander <sander@humilis.net>
wrote:>> >>Are there options available comparable to ext2/ext3 to help
reduce
>> >>wear and improve performance?
> 
> With SSDs you don''t have to worry about wear.
And if you believe that you clearly swallowed the marketing spiel hook
line and sinker without enough real world exprience to show you otherwise.
But I''m not going to go off on a tangent now enumerating various
victories
of marketing over mathematics and empirical evidence relating to currently
popular technologies.

In short - I have several dead SSDs of various denominations that
demonstrate otherwise - and all within their warranty period, and not
having been used in pathologically write-heavy environments. You do have to
worry about wear. Operations that increase wear also reduce speed (erasing
a block is slow, and if the disk is fully tainted you cannot write without
erasing), so you doubly have to worry about it.

Also remember that hardware sectors are 512 bytes, and FS blocks tend to
be 4096 bytes. It is thus entirely plausible that if you aren''t careful
you''ll end up with blocks straddling two erase block boundaries. If
that
happens, you''ll make wear twice as bad because you are facing a
situation
where you may need to erase and write two blocks rather than one. Half the
performance, twice the wear.
>> And while I appreciate hopeful remarks along the lines of "I think
>> you''ll get more out of btrfs", I am really after
specifics of what
>> the ssd mount option does, and what features comparable to the
>> optimizations that can be done with ext2/3/4 (e.g. the mentioned
>> stripe-width option) are available to get the best possible
>> alignment of data and metadata to increase both performance and life
>> expectancy of a SSD.
> 
> Alignment is about the partition, not the fs, and thus taken care of
> with fdisk and the like.
> 
> If you don''t create a partition, the fs is aligned with the SSD.
I''m talking about internal FS data structures, not the partition
alignment.
>> Also, for drives that don''t support TRIM, is there a way to
make the
>> FS apply aggressive re-use of erased space (in order to help the
>> drive''s internal wear-leveling)?
> 
> TRIM has nothing to do with wear-leveling (although it helps reducing
> wear).
That''s self contradictory. If it helps reduce wear it has something to
do
with wear leveling.
> TRIM lets the OS tell the disk which blocks are not in use anymore, and
> thus don''t have to be copied during a rewrite of the blocks.
> Wear-leveling is the SSD making sure all blocks are more or less equally
> written to avoid continuous load on the same blocks.
And thus it is impossible to do wear leveling when all blocks have been
written to once without TRIM. So I''d say that in the long term, without
TRIM there is no wear leveling. That makes them pretty related.

So considering that there are various nuggets of opinion floating around
(correct or otherwise) saying that ext4 has support for TRIM, I''d like
to
know whether there similar support in BTRFS at the moment?

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 12:03 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 10:35:45 +0000, Daniel J Blueman
<daniel.blueman@gmail.com> wrote:>>>> [...]
>>>> Are there similar optimizations available in BTRFS?
>>>
>>> There is an SSD mount option available[1].
>>>
>>> [1]
http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_Options>>
>> But what _exactly_ does it do?
> 
> Chris explains the change to favour spatial locality in allocator
> behaviour in with ''-o ssd''. ''-o
ssd_spread'' does the opposite, where
> RMW cycles are higher penalty. Elsewhere IIRC, Chris also said BTRFS
> attempts to submit 128KB BIOs where possible (or wishful thinking?):
> 
> http://markmail.org/message/4sq4uco2lghgxzzz
Thanks, that''s useful info.

What about FS block and metadata alignment, though? Is there a way to
leverage the knowledge of erase block size in order to reduce wear and
increase performance?

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 12:09 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 11:59:57 +0100, Hubert Kario <hka@qbs.com.pl> wrote:
> Still, I think that if you can prolong the life of hardware without
> noticable 
> performance degradation, you should do it. Just because it may help the
> drive 
> with some defects last those 3-5years between upgreades without any
> problems.
I couldn''t agree more. Not only that but working with wear leveling in
mind (especially on devices without TRIM support) can increase performance,
too, by avoiding having to wait for the slow erase operation on writes.
>> > Also, for drives that don''t support TRIM, is there a way
to make the
>> > FS apply aggressive re-use of erased space (in order to help the
>> > drive''s internal wear-leveling)?
>> 
>> TRIM has nothing to do with wear-leveling (although it helps reducing
>> wear).
>> TRIM lets the OS tell the disk which blocks are not in use anymore, and
>> thus don''t have to be copied during a rewrite of the blocks.
>> Wear-leveling is the SSD making sure all blocks are more or less
equally>> written to avoid continuous load on the same blocks.
> 
> Isn''t this all about wear leveling? TRIM has no meaning for
magnetic
> media.
I fully agree that it''s important for wear leveling on flash media, but
from the security point of view, I think TRIM would be a useful feature on
all storage media. If the erased blocks were trimmed it would provide a
potentially useful feature of securely erasing the sectors that are no
longer used. It would be useful and much more transparent than the secure
erase features that only operate on the entire disk. Just MHO.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 12:17 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 12:31:03 +0100, Stephan von Krawczynski
<skraw@ithnet.com> wrote:>> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic
<gordan@bobich.net>
>> > > >wrote:
>> > > >>Are there options available comparable to ext2/ext3
to help
reduce>> > > >>wear and improve performance?
>> > 
>> > With SSDs you don''t have to worry about wear.
>> 
>> Sorry, but you do have to worry about wear. I was able to destroy a
>> relatively
>> new SD card (2007 or early 2008) just by writing on the first 10MiB
over>> and
>> over again for two or three days. The end of the card still works
>> without
>> problems but about 10 sectors on the beginning give write errors.
> 
> Sorry, the topic was SSD, not SD.
SD == SSD with an SD interface.
> SSDs have controllers that contain heavy
> closed magic to circumvent all kinds of troubles you get when using
> classical flash and SD cards.
There is absolutely no basis for thinking that SD cards don''t contain
wear
leveling logic. SD standard, and thus SD cards support a lot of fancy copy
protection capabilities, which means there is a lot of firmware involvement
on SD cards. It is unlikely that any reputable SD card manufacturer
wouldn''t also build wear leveling logic into it.
> Honestly I would just drop the idea of an SSD option simply because the
> vendors implement all kinds of neat strategies in their devices. So in
the> end you cannot really tell if the option does something constructive and
not> destructive in combination with a SSD controller.
You can make an educated guess. For starters given that visible sector
sizes are not equal to FS block sizes, it means that FS block sizes can
straddle erase block boundaries without the flash controller, no matter how
fancy, being able to determine this. Thus, at the very least, aligning FS
structures so that they do not straddle erase block boundaries is useful in
ALL cases. Thinking otherwise is just sticking your head in the sand
because you cannot be bothered to think.
> Of course you may well discuss about an option for passive flash devices
> like ide-CF/SD or the like. There is no controller involved so your fs
> implementation may well work out.
I suggest you educate yourself on the nature of IDE and CF (which is just
IDE with a different connector). There most certainly are controllers
involved. The days when disks (mechanical or solid state) didn''t
integrate
controllers ended with MFM/RLL and ESDI disks some 20+ years ago.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephan von Krawczynski

2010-Mar-11 12:59 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 12:17:30 +0000
Gordan Bobic <gordan@bobich.net> wrote:
> On Thu, 11 Mar 2010 12:31:03 +0100, Stephan von Krawczynski
> <skraw@ithnet.com> wrote:
> >> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic
<gordan@bobich.net>
> >> > > >wrote:
> >> > > >>Are there options available comparable to
ext2/ext3 to help
> reduce
> >> > > >>wear and improve performance?
> >> > 
> >> > With SSDs you don''t have to worry about wear.
> >> 
> >> Sorry, but you do have to worry about wear. I was able to destroy
a
> >> relatively
> >> new SD card (2007 or early 2008) just by writing on the first
10MiB
> over
> >> and
> >> over again for two or three days. The end of the card still works
> >> without
> >> problems but about 10 sectors on the beginning give write errors.
> > 
> > Sorry, the topic was SSD, not SD.
> 
> SD == SSD with an SD interface.
That really is quite a statement. You really talk of a few-bucks SD card (like
the one in my android handy) as an SSD comparable with Intel XE only with
different interface? Come on, stay serious. The product is not only made of
SLCs and some raw logic.
 > > SSDs have controllers that contain heavy
> > closed magic to circumvent all kinds of troubles you get when using
> > classical flash and SD cards.
> 
> There is absolutely no basis for thinking that SD cards don''t
contain wear
> leveling logic. SD standard, and thus SD cards support a lot of fancy copy
> protection capabilities, which means there is a lot of firmware involvement
> on SD cards. It is unlikely that any reputable SD card manufacturer
> wouldn''t also build wear leveling logic into it.
I really don''t guess about what is built into an SD or even CF card.
But we
hopefully agree that there is a significant difference compared to a product
that calls itself a _disk_.
 > > Honestly I would just drop the idea of an SSD option simply because
the
> > vendors implement all kinds of neat strategies in their devices. So in
> the
> > end you cannot really tell if the option does something constructive
and
> not
> > destructive in combination with a SSD controller.
> 
> You can make an educated guess. For starters given that visible sector
> sizes are not equal to FS block sizes, it means that FS block sizes can
> straddle erase block boundaries without the flash controller, no matter how
> fancy, being able to determine this. Thus, at the very least, aligning FS
> structures so that they do not straddle erase block boundaries is useful in
> ALL cases. Thinking otherwise is just sticking your head in the sand
> because you cannot be bothered to think.
And your guess is that intel engineers had no glue when designing the XE
including its controller? You think they did not know what you and me know and
therefore pray every day that some smart fs designer falls from heaven and
saves their product from dying in between? Really?
 > > Of course you may well discuss about an option for passive flash
devices
> > like ide-CF/SD or the like. There is no controller involved so your fs
> > implementation may well work out.
> 
> I suggest you educate yourself on the nature of IDE and CF (which is just
> IDE with a different connector). There most certainly are controllers
> involved. The days when disks (mechanical or solid state) didn''t
integrate
> controllers ended with MFM/RLL and ESDI disks some 20+ years ago.
I suggest you don''t talk to someone administering some hundred boxes
based on
CF and SSD mediums for _years_ about pro and con of the respective
implementation and its long term usage.
Sorry, the world is not built out of paper, sometimes you meet the hard facts.
And one of it is that the ssd option in fs is very likely already overrun by
the ssd controller designers and mostly _superfluous_. The market has already
decided to make SSDs compatible to standard fs layouts.

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 13:20 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 13:59:09 +0100, Stephan von Krawczynski
<skraw@ithnet.com> wrote:
>> >> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic
>> >> > > ><gordan@bobich.net>
>> >> > > >wrote:
>> >> > > >>Are there options available comparable to
ext2/ext3 to help
>> reduce
>> >> > > >>wear and improve performance?
>> >> > 
>> >> > With SSDs you don''t have to worry about wear.
>> >> 
>> >> Sorry, but you do have to worry about wear. I was able to
destroy a
>> >> relatively
>> >> new SD card (2007 or early 2008) just by writing on the first
10MiB
>> over
>> >> and
>> >> over again for two or three days. The end of the card still
works
>> >> without
>> >> problems but about 10 sectors on the beginning give write
errors.
>> > 
>> > Sorry, the topic was SSD, not SD.
>> 
>> SD == SSD with an SD interface.
> 
> That really is quite a statement. You really talk of a few-bucks SD card
> (like the one in my android handy) as an SSD comparable with Intel XE
only with> different interface? Come on, stay serious. The product is not only made
of> SLCs and some raw logic.
I am saying that there is no reason for the firmware in an SD card to not
be as advanced. If the manufacturer has some advanced logic in their SATA
SSD, I cannot see any valid engineering reason to not apply the same logic
in a SD product.
>> > SSDs have controllers that contain heavy
>> > closed magic to circumvent all kinds of troubles you get when
using
>> > classical flash and SD cards.
>> 
>> There is absolutely no basis for thinking that SD cards don''t
contain
>> wear
>> leveling logic. SD standard, and thus SD cards support a lot of fancy
>> copy
>> protection capabilities, which means there is a lot of firmware
>> involvement
>> on SD cards. It is unlikely that any reputable SD card manufacturer
>> wouldn''t also build wear leveling logic into it.
> 
> I really don''t guess about what is built into an SD or even CF
card. But
we> hopefully agree that there is a significant difference compared to a
> product that calls itself a _disk_.
Wo don''t agree on that. Not at all. I don''t see any reason why
a CF card
and an IDE SSD made by the same manufacturer should have any difference
between them other than capacity and the physical package.
>> > Honestly I would just drop the idea of an SSD option simply
because
the>> > vendors implement all kinds of neat strategies in their devices.
So
in>> the
>> > end you cannot really tell if the option does something
constructive
>> > and not destructive in combination with a SSD controller.
>> 
>> You can make an educated guess. For starters given that visible sector
>> sizes are not equal to FS block sizes, it means that FS block sizes can
>> straddle erase block boundaries without the flash controller, no matter
>> how
>> fancy, being able to determine this. Thus, at the very least, aligning
FS>> structures so that they do not straddle erase block boundaries is
useful>> in
>> ALL cases. Thinking otherwise is just sticking your head in the sand
>> because you cannot be bothered to think.
> 
> And your guess is that intel engineers had no glue when designing the XE
> including its controller? You think they did not know what you and me
know> and
> therefore pray every day that some smart fs designer falls from heaven
and> saves their product from dying in between? Really?
I am saying that there are problems that CANNOT be solved on the disk
firmware level. Some problems HAVE to be addressed higher up the stack.
>> > Of course you may well discuss about an option for passive flash
>> > devices
>> > like ide-CF/SD or the like. There is no controller involved so
your
fs>> > implementation may well work out.
>> 
>> I suggest you educate yourself on the nature of IDE and CF (which is
just>> IDE with a different connector). There most certainly are controllers
>> involved. The days when disks (mechanical or solid state)
didn''t
>> integrate
>> controllers ended with MFM/RLL and ESDI disks some 20+ years ago.
> 
> I suggest you don''t talk to someone administering some hundred
boxes
based> on
> CF and SSD mediums for _years_ about pro and con of the respective
> implementation and its long term usage.
> Sorry, the world is not built out of paper, sometimes you meet the hard
> facts.
> And one of it is that the ssd option in fs is very likely already
overrun> by
> the ssd controller designers and mostly _superfluous_. The market has
> already decided to make SSDs compatible to standard fs layouts.
Seems to me that you haven''t done any analysis of comparative long term
failure rates between SSDs used with default layouts (Default? Really? You
mean you don''t apply any special partitioning on your hundreds of
servers?)
and those with carefully aligned FS-es. Just because defaults may be good
enough for your use case, doesn''t mean that somebody with a use case
that''s
harder on the flash will observe the same reliability, or deem the the
unoptimized performance figures good enough.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-Mar-11 14:01 UTC

head link

Re: SSD Optimizations

On Thursday 11 March 2010 14:20:23 Gordan Bobic wrote:> On Thu, 11 Mar 2010 13:59:09 +0100, Stephan von Krawczynski
> 
> <skraw@ithnet.com> wrote:
> >> >> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan
Bobic
> >> >> > > ><gordan@bobich.net>
> >> >> > > >
> >> >> > > >wrote:
> >> >> > > >>Are there options available comparable
to ext2/ext3 to help
> >>
> >> reduce
> >>
> >> >> > > >>wear and improve performance?
> >> >> >
> >> >> > With SSDs you don''t have to worry about
wear.
> >> >>
> >> >> Sorry, but you do have to worry about wear. I was able to
destroy a
> >> >> relatively
> >> >> new SD card (2007 or early 2008) just by writing on the
first 10MiB
> >>
> >> over
> >>
> >> >> and
> >> >> over again for two or three days. The end of the card
still works
> >> >> without
> >> >> problems but about 10 sectors on the beginning give write
errors.
> >> >
> >> > Sorry, the topic was SSD, not SD.
> >>
> >> SD == SSD with an SD interface.
> >
> > That really is quite a statement. You really talk of a few-bucks SD
card
> > (like the one in my android handy) as an SSD comparable with Intel XE
> 
> only with
> 
> > different interface? Come on, stay serious. The product is not only
made
> 
> of
> 
> > SLCs and some raw logic.
> 
> I am saying that there is no reason for the firmware in an SD card to not
> be as advanced. If the manufacturer has some advanced logic in their SATA
> SSD, I cannot see any valid engineering reason to not apply the same logic
> in a SD product.
The _SD_standard_ states that the media has to implement wear-leveling.
So any card with an SD logo implements it.

As I stated previously, the algorithms used in SD cards may not be as advanced 
as those in top-of-the-line Intel SSDs, but I bet they don''t differ by
much to
the ones used in cheapest SSD drives.

Besides, why shouldn''t we help the drive firmware by 
- writing the data only in erase-block sizes
- trying to write blocks that are smaller than the erase-block in a way that 
won''t cross the erase-block boundary
- using TRIM on deallocated parts of the drive

This will not only increase the life of the SSD but also increase its 
performance.
> 
> >> > Honestly I would just drop the idea of an SSD option simply
because
> 
> the
> 
> >> > vendors implement all kinds of neat strategies in their
devices. So
> 
> in
> 
> >> the
> >>
> >> > end you cannot really tell if the option does something
constructive
> >> > and not destructive in combination with a SSD controller.
> >>
> >> You can make an educated guess. For starters given that visible
sector
> >> sizes are not equal to FS block sizes, it means that FS block
sizes can
> >> straddle erase block boundaries without the flash controller, no
matter
> >> how
> >> fancy, being able to determine this. Thus, at the very least,
aligning
> 
> FS
> 
> >> structures so that they do not straddle erase block boundaries is
> 
> useful
> 
> >> in
> >> ALL cases. Thinking otherwise is just sticking your head in the
sand
> >> because you cannot be bothered to think.
> >
> > And your guess is that intel engineers had no glue when designing the
XE
> > including its controller? You think they did not know what you and me
> 
> know
> 
> > and
> > therefore pray every day that some smart fs designer falls from heaven
> 
> and
> 
> > saves their product from dying in between? Really?
> 
> I am saying that there are problems that CANNOT be solved on the disk
> firmware level. Some problems HAVE to be addressed higher up the stack.
Exactly, you can''t assume that the SSDs firmware understands any and
all file
system layouts, especially if they are on fragmented LVM or other logical 
volume manager partitions.

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarządzania Jakością
zgodny z normą ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Mar-11 14:21 UTC

head link

Re: SSD Optimizations

On Wed, Mar 10, 2010 at 07:49:34PM +0000, Gordan Bobic
wrote:> I''m looking to try BTRFS on a SSD, and I would like to know what
SSD
> optimizations it applies. Is there a comprehensive list of what ssd
> mount option does? How are the blocks and metadata arranged? Are
> there options available comparable to ext2/ext3 to help reduce wear
> and improve performance?
> 
> Specifically, on ext2 (journal means more writes, so I don''t use
> ext3 on SSDs, since fsck typically only takes a few seconds when
> access time is < 100us), I usually apply the
> -b 4096 -E stripe-width = (erase_block/4096)
> parameters to mkfs in order to reduce the multiple erase cycles on
> the same underlying block.
> 
> Are there similar optimizations available in BTRFS?
All devices (raid, ssd, single spindle) tend to benefit from big chunks
of writes going down close together on disk.  This is true for different
reasons on each one, but it is still the easiest way to optimize writes.
COW filesystems like btrfs are very well suited to send down lots of big writes
because we''re always reallocating things.

For traditional storage, we also need to keep blocks from one file (or
files in a directory) close together to reduce seeks during reads.  SSDs
have no such restrictions, and so the mount -o ssd related options in
btrfs focus on tossing out tradeoffs that slow down writes in hopes of
reading faster later.

Someone already mentioned the mount -o ssd and ssd_spread options.
Mount -o ssd is targeted at faster SSD that is good at wear leveling and
generally just benefits from having a bunch of data sent down close
together.  In mount -o ssd, you might find a write pattern like this:

block N, N+2, N+3, N+4, N+6, N+7, N+16, N+17, N+18, N+19, N+20 ...

It''s a largely contiguous chunk of writes, but there may be gaps.  Good
ssds don''t really care about the gaps, and they benefit more from the
fact that we''re preferring to reuse blocks that had once been written
than to go off and find completely contiguous areas of the disk to
write (which are more likely to have never been written at all).

mount -o ssd_spread is much more strict.  You''ll get N,N+2,N+3,N+4,N+5
etc because crummy ssds really do care about the gaps.

Now, btrfs could go off and probe for the erasure size and work very
hard to align things to it.  As someone said, alignment of the partition
table is very important here as well.  But for modern ssd this generally
matters much less than just doing big ios and letting the little log
structured squirrel inside the device figure things out.

For trim, we do have mount -o discard.  It does introduce a run time
performance hit (this varies wildly from device to device) and we''re
tuning things as discard capable devices become more common.  If anyone
is looking for a project it would be nice to have an ioctl that triggers
free space discards in bulk.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sander

2010-Mar-11 14:39 UTC

head link

Re: SSD Optimizations

Stephan von Krawczynski wrote (ao):> Honestly I would just drop the idea of an SSD option simply because the
> vendors implement all kinds of neat strategies in their devices. So in the
end
> you cannot really tell if the option does something constructive and not
> destructive in combination with a SSD controller.
My understanding of the ssd mount option is also that the fs doens''t
try
to do all kinds of smart (and potential expensive) things which make
sense for rotating media to reduce seeks and the like.

	Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephan von Krawczynski

2010-Mar-11 15:35 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 15:01:55 +0100
Hubert Kario <hka@qbs.com.pl> wrote:
> [...]
> The _SD_standard_ states that the media has to implement wear-leveling.
> So any card with an SD logo implements it.
> 
> As I stated previously, the algorithms used in SD cards may not be as
advanced
> as those in top-of-the-line Intel SSDs, but I bet they don''t
differ by much to
> the ones used in cheapest SSD drives.
Well, we are all pretty sure about that. And that is exactly the reason why
these are not surviving the market pressure. Why should one care about bad
products that are possibly already extincted because of their bad performance
when the fs is production ready some day?
 > Besides, why shouldn''t we help the drive firmware by 
> - writing the data only in erase-block sizes
> - trying to write blocks that are smaller than the erase-block in a way
that
> won''t cross the erase-block boundary
Because if the designing engineer of a good SSD controller wasn''t able
to cope
with that he will have no chance to design a second one.
> - using TRIM on deallocated parts of the drive
Another story. That is a designed part of a software interface between fs and
drive bios on which both agreed in its usage pattern. Whereas the above points
are pure guess based on dumb and old hardware and its behaviour.
 > This will not only increase the life of the SSD but also increase its 
> performance.
TRIM: maybe yes. Rest: pure handwaving.
> [...]
> > > And your guess is that intel engineers had no glue when designing
the XE
> > > including its controller? You think they did not know what you
and me
> > > know and
> > > therefore pray every day that some smart fs designer falls from
heaven
> > > and saves their product from dying in between? Really?
> > 
> > I am saying that there are problems that CANNOT be solved on the disk
> > firmware level. Some problems HAVE to be addressed higher up the
stack.
> 
> Exactly, you can''t assume that the SSDs firmware understands any
and all file
> system layouts, especially if they are on fragmented LVM or other logical 
> volume manager partitions.
Hopefully the firmware understands exactly no fs layout at all. That would be
braindead. Instead it should understand how to arrange incoming and outgoing
data in a way that its own technical requirements are met as perfect as
possible. This is no spinning disk, it is completely irrelevant what the data
layout looks like as long as the controller finds its way through and copes
best with read/write/erase cycles. It may well use additional RAM for caching
and data reordering.
Do you really believe ascending block numbers are placed in ascending
addresses inside the disk (as an example)? Why should they? What does that
mean for fs block ordering? If you don''t know anyway what a controller
does to
your data ordering, how do you want to help it with its job?
Please accept that we are _not_ talking about trivial flash mem here or
pseudo-SSDs consisting of sd cards. The market has already evolved better
products. The dinosaurs are extincted even if some are still looking alive.
> -- 
> Hubert Kario
-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Asdo

2010-Mar-11 15:59 UTC

head link

Re: SSD Optimizations

Gordan Bobic wrote:>> TRIM lets the OS tell the disk which blocks are not in use anymore, and
>> thus don''t have to be copied during a rewrite of the blocks.
>> Wear-leveling is the SSD making sure all blocks are more or less
equally
>> written to avoid continuous load on the same blocks.
>>     
>
> And thus it is impossible to do wear leveling when all blocks have been
> written to once without TRIM. So I''d say that in the long term,
without
> TRIM there is no wear leveling. That makes them pretty related.
>   I''m no expert of SSDs, however

1- I think the SSD would rewrite once-written blocks to other locations, 
so to reuse the same physical blocks for wear levelling. The 
written-once blocks are very good candidates because their write-count 
is "1"

2- I think SSDs show you a smaller usable size than what they physically 
have. In this way they always have some more blocks for moving data to, 
so to free blocks which have a low write-count.

3- If you think #2 is not enough you can leave part of the SSD disk 
unused, by leaving unused space after the last partition.

Actually, after considering #1 and #2, I don''t think TRIM is really 
needed for SSDs, are you sure it is really needed? I think it''s more a 
kind of optimization, but it needs to be very fast for it to be useful 
as an optimization, faster than an internal block rewrite by the SSD 
wear levelling, and so fast as SATA/SAS command that the computer is not 
significantly slowed down by using it. Instead IIRC I read something 
about it being slow and maybe was even requiring FUA or barrier or 
flush? I don''t remember exactly.

There is one place where TRIM would be very useful though, and it''s not
for SSDs, but it''s in virtualization: if the Virtual Machine frees 
space, the VM file system should use TRIM to signal to the host that 
some space is unused. The host should have a way to tell its filesystem 
that the VM-disk-file has a new hole in that position, so that disk 
space can be freed on the host for use for another VM. This would allow 
much greater overcommit of disk spaces to virtual machines.

There''s probably no need for "TRIM" support itself on the
host
filesystem, but another mechanism is needed that allows to sparsify an 
existing file creating a hole in it (which I think is not possible with 
the filesystems syscalls we have now, correct me if I''m wrong). There 
*is* need for TRIM support in the guest filesystem though.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 16:03 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 16:35:33 +0100, Stephan von Krawczynski
<skraw@ithnet.com> wrote:
>> Besides, why shouldn''t we help the drive firmware by 
>> - writing the data only in erase-block sizes
>> - trying to write blocks that are smaller than the erase-block in a way
>> that won''t cross the erase-block boundary
> 
> Because if the designing engineer of a good SSD controller wasn''t
able
to> cope with that he will have no chance to design a second one.
You seem to be confusing quality of implementation with theoretical
possibility.
>> This will not only increase the life of the SSD but also increase its 
>> performance.
> 
> TRIM: maybe yes. Rest: pure handwaving.
> 
>> [...]
>> > > And your guess is that intel engineers had no glue when
designing
>> > > the XE
>> > > including its controller? You think they did not know what
you and
me>> > > know and
>> > > therefore pray every day that some smart fs designer falls
from
>> > > heaven
>> > > and saves their product from dying in between? Really?
>> > 
>> > I am saying that there are problems that CANNOT be solved on the
disk
>> > firmware level. Some problems HAVE to be addressed higher up the
stack.>> 
>> Exactly, you can''t assume that the SSDs firmware understands
any and
all>> file
>> system layouts, especially if they are on fragmented LVM or other
>> logical
>> volume manager partitions.
> 
> Hopefully the firmware understands exactly no fs layout at all. That
would> be
> braindead. Instead it should understand how to arrange incoming and
> outgoing
> data in a way that its own technical requirements are met as perfect as
> possible. This is no spinning disk, it is completely irrelevant what the
> data
> layout looks like as long as the controller finds its way through and
copes> best with read/write/erase cycles. It may well use additional RAM for
> caching and data reordering.
> Do you really believe ascending block numbers are placed in ascending
> addresses inside the disk (as an example)? Why should they? What does
that> mean for fs block ordering? If you don''t know anyway what a
controller
> does to
> your data ordering, how do you want to help it with its job?
> Please accept that we are _not_ talking about trivial flash mem here or
> pseudo-SSDs consisting of sd cards. The market has already evolved
better> products. The dinosaurs are extincted even if some are still lookingalive.

I am assuming that you are being deliberately facetious here (the
alternative is less kind). The simple fact is that you cannot come up with
some magical data (re)ordering method that nullifies problems of common
use-cases that are quite nasty for flash based media.

For example - you have a disk that has had all it''s addressable blocks
tainted. A new write comes in - what do you do with it? Worse, a write
comes in spanning two erase blocks as a consequence of the data
re-alignment in the firmware. You have no choice but to wipe them both and
re-write the data. You''d be better off not doing the magic and assuming
that the FS is sensibly aligned.

Having a large chunk of spare non-addressable space for this doesn''t
necessarily help you, either, unless it is about the same size as the
addressable space (worse case scenario, if you accept that the vast
majority of FS-es use 4KB block sizes, you can cut a corner there by a
factor of 8). All of that adds to cost - flash is still expensive.

The bottom line is that you _cannot_ solve wear-leveling completely just
in firmware. There is no doubt you can get some of the way there, but it is
mathematically impossible to solve completely without intervention from
further up the stack. Since some black-box firmware optimizations may quite
concievably make the wear problem worse, it makes perfect sense to just
hopefully assume that the FS is trying to help - it''s unlikely to make
things worse and may well make things a lot better.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 16:15 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 14:42:40 +0100, Asdo <asdo@shiftmail.org> wrote:
> 1- I think the SSD would rewrite once-written blocks to other locations,
> so to reuse the same physical blocks for wear levelling. The 
> written-once blocks are very good candidates because their write-count 
> is "1"
There are likely to be millions of blocks with the same write count. How
do you pick the optimal ones?
> 2- I think SSDs show you a smaller usable size than what they physically
> have. In this way they always have some more blocks for moving data to, 
> so to free blocks which have a low write-count.
I''m pretty sure that is the case, too. However, to be able to deal with
the worst case scenario you would have to effectively double the amoutn of
flash (only expose half of it as addressable). With some corner cutting and
assumptions about file systems'' block sizes (usually 4KB these days),
you
can cut a corner, but that''s dodgy until we get bigger hardware sectors
as
standard.
> 3- If you think #2 is not enough you can leave part of the SSD disk 
> unused, by leaving unused space after the last partition.
That is true, but I''d rather apply some higher level logic to this
rather
than expect the firmware do make a guess about things it has absolutely no
way of knowing for sure.
> Actually, after considering #1 and #2, I don''t think TRIM is
really
> needed for SSDs, are you sure it is really needed?
I don''t think there''s any doubt that trim helps flash
longevity.
> I think it''s more a 
> kind of optimization, but it needs to be very fast for it to be useful 
> as an optimization, faster than an internal block rewrite by the SSD 
> wear levelling, and so fast as SATA/SAS command that the computer is not
> significantly slowed down by using it. Instead IIRC I read something 
> about it being slow and maybe was even requiring FUA or barrier or 
> flush? I don''t remember exactly.
There is no obligation on part of the disk to do anything in response to
the trim command, IIRC. It is advisory. It doesn''t have to clear the
blocks
online. In fact, trim is sector based, and it is unlikely an SSD would act
until it can sensibly free an entire erase block.
> There is one place where TRIM would be very useful though, and
it''s not
> for SSDs, but it''s in virtualization: if the Virtual Machine frees
> space, the VM file system should use TRIM to signal to the host that 
> some space is unused. The host should have a way to tell its filesystem 
> that the VM-disk-file has a new hole in that position, so that disk 
> space can be freed on the host for use for another VM. This would allow 
> much greater overcommit of disk spaces to virtual machines.
Indeed, I brought this very point up on the KVM mailing list a while back.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gordan Bobic

2010-Mar-11 16:18 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 09:21:30 -0500, Chris Mason <chris.mason@oracle.com>
wrote:> On Wed, Mar 10, 2010 at 07:49:34PM +0000, Gordan Bobic wrote:
>> I''m looking to try BTRFS on a SSD, and I would like to know
what SSD
>> optimizations it applies. Is there a comprehensive list of what ssd
>> mount option does? How are the blocks and metadata arranged? Are
>> there options available comparable to ext2/ext3 to help reduce wear
>> and improve performance?
>> 
>> Specifically, on ext2 (journal means more writes, so I don''t
use
>> ext3 on SSDs, since fsck typically only takes a few seconds when
>> access time is < 100us), I usually apply the
>> -b 4096 -E stripe-width = (erase_block/4096)
>> parameters to mkfs in order to reduce the multiple erase cycles on
>> the same underlying block.
>> 
>> Are there similar optimizations available in BTRFS?
> 
> All devices (raid, ssd, single spindle) tend to benefit from big chunks
> of writes going down close together on disk.  This is true for different
> reasons on each one, but it is still the easiest way to optimize writes.
> COW filesystems like btrfs are very well suited to send down lots of big
> writes because we''re always reallocating things.
Doesn''t this mean _more_ writes? If that''s the case, then that
would make
btrfs a _bad_ choice for flash based media with limite write cycles.
> For traditional storage, we also need to keep blocks from one file (or
> files in a directory) close together to reduce seeks during reads.  SSDs
> have no such restrictions, and so the mount -o ssd related options in
> btrfs focus on tossing out tradeoffs that slow down writes in hopes of
> reading faster later.
> 
> Someone already mentioned the mount -o ssd and ssd_spread options.
> Mount -o ssd is targeted at faster SSD that is good at wear leveling and
> generally just benefits from having a bunch of data sent down close
> together.  In mount -o ssd, you might find a write pattern like this:
> 
> block N, N+2, N+3, N+4, N+6, N+7, N+16, N+17, N+18, N+19, N+20 ...
> 
> It''s a largely contiguous chunk of writes, but there may be gaps. 
Good
> ssds don''t really care about the gaps, and they benefit more from
the
> fact that we''re preferring to reuse blocks that had once been
written
> than to go off and find completely contiguous areas of the disk to
> write (which are more likely to have never been written at all).
> 
> mount -o ssd_spread is much more strict.  You''ll get
N,N+2,N+3,N+4,N+5
> etc because crummy ssds really do care about the gaps.
> 
> Now, btrfs could go off and probe for the erasure size and work very
> hard to align things to it.  As someone said, alignment of the partition
> table is very important here as well.  But for modern ssd this generally
> matters much less than just doing big ios and letting the little log
> structured squirrel inside the device figure things out.
Thanks, that''s quite helpful. Can you provide any insight into
alignment
of FS structures in such a way that they do not straddle erase block
boundaries?
> For trim, we do have mount -o discard.  It does introduce a run time
> performance hit (this varies wildly from device to device) and
we''re
> tuning things as discard capable devices become more common.  If anyone
> is looking for a project it would be nice to have an ioctl that triggers
> free space discards in bulk.
Are you saying that -o discard implements trim support?

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Mar-11 16:19 UTC

head link

Re: SSD Optimizations

On Thu, Mar 11, 2010 at 04:03:59PM +0000, Gordan Bobic
wrote:> On Thu, 11 Mar 2010 16:35:33 +0100, Stephan von Krawczynski
> <skraw@ithnet.com> wrote:
> 
> >> Besides, why shouldn''t we help the drive firmware by 
> >> - writing the data only in erase-block sizes
> >> - trying to write blocks that are smaller than the erase-block in
a way
> >> that won''t cross the erase-block boundary
> > 
> > Because if the designing engineer of a good SSD controller
wasn''t able
> to
> > cope with that he will have no chance to design a second one.
> 
> You seem to be confusing quality of implementation with theoretical
> possibility.
> 
> >> This will not only increase the life of the SSD but also increase
its
> >> performance.
> > 
> > TRIM: maybe yes. Rest: pure handwaving.
> > 
> >> [...]
> >> > > And your guess is that intel engineers had no glue when
designing
> >> > > the XE
> >> > > including its controller? You think they did not know
what you and
> me
> >> > > know and
> >> > > therefore pray every day that some smart fs designer
falls from
> >> > > heaven
> >> > > and saves their product from dying in between? Really?
> >> > 
> >> > I am saying that there are problems that CANNOT be solved on
the disk
> >> > firmware level. Some problems HAVE to be addressed higher up
the
> stack.
> >> 
> >> Exactly, you can''t assume that the SSDs firmware
understands any and
> all
> >> file
> >> system layouts, especially if they are on fragmented LVM or other
> >> logical
> >> volume manager partitions.
> > 
> > Hopefully the firmware understands exactly no fs layout at all. That
> would
> > be
> > braindead. Instead it should understand how to arrange incoming and
> > outgoing
> > data in a way that its own technical requirements are met as perfect
as
> > possible. This is no spinning disk, it is completely irrelevant what
the
> > data
> > layout looks like as long as the controller finds its way through and
> copes
> > best with read/write/erase cycles. It may well use additional RAM for
> > caching and data reordering.
> > Do you really believe ascending block numbers are placed in ascending
> > addresses inside the disk (as an example)? Why should they? What does
> that
> > mean for fs block ordering? If you don''t know anyway what a
controller
> > does to
> > your data ordering, how do you want to help it with its job?
> > Please accept that we are _not_ talking about trivial flash mem here
or
> > pseudo-SSDs consisting of sd cards. The market has already evolved
> better
> > products. The dinosaurs are extincted even if some are still looking
> alive.
> 
> I am assuming that you are being deliberately facetious here (the
> alternative is less kind). The simple fact is that you cannot come up with
> some magical data (re)ordering method that nullifies problems of common
> use-cases that are quite nasty for flash based media.
> 
> For example - you have a disk that has had all it''s addressable
blocks
> tainted. A new write comes in - what do you do with it? Worse, a write
> comes in spanning two erase blocks as a consequence of the data
> re-alignment in the firmware. You have no choice but to wipe them both and
> re-write the data. You''d be better off not doing the magic and
assuming
> that the FS is sensibly aligned.
Ok, how exactly would the FS help here?  We have a device with a 256kb
erasure size, and userland does a 4k write followed by an fsync.

If the FS were to be smart and know about the 256kb requirement, it
would do a read/modify/write cycle somewhere and then write the 4KB.

The underlying implementation is the same in the device.  It picks a
destination, reads it then writes it back.  You could argue (and many
people do) that this operation is risky and has a good chance of
destroying old data.  Perhaps we''re best off if the FS does the rmw
cycle instead into an entirely safe location.

It''s a great place for research and people are definitely looking at
it.

But with all of that said, it has nothing to do with alignment or trim.
Modern ssds are a raid device with a large stripe size, and someone
somewhere is going to do a read/modify/write to service any small write.
You can force this up to the FS or the application, it''ll happen
somewhere.

The filesystem metadata writes are a very small percentage of the
problem overall.  Sure we can do better and try to force larger metadata
blocks.  This was the whole point behind btrfs'' support for large tree
blocks, which I''ll be enabling again shortly.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin K. Petersen

2010-Mar-11 16:22 UTC

head link

Re: SSD Optimizations

>>>>> "Gordan" == Gordan Bobic
<gordan@bobich.net> writes:
Gordan> I fully agree that it''s important for wear leveling on flash
Gordan> media, but from the security point of view, I think TRIM would
Gordan> be a useful feature on all storage media. If the erased blocks
Gordan> were trimmed it would provide a potentially useful feature of
Gordan> securely erasing the sectors that are no longer used. It would
Gordan> be useful and much more transparent than the secure erase
Gordan> features that only operate on the entire disk. Just MHO.

Except there are no guarantees that TRIM does anything, even if the
drive claims to support it.

There are a couple of IDENTIFY DEVICE knobs that indicate whether the
drive deterministically returns data after a TRIM.  And whether the
resulting data is zeroes.  We query these values and report them to the
filesystem.

However, testing revealed several devices that reported the right thing
but which did in fact return the old data afterwards.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Mar-11 16:29 UTC

head link

Re: SSD Optimizations

On Thu, Mar 11, 2010 at 04:18:48PM +0000, Gordan Bobic
wrote:> On Thu, 11 Mar 2010 09:21:30 -0500, Chris Mason
<chris.mason@oracle.com>
> wrote:
> > On Wed, Mar 10, 2010 at 07:49:34PM +0000, Gordan Bobic wrote:
> >> I''m looking to try BTRFS on a SSD, and I would like to
know what SSD
> >> optimizations it applies. Is there a comprehensive list of what
ssd
> >> mount option does? How are the blocks and metadata arranged? Are
> >> there options available comparable to ext2/ext3 to help reduce
wear
> >> and improve performance?
> >> 
> >> Specifically, on ext2 (journal means more writes, so I
don''t use
> >> ext3 on SSDs, since fsck typically only takes a few seconds when
> >> access time is < 100us), I usually apply the
> >> -b 4096 -E stripe-width = (erase_block/4096)
> >> parameters to mkfs in order to reduce the multiple erase cycles on
> >> the same underlying block.
> >> 
> >> Are there similar optimizations available in BTRFS?
> > 
> > All devices (raid, ssd, single spindle) tend to benefit from big
chunks
> > of writes going down close together on disk.  This is true for
different
> > reasons on each one, but it is still the easiest way to optimize
writes.
> > COW filesystems like btrfs are very well suited to send down lots of
big
> > writes because we''re always reallocating things.
> 
> Doesn''t this mean _more_ writes? If that''s the case, then
that would make
> btrfs a _bad_ choice for flash based media with limite write cycles.
It just means that when we do write, we don''t overwrite the existing
data in the file.  We allocate a new block instead and write there
(freeing the old one)
.
This gives us a lot of control over grouping writes together, instead of
being restricted to the layout from when the file was first created.

It also fragments the files much more, but this isn''t an issue on ssd.
> 
> > For traditional storage, we also need to keep blocks from one file (or
> > files in a directory) close together to reduce seeks during reads. 
SSDs
> > have no such restrictions, and so the mount -o ssd related options in
> > btrfs focus on tossing out tradeoffs that slow down writes in hopes of
> > reading faster later.
> > 
> > Someone already mentioned the mount -o ssd and ssd_spread options.
> > Mount -o ssd is targeted at faster SSD that is good at wear leveling
and
> > generally just benefits from having a bunch of data sent down close
> > together.  In mount -o ssd, you might find a write pattern like this:
> > 
> > block N, N+2, N+3, N+4, N+6, N+7, N+16, N+17, N+18, N+19, N+20 ...
> > 
> > It''s a largely contiguous chunk of writes, but there may be
gaps.  Good
> > ssds don''t really care about the gaps, and they benefit more
from the
> > fact that we''re preferring to reuse blocks that had once been
written
> > than to go off and find completely contiguous areas of the disk to
> > write (which are more likely to have never been written at all).
> > 
> > mount -o ssd_spread is much more strict.  You''ll get
N,N+2,N+3,N+4,N+5
> > etc because crummy ssds really do care about the gaps.
> > 
> > Now, btrfs could go off and probe for the erasure size and work very
> > hard to align things to it.  As someone said, alignment of the
partition
> > table is very important here as well.  But for modern ssd this
generally
> > matters much less than just doing big ios and letting the little log
> > structured squirrel inside the device figure things out.
> 
> Thanks, that''s quite helpful. Can you provide any insight into
alignment
> of FS structures in such a way that they do not straddle erase block
> boundaries?
We align on 4k (but partition alignment can defeat this).  We don''t
attempt to understand or guess at erasure blocks.  Unless the filesystem
completely takes over the FTL duties, I don''t think it makes sense to
do
more than send large writes whenever we can.

The raid 5/6 patches will add more knobs for strict alignment, but I''d
be very surprised if they made a big difference on modern ssd.
> 
> > For trim, we do have mount -o discard.  It does introduce a run time
> > performance hit (this varies wildly from device to device) and
we''re
> > tuning things as discard capable devices become more common.  If
anyone
> > is looking for a project it would be nice to have an ioctl that
triggers
> > free space discards in bulk.
> 
> Are you saying that -o discard implements trim support?
Yes, it sends trim/discards down to devices that support it.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin K. Petersen

2010-Mar-11 16:48 UTC

head link

Re: SSD Optimizations

>>>>> "Gordan" == Gordan Bobic
<gordan@bobich.net> writes:
Gordan> SD == SSD with an SD interface.

No, not really.

It is true that conceivably you could fit a sophisticated controller in
an SD card form factor.  But fact is that takes up space which could
otherwise be used for flash.  There may also be power consumption/heat
dissipation concerns.

Most SD card controllers have very, very simple wear leveling that in
most cases rely on the filesystem being FAT.  These cards are aimed at
cameras, MP3 players, etc. after all. And consequently it''s trivial to
wear out an SD card by writing a block over and over.

The same is kind of true for Compact Flash.  There are two types of
cards, I prefer to think of them as camera grade and industrial.  Camera
grade CF is really no different from SD cards or any other consumer
flash form factor.

Industrial CF cards have controllers with sophisticated wear leveling.
Usually this is not quite as clever as a "big" SSD, but it is close
enough that you can treat the device as a real disk drive.  I.e. it has
multiple channels working in parallel unlike the consumer devices.

As a result of the smarter controller logic and the bigger bank of spare
flash, industrial cards are much smaller in capacity.  Typically in the
1-4 GB range.  But they are in many cases indistinguishable from a real
SSD in terms of performance and reliability.


Gordan> You can make an educated guess. For starters given that visible
Gordan> sector sizes are not equal to FS block sizes, it means that FS
Gordan> block sizes can straddle erase block boundaries without the
Gordan> flash controller, no matter how fancy, being able to determine
Gordan> this. Thus, at the very least, aligning FS structures so that
Gordan> they do not straddle erase block boundaries is useful in ALL
Gordan> cases. Thinking otherwise is just sticking your head in the sand
Gordan> because you cannot be bothered to think.

There are no means of telling what the erase block size is.  None.  We
have no idea.  The vendors won''t talk.  It''s part of their IP.

Also, there is no point in being hung up on the whole erase block thing.
Only crappy SSDs use block mapping where that matters.  These devices
will die a horrible death soon enough.  Good SSDs use a technique akin
to logging filesystems in which the erase block size and all other other
physical characteristics don''t matter (from a host perspective).

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephan von Krawczynski

2010-Mar-11 17:35 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 15:39:05 +0100
Sander <sander@humilis.net> wrote:
> Stephan von Krawczynski wrote (ao):
> > Honestly I would just drop the idea of an SSD option simply because
the
> > vendors implement all kinds of neat strategies in their devices. So in
the end
> > you cannot really tell if the option does something constructive and
not
> > destructive in combination with a SSD controller.
> 
> My understanding of the ssd mount option is also that the fs
doens''t try
> to do all kinds of smart (and potential expensive) things which make
> sense for rotating media to reduce seeks and the like.
> 
> 	Sander
Such an optimization sounds valid on first sight. But re-think closely: how
does the fs really know about seeks needed during some operation? If your
disk is a single plate one your seeks are completely different from multi
plate. So even a simple case is more or less unpredictable. If you consider a
RAID or SAN as device base it should be clear that trying to optimize for
certain device types is just a fake. What does that tell you? The optimization
was a pure loss of work hours in the first place. In fact if you look at this
list a lot of talks going on are highly academic and have no real usage
scenario.
Sometimes trying to be super-smart is indeed not useful (for a fs) ...
-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Mar-11 18:00 UTC

head link

Re: SSD Optimizations

On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski
wrote:> On Thu, 11 Mar 2010 15:39:05 +0100
> Sander <sander@humilis.net> wrote:
> 
> > Stephan von Krawczynski wrote (ao):
> > > Honestly I would just drop the idea of an SSD option simply
because the
> > > vendors implement all kinds of neat strategies in their devices.
So in the end
> > > you cannot really tell if the option does something constructive
and not
> > > destructive in combination with a SSD controller.
> > 
> > My understanding of the ssd mount option is also that the fs
doens''t try
> > to do all kinds of smart (and potential expensive) things which make
> > sense for rotating media to reduce seeks and the like.
> > 
> > 	Sander
> 
> Such an optimization sounds valid on first sight. But re-think closely: how
> does the fs really know about seeks needed during some operation?
Well the FS makes a few assumptions (in the nonssd case).  First it
assumes the storage is not a memory device.  If things would fit in
memory we wouldn''t need filesytems in the first place.

Then it assumes that adjacent blocks are cheap to read and blocks that
are far away are expensive to read.  Given expensive raid controllers,
cache, and everything else, you''re correct that sometimes this
assumption is wrong.  But, on average seeking hurts.  Really a lot.

We try to organize files such that files that are likely to be read
together are found together on disk.  Btrfs is fairly good at this
during file creation and not as good as ext*/xfs as files over
overwritten and modified again and again (due to cow).

If you turn mount -o ssd on for your drive and do a test, you might not
notice much difference right away.  ssds tend to be pretty good right
out of the box.  Over time it tends to help, but it is a very hard thing
to benchmark in general.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-Mar-12 01:07 UTC

head link

Re: SSD Optimizations

On Thursday 11 March 2010 17:19:32 Chris Mason wrote:> On Thu, Mar 11, 2010 at 04:03:59PM +0000, Gordan Bobic wrote:
> > On Thu, 11 Mar 2010 16:35:33 +0100, Stephan von Krawczynski
> > 
> > <skraw@ithnet.com> wrote:
> > >> Besides, why shouldn''t we help the drive firmware by
> > >> - writing the data only in erase-block sizes
> > >> - trying to write blocks that are smaller than the
erase-block in a
> > >> way that won''t cross the erase-block boundary
> > > 
> > > Because if the designing engineer of a good SSD controller
wasn''t able
> > 
> > to
> > 
> > > cope with that he will have no chance to design a second one.
> > 
> > You seem to be confusing quality of implementation with theoretical
> > possibility.
> > 
> > >> This will not only increase the life of the SSD but also
increase its
> > >> performance.
> > > 
> > > TRIM: maybe yes. Rest: pure handwaving.
> > > 
> > >> [...]
> > >> 
> > >> > > And your guess is that intel engineers had no glue
when designing
> > >> > > the XE
> > >> > > including its controller? You think they did not
know what you and
> > 
> > me
> > 
> > >> > > know and
> > >> > > therefore pray every day that some smart fs
designer falls from
> > >> > > heaven
> > >> > > and saves their product from dying in between?
Really?
> > >> > 
> > >> > I am saying that there are problems that CANNOT be
solved on the
> > >> > disk firmware level. Some problems HAVE to be addressed
higher up
> > >> > the
> > 
> > stack.
> > 
> > >> Exactly, you can''t assume that the SSDs firmware
understands any and
> > 
> > all
> > 
> > >> file
> > >> system layouts, especially if they are on fragmented LVM or
other
> > >> logical
> > >> volume manager partitions.
> > > 
> > > Hopefully the firmware understands exactly no fs layout at all.
That
> > 
> > would
> > 
> > > be
> > > braindead. Instead it should understand how to arrange incoming
and
> > > outgoing
> > > data in a way that its own technical requirements are met as
perfect as
> > > possible. This is no spinning disk, it is completely irrelevant
what
> > > the data
> > > layout looks like as long as the controller finds its way through
and
> > 
> > copes
> > 
> > > best with read/write/erase cycles. It may well use additional RAM
for
> > > caching and data reordering.
> > > Do you really believe ascending block numbers are placed in
ascending
> > > addresses inside the disk (as an example)? Why should they? What
does
> > 
> > that
> > 
> > > mean for fs block ordering? If you don''t know anyway
what a controller
> > > does to
> > > your data ordering, how do you want to help it with its job?
> > > Please accept that we are _not_ talking about trivial flash mem
here or
> > > pseudo-SSDs consisting of sd cards. The market has already
evolved
> > 
> > better
> > 
> > > products. The dinosaurs are extincted even if some are still
looking
> > 
> > alive.
You seem to be forgetting that CEOs like to save 10 cents per drive to show 
"millions of dollars saved" by their work, I highly doubt that we
won''t see
SSDs with half assed wear leveling implementations 10 years from now.

And no, I don''t think that the linear storage that we see at the ATA
level is
any linear on the drive itself. But erase blocks are still erase blocks. I 
highly doubt that the abstraction layer works over sector sizes (512B) and not 
over whole erase block sizes -- just because it would make it much more 
complicated, thus slower.

This way, even if the writes to the flash cells are made in fashion similar to 
a LogFS, one will still get r/m/w cycle if the write is 512B in size on a 
block that has also other data.
> > 
> > I am assuming that you are being deliberately facetious here (the
> > alternative is less kind). The simple fact is that you cannot come up
> > with some magical data (re)ordering method that nullifies problems of
> > common use-cases that are quite nasty for flash based media.
> > 
> > For example - you have a disk that has had all it''s
addressable blocks
> > tainted. A new write comes in - what do you do with it? Worse, a write
> > comes in spanning two erase blocks as a consequence of the data
> > re-alignment in the firmware. You have no choice but to wipe them both
> > and re-write the data. You''d be better off not doing the
magic and
> > assuming that the FS is sensibly aligned.
> 
> Ok, how exactly would the FS help here?  We have a device with a 256kb
> erasure size, and userland does a 4k write followed by an fsync.
I assume here that the FS knows about erasure size and does implement TRIM.
> If the FS were to be smart and know about the 256kb requirement, it
> would do a read/modify/write cycle somewhere and then write the 4KB.
If all the free blocks have been TRIMmed, FS should pick a completely free 
erasure size block and write those 4KiB of data.

Correct implementation of wear leveling in the drive should notice that the 
write is entirely inside a free block and make just a write cycle adding zeros 
to the end of supplied data.
> The underlying implementation is the same in the device.  It picks a
> destination, reads it then writes it back.  You could argue (and many
> people do) that this operation is risky and has a good chance of
> destroying old data.  Perhaps we''re best off if the FS does the
rmw
> cycle instead into an entirely safe location.
And IMO that''s the idea behind TRIM -- not to force the device do do
rmw
cycles, only write cycle or erase cycle, provided there''s free space
and the
free space  doesn''t have considerably more write cycles than the
already
allocated data.
> 
> It''s a great place for research and people are definitely looking
at it.
> 
> But with all of that said, it has nothing to do with alignment or trim.
> Modern ssds are a raid device with a large stripe size, and someone
> somewhere is going to do a read/modify/write to service any small write.
> You can force this up to the FS or the application, it''ll happen
> somewhere.
Yes, and if the parition is full rmw will happen in the drive. But if the 
partition is far from full, free space is TRIMmed then than the r/m/w cycle 
will happen inside btrfs and the SSD won''t have to do its magic --
making the
process faster.

The effect will be a FS that behaves consistently over a broad range of SSDs, 
provided there''s free space left.
> The filesystem metadata writes are a very small percentage of the
> problem overall.  Sure we can do better and try to force larger metadata
> blocks.  This was the whole point behind btrfs'' support for large
tree
> blocks, which I''ll be enabling again shortly.
-- 
Hubert Kario
QBS - Quality Business Software
ul. Ksawerów 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Mar-12 01:42 UTC

head link

Re: SSD Optimizations

On Fri, Mar 12, 2010 at 02:07:40AM +0100, Hubert Kario
wrote:> > > For example - you have a disk that has had all it''s
addressable blocks
> > > tainted. A new write comes in - what do you do with it? Worse, a
write
> > > comes in spanning two erase blocks as a consequence of the data
> > > re-alignment in the firmware. You have no choice but to wipe them
both
> > > and re-write the data. You''d be better off not doing the
magic and
> > > assuming that the FS is sensibly aligned.
> > 
> > Ok, how exactly would the FS help here?  We have a device with a 256kb
> > erasure size, and userland does a 4k write followed by an fsync.
> 
> I assume here that the FS knows about erasure size and does implement TRIM.
> 
> > If the FS were to be smart and know about the 256kb requirement, it
> > would do a read/modify/write cycle somewhere and then write the 4KB.
> 
> If all the free blocks have been TRIMmed, FS should pick a completely free 
> erasure size block and write those 4KiB of data.
> 
> Correct implementation of wear leveling in the drive should notice that the
> write is entirely inside a free block and make just a write cycle adding
zeros
> to the end of supplied data.
> 
> > The underlying implementation is the same in the device.  It picks a
> > destination, reads it then writes it back.  You could argue (and many
> > people do) that this operation is risky and has a good chance of
> > destroying old data.  Perhaps we''re best off if the FS does
the rmw
> > cycle instead into an entirely safe location.
> 
> And IMO that''s the idea behind TRIM -- not to force the device do
do rmw
> cycles, only write cycle or erase cycle, provided there''s free
space and the
> free space  doesn''t have considerably more write cycles than the
already
> allocated data.
> 
> > 
> > It''s a great place for research and people are definitely
looking at it.
> > 
> > But with all of that said, it has nothing to do with alignment or
trim.
> > Modern ssds are a raid device with a large stripe size, and someone
> > somewhere is going to do a read/modify/write to service any small
write.
> > You can force this up to the FS or the application, it''ll
happen
> > somewhere.
> 
> Yes, and if the parition is full rmw will happen in the drive. But if the 
> partition is far from full, free space is TRIMmed then than the r/m/w cycle
> will happen inside btrfs and the SSD won''t have to do its magic --
making the
> process faster.
The filesystem cannot do read/modify/write faster or better than the
drive.  The drive is pushing data around internally and the FS has to
pull it in and out of the sata bus.  The drive is much faster.

The FS can be safer than the drive because it is able to do more
consistency checks on the data as it reads.  But this also has a cost
because the crcs for the blocks might not be adjacent to the block.

If the FS is the FTL, we don''t need trim because the FS already knows
which blocks are in use.  So, there isn''t as much complexity in finding
the free erasure block.  The FS FTL does win there.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephan von Krawczynski

2010-Mar-12 09:15 UTC

head link

Re: SSD Optimizations

On Fri, 12 Mar 2010 02:07:40 +0100
Hubert Kario <hka@qbs.com.pl> wrote:
> > [...]
> > If the FS were to be smart and know about the 256kb requirement, it
> > would do a read/modify/write cycle somewhere and then write the 4KB.
> 
> If all the free blocks have been TRIMmed, FS should pick a completely free 
> erasure size block and write those 4KiB of data.
> 
> Correct implementation of wear leveling in the drive should notice that the
> write is entirely inside a free block and make just a write cycle adding
zeros
> to the end of supplied data.
Your assumption here is that your _addressed_ block layout is completely
identical to the SSDs "disk" layout. Else you cannot know where a
"free
erasure block" is located and how to address it from FS.
I really wonder what this assumption is based on. You still think a SSD is a
true disk with linear addressing. I doubt that very much. Even on true
spinning disks your assumption is wrong for relocated sectors. Which basically
means that every disk controller firmware fiddles around with the physical
layout since decades. Please accept that you cannot do a disks'' job in
FS. The
more advanced technology gets the more disks become black boxes with a defined
software interface. Use this interface and drop the idea of having inside
knowledge of such a device. That''s other peoples'' work. If you
want to design
smart SSD controllers hire at a company that builds those.

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-Mar-12 16:00 UTC

head link

Re: SSD Optimizations

On Friday 12 March 2010 10:15:28 Stephan von Krawczynski
wrote:> On Fri, 12 Mar 2010 02:07:40 +0100
> 
> Hubert Kario <hka@qbs.com.pl> wrote:
> > > [...]
> > > If the FS were to be smart and know about the 256kb requirement,
it
> > > would do a read/modify/write cycle somewhere and then write the
4KB.
> > 
> > If all the free blocks have been TRIMmed, FS should pick a completely
> > free erasure size block and write those 4KiB of data.
> > 
> > Correct implementation of wear leveling in the drive should notice
that
> > the write is entirely inside a free block and make just a write cycle
> > adding zeros to the end of supplied data.
> 
> Your assumption here is that your _addressed_ block layout is completely
> identical to the SSDs "disk" layout.
> Else you cannot know where a "free
> erasure block" is located and how to address it from FS.
> I really wonder what this assumption is based on. You still think a SSD is
> a true disk with linear addressing. I doubt that very much.
I made no such assumptions.
Im sure that the linearity on the ATA LBA level isn''t so linear on the
device
level, especially after wear-leveling takes its toll, but I assume that the 
smallest block of data that the translation layer can address is erase-block 
sized and that all the erase-block are equal in size. Otherwise the algorithm 
would be needlessly complicated which would make it both slower and more error 
prone.
> Even on true
> spinning disks your assumption is wrong for relocated sectors.
Which we don''t have to worry about because if the drive has less than 5
of
''em, the impact of hitting them is marginal and if there are more, the
user
has much more pressing problem than the performance of the drive or FS.
> Which
> basically means that every disk controller firmware fiddles around with
> the physical layout since decades. Please accept that you cannot do a
> disks'' job in FS. The more advanced technology gets the more disks
become
> black boxes with a defined software interface. Use this interface and drop
> the idea of having inside knowledge of such a device. That''s other
> peoples'' work. If you want to design smart SSD controllers hire at
a
> company that builds those.
And I don''t think that doing disks'' job in the FS is good
idea, but I think
that we should be able to minimise the impact of the translation layer.

The way to do this, is to threat the device as a block device with sectors the 
size of erase-blocks. That''s nothing too fancy, don''t you
think?

-- 
Hubert Kario
QBS - Quality Business Software
ul. Ksawerów 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephan von Krawczynski

2010-Mar-13 16:43 UTC

head link

Re: SSD Optimizations

On Thu, 11 Mar 2010 13:00:17 -0500
Chris Mason <chris.mason@oracle.com> wrote:
> On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski wrote:
> > On Thu, 11 Mar 2010 15:39:05 +0100
> > Sander <sander@humilis.net> wrote:
> > 
> > > Stephan von Krawczynski wrote (ao):
> > > > Honestly I would just drop the idea of an SSD option simply
because the
> > > > vendors implement all kinds of neat strategies in their
devices. So in the end
> > > > you cannot really tell if the option does something
constructive and not
> > > > destructive in combination with a SSD controller.
> > > 
> > > My understanding of the ssd mount option is also that the fs
doens''t try
> > > to do all kinds of smart (and potential expensive) things which
make
> > > sense for rotating media to reduce seeks and the like.
> > > 
> > > 	Sander
> > 
> > Such an optimization sounds valid on first sight. But re-think
closely: how
> > does the fs really know about seeks needed during some operation?
> 
> Well the FS makes a few assumptions (in the nonssd case).  First it
> assumes the storage is not a memory device.  If things would fit in
> memory we wouldn''t need filesytems in the first place.
Ok, here is the bad news. This assumption everything from right to completely
wrong, and you cannot really tell the mainstream answer.
Two examples from opposite parts of the technology world:
- History: way back in the 80''s there was a 3rd party hardware for
C=1541
(floppy drive for C=64) that read in the complete floppy and served all
incoming requests from the ram buffer. So your assumption can already be wrong
for a trivial floppy drive from ancient times.
- Nowadays: being a linux installation today chances are that the matrix has
you. Quite a lot of installations are virtualized. So your storage is a virtual
one either, which means it is likely being a fs buffer from the host system,
i.e. RAM.
And sorry to say: "if things would fit in memory" you probably still
need a fs
simply because there is no actual way to organize data (be it executable or
not) in RAM without a fs layer. You can''t save data without an abstract
file
data type. To have one accessible you need a fs.
Btw the other way round is as interesting: there is currently no fs for linux
that knows how to execute in place. Meaning if you really had only RAM and you
have a fs to organize your data it would be just logical to have ways to _not_
load data (in other parts of the RAM), but to use it in its original storage
(RAM-)space. 
> Then it assumes that adjacent blocks are cheap to read and blocks that
> are far away are expensive to read.  Given expensive raid controllers,
> cache, and everything else, you''re correct that sometimes this
> assumption is wrong.
As already mentioned this assumption may be completely wrong even without a
raid controller, being within a virtual environment. Even far away blocks can
be one byte away in the next fs buffer of the underlying host fs (assuming
your device is in fact a file on the host;-).
>  But, on average seeking hurts.  Really a lot.
Yes, seeking hurts. But there is no way to know if there is seeking at all.
On the other hand, if your storage is a netblock device seeking on the server
is probably your smallest problem, compared to the network latency in between.
 > We try to organize files such that files that are likely to be read
> together are found together on disk.  Btrfs is fairly good at this
> during file creation and not as good as ext*/xfs as files over
> overwritten and modified again and again (due to cow).
You are basically saying that btrfs perfectly organizes write-once devices ;-)
> If you turn mount -o ssd on for your drive and do a test, you might not
> notice much difference right away.  ssds tend to be pretty good right
> out of the box.  Over time it tends to help, but it is a very hard thing
> to benchmark in general.
Honestly, this sounds like "I give up" to me ;-)
You just said that generally it is "very hard to benchmark". Which
means
"nobody can see or feel it in real world" in non-tech language.

Please understand that I am the last one critizing your and others''
brillant
work and the time you spend for btrfs. Only I do believe that if you spent one
hour on some fs like glusterfs for every 10 hours you spend on btrfs you would
be both king and queen for the linux HA community :-)
(but probably unemployed, so I can''t really beat you for it)
 > -chris
-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephan von Krawczynski

2010-Mar-13 17:02 UTC

head link

Re: SSD Optimizations

On Fri, 12 Mar 2010 17:00:08 +0100
Hubert Kario <hka@qbs.com.pl> wrote:> > Even on true
> > spinning disks your assumption is wrong for relocated sectors.
> 
> Which we don''t have to worry about because if the drive has less
than 5 of
> ''em, the impact of hitting them is marginal and if there are more,
the user
> has much more pressing problem than the performance of the drive or FS.
Are you really sure that a drive firmware tells you about the true number of
relocated sectors? I mean if it makes the product look better in comparison to
another product, are you really sure that the firmware will not tell you what
you expect to see only to make you content and happy with your drive?
> > Which
> > basically means that every disk controller firmware fiddles around
with
> > the physical layout since decades. Please accept that you cannot do a
> > disks'' job in FS. The more advanced technology gets the more
disks become
> > black boxes with a defined software interface. Use this interface and
drop
> > the idea of having inside knowledge of such a device. That''s
other
> > peoples'' work. If you want to design smart SSD controllers
hire at a
> > company that builds those.
> 
> And I don''t think that doing disks'' job in the FS is good
idea, but I think
> that we should be able to minimise the impact of the translation layer.
> 
> The way to do this, is to threat the device as a block device with sectors
the
> size of erase-blocks. That''s nothing too fancy, don''t you
think?
I don''t believe anyone is able to tell the size of erase-blocks of some
device
- current and future - for sure. I do believe that making this guess only
reduces the future design options for new devices - if its creators care at
all about your guess.
Why not let the fs designer take his creative options in fs layer and let the
device designer use his brain on the device level and all meet at the
predefined software interface in between - and nowhere _else_.

> -- 
> Hubert Kario
-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-Mar-13 19:01 UTC

head link

Re: SSD Optimizations

On Saturday 13 March 2010 18:02:10 Stephan von Krawczynski
wrote:> On Fri, 12 Mar 2010 17:00:08 +0100
> Hubert Kario <hka@qbs.com.pl> wrote:
> > > Even on true
> > > spinning disks your assumption is wrong for relocated sectors.
> > 
> > Which we don''t have to worry about because if the drive has
less than 5
> > of ''em, the impact of hitting them is marginal and if there
are more,
> > the user has much more pressing problem than the performance of the
> > drive or FS.
> 
> Are you really sure that a drive firmware tells you about the true number
> of relocated sectors? I mean if it makes the product look better in
> comparison to another product, are you really sure that the firmware will
> not tell you what you expect to see only to make you content and happy
> with your drive?
because Joe Sixpack reads SMART values, and even if he does, he will be much 
more angry when a drive that has no or few relocations fails, that when a 
drive that reports that''s failing fails.

If the drive arrives with badsectors, it goes where it came from the same day 
if it meets an IT guy worth its salt, any IT guy knows that some HDDs develop 
badsectors no matter the make and model, but if they do, you replace them.

And as the Google disk survey showed, the SMART has very high percentage of 
Type I errors, but very few Type II errors.

But we''re off-topic here
> > > Which
> > > basically means that every disk controller firmware fiddles
around with
> > > the physical layout since decades. Please accept that you cannot
do a
> > > disks'' job in FS. The more advanced technology gets the
more disks
> > > become black boxes with a defined software interface. Use this
> > > interface and drop the idea of having inside knowledge of such a
> > > device. That''s other peoples'' work. If you want
to design smart SSD
> > > controllers hire at a company that builds those.
> > 
> > And I don''t think that doing disks'' job in the FS is
good idea, but I
> > think that we should be able to minimise the impact of the translation
> > layer.
> > 
> > The way to do this, is to threat the device as a block device with
> > sectors the size of erase-blocks. That''s nothing too fancy,
don''t you
> > think?
> 
> I don''t believe anyone is able to tell the size of erase-blocks of
some
> device - current and future - for sure.
Well, if the engeneer that designed it doesn''t know this, I
don''t know how he
got his degree.

Just because it isn''t publicised now, doesn''t mean it
won''t be in near future.

Besides that, to detect how big the erase-blocks are in size is easy, if they 
have any impact on the performance, if they don''t have any impact
(whatever
the reason) tunning for their size is pointless anyway. 
> I do believe that making this
> guess only reduces the future design options for new devices - if its
> creators care at all about your guess.
Did I, or any one else, say that we want to hardwire a specific erase-block 
size to the design of the FS?! That would be utter stupidity!
> Why not let the fs designer take his creative options in fs layer and let
> the device designer use his brain on the device level and all meet at the
> predefined software interface in between - and nowhere _else_.
We (well, at least Gordon and I) just want a "stripe_width" option
added to
the mkfs.btrfs, just like it is there for ext2/3/4, reiserfs, xfs and jfs to 
name a few. It would need very few additional tweaks to make it SSD friendly, 
hardly any considering how -o ssd or -o ssd_spread already work.

You''re forgetting there''s an elephant in the room that
won''t to talk to
devices that don''t have sectors 512B in size. If not for it, there
wouldn''t
even _be_ SSDs with 512B sectors.

It''s not the way Flash memory works.

The 512B abstraction is there to be compatible, to work with one current OS, 
it''s not there because it describes better the way Flash memory works
or is
the best way to address the data on the device itself.

There are already consumer HDDs with 4kiB sector size, so the situation is  
getting better. We can only hope that in few years time the SSDs will have 
sectors the size of erase-blocks. But in the mean time, stripe_width would be 
enough.


Besides, the stripe_width option will be not only useful for the SSDs but also 
in environments where btrfs is on a device that is a RAID5/6 array 
(reconfiguring a server with many virtual machines is far from easy and 
sometimes just can''t be done because of heterogeneous virtualised OSs
that
need the data protection provided by lower layers).

-- 
Hubert Kario
QBS - Quality Business Software
ul. Ksawerów 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2010-Mar-13 19:41 UTC

head link

Re: SSD Optimizations

On Saturday 13 March 2010 17:43:59 Stephan von Krawczynski
wrote:> On Thu, 11 Mar 2010 13:00:17 -0500
> Chris Mason <chris.mason@oracle.com> wrote:
> > On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski
wrote:
> > > On Thu, 11 Mar 2010 15:39:05 +0100
> > > Sander <sander@humilis.net> wrote:
> > > > Stephan von Krawczynski wrote (ao):
> > > > > Honestly I would just drop the idea of an SSD option
simply because
> > > > > the vendors implement all kinds of neat strategies in
their
> > > > > devices. So in the end you cannot really tell if the
option does
> > > > > something constructive and not destructive in
combination with a
> > > > > SSD controller.
> > > > 
> > > > My understanding of the ssd mount option is also that the fs
doens''t
> > > > try to do all kinds of smart (and potential expensive)
things which
> > > > make sense for rotating media to reduce seeks and the like.
> > > > 
> > > > 	Sander
> > > 
> > > Such an optimization sounds valid on first sight. But re-think
closely:
> > > how does the fs really know about seeks needed during some
operation?
> > 
> > Well the FS makes a few assumptions (in the nonssd case).  First it
> > assumes the storage is not a memory device.  If things would fit in
> > memory we wouldn''t need filesytems in the first place.
> 
> Ok, here is the bad news. This assumption everything from right to
> completely wrong, and you cannot really tell the mainstream answer.
> Two examples from opposite parts of the technology world:
> - History: way back in the 80''s there was a 3rd party hardware for
C=1541
> (floppy drive for C=64) that read in the complete floppy and served all
> incoming requests from the ram buffer. So your assumption can already be
> wrong for a trivial floppy drive from ancient times.
such assumption doesn''t make it work slower on such device
> - Nowadays: being a linux installation today chances are that the matrix
> has you. Quite a lot of installations are virtualized. So your storage is
> a virtual one either, which means it is likely being a fs buffer from the
> host system, i.e. RAM.
Buffers use read_ahead and are smaller than the underlaying device, still, such 
assumption doesn''t make the FS perform worse in this situation. 
> And sorry to say: "if things would fit in memory" you probably
still need a
> fs simply because there is no actual way to organize data (be it
> executable or not) in RAM without a fs layer. You can''t save data
without
> an abstract file data type. To have one accessible you need a fs.
yes, that''s why there is tmpfs, btrfs isn''t meant to be all
and end all as far
as FSs go
> Btw the other way round is as interesting: there is currently no fs for
> linux that knows how to execute in place. Meaning if you really had only
> RAM and you have a fs to organize your data it would be just logical to
> have ways to _not_ load data (in other parts of the RAM), but to use it in
> its original storage (RAM-)space.
at least ext2 does support XIP on platform that support it...
> 
> > Then it assumes that adjacent blocks are cheap to read and blocks that
> > are far away are expensive to read.  Given expensive raid controllers,
> > cache, and everything else, you''re correct that sometimes
this
> > assumption is wrong.
> 
> As already mentioned this assumption may be completely wrong even without a
> raid controller, being within a virtual environment. Even far away blocks
> can be one byte away in the next fs buffer of the underlying host fs
> (assuming your device is in fact a file on the host;-).
and again, such assumption doesn''t reduce the performance
> 
> >  But, on average seeking hurts.  Really a lot.
> 
> Yes, seeking hurts. But there is no way to know if there is seeking at all.
> On the other hand, if your storage is a netblock device seeking on the
> server is probably your smallest problem, compared to the network latency
> in between.
and because of that, there''s read ahead and support for big packets on
the TCP
level, so the assumption does make the FS perform better with it than without 
it.


It''s one of the assumptions that you _have_ to make, just like the
assumption
that the computer counts in binary, or there''s more disk space than
RAM. But
those assumptions _don''t_ make the performance (much) worse when they
don''t
hold true for known devices that can impersonate rotating magnetic media.
> > We try to organize files such that files that are likely to be read
> > together are found together on disk.  Btrfs is fairly good at this
> > during file creation and not as good as ext*/xfs as files over
> > overwritten and modified again and again (due to cow).
> 
> You are basically saying that btrfs perfectly organizes write-once devices
> ;-)
> 
> > If you turn mount -o ssd on for your drive and do a test, you might
not
> > notice much difference right away.  ssds tend to be pretty good right
> > out of the box.  Over time it tends to help, but it is a very hard
thing
> > to benchmark in general.
> 
> Honestly, this sounds like "I give up" to me ;-)
> You just said that generally it is "very hard to benchmark".
Which means
> "nobody can see or feel it in real world" in non-tech language.
No, it''s not this. When a SSD is fresh, the undeling write leveling has
many
blocks to choose from, so it''s blaizing fast. The same holds true when
the
test uses small amount of data (relative to SSD size).

"very hard to benchmark" means just that -- the benchmark is much more
complicated, must take into account much more variables and takes much more 
time compared to rotating magnetic media benchmark.

To test SSD performance you need to benchmark both the speed of flash memory 
_and_ the speed and performance of the write leveling algorithm (because it 
shows its ugly head only after specific workloads or when all blocks are 
allocated), and that''s non trivial to say the least. Add FS on top of
it and
you have a nice dissertation right there.

-- 
Hubert Kario
QBS - Quality Business Software
ul. Ksawerów 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Mar-13 21:48 UTC

head link

Re: SSD Optimizations

On Sat, Mar 13, 2010 at 05:43:59PM +0100, Stephan von Krawczynski
wrote:> On Thu, 11 Mar 2010 13:00:17 -0500
> Chris Mason <chris.mason@oracle.com> wrote:
> 
> > On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski
wrote:
> > > On Thu, 11 Mar 2010 15:39:05 +0100
> > > Sander <sander@humilis.net> wrote:
> > > 
> > > > Stephan von Krawczynski wrote (ao):
> > > > > Honestly I would just drop the idea of an SSD option
simply because the
> > > > > vendors implement all kinds of neat strategies in their
devices. So in the end
> > > > > you cannot really tell if the option does something
constructive and not
> > > > > destructive in combination with a SSD controller.
> > > > 
> > > > My understanding of the ssd mount option is also that the fs
doens''t try
> > > > to do all kinds of smart (and potential expensive) things
which make
> > > > sense for rotating media to reduce seeks and the like.
> > > > 
> > > > 	Sander
> > > 
> > > Such an optimization sounds valid on first sight. But re-think
closely: how
> > > does the fs really know about seeks needed during some operation?
> > 
> > Well the FS makes a few assumptions (in the nonssd case).  First it
> > assumes the storage is not a memory device.  If things would fit in
> > memory we wouldn''t need filesytems in the first place.
> 
> Ok, here is the bad news. This assumption everything from right to
completely
> wrong, and you cannot really tell the mainstream answer.
> Two examples from opposite parts of the technology world:
> - History: way back in the 80''s there was a 3rd party hardware for
C=1541
> (floppy drive for C=64) that read in the complete floppy and served all
> incoming requests from the ram buffer. So your assumption can already be
wrong
> for a trivial floppy drive from ancient times.
Agreed, I''ll try my best not to tune btrfs for trivial floppies from
ancient times ;)
> > Then it assumes that adjacent blocks are cheap to read and blocks that
> > are far away are expensive to read.  Given expensive raid controllers,
> > cache, and everything else, you''re correct that sometimes
this
> > assumption is wrong.
> 
> As already mentioned this assumption may be completely wrong even without a
> raid controller, being within a virtual environment. Even far away blocks
can
> be one byte away in the next fs buffer of the underlying host fs (assuming
> your device is in fact a file on the host;-).
Ok, there are roughly three environments at play here.

1) Seeking hurts, and you have no idea if adjacent block numbers are
close together on the device.

2) Seeking doesn''t hurt and you have no idea if adjacent block numbers
are close together on the device. (SSD).

3) Seeking hurts and you can assume adjacent block numbers are close
together on the device (disks).

Type one is impossible to tune, and so it isn''t interesting in this
discussion.  There are an infinite number of ways to actually store data
you care about, and just because one of those ways can''t be tuned
doesn''t mean we should stop trying to tune for the ones that most
people
actually use.
> 
> >  But, on average seeking hurts.  Really a lot.
> 
> Yes, seeking hurts. But there is no way to know if there is seeking at all.
> On the other hand, if your storage is a netblock device seeking on the
server
> is probably your smallest problem, compared to the network latency in
between.
>  
Very true, and if I were using such a setup in performance critical
applications, I would:

1) Tune the network so that seeks mattered again
2) Tune the seeks.
> > We try to organize files such that files that are likely to be read
> > together are found together on disk.  Btrfs is fairly good at this
> > during file creation and not as good as ext*/xfs as files over
> > overwritten and modified again and again (due to cow).
> 
> You are basically saying that btrfs perfectly organizes write-once devices
;-)
Storage is all about trade offs, and optimizing read access for write
once vs write many is a very different thing.  It''s surprising how many
of your files are written once and never read, let alone written and
then never changed.
> 
> > If you turn mount -o ssd on for your drive and do a test, you might
not
> > notice much difference right away.  ssds tend to be pretty good right
> > out of the box.  Over time it tends to help, but it is a very hard
thing
> > to benchmark in general.
> 
> Honestly, this sounds like "I give up" to me ;-)
> You just said that generally it is "very hard to benchmark".
Which means
> "nobody can see or feel it in real world" in non-tech language.
No, it just means it is hard to benchmark.   SSDs, even really good
ssds, are not deterministic.  Sometimes they are faster than others and
the history of how you''ve abused it in the past factors into how well
it
performs in the future.

A simple graph that talks about the performance of one drive in one
workload needs a lot of explanation.
> 
> Please understand that I am the last one critizing your and
others'' brillant
> work and the time you spend for btrfs. Only I do believe that if you spent
one
> hour on some fs like glusterfs for every 10 hours you spend on btrfs you
would
> be both king and queen for the linux HA community :-)
> (but probably unemployed, so I can''t really beat you for it)
Grin, the list of things I wish I had time to work on is quite long.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jeremy Fitzhardinge

2010-Mar-14 03:19 UTC

head link

Re: SSD Optimizations

On 03/13/2010 08:43 AM, Stephan von Krawczynski wrote:> - Nowadays: being a linux installation today chances are that the matrix
has
> you. Quite a lot of installations are virtualized. So your storage is a
virtual
> one either, which means it is likely being a fs buffer from the host
system,
> i.e. RAM.
>    
That would be a strictly amateur-hour implementation.  It is very 
important for data integrity that at least all writes are synchronous, 
and ideally all IO should be uncached in the host.  In that case the 
performance of the guest''s virtual IO device will be broadly similar to
a real hardware device.

     J
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Apparently Analagous Threads

Search for more seemingly similar threads

Btrfs devel - Mar 2010 - SSD Optimizations

SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Re: SSD Optimizations

Apparently Analagous Threads