thr3ads.net - Btrfs devel - btrfs support for efficient SSD operation (data blocks alignment) [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Martin

2012-Feb-08 19:24 UTC

btrfs support for efficient SSD operation (data blocks alignment)

My understanding is that for x86 architecture systems, btrfs only allows
a sector size of 4kB for a HDD/SSD. That is fine for the present HDDs
assuming the partitions are aligned to a 4kB boundary for that device.

However for SSDs...

I''m using for example a 60GByte SSD that has:

    8kB page size;
    16kB logical to physical mapping chunk size;
    2MB erase block size;
    64MB cache.

And the sector size reported to Linux 3.0 is the default 512 bytes!


My first thought is to try formatting with a sector size of 16kB to
align with the SSD logical mapping chunk size. This is to avoid SSD
write amplification. Also, the data transfer performance for that device
is near maximum for writes with a blocksize of 16kB and above. Yet,
btrfs supports a 4kByte page/sector size only at present...


Is there any control possible over the btrfs filesystem structure to map
metadata and data structures to the underlying device boundaries?

For example to maximise performance, can the data chunks and the data
chunk size be aligned to be sympathetic to the SSD logical mapping chunk
size and the erase block size?

What features other than the trim function does btrfs employ to optimise
for SSD operation?


Regards,
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2012-Feb-09 01:42 UTC

head link

Re: btrfs support for efficient SSD operation (data blocks alignment)

On 02/09/2012 03:24 AM, Martin wrote:> My understanding is that for x86 architecture systems, btrfs only allows
> a sector size of 4kB for a HDD/SSD. That is fine for the present HDDs
> assuming the partitions are aligned to a 4kB boundary for that device.
> 
> However for SSDs...
> 
> I''m using for example a 60GByte SSD that has:
> 
>     8kB page size;
>     16kB logical to physical mapping chunk size;
>     2MB erase block size;
>     64MB cache.
> 
> And the sector size reported to Linux 3.0 is the default 512 bytes!
> 
> 
> My first thought is to try formatting with a sector size of 16kB to
> align with the SSD logical mapping chunk size. This is to avoid SSD
> write amplification. Also, the data transfer performance for that device
> is near maximum for writes with a blocksize of 16kB and above. Yet,
> btrfs supports a 4kByte page/sector size only at present...
> 
> 
> Is there any control possible over the btrfs filesystem structure to map
> metadata and data structures to the underlying device boundaries?
> 
> For example to maximise performance, can the data chunks and the data
> chunk size be aligned to be sympathetic to the SSD logical mapping chunk
> size and the erase block size?
> 
The metadata buffer size will support size larger than 4K at least, it is on
development.
> What features other than the trim function does btrfs employ to optimise
> for SSD operation?
> 
e.g COW(avoid writing to one place multi-times),
delayed allocation(intend to reduce the write frequency)

thanks,
liubo
> 
> Regards,
> Martin
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2012-Feb-10 01:05 UTC

head link

Re: btrfs support for efficient SSD operation (data blocks alignment)

On 09/02/12 01:42, Liu Bo wrote:> On 02/09/2012 03:24 AM, Martin wrote:
[ No problem for 4kByte sector HDDs. However, for SSDs... ]
>> However for SSDs...
>>
>> I''m using for example a 60GByte SSD that has:
>>
>>     8kB page size;
>>     16kB logical to physical mapping chunk size;
>>     2MB erase block size;
>>     64MB cache.
>>
>> And the sector size reported to Linux 3.0 is the default 512 bytes!
[...]>> Is there any control possible over the btrfs filesystem structure to
map
>> metadata and data structures to the underlying device boundaries?
>>
>> For example to maximise performance, can the data chunks and the data
>> chunk size be aligned to be sympathetic to the SSD logical mapping
chunk
>> size and the erase block size?
>>
> 
> The metadata buffer size will support size larger than 4K at least, it is
on development.
And also for the data? Also pack smaller data chunks in with the
metadata as is done already but with all the present parameters
proportioned according to the "sector size"?

(For my example, the filesystem may as well use 16kByte sectors because
the SSD firmware will do a read-modify-write for anything smaller.)

>> What features other than the trim function does btrfs employ to
optimise
>> for SSD operation?
>>
> 
> e.g COW(avoid writing to one place multi-times),
> delayed allocation(intend to reduce the write frequency)
I''m using ext4 on a SSD web server and have formatted with (for ext4):

mke2fs -v -T ext4 -L fs_label_name -b 4096 -E
stride=4,stripe-width=4,lazy_itable_init=0 -O
none,dir_index,extent,filetype,flex_bg,has_journal,sparse_super,uninit_bg
/dev/sdX

and mounted with the mount options:
journal_checksum,barrier,stripe=4,delalloc,commit=300,max_batch_time=15000,min_batch_time=200,discard,noatime,nouser_xattr,noacl,errors=remount-ro

The main bits for the SSD are the:
"stripe=4,delalloc,commit=300,max_batch_time=15000,min_batch_time=200,discard,noatime"

The "-b 4096" is the maximum value allowed. The stride and
stripe-width
then take that up to 16kBytes (hopefully...).

(Make sure you''re on a good UPS with a reliable shutdown mechanism for
power fail!)


A further thought is:

For my one SSD example, the erase state appears to be all "0xFF"...
Can
the fs easily check the erase state value and leave any blank space
unchanged to minimise the bit flipping?

Reasonable to be included?


All unnecessary for HDDs but possibly of use for maintaining the
lifespan of SSDs...

Hope of interest,

Regards,
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin Steigerwald

2012-Feb-10 18:18 UTC

head link

Re: btrfs support for efficient SSD operation (data blocks alignment)

Hi Martin,

Am Mittwoch, 8. Februar 2012 schrieb Martin:> My understanding is that for x86 architecture systems, btrfs only
> allows a sector size of 4kB for a HDD/SSD. That is fine for the
> present HDDs assuming the partitions are aligned to a 4kB boundary for
> that device.
> 
> However for SSDs...
> 
> I''m using for example a 60GByte SSD that has:
> 
>     8kB page size;
>     16kB logical to physical mapping chunk size;
>     2MB erase block size;
>     64MB cache.
> 
> And the sector size reported to Linux 3.0 is the default 512 bytes!
> 
> 
> My first thought is to try formatting with a sector size of 16kB to
> align with the SSD logical mapping chunk size. This is to avoid SSD
> write amplification. Also, the data transfer performance for that
> device is near maximum for writes with a blocksize of 16kB and above.
> Yet, btrfs supports a 4kByte page/sector size only at present...
Thing is as far as I know the better SSDs and even the dumber ones have 
quite some intelligence in the firmware. And at least for me its not clear 
what the firmware of my Intel SSD 320 all does on its own and whether any 
of my optimization attempts even matter.

So I am not sure, whether just thinking about one write operation of say 4 
KB or 2 KB singularily even may sense. I bet often several processes write 
data at once. So there is more amount of data to write.

What now is not clear to me whether the SSD will combine several write 
requests into a single mapping chunk or erase block or combine them into 
the already erased space of an erase block. I would bet at least the 
better SSDs would do it. So even when from the OS point of view, in a 
simplistic example, one write of 1 MB goes to LBA 40000 and one write of 1 
MB to LBA 80000 the SSD might still just use a single erase block and 
combine the writes next to each other. As far as I understand SSDs do COW 
to spread writes evenly across erase blocks. As far as I furtherly 
understand from a seek time point of view the exact location where to put 
a write request does not matter at all. So for me for an SSD firmware it 
looks perfectly sane to combine writes as they see fit. And SSDs that carry 
condensators, like above mentioned Intel SSD, may even cache writes for a 
while to wait for further requests.

The article on write amplication on wikipedia gives me a glimpse of the 
complexity involved¹. Yes, I set stripe-width as well on my Ext4 
filesystem, but frankly said I am not even sure whether this has any 
positive effect except of maybe sparing the SSD controller firmware some 
reshuffling work.

So from my current point of view most of what you wrote IMHO is more 
important for really dumb flash. Like as I understood some kernel 
developers really like to see so that most of the logic could be put into 
the kernel and be easily modifyable: JBOF - just a bunch of flash cells 
with an interface to access them directly. But for now AFAIK most consumer 
grade SSDs just provide a SATA interface and hide the internals. So an 
optimization for one kind or one brand of SSDs may not be suitable for 
another one.

There are PCI express models but these probably aren´t dumb either. And 
then there is the idea of auto commit memory (ACM) by Fusion-IO which just 
makes a part of the virtual address space persistent.

So its a question on where to put the intelligence. For current SSDs is 
seems the intelligence is really near the storage medium and then IMHO it 
makes sense to even reduce the intelligence on the Linux side.

[1] http://en.wikipedia.org/wiki/Write_amplification

Ciao,
-- 
Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2012-May-01 17:04 UTC

head link

Re: btrfs support for efficient SSD operation (data blocks alignment)

Looking at this again from some time ago...

Brief summary:

There is a LOT of nefarious cleverness being attempted by SSD
manufacturers to accommodate a 4kByte block size. Get that wrong, or
just be unsympathetic to that ''cleverness'', and you suffer
performance
degradation and/or premature device wear.

Is that significant? Very likely it will be for the new three-bit FLASH
devices that have a PE (program-erase) lifespan of only 1000 or so
cycles per cell.

A better question is whether the filesystem can be easily made to be
more sympathetic to all SSDs?

From my investigating, there appears to be a sweet spot for performance
for writing (aligned) 16kByte blocks.

TRIM and keeping the device non-full also helps greatly.

I suspect that consecutive writes, as is the case for HDDs, also helps
performance to a lesser degree.

The erased state for SSDs appears to be either all 0xFF or all 0x00
(I''ve got examples of both). Can that be automatically detected and
used
by btrfs so as to minimise write cycling the bits for (unused) padded areas?

Are 16kByte blocks/sectors useful to btrfs?

Or rather, can btrfs usefully use 16kByte blocks?

Can that be supported?

Further detail...

Some good comments:

On 10/02/12 18:18, Martin Steigerwald wrote:> Hi Martin,
> 
> Am Mittwoch, 8. Februar 2012 schrieb Martin:
>> My understanding is that for x86 architecture systems, btrfs only
>> allows a sector size of 4kB for a HDD/SSD. That is fine for the
>> present HDDs assuming the partitions are aligned to a 4kB boundary for
>> that device.
>>
>> However for SSDs...
>>
>> I''m using for example a 60GByte SSD that has:
>>
>>     8kB page size;
>>     16kB logical to physical mapping chunk size;
>>     2MB erase block size;
>>     64MB cache.
>>
>> And the sector size reported to Linux 3.0 is the default 512 bytes!
>>
>>
>> My first thought is to try formatting with a sector size of 16kB to
>> align with the SSD logical mapping chunk size. This is to avoid SSD
>> write amplification. Also, the data transfer performance for that
>> device is near maximum for writes with a blocksize of 16kB and above.
>> Yet, btrfs supports a 4kByte page/sector size only at present...
> 
> Thing is as far as I know the better SSDs and even the dumber ones have 
> quite some intelligence in the firmware. And at least for me its not clear 
> what the firmware of my Intel SSD 320 all does on its own and whether any 
> of my optimization attempts even matter.
[...]
> The article on write amplication on wikipedia gives me a glimpse of the 
> complexity involved¹. Yes, I set stripe-width as well on my Ext4 
> filesystem, but frankly said I am not even sure whether this has any 
> positive effect except of maybe sparing the SSD controller firmware some 
> reshuffling work.
> 
> So from my current point of view most of what you wrote IMHO is more 
> important for really dumb flash. ...
[...]
> grade SSDs just provide a SATA interface and hide the internals. So an 
> optimization for one kind or one brand of SSDs may not be suitable for 
> another one.
> 
> There are PCI express models but these probably aren´t dumb either. And 
> then there is the idea of auto commit memory (ACM) by Fusion-IO which just 
> makes a part of the virtual address space persistent.
> 
> So its a question on where to put the intelligence. For current SSDs is 
> seems the intelligence is really near the storage medium and then IMHO it 
> makes sense to even reduce the intelligence on the Linux side.
> 
> [1] http://en.wikipedia.org/wiki/Write_amplification

As an engineer, I have a deep mistrust of the phrase "Trust me" or of
"Magic" or "Proprietary, secret" or "Proprietary, keep
out!".

Anand at Anandtech has produced some good articles on some of what goes
on inside SSDs and some of the consequences. If you want a good long read:

The SSD Relapse: Understanding and Choosing the Best SSD
http://www.anandtech.com/print/2829

Covers block allocation and write amplification and the effect of free
space on the write amplification factor.

... The Fastest MLC SSD We''ve Ever Tested
http://www.anandtech.com/print/2899

Details the Sandforce controller at that time and its use of data
compression on the controller. The latest Sandforce controllers also
utilise data deduplication on the SSD!

OCZ Agility 3 (240GB) Review
http://www.anandtech.com/print/4346

Shows an example set of Performance vs Transfer Size graphs.

Flashy fists fly as OCZ and DDRdrive row over SSD performance
http://www.theregister.co.uk/2011/01/14/ocz_and_ddrdrive_performance_row/

Shows an old and unfair comparison highlighting SSD performance
degradation due to write amplification for 4kByte random writes on a
full device.

A bit of a "Joker" in the pack are the SSDs that implement their own
controller-level data compression and data deduplication (all
proprietary and secret...). Ofcourse, that is all useless for encrypted
filesystems... Also, what does the controller based data compression do
for aligning to the underlying device blocks?

What is apparent from all that lot is that 4kBytes is a bit of a
headache for SSDs. Perhaps we should all move to a more sympathetic
aligned 16kBytes or 32kBytes?

What''s the latest state of play with btrfs for selecting a sector size
of say 16kBytes?

Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2012-May-01 17:20 UTC

head link

Re: btrfs support for efficient SSD operation (data blocks alignment)

On Tuesday 01 of May 2012 18:04:25 Martin wrote:> Are 16kByte blocks/sectors useful to btrfs?
> 
> Or rather, can btrfs usefully use 16kByte blocks?
Yes, and they are already supported using -l and -n flags:

mkfs.btrfs -l $((4*4096)) -n $((4*4096)) /dev/sda1

You can set sector size to 16kb but this requires 16kb memory pages.

Regards,
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Feb 2012 - btrfs support for efficient SSD operation (data blocks alignment)

btrfs support for efficient SSD operation (data blocks alignment)

Re: btrfs support for efficient SSD operation (data blocks alignment)

Re: btrfs support for efficient SSD operation (data blocks alignment)

Re: btrfs support for efficient SSD operation (data blocks alignment)

Re: btrfs support for efficient SSD operation (data blocks alignment)

Re: btrfs support for efficient SSD operation (data blocks alignment)