Marc MERLIN
2012-Jan-30 00:37 UTC
brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
Howdy, I''m considering using brtfs for my new laptop install. Encryption is however a requirement, and ecryptfs doesn''t quite cut it for me, so that leaves me with dmcrypt which is what I''ve been using with ext3/ext4 for years. https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html still states that ''dm-crypt block devices require write-caching to be turned off on the underlying HDD'' While the report was for 2.6.33, I''ll assume it''s still true. I was considering migrating to a recent 256GB SSD and 3.2.x kernel. First, I''d like to check if the ''turn off write cache'' comment is still accurate and if it does apply to SSDs too. Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD and what the performance is like with write cache turned off (I''m actually not too sure what the impact is for SSDs considering that writing to flash can actually be slower than writing to a hard drive). Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2012-Feb-01 17:56 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Sun, Jan 29, 2012 at 04:37:54PM -0800, Marc MERLIN wrote:> Howdy, > > I''m considering using brtfs for my new laptop install. > > Encryption is however a requirement, and ecryptfs doesn''t quite cut it for > me, so that leaves me with dmcrypt which is what I''ve been using with > ext3/ext4 for years. > > https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html > still states that > ''dm-crypt block devices require write-caching to be turned off on the > underlying HDD'' > While the report was for 2.6.33, I''ll assume it''s still true. > > > I was considering migrating to a recent 256GB SSD and 3.2.x kernel. > > First, I''d like to check if the ''turn off write cache'' comment is still > accurate and if it does apply to SSDs too. > > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD > and what the performance is like with write cache turned off (I''m actually > not too sure what the impact is for SSDs considering that writing to flash > can actually be slower than writing to a hard drive).Performance without the cache on is going to vary wildly from one SSD to another. Some really need it to give them nice fat writes while others do better on smaller writes. It''s best to just test yours and see. With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm are doing the right thing for barriers. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Feb-02 03:23 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Wed, Feb 01, 2012 at 12:56:24PM -0500, Chris Mason wrote:> > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD > > and what the performance is like with write cache turned off (I''m actually > > not too sure what the impact is for SSDs considering that writing to flash > > can actually be slower than writing to a hard drive). > > Performance without the cache on is going to vary wildly from one SSD to > another. Some really need it to give them nice fat writes while others > do better on smaller writes. It''s best to just test yours and see. > > With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm > are doing the right thing for barriers.Thanks for the answer. Can you confirm that I still must disable write cache on the SSD to avoid corruption with btrfs on top of dmcrypt, or is there a chance that it just works now? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2012-Feb-02 12:42 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Wed, Feb 01, 2012 at 07:23:45PM -0800, Marc MERLIN wrote:> On Wed, Feb 01, 2012 at 12:56:24PM -0500, Chris Mason wrote: > > > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD > > > and what the performance is like with write cache turned off (I''m actually > > > not too sure what the impact is for SSDs considering that writing to flash > > > can actually be slower than writing to a hard drive). > > > > Performance without the cache on is going to vary wildly from one SSD to > > another. Some really need it to give them nice fat writes while others > > do better on smaller writes. It''s best to just test yours and see. > > > > With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm > > are doing the right thing for barriers. > > Thanks for the answer. > Can you confirm that I still must disable write cache on the SSD to avoid > corruption with btrfs on top of dmcrypt, or is there a chance that it just > works now?No, with 3.2 or higher it is expected to work. dm-crypt is doing the barriers correctly and as of 3.2 btrfs is sending them down correctly. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Feb-12 22:32 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Thu, Feb 02, 2012 at 07:27:22AM -0800, Marc MERLIN wrote:> On Thu, Feb 02, 2012 at 07:42:41AM -0500, Chris Mason wrote: > > On Wed, Feb 01, 2012 at 07:23:45PM -0800, Marc MERLIN wrote: > > > On Wed, Feb 01, 2012 at 12:56:24PM -0500, Chris Mason wrote: > > > > > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD > > > > > and what the performance is like with write cache turned off (I''m actually > > > > > not too sure what the impact is for SSDs considering that writing to flash > > > > > can actually be slower than writing to a hard drive). > > > > > > > > Performance without the cache on is going to vary wildly from one SSD to > > > > another. Some really need it to give them nice fat writes while others > > > > do better on smaller writes. It''s best to just test yours and see. > > > > > > > > With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm > > > > are doing the right thing for barriers. > > > > > > Thanks for the answer. > > > Can you confirm that I still must disable write cache on the SSD to avoid > > > corruption with btrfs on top of dmcrypt, or is there a chance that it just > > > works now? > > > > No, with 3.2 or higher it is expected to work. dm-crypt is doing the > > barriers correctly and as of 3.2 btrfs is sending them down correctly. > > Thanks for confirming, I''ll give this a shot. > (no warranty implied of course :) ).Actually I had one more question. I read this page: http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html I''m not super clear if with 3.2.5 kernel, I need to pass the special allow_discards option for brtfs and dm-crypt to be safe together, or whether they now talk through an API and everything "just works" :) Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Milan Broz
2012-Feb-12 23:47 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On 02/12/2012 11:32 PM, Marc MERLIN wrote:> Actually I had one more question. > > I read this page: > http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html > > I''m not super clear if with 3.2.5 kernel, I need to pass the special > allow_discards option for brtfs and dm-crypt to be safe together, or whether > they now talk through an API and everything "just works" :)If you want discards to be supported in dmcrypt, you have to enable it manually. The most comfortable way is just use recent cryptsetup and add --allow-discards option to luksOpen command. It will be never enabled by default in dmcrypt for security reasons http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html Milan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Feb-13 00:14 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote:> On 02/12/2012 11:32 PM, Marc MERLIN wrote: > >Actually I had one more question. > > > >I read this page: > >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html > > > >I''m not super clear if with 3.2.5 kernel, I need to pass the special > >allow_discards option for brtfs and dm-crypt to be safe together, or > >whether > >they now talk through an API and everything "just works" :) > > If you want discards to be supported in dmcrypt, you have to enable it > manually. > > The most comfortable way is just use recent cryptsetup and add > --allow-discards option to luksOpen command. > > It will be never enabled by default in dmcrypt for security reasons > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.htmlThanks for the answer. I knew that it created some security problems but I had not yet found the page you just gave, which effectively states that TRIM isn''t actually that big a win on recent SSDs (I thought it was actually pretty important to use it on them until now). Considering that I have a fairly new crucial 256GB SDD, I''m going to assume that this bit applies to me: "On the other side, TRIM is usually overrated. Drive itself should keep good performance even without TRIM, either by using internal garbage collecting process or by some area reserved for optimal writes handling." So it sounds like I should just not give the "ssd" mount option to btrfs, and not worry about TRIM. My main concern was to make sure I wasn''t risking corruption with btrfs on top of dm-crypt, which is now true with 3.2.x, and I now understand that it is true regardless of whether I use --allow-discards in cryptsetup, so I''m just not going to use it given the warning page you just posted. Thanks for the answer. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Calvin Walton
2012-Feb-15 15:42 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Sun, 2012-02-12 at 16:14 -0800, Marc MERLIN wrote:> Considering that I have a fairly new crucial 256GB SDD, I''m going to assume > that this bit applies to me: > "On the other side, TRIM is usually overrated. Drive itself should keep good > performance even without TRIM, either by using internal garbage collecting > process or by some area reserved for optimal writes handling." > > So it sounds like I should just not give the "ssd" mount option to btrfs, > and not worry about TRIM.The ''ssd'' option on btrfs is actually completely unrelated to trim support. Instead, it changes how blocks are allocated on the device, taking advantage of the the improved random read/write speed. The ''ssd'' option should be autodetected on most SSDs, but I don''t know if this is handled correctly when you''re using dm-crypt. (Btrfs prints a message at mount time when it autodetects this.) It shouldn''t hurt to leave it. Discard is handled with a separate mount option on btrfs (called ''discard''), and is disabled by default even if you have the ''ssd'' option enabled, because of the negative performance impact it has had on some SSDs. -- Calvin Walton <calvin.walton@kepstin.ca> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Feb-15 16:55 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Wed, Feb 15, 2012 at 10:42:43AM -0500, Calvin Walton wrote:> On Sun, 2012-02-12 at 16:14 -0800, Marc MERLIN wrote: > > Considering that I have a fairly new crucial 256GB SDD, I''m going to assume > > that this bit applies to me: > > "On the other side, TRIM is usually overrated. Drive itself should keep good > > performance even without TRIM, either by using internal garbage collecting > > process or by some area reserved for optimal writes handling." > > > > So it sounds like I should just not give the "ssd" mount option to btrfs, > > and not worry about TRIM. > > The ''ssd'' option on btrfs is actually completely unrelated to trim > support. Instead, it changes how blocks are allocated on the device, > taking advantage of the the improved random read/write speed. The ''ssd'' > option should be autodetected on most SSDs, but I don''t know if this is > handled correctly when you''re using dm-crypt. (Btrfs prints a message at > mount time when it autodetects this.) It shouldn''t hurt to leave it.Yes, I found out more after I got my laptop back up (I had limited search while I was rebuilding it). Thanks for clearing up my improper guess at the time :) The good news is that ssd mode is autodetected through dmcrypt: [ 23.130486] device label btrfs_pool1 devid 1 transid 732 /dev/mapper/cryptroot [ 23.130854] btrfs: disk space caching is enabled [ 23.175547] Btrfs detected SSD devices, enabling SSD mode> Discard is handled with a separate mount option on btrfs (called > ''discard''), and is disabled by default even if you have the ''ssd'' option > enabled, because of the negative performance impact it has had on some > SSDs.That''s what I read up later. It''s a bit counter intuitive after all the work what went into TRIM to then figure out that actually there are more reasons not to bother with it then to do :) On the plus side, it means SSDs are getting better and don''t need special code that makes data recovery harder should you ever need it. I tried updating the wiki pages, because: https://btrfs.wiki.kernel.org/articles/f/a/q/FAQ_1fe9.html says nothing about - trim/discard - dmcrypt while https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html still states ''btrfs volumes on top of dm-crypt block devices (and possibly LVM) require write-caching to be turned off on the underlying HDD. Failing to do so, in the event of a power failure, may result in corruption not yet handled by btrfs code. (2.6.33) '' I''m happy to fix both pages, but the login link of course doesn''t work and I''m not sure where the canonical copy to edit actually is or if I can get access. That said if someone else can fix it too, that''s great :) Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2012-Feb-15 16:59 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Wed, Feb 15, 2012 at 08:55:40AM -0800, Marc MERLIN wrote:> https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html > still states ''btrfs volumes on top of dm-crypt block devices (and possibly > LVM) require write-caching to be turned off on the underlying HDD. Failing > to do so, in the event of a power failure, may result in corruption not yet > handled by btrfs code. (2.6.33) '' > > I''m happy to fix both pages, but the login link of course doesn''t work and > I''m not sure where the canonical copy to edit actually is or if I can get > access.btrfs.ipv5.de It''s a "temporary" location (6 months and counting) while kernel.org is static-pages-only. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- The English language has the mot juste for every occasion. ---
Chris Samuel
2012-Feb-16 06:33 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Monday 13 February 2012 11:14:00 Marc MERLIN wrote:> I knew that it created some security problems but I had not yet > found the page you just gave, which effectively states that TRIM > isn''t actually that big a win on recent SSDs (I thought it was > actually pretty important to use it on them until now).In addition TRIM has always been problematic for SATA SSD''s as it was initially specified as a non-queueable command: https://lwn.net/Articles/347511/ However, it appears that a queueable TRIM command has been introduced for SATA 3.1: http://techreport.com/discussions.x/21311 cheers! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
Martin Steigerwald
2012-Feb-18 12:33 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
Am Montag, 13. Februar 2012 schrieb Marc MERLIN:> On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote: > > On 02/12/2012 11:32 PM, Marc MERLIN wrote: > > >Actually I had one more question. > > > > > >I read this page: > > >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html > > > > > >I''m not super clear if with 3.2.5 kernel, I need to pass the special > > >allow_discards option for brtfs and dm-crypt to be safe together, or > > >whether > > >they now talk through an API and everything "just works" :) > > > > If you want discards to be supported in dmcrypt, you have to enable > > it manually. > > > > The most comfortable way is just use recent cryptsetup and add > > --allow-discards option to luksOpen command. > > > > It will be never enabled by default in dmcrypt for security reasons > > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html > > Thanks for the answer. > I knew that it created some security problems but I had not yet found > the page you just gave, which effectively states that TRIM isn''t > actually that big a win on recent SSDs (I thought it was actually > pretty important to use it on them until now).Well I find "On the other side, TRIM is usually overrated. Drive itself should keep good performance even without TRIM, either by using internal garbage collecting process or by some area reserved for optimal writes handling." a very, very weak statement on the matter, cause it lacks any links to any evidence for the statement made. For that I meant to knew up to know is that wear leaveling of the SSD can be more effective the more space the SSD controller/firmware can use for wear leveling freely. Thus when I give space back to the SSD via fstrim it has more space for wear leveling which should lead to more evenly distributed write pattterns and more evenly distributed write accesses to flash cells and thus a longer life time. I do not see any other means on how the SSD drive can do that data has been freed again except for it being overwritten, then the old write location can be freed of course. But then BTRFS does COW and thus when I understand this correctly, the SSD wouldn´t even be told when a file is overwritten, cause actually it isn´t, but is written to a new location. Thus especially for BTRFS I see even more reasons to use fstrim. I have no scientifical backing either, but at least I tried to think of an explaination that appears logical to me instead of just saying it is so without providing any explaination at all. Yes, I dislike bold statements without any backing at all. (If I overread something in the text, please hint me to it, but I did not see any explaination or link to support the statement.) I use ecryptfs and I use fstrim occassionally with BTRFS and Ext4, but I do not use online discard. Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Feb-18 12:39 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
Sorry, I forget to keep Cc´s, different lists, different habits. To Milan Broz: Well now I noticed that you linked to your own blog entry. Please do not take my below statements personally - I might have written them a bit harshly. Actually I do not really know whether your statement that TRIM is overrated is correct, but before believing that TRIM does not give much advantage, I would like to see at least some evidence of any sort, cause for me my explaination below that it should make a difference at least seems logical to me. Am Montag, 13. Februar 2012 schrieb Marc MERLIN:> On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote: > > On 02/12/2012 11:32 PM, Marc MERLIN wrote: > > >Actually I had one more question. > > > > > >I read this page: > > >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html > > > > > >I''m not super clear if with 3.2.5 kernel, I need to pass the special > > >allow_discards option for brtfs and dm-crypt to be safe together, or > > >whether > > >they now talk through an API and everything "just works" :) > > > > If you want discards to be supported in dmcrypt, you have to enable > > it manually. > > > > The most comfortable way is just use recent cryptsetup and add > > --allow-discards option to luksOpen command. > > > > It will be never enabled by default in dmcrypt for security reasons > > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html > > Thanks for the answer. > I knew that it created some security problems but I had not yet found > the page you just gave, which effectively states that TRIM isn''t > actually that big a win on recent SSDs (I thought it was actually > pretty important to use it on them until now).Well I find "On the other side, TRIM is usually overrated. Drive itself should keep good performance even without TRIM, either by using internal garbage collecting process or by some area reserved for optimal writes handling." a very, very weak statement on the matter, cause it lacks any links to any evidence for the statement made. For that I meant to knew up to know is that wear leaveling of the SSD can be more effective the more space the SSD controller/firmware can use for wear leveling freely. Thus when I give space back to the SSD via fstrim it has more space for wear leveling which should lead to more evenly distributed write pattterns and more evenly distributed write accesses to flash cells and thus a longer life time. I do not see any other means on how the SSD drive can do that data has been freed again except for it being overwritten, then the old write location can be freed of course. But then BTRFS does COW and thus when I understand this correctly, the SSD wouldn´t even be told when a file is overwritten, cause actually it isn´t, but is written to a new location. Thus especially for BTRFS I see even more reasons to use fstrim. I have no scientifical backing either, but at least I tried to think of an explaination that appears logical to me instead of just saying it is so without providing any explaination at all. Yes, I dislike bold statements without any backing at all. (If I overread something in the text, please hint me to it, but I did not see any explaination or link to support the statement.) I use ecryptfs and I use fstrim occassionally with BTRFS and Ext4, but I do not use online discard. Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Feb-18 12:49 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
Am Samstag, 18. Februar 2012 schrieb Martin Steigerwald: […]> Am Montag, 13. Februar 2012 schrieb Marc MERLIN: > > On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote: > > > On 02/12/2012 11:32 PM, Marc MERLIN wrote: > > > >Actually I had one more question. > > > > > > > >I read this page: > > > >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html > > > > > > > >I''m not super clear if with 3.2.5 kernel, I need to pass the > > > >special allow_discards option for brtfs and dm-crypt to be safe > > > >together, or whether > > > >they now talk through an API and everything "just works" :) > > > > > > If you want discards to be supported in dmcrypt, you have to enable > > > it manually. > > > > > > The most comfortable way is just use recent cryptsetup and add > > > --allow-discards option to luksOpen command. > > > > > > It will be never enabled by default in dmcrypt for security reasons > > > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html > > > > Thanks for the answer. > > I knew that it created some security problems but I had not yet found > > the page you just gave, which effectively states that TRIM isn''t > > actually that big a win on recent SSDs (I thought it was actually > > pretty important to use it on them until now). > > Well I find > > "On the other side, TRIM is usually overrated. Drive itself should keep > good performance even without TRIM, either by using internal garbage > collecting process or by some area reserved for optimal writes > handling." > > a very, very weak statement on the matter, cause it lacks any links to > any evidence for the statement made. For that I meant to knew up to > know is that wear leaveling of the SSD can be more effective the more > space the SSD controller/firmware can use for wear leveling freely. > Thus when I give space back to the SSD via fstrim it has more space > for wear leveling which should lead to more evenly distributed write > pattterns and more evenly distributed write accesses to flash cells > and thus a longer life time. I do not see any other means on how the > SSD drive can do that data has been freed again except for it being > overwritten, then the old write location can be freed of course. But > then BTRFS does COW and thus when I understand this correctly, the SSD > wouldn´t even be told when a file is overwritten, cause actually it > isn´t, but is written to a new location. Thus especially for BTRFS I > see even more reasons to use fstrim.Hmmm, even with COW old data is overwritten eventually, cause especially on SSDs free space might be sparse. So even when when overwriting a file or some parts of it writes to a new location, at some time in the future BTRFS will likely overwrite the old one. So seen in that light, an occasional fstrim might just help to make free space available sooner. Heck, I find it so difficult to know whats best when SSD / flash is involved. Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Feb-18 16:07 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Sat, Feb 18, 2012 at 01:39:24PM +0100, Martin Steigerwald wrote:> To Milan Broz: Well now I noticed that you linked to your own blog entry.He did not, I''m the one who did. I asked a bunch of questions since the online docs didn''t address them for me. Some of you answered those for me, I asked access to the wiki and I updated the wiki to have the information you gave me. While I have no inherent bias one way or another, obviously I did put some of your opinions on the wiki :)> Please do not take my below statements personally - I might have written > them a bit harshly. Actually I do not really know whether your statement > that TRIM is overrated is correct, but before believing that TRIM does not > give much advantage, I would like to see at least some evidence of any > sort, cause for me my explaination below that it should make a difference > at least seems logical to me.That sounds like a reasonable request to me. In the meantime I changed the page to ''Does Btrfs support TRIM/discard? "-o discard" is supported, but can have some negative consequences on performance on some SSDs or at least whether it adds worthwhile performance is up for debate depending on who you ask, and makes undeletion/recovery near impossible while being a security problem if you use dm-crypt underneath (see http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html ), therefore it is not enabled by default. You are welcome to run your own benchmarks and post them here, with the caveat that they''ll be very SSD firmware specific.'' I''ll leave further edits to others ;) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Clemens Eisserer
2012-Feb-19 00:53 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
Hi,> "-o discard" is supported, but can have some negative consequences on > performance on some SSDs or at least whether it adds worthwhile performance > is up for debate ....The problem is, that TRIM can''t be benchmarked in a traditional way - by running one run with TRIM and another one without and later comparing numbers. Sure, enabling TRIM causes your benchmarks to run a bit slower in the first place (as the TRIM commands need to be generated and interpreted by the SSD, which is not free) - however without TRIM the SSD won''t have any clue what data is "dead" and what needs to be rescued until you overwrite it. On my SandForce SSD I had disabled TRIM (because those drives are known for the good recovery without TRIM), and for now the SSD is in the following state: dd oflag=sync if=/some_large_file_in_page_cache of=testfile bs=4M 567558236 bytes (568 MB) copied, 23.7798 s, 23.9 MB/s 233 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 3968 241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 1280 So that drive is writing at 24mb/s and has a write amplification of 3.1 - usually you see values between one and two. Completly "normal" desktop useage. - Clemens -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Justin Ossevoort
2012-Feb-22 10:28 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On 15/02/12 17:59, Hugo Mills wrote:> It''s a "temporary" location (6 months and counting) while > kernel.org is static-pages-only. > > Hugo. >Couldn''t btrfs.wiki.kernel.org be made a CNAME for btrfs.ipv5.de for the time being until kernel.org is up and running? Or alternatively all requests could be proxied/frame-forwarded/redirected (in order of preference, to make sure everybody keeps using the original url as much as possible) to btrfs.ipv5.de? Kernel.org is a complex site and I can easily foresee them having to spent months still fixing/hardening all of it''s internals and backends before getting around to these "extra" project facilities. Regards, justin.... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2012-Feb-22 11:07 UTC
Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
On Wed, Feb 22, 2012 at 11:28:13AM +0100, Justin Ossevoort wrote:> On 15/02/12 17:59, Hugo Mills wrote: > > It''s a "temporary" location (6 months and counting) while > > kernel.org is static-pages-only. > > > > Hugo. > > > > Couldn''t btrfs.wiki.kernel.org be made a CNAME for btrfs.ipv5.de for the > time being until kernel.org is up and running? > Or alternatively all requests could be > proxied/frame-forwarded/redirected (in order of preference, to make sure > everybody keeps using the original url as much as possible) to > btrfs.ipv5.de?That would require additional effort on the part of kernel.org, which probably won''t be forthcoming. I don''t really have any authority to ask kernel.org people to do anything anyway -- it''ll probably have to come from Chris.> Kernel.org is a complex site and I can easily foresee them having to > spent months still fixing/hardening all of it''s internals and backends > before getting around to these "extra" project facilities.Indeed. It''s not clear whether we''ll get the kernel.org wiki back in writable form at all. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Python is executable pseudocode; perl --- is executable line-noise.
On Sat, Feb 18, 2012 at 08:07:02AM -0800, Marc MERLIN wrote:> On Sat, Feb 18, 2012 at 01:39:24PM +0100, Martin Steigerwald wrote: > > To Milan Broz: Well now I noticed that you linked to your own blog entry. > > He did not, I''m the one who did. > I asked a bunch of questions since the online docs didn''t address them for > me. Some of you answered those for me, I asked access to the wiki and I > updated the wiki to have the information you gave me. > > While I have no inherent bias one way or another, obviously I did put some > of your opinions on the wiki :) > > > Please do not take my below statements personally - I might have written > > them a bit harshly. Actually I do not really know whether your statement > > that TRIM is overrated is correct, but before believing that TRIM does not > > give much advantage, I would like to see at least some evidence of any > > sort, cause for me my explaination below that it should make a difference > > at least seems logical to me. > > That sounds like a reasonable request to me. > > In the meantime I changed the page to > ''Does Btrfs support TRIM/discard? > > "-o discard" is supported, but can have some negative consequences on > performance on some SSDs or at least whether it adds worthwhile performance > is up for debate depending on who you ask, and makes undeletion/recovery > near impossible while being a security problem if you use dm-crypt > underneath (see > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html ), therefore > it is not enabled by default. You are welcome to run your own benchmarks and > post them here, with the caveat that they''ll be very SSD firmware specific.'' > > I''ll leave further edits to others ;)TL;DR: I''m going to change the FAQ to say people should use TRIM with dmcrypt because not doing so definitely causes some lesser SSDs to suck, or possibly even fail and lose our data. Longer version: Ok, so several months later I can report back with useful info. Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused its garbage collection algorithm to fail (killing the drive and all its data), and it was also causing BRTFS to hang badly when I was getting within 10GB of the drive getting full. I reported some problems I had with btrfs being very slow and hanging when I only had 10GB free, and I''m now convinced that it was the SSD that was at fault. On the Crucial RealSSD C300 256GB, and from talking to their tech support and other folks who happened to have gotten that ''drive'' at work and also got weird unexplained failures, I''m convinced that even its latest 007 firmware (the firmware it shipped with would just hang the system for a few seconds every so often so I did upgrade to 007 early on), the drive does very poorly without TRIM when it''s getting close to full. In my case, I hit a bug that is likely firmware and caused the drive to die (not show up on the bus anymore). I went through special reset, power on but don''t talk to it procedures that eventually brought the drive back after 4 hours of trying, except the drive is now ''blank'', all the partitions and data are gone as far as the controller is concerned. (yes, I had backups, thanks for asking :) ). From talking to their techs and other folks, it seems clear that TRIM is greatly encouraged, and I''m pretty sure that had I used TRIM, I would not have hit the problems that caused my drive to fail and suck so much when it was getting full. Now, I still think that the Crucial RealSSD C300 256GB has poor firmware compared to some of its competition and there is no excuse for it just dying like that, but I''m going to change the FAQ to recommend that people do use TRIM if they can, even if they use dm-crypt (assuming their kernel is at least 3.2.0). Any objections and/or comments? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fajar A. Nugraha
2012-Jul-18 20:04 UTC
Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
On Thu, Jul 19, 2012 at 1:13 AM, Marc MERLIN <marc@merlins.org> wrote:> TL;DR: > I''m going to change the FAQ to say people should use TRIM with dmcrypt > because not doing so definitely causes some lesser SSDs to suck, or > possibly even fail and lose our data. > > > Longer version: > Ok, so several months later I can report back with useful info. > > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused > its garbage collection algorithm to fail (killing the drive and all its > data), and it was also causing BRTFS to hang badly when I was getting > within 10GB of the drive getting full. > > I reported some problems I had with btrfs being very slow and hanging when I > only had 10GB free, and I''m now convinced that it was the SSD that was at > fault. > > On the Crucial RealSSD C300 256GB, and from talking to their tech support > and other folks who happened to have gotten that ''drive'' at work and also > got weird unexplained failures, I''m convinced that even its latest 007 > firmware (the firmware it shipped with would just hang the system for a few > seconds every so often so I did upgrade to 007 early on), the drive does > very poorly without TRIM when it''s getting close to full.If you''re going to edit the wiki, I''d suggest you say "SOME SSDs might need to use TRIM with dmcrypt". That''s because some SSD controllers (e.g. sandforce) performs just fine without TRIM, and in my case TRIM made performance worse. -- Fajar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 19, 2012 at 03:04:29AM +0700, Fajar A. Nugraha wrote:> > On the Crucial RealSSD C300 256GB, and from talking to their tech support > > and other folks who happened to have gotten that ''drive'' at work and also > > got weird unexplained failures, I''m convinced that even its latest 007 > > firmware (the firmware it shipped with would just hang the system for a few > > seconds every so often so I did upgrade to 007 early on), the drive does > > very poorly without TRIM when it''s getting close to full. > > If you''re going to edit the wiki, I''d suggest you say "SOME SSDs might > need to use TRIM with dmcrypt". That''s because some SSD controllers > (e.g. sandforce) performs just fine without TRIM, and in my case TRIM > made performance worse.Right. I''m definitely planning on writing that it is controller dependent. By the way, I read that SF performs worse with dmcrypt because: http://superuser.com/questions/250626/sandforce-ssd-encryption-security-and-support ''Using software full-disk-encryption on an SSD seriously impacts the performance of the drive - up to a point where it can be slower than a regular hard disk.'' http://c089.wordpress.com/2011/02/28/encrypting-solid-state-disks/ http://www.madshrimps.be/articles/article/965#axzz1FGfzD15q So, really, it seems that one should not buy a drive with a sandforce controller if you plan to use dmcrypt. I haven''t found clear data to say which controllers other than sandforce have the same problem. https://docs.google.com/a/google.com/viewer?a=v&q=cache:eyoe3gL10qcJ:www.samsung.com/global/business/semiconductor/file/product/ssd/SamsungSSD_Encryption_Benchmarks_201011.pdf+&hl=en&gl=us&pid=bl&srcid=ADGEEShHEj1jkxdwa7JREmubW6iAyc2RGqQt8MoGxMgQrRNDifoGqTYWjYdaUypnaRtkjKrjiOt1JCr4dDt4ycD2rjWO51PtAo67JvnZGe6Gx1s-9yjkRVYZWsUHZApkjm7dWVzkdQg6&sig=AHIEtbRaViqG4cYC_RYajAW04JRjNhQq6g seems to imply that software encryption on samsung drives (which is what I just got as a replacement for my crucial), does not have the same penalty than you have with the SF controller. But back to the point about "yes, it depends". Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Clemens Eisserer
2012-Jul-18 21:34 UTC
Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
Hi,> Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused > its garbage collection algorithm to fail (killing the drive and all its > data), and it was also causing BRTFS to hang badly when I was getting > within 10GB of the drive getting full.Funny thing, as without TRIM the drive does not even know how "full" it is ;) - Clemens -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 18, 2012 at 11:34:32PM +0200, Clemens Eisserer wrote:> Hi, > > > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused > > its garbage collection algorithm to fail (killing the drive and all its > > data), and it was also causing BRTFS to hang badly when I was getting > > within 10GB of the drive getting full. > > Funny thing, as without TRIM the drive does not even know how "full" it is ;)That''s very true. To be honest, I don''t fully know what happened to that drive, but I actually heard the following from Crucial tech support twice: "without TRIM, the drive needs to do garbage collection (correct), and it only happens when the drive has been fully idle for about 15mn (oh boy, why?). We recommend you let your drive sit at the bios or just with sata power from time to time so that it can do garbage collection" Then, there is also this gem: http://forums.crucial.com/t5/Solid-State-Drives-SSD/Bios-is-not-recognize-crucail-m4-SSD-64GB/td-p/63835 If the drive cannot garbage collect, it will shut off and appear dead. When that happens, it needs to be powered on for 20mn with no data, turned off, and on again for 20mn, and off. Then, it''s supposed to come back and be usable again. (well, mine not so much. It came back eventually, but empty). This is not a btrfs problem per se, but since we had a discussion on TRIM and SSDs and what to do with dm-crypt, I thought I''d report back since it''s actually not a simple situation at all, and very SSD dependent afterall. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Jul-18 21:49 UTC
Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
Am Mittwoch, 18. Juli 2012 schrieb Marc MERLIN:> On Sat, Feb 18, 2012 at 08:07:02AM -0800, Marc MERLIN wrote: > > On Sat, Feb 18, 2012 at 01:39:24PM +0100, Martin Steigerwald wrote: > > > To Milan Broz: Well now I noticed that you linked to your own blog > > > entry. > > > > He did not, I''m the one who did. > > I asked a bunch of questions since the online docs didn''t address > > them for me. Some of you answered those for me, I asked access to > > the wiki and I updated the wiki to have the information you gave me. > > > > While I have no inherent bias one way or another, obviously I did put > > some of your opinions on the wiki :) > > > > > Please do not take my below statements personally - I might have > > > written them a bit harshly. Actually I do not really know whether > > > your statement that TRIM is overrated is correct, but before > > > believing that TRIM does not give much advantage, I would like to > > > see at least some evidence of any sort, cause for me my > > > explaination below that it should make a difference at least seems > > > logical to me. > > > > That sounds like a reasonable request to me. > > > > In the meantime I changed the page to > > ''Does Btrfs support TRIM/discard? > > > > "-o discard" is supported, but can have some negative consequences on > > performance on some SSDs or at least whether it adds worthwhile > > performance is up for debate depending on who you ask, and makes > > undeletion/recovery near impossible while being a security problem > > if you use dm-crypt underneath (see > > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html ), > > therefore it is not enabled by default. You are welcome to run your > > own benchmarks and post them here, with the caveat that they''ll be > > very SSD firmware specific.'' > > > > I''ll leave further edits to others ;) > > TL;DR: > I''m going to change the FAQ to say people should use TRIM with dmcrypt > because not doing so definitely causes some lesser SSDs to suck, or > possibly even fail and lose our data.I can´t comment on dm-crypt since I do not use it. I am still not convinced that dm-crypt is the best way to go about encryption especially for SSDs. But its more of a gut feeling than anything that I can explain easily. I use ecryptfs, formerly encfs, but encfs is much slower. The advantage for me is that I can backup my data without decrypting and reencrypting it just with rsync as long as I exclude the uncrypted view on the data. I also always thought not being block device based would be an advantage too, but I am not so sure about it anymore. Disadvantage would be some information leakage like amount of file and directory layout and approximate sizes of files. And the need to handle the swap partition separately. What I do not even do right now.> Longer version: > Ok, so several months later I can report back with useful info. > > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what > caused its garbage collection algorithm to fail (killing the drive and > all its data), and it was also causing BRTFS to hang badly when I was > getting within 10GB of the drive getting full.How did you know that it was its garbage collection algorithm?> I reported some problems I had with btrfs being very slow and hanging > when I only had 10GB free, and I''m now convinced that it was the SSD > that was at fault. > > On the Crucial RealSSD C300 256GB, and from talking to their tech > support and other folks who happened to have gotten that ''drive'' at > work and also got weird unexplained failures, I''m convinced that even > its latest 007 firmware (the firmware it shipped with would just hang > the system for a few seconds every so often so I did upgrade to 007 > early on), the drive does very poorly without TRIM when it''s getting > close to full. > In my case, I hit a bug that is likely firmware and caused the drive to > die (not show up on the bus anymore). I went through special reset, > power on but don''t talk to it procedures that eventually brought the > drive back after 4 hours of trying, except the drive is now ''blank'', > all the partitions and data are gone as far as the controller is > concerned. > (yes, I had backups, thanks for asking :) ). > > From talking to their techs and other folks, it seems clear that TRIM > is greatly encouraged, and I''m pretty sure that had I used TRIM, I > would not have hit the problems that caused my drive to fail and suck > so much when it was getting full.I still think that telling the SSD about free space is helping it to balance its wear leveling process. Say I have a 10 GB file that I newer ever touch again. The SSD firmware will likely begin to move it around to blocks with more write accesses in order to use that blocks that have once be written to for further writes. For that wear leveling its beneficial if the SSD has free blocks to move stuff to. The more the better it seems to me. Cause if it only has little free space it possibly has to do more moving around to equally distribute write accesses.> Now, I still think that the Crucial RealSSD C300 256GB has poor > firmware compared to some of its competition and there is no excuse > for it just dying like that, but I''m going to change the FAQ to > recommend that people do use TRIM if they can, even if they use > dm-crypt (assuming their kernel is at least 3.2.0). > > Any objections and/or comments?I still only use fstrim from time to time. About once a week or after lots of drive churning or removing lots of data. I also have a logical volume of about 20 GiB that I keep free for most of the time. And other filesystem are quite full, but there is also some little free space of about 20-30 GiB together. So it should be about 40-50 GiB free most of the time. The 300 GB Intel SSD 320 in this ThinkPad T520 is still fine after about 1 year and 2-3 months. I do not see any performance degradation whatsover so far. Last time I looked also SMART data looked fine, but I have not much experience with SMART on SSDs so far. Regards, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 18, 2012 at 11:49:36PM +0200, Martin Steigerwald wrote:> I am still not convinced that dm-crypt is the best way to go about > encryption especially for SSDs. But its more of a gut feeling than > anything that I can explain easily.I agree that dmcrypt is not great, and it even makes some SSDs slower than hard drives as per some reports I just posted in another mail. But:> I use ecryptfs, formerly encfs, but encfs is much slower. The advantageecryptfs is: 1) painfully slow compared to dmcrypt in my tests. As in so slow that I don''t even need to benchmark it. 2) unable to encrypt very long filenames, so when I copy my archive on an ecryptfs volume, some files won''t copy unless I rename them. I would love for ecryptfs to have the performance of dmcrypt, because it''d be easier for me to use it, but it didn''t even come close.> > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what > > caused its garbage collection algorithm to fail (killing the drive and > > all its data), and it was also causing BRTFS to hang badly when I was > > getting within 10GB of the drive getting full. > > How did you know that it was its garbage collection algorithm?I don''t, hence "most likely". The failure of the drive I got was likely garbage collection related from what I got from the techs I talked to.> > From talking to their techs and other folks, it seems clear that TRIM > > is greatly encouraged, and I''m pretty sure that had I used TRIM, I > > would not have hit the problems that caused my drive to fail and suck > > so much when it was getting full. > > I still think that telling the SSD about free space is helping it to > balance its wear leveling process.Yes.> > Any objections and/or comments? > > I still only use fstrim from time to time. About once a week or after lots > of drive churning or removing lots of data. I also have a logical volume > of about 20 GiB that I keep free for most of the time. And other filesystem > are quite full, but there is also some little free space of about 20-30 > GiB together. So it should be about 40-50 GiB free most of the time.I''m curious. If your filesystem supports trim (i.e. ext4 and btrfs), is there every a reason to turn off trim in the FS and use fstrim instead?> The 300 GB Intel SSD 320 in this ThinkPad T520 is still fine after about 1 > year and 2-3 months. I do not see any performance degradation whatsover so > far. Last time I looked also SMART data looked fine, but I have not much > experience with SMART on SSDs so far.My experience and what I read online is that SMART on SSDs doesn''t seem to help much in many cases. I''ve seen too many reports of SSDs dying very suddenly with absolutely no warning. Hard drives, if you look at smart data over time, typically give you plenty of warning before they die (as long as you don''t drop them off a table without parking their heads). If you''re curious, here''s the last dump of my SMART data on the SSD that died: === START OF INFORMATION SECTION ==Model Family: Crucial RealSSD C300 Device Model: C300-CTFDDAC256MAG Serial Number: 00000000111003044A9C LU WWN Device Id: 5 00a075 103044a9c Firmware Version: 0007 User Capacity: 256,060,514,304 bytes [256 GB] 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 000 Pre-fail Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2977 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 472 170 Grown_Failing_Block_Ct 0x0033 100 100 000 Pre-fail Always - 0 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Wear_Levelling_Count 0x0033 100 100 000 Pre-fail Always - 35 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 1 181 Non4k_Aligned_Access 0x0022 100 100 000 Old_age Always - 2757376737803 183 SATA_Iface_Downshift 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 Factory_Bad_Block_Ct 0x000e 100 100 000 Old_age Always - 422 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 0 202 Perc_Rated_Life_Used 0x0018 100 100 000 Old_age Offline - 0 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 2958 - # 2 Extended offline Completed without error 00% 2937 - # 3 Short offline Completed without error 00% 2936 - # 4 Short offline Completed without error 00% 2915 - # 5 Short offline Completed without error 00% 2892 - # 6 Short offline Completed without error 00% 2881 - # 7 Short offline Completed without error 00% 2859 - # 8 Vendor (0xff) Completed without error 00% 2852 - Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Jul-19 10:40 UTC
Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
Am Donnerstag, 19. Juli 2012 schrieb Marc MERLIN:> On Wed, Jul 18, 2012 at 11:49:36PM +0200, Martin Steigerwald wrote: > > I am still not convinced that dm-crypt is the best way to go about > > encryption especially for SSDs. But its more of a gut feeling than > > anything that I can explain easily. > > I agree that dmcrypt is not great, and it even makes some SSDs slower > than hard drives as per some reports I just posted in another mail. > > But: > > I use ecryptfs, formerly encfs, but encfs is much slower. The > > advantage > > ecryptfs is: > 1) painfully slow compared to dmcrypt in my tests. As in so slow that I > don''t even need to benchmark it.Huh! Its an order of magnitude faster than encfs here – well no wonder considering it uses FUSE – and its definately fast enough for my work account on this machine.> 2) unable to encrypt very long filenames, so when I copy my archive on > an ecryptfs volume, some files won''t copy unless I rename them.Hmmm, okay, thats an issue. I do not seem to have that long names in that directory tough.> I would love for ecryptfs to have the performance of dmcrypt, because > it''d be easier for me to use it, but it didn''t even come close.I never benchmarked it except for the time it needs to build my training slides. It was about twice as fast as encfs and it was fast enough for my case. Might be that dm-crypt is even faster, but then I do not see any visible performance difference between my unencrypted private account and the encrypted work account. Might still be that there is a difference, its actually likely, but I did not noticed it so far. I difference is: My /home is still on Ext4 here. Only / is on BTRFS. So maybe ecryptfs is only that slow on BTRFS?> > > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what > > > caused its garbage collection algorithm to fail (killing the drive > > > and all its data), and it was also causing BRTFS to hang badly > > > when I was getting within 10GB of the drive getting full. > > > > How did you know that it was its garbage collection algorithm? > > I don''t, hence "most likely". The failure of the drive I got was likely > garbage collection related from what I got from the techs I talked to.Ah okay. I wouldn´t now it either. Its almost a black box for me. Like my car. If it has something I drive it to the dealer´s garage and be done with it. As long as the SSD works. I try to treat it somewhat gently. […]> > > Any objections and/or comments? > > > > I still only use fstrim from time to time. About once a week or after > > lots of drive churning or removing lots of data. I also have a > > logical volume of about 20 GiB that I keep free for most of the > > time. And other filesystem are quite full, but there is also some > > little free space of about 20-30 GiB together. So it should be about > > 40-50 GiB free most of the time. > > I''m curious. If your filesystem supports trim (i.e. ext4 and btrfs), is > there every a reason to turn off trim in the FS and use fstrim instead?Frankly, I do not know exactly. AFAIR It has been reported here and elsewhere that constant trimming will yield a performance penalty – maybe thats why your ecryptfs was so slow? some stacking effect combined with constant trimming – and might even harm some cheaper SSDs. Thus I do batched trimming. I am not sure whether some filesystem have been changed to some intermediate with the "discard" mount option as well. AFAIR XFS has been enhanced to provide performant trimming in batches with "discard" mount option. Do not know about BTRFS so far. Regarding SSDs much seems like guess work.> > The 300 GB Intel SSD 320 in this ThinkPad T520 is still fine after > > about 1 year and 2-3 months. I do not see any performance > > degradation whatsover so far. Last time I looked also SMART data > > looked fine, but I have not much experience with SMART on SSDs so > > far. > > My experience and what I read online is that SMART on SSDs doesn''t seem > to help much in many cases. I''ve seen too many reports of SSDs dying > very suddenly with absolutely no warning. > Hard drives, if you look at smart data over time, typically give you > plenty of warning before they die (as long as you don''t drop them off > a table without parking their heads).Hmmm, so as always its good to have backups. Plan to refresh my backup in the next days. ;)> If you''re curious, here''s the last dump of my SMART data on the SSD > that died:Thanks. Interesting. I do not see anything obvious that would tell me that it might fail. But then there is no guarentee for harddisks either. I got: merkaba:~> smartctl -a /dev/sda smartctl 5.43 2012-06-05 r3561 [x86_64-linux-3.5.0-rc7-tp520-toi-3.3-timekeeping+] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION ==Model Family: Intel 320 Series SSDs Device Model: INTEL SSDSA2CW300G3 Serial Number: […] LU WWN Device Id: […] Firmware Version: 4PC10362 User Capacity: 300.069.052.416 bytes [300 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu Jul 19 12:28:43 2012 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED […] SMART Attributes Data Structure revision number: 5 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0 4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2443 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1263 170 Reserve_Block_Count 0x0033 100 100 010 Pre-fail Always - 0 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 169 183 Runtime_Bad_Block 0x0030 100 100 000 Old_age Offline - 0 184 End-to-End_Error 0x0032 100 100 090 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 120 199 UDMA_CRC_Error_Count 0x0030 100 100 000 Old_age Offline - 0 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 127984 226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 314 227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 61 228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 146593 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 127984 242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 201548 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Vendor (0xe0) Completed without error 00% 2443 - # 2 Vendor (0xd8) Completed without error 00% 271 - # 3 Vendor (0xd8) Completed without error 00% 271 - # 4 Vendor (0xa8) Completed without error 00% 324 - 0xd8 is a long test, 0xa8 is a short test. Don´t know about 0xe0. Seems that smartctl does not know this test numbers yet. Except for that unsafe shutdown count I do not see anything interesting in there. Oh, that Erase fail count – so some cells already got broken? Lets see how that was initially: martin@merkaba:~/Computer/Merkaba> diff -u smartctl-a-2011-05-19-nach-secure-erase.txt smartctl-a-2012-07-19.txt --- smartctl-a-2011-05-19-nach-secure-erase.txt 2011-05-19 16:20:50.000000000 +0200 +++ smartctl-a-2011-07-19.txt 2012-07-19 12:34:22.512228427 +0200 @@ -1,15 +1,18 @@ -smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) -Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net +smartctl 5.43 2012-06-05 r3561 [x86_64-linux-3.5.0-rc7-tp520-toi-3.3-timekeeping+] (local build) +Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION ==+Model Family: Intel 320 Series SSDs Device Model: INTEL SSDSA2CW300G3 Serial Number: […] -Firmware Version: 4PC10302 -User Capacity: 300,069,052,416 bytes -Device is: Not in smartctl database [for details use: -P showall] +LU WWN Device Id: […] +Firmware Version: 4PC10362 +User Capacity: 300.069.052.416 bytes [300 GB] +Sector Size: 512 bytes logical/physical +Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 -Local Time is: Thu May 19 16:20:49 2011 CEST +Local Time is: Thu Jul 19 12:34:22 2012 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled […] @@ -56,29 +59,34 @@ 3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0 4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0 - 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1 - 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4 -170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0 -171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 -172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 -184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0 + 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2443 + 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1263 +170 Reserve_Block_Count 0x0033 100 100 010 Pre-fail Always - 0 +171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 +172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 169 +183 Runtime_Bad_Block 0x0030 100 100 000 Old_age Offline - 0 +184 End-to-End_Error 0x0032 100 100 090 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 -192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1 -225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 995 -226 Load-in_Time 0x0032 100 100 000 Old_age Always - 2203323 -227 Torq-amp_Count 0x0032 100 100 000 Old_age Always - 49 -228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 12587069 +192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 120 +199 UDMA_CRC_Error_Count 0x0030 100 100 000 Old_age Offline - 0 +225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 127984 +226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 314 +227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 61 +228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 146593 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 -241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 995 -242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 466 +241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 127984 +242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 201549 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 -No self-tests have been logged. [To run self-tests, use: smartctl -t] - +Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error +# 1 Vendor (0xe0) Completed without error 00% 2443 - +# 2 Vendor (0xd8) Completed without error 00% 271 - +# 3 Vendor (0xd8) Completed without error 00% 271 - +# 4 Vendor (0xa8) Completed without error 00% 324 - Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run SMART Selective self-test log data structure revision number 0 Jup, erase fail count has raised. Whatever that means. I think I will keep an eye on it. Ciao, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Jul-22 18:58 UTC
Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
I''m still getting a bit more data before updating the btrfs wiki with my best recommendations for today. First, everything I''ve read so far says that the ssd btrfs mount option makes btrfs slower in benchmarks. What gives? Anyone using it or know of a reason not to mount my ssd with nossd? Next, I got a new Samsumg 830 512GB SSD which is supposed to be very high performance. The raw device seems fast enough on a quick hdparm test: But once I encrypt it, it drops to 5 times slower than my 1TB spinning disk in the same laptop: gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt /dev/mapper/ssdcrypt: Timing cached reads: 15412 MB in 2.00 seconds = 7715.37 MB/sec Timing buffered disk reads: 70 MB in 3.06 seconds = 22.91 MB/sec <<<< gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk) /dev/mapper/cryptroot: Timing cached reads: 16222 MB in 2.00 seconds = 8121.03 MB/sec Timing buffered disk reads: 308 MB in 3.01 seconds = 102.24 MB/sec <<<< The non encrypted SSD device gives me: /dev/sda4: Timing cached reads: 14258 MB in 2.00 seconds = 7136.70 MB/sec Timing buffered disk reads: 1392 MB in 3.00 seconds = 463.45 MB/sec which is 4x faster than my non encrypted spinning disk, as expected. I used aes-xts-plain as recommended on http://www.mayrhofer.eu.org/ssd-linux-benchmark gandalfthegreat:~# cryptsetup status /dev/mapper/ssdcrypt /dev/mapper/ssdcrypt is active. type: LUKS1 cipher: aes-xts-plain keysize: 256 bits device: /dev/sda4 offset: 4096 sectors size: 926308752 sectors mode: read/write gandalfthegreat:~# lsmod |grep -e aes aesni_intel 50443 66 cryptd 14517 18 ghash_clmulni_intel,aesni_intel aes_x86_64 16796 1 aesni_intel I realize this is not btrfs'' fault :) although I''m a bit stuck as to how to benchmark btrfs if the underlying encrypted device is so slow :) Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Jul-22 19:35 UTC
Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
Hi Marc, Am Sonntag, 22. Juli 2012 schrieb Marc MERLIN:> I''m still getting a bit more data before updating the btrfs wiki with > my best recommendations for today. > > First, everything I''ve read so far says that the ssd btrfs mount option > makes btrfs slower in benchmarks. > What gives? > Anyone using it or know of a reason not to mount my ssd with nossd? > > > Next, I got a new Samsumg 830 512GB SSD which is supposed to be very > high performance. > The raw device seems fast enough on a quick hdparm test: > > > But once I encrypt it, it drops to 5 times slower than my 1TB spinning > disk in the same laptop: > gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt > /dev/mapper/ssdcrypt: > Timing cached reads: 15412 MB in 2.00 seconds = 7715.37 MB/sec > Timing buffered disk reads: 70 MB in 3.06 seconds = 22.91 MB/sec > <<<< > > gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk) > /dev/mapper/cryptroot: > Timing cached reads: 16222 MB in 2.00 seconds = 8121.03 MB/sec > Timing buffered disk reads: 308 MB in 3.01 seconds = 102.24 MB/sec > <<<<Have you looked whether certain kernel threads are eating CPU? I would have a look at this. Or use atop to have a complete system overview during the hdparm run. You may want to use its default 10 seconds delay. Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds is really short.) Did you repeat tests three or five times and looked at the deviation? For what it is worth I can beat that with ecryptfs on top of Ext4 ontop of an Intel SSD 320 (SATA 300 based): martin@merkaba:~> su - ms Passwort: ms@merkaba:~> df -hT . Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf /home/.ms ecryptfs 224G 211G 11G 96% /home/ms ms@merkaba:~> dd if=/dev/zero of=testfile bs=1M count=1000 conv=fsync 1000+0 Datensätze ein 1000+0 Datensätze aus 1048576000 Bytes (1,0 GB) kopiert, 20,1466 s, 52,0 MB/s ms@merkaba:~> rm testfile ms@merkaba:~> sudo fstrim /home [sudo] password for ms: ms@merkaba:~> Thats way slower than a dd without encryption, but its way faster than your hdparm figures. The SSD was underutilized according to the harddisk LED of this ThinkPad T520 with Intel i5 Sandybridge 2.5 GHz dualcore. Did start atop to late to see whats going on. (I did not yet test ecryptfs on top of BTRFS, but you didn´t test a filesystem with hdparm anyway.) Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Jul-22 19:43 UTC
Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
Am Sonntag, 22. Juli 2012 schrieb Martin Steigerwald:> Hi Marc, > > Am Sonntag, 22. Juli 2012 schrieb Marc MERLIN: > > I''m still getting a bit more data before updating the btrfs wiki with > > my best recommendations for today. > > > > First, everything I''ve read so far says that the ssd btrfs mount > > option makes btrfs slower in benchmarks. > > What gives? > > Anyone using it or know of a reason not to mount my ssd with nossd? > > > > > > Next, I got a new Samsumg 830 512GB SSD which is supposed to be very > > high performance. > > The raw device seems fast enough on a quick hdparm test: > > > > > > But once I encrypt it, it drops to 5 times slower than my 1TB > > spinning disk in the same laptop: > > gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt > > > > /dev/mapper/ssdcrypt: > > Timing cached reads: 15412 MB in 2.00 seconds = 7715.37 MB/sec > > Timing buffered disk reads: 70 MB in 3.06 seconds = 22.91 MB/sec > > > > <<<< > > > > gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk) > > > > /dev/mapper/cryptroot: > > Timing cached reads: 16222 MB in 2.00 seconds = 8121.03 MB/sec > > Timing buffered disk reads: 308 MB in 3.01 seconds = 102.24 MB/sec > > > > <<<< > > Have you looked whether certain kernel threads are eating CPU? > > I would have a look at this. > > Or use atop to have a complete system overview during the hdparm run. > You may want to use its default 10 seconds delay. > > Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 > seconds is really short.) > > Did you repeat tests three or five times and looked at the deviation? > > For what it is worth I can beat that with ecryptfs on top of Ext4 ontop > of an Intel SSD 320 (SATA 300 based): > > martin@merkaba:~> su - ms > Passwort: > ms@merkaba:~> df -hT . > Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf > /home/.ms ecryptfs 224G 211G 11G 96% /home/ms > ms@merkaba:~> dd if=/dev/zero of=testfile bs=1M count=1000 conv=fsync > 1000+0 Datensätze ein > 1000+0 Datensätze aus > 1048576000 Bytes (1,0 GB) kopiert, 20,1466 s, 52,0 MB/s > ms@merkaba:~> rm testfile > ms@merkaba:~> sudo fstrim /home > [sudo] password for ms: > ms@merkaba:~> > > Thats way slower than a dd without encryption, but its way faster than > your hdparm figures. The SSD was underutilized according to the > harddisk LED of this ThinkPad T520 with Intel i5 Sandybridge 2.5 GHz > dualcore. Did start atop to late to see whats going on. > > (I did not yet test ecryptfs on top of BTRFS, but you didn´t test a > filesystem with hdparm anyway.)You measured read speed. So here read speed as well: martin@merkaba:~#1> su - ms Passwort: ms@merkaba:~> LANG=C df -hT . Filesystem Type Size Used Avail Use% Mounted on /home/.ms ecryptfs 224G 211G 11G 96% /home/ms ms@merkaba:~> LANG=C df -hT /home Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/merkaba-home ext4 224G 211G 11G 96% /home ms@merkaba:~> dd if=/dev/zero of=testfile bs=1M count=1000 conv=fsync 1000+0 Datensätze ein 1000+0 Datensätze aus 1048576000 Bytes (1,0 GB) kopiert, 19,4592 s, 53,9 MB/s ms@merkaba:~> sync ; su -c "echo 3 > /proc/sys/vm/drop_caches" ; dd if=testfile of=/dev/null bs=1M count=1000 Passwort: 1000+0 Datensätze ein 1000+0 Datensätze aus 1048576000 Bytes (1,0 GB) kopiert, 12,4633 s, 84,1 MB/s ms@merkaba:~> ms@merkaba:~> sync ; su -c "echo 3 > /proc/sys/vm/drop_caches" ; dd if=testfile of=/dev/null bs=1M count=1000 Passwort: 1000+0 Datensätze ein 1000+0 Datensätze aus 1048576000 Bytes (1,0 GB) kopiert, 12,0747 s, 86,8 MB/s ms@merkaba:~> sync ; su -c "echo 3 > /proc/sys/vm/drop_caches" ; dd if=testfile of=/dev/null bs=1M count=1000 Passwort: 1000+0 Datensätze ein 1000+0 Datensätze aus 1048576000 Bytes (1,0 GB) kopiert, 11,7131 s, 89,5 MB/s ms@merkaba:~> Figures got faster with each measurement. Maybe SSD internal caching? At leasts that way faster than 22.91 MB/sec ;) On reads dd used up one core. On writes dd and flush-ecryptfs used up a little more than one core, about 110-150%, but not all two cores completely. Kernel is 3.5. Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Jul-22 20:44 UTC
Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
On Sun, Jul 22, 2012 at 09:35:10PM +0200, Martin Steigerwald wrote:> Or use atop to have a complete system overview during the hdparm run. You > may want to use its default 10 seconds delay.atop looked totally fine and showed that a long dd took 6% CPU of one core.> Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds > is really short.) > > Did you repeat tests three or five times and looked at the deviation?Yes. I also ran dd with 1GB, and got the exact same number. Thanks for confirming that it is indeed way faster for you, and you''re going through the filesystem layer which makes it slower compared to my doing dd against the raw device. But anyway, while my question about using ''nossd'' is on topic here, speed of raw device falling dramatically once encrypted isn''t really relevant here, so I''ll continue on the dm-crypt mailing list. Thanks Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Jul-22 22:41 UTC
Re: brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk
On Sun, Jul 22, 2012 at 01:44:28PM -0700, Marc MERLIN wrote:> On Sun, Jul 22, 2012 at 09:35:10PM +0200, Martin Steigerwald wrote: > > Or use atop to have a complete system overview during the hdparm run. You > > may want to use its default 10 seconds delay. > > atop looked totally fine and showed that a long dd took 6% CPU of one core. > > > Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds > > is really short.) > > > > Did you repeat tests three or five times and looked at the deviation? > > Yes. I also ran dd with 1GB, and got the exact same number.Oh my, the plot thickens. I''m bringing this back on this list because 1) I''ve confirmed that I can get 500MB/s unencrypted reads (2GB file) with btfrs on the SSD 2) Despite getting a measly 23MB/s reading from /dev/mapper/ssdcrypt, once I mount it as a btrfs drive and I a read a file from it, I now get 267MB/s 3) Stating a bunch of files on the ssd is very slow (5-6x slower than stating the same files from disk). Using noatime did not help. On an _unencrypted_ partition on the SSD, running du -sh on a directory with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on encrypted spinning drive, both with a similar btrfs filesystem, and the same kernel (3.4.4). Since I''m getting some kind of "seek problem" on the ssd (yes, there are no real seeks on SSD), it seems that I do have a problem with btrfs that is at least not related to encryption. Unencrypted btrfs on SSD: gandalfthegreat:/mnt/mnt2# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.667s Encrypted btrfs on spinning drive: gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m3.881s I''ve run this many times and get the same numbers. I''ve tried deadline and noop on /dev/sda (the SSD) and du is just as slow. I also tried with: - space_cache and nospace_cache - ssd and nossd - noatime didn''t seem to help even though I was hopeful on this one. In all cases, I get: gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.537s 22 seconds for 15K files on an SSD while being 5 times slower than a spinning disk with the same data. What''s going on? More details below for anyone interested: ---------------------------------------------------------------------------- The test below shows that I can access the encrypted SSD 10x faster by talking to it through a btrfs filesystem than doing a dd read from the device node. So I suppose I could just ignore the dd device node thing, however not really. Let''s take a directory with some source files inside: Let''s start with the hard drive in my laptop (dmcrypt with btrtfs) gandalfthegreat:/var/local# find src | wc -l 15261 gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m4.068s So on an encrypted spinning disk, it takes 4 seconds. On my SSD in the same machine with the same encryption and the same btrfs filesystem, I get 5-6 times slower: gandalfthegreat:/mnt/btrfs_pool1/var/local# time du -sh src 514M src real 0m24.937s Incidently 5x is also the speed difference between my encrypted HD and encrypted SSD with dd. Now, why du is 5x slower and dd of a file from the filesystem is 2.5x faster, I have no idea :-/ See below: 1) drop caches gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=w2k-s001.vmdk of=/dev/null 2146631680 bytes (2.1 GB) copied, 8.03898 s, 267 MB/s -> 267MB/s reading from the file through the encrypted filesystem. That''s good. For comparison gandalfthegreat:/mnt/mnt2# dd if=w2k-s001.vmdk of=/dev/null 2146631680 bytes (2.1 GB) copied, 4.33393 s, 495 MB/s -> almost 500MB/s reading through another unencrypted filesystem on the same SSD gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=/dev/mapper/ssdcrypt of=/dev/null bs=1M count=1000 1048576000 bytes (1.0 GB) copied, 45.1234 s, 23.2 MB/s -> 23MB/s reading from the block device that my FS is mounted from. WTF? gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches; dd if=w2k-s001.vmdk of=test 2146631680 bytes (2.1 GB) copied, 17.9129 s, 120 MB/s -> 120MB/s copying a file from the SSD to itself. That''s not bad. gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches; dd if=test of=/dev/null 2146631680 bytes (2.1 GB) copied, 8.4907 s, 253 MB/s -> reading the new copied file still shows 253MB/s, good. gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=test of=/dev/null 2146631680 bytes (2.1 GB) copied, 2.11001 s, 1.0 GB/s -> reading without dropping the cache shows 1GB/s I''m very lost now, any idea what''s going on? - big file copies from encrypted SSD are fast as expected - raw device access is unexplainably slow (10x slower than btrfs on the raw device) - stating 15K files on the encrypted ssd with btrfs is super slow Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I just realized that the older thread got a bit confusing, so I''ll keep problems separate and make things simpler :) On an _unencrypted_ partition on the SSD, running du -sh on a directory with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on encrypted spinning drive, both with a similar btrfs filesystem, and the same kernel (3.4.4). Unencrypted btrfs on SSD: gandalfthegreat:~# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.667s Encrypted btrfs on spinning drive of the same src directory: gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m3.881s I''ve run this many times and get the same numbers. I''ve tried deadline and noop on /dev/sda (the SSD) and du is just as slow. I also tried with: - space_cache and nospace_cache - ssd and nossd - noatime didn''t seem to help even though I was hopeful on this one. In all cases, I get: gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.537s I''m having the same slow speed on 2 btrfs filesystems on the same SSD. One is encrypted, the other one isnt: Label: ''btrfs_pool1'' uuid: d570c40a-4a0b-4d03-b1c9-cff319fc224d Total devices 1 FS bytes used 144.74GB devid 1 size 441.70GB used 195.04GB path /dev/dm-0 Label: ''boot'' uuid: 84199644-3542-430a-8f18-a5aa58959662 Total devices 1 FS bytes used 2.33GB devid 1 size 25.00GB used 5.04GB path /dev/sda2 If instead of stating a bunch of files, I try reading a big file, I do get speeds that are quite fast (253MB/s and 423MB/s). 22 seconds for 15K files on an SSD is super slow and being 5 times slower than a spinning disk with the same data. What''s going on? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Jul-24 07:56 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Montag, 23. Juli 2012 schrieb Marc MERLIN:> I just realized that the older thread got a bit confusing, so I''ll keep > problems separate and make things simpler :) > > On an _unencrypted_ partition on the SSD, running du -sh on a directory > with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on > encrypted spinning drive, both with a similar btrfs filesystem, and > the same kernel (3.4.4). > > Unencrypted btrfs on SSD: > gandalfthegreat:~# mount -o > compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2 > gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du > -sh src 514M src > real 0m22.667s > > Encrypted btrfs on spinning drive of the same src directory: > gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du > -sh src 514M src > real 0m3.881sfind is fast, du is much slower: merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l ) 404166 ( find /usr | wc -l; ) 0,03s user 0,07s system 1% cpu 9,212 total merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr ) 11G /usr ( du -sh /usr; ) 1,00s user 19,07s system 41% cpu 48,886 total Now I try to find something with less files. merkaba:~> find /usr/share/doc | wc -l 50715 merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr/share/doc | wc -l ) 50715 ( find /usr/share/doc | wc -l; ) 0,00s user 0,02s system 1% cpu 1,398 total merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr/share/doc ) 606M /usr/share/doc ( du -sh /usr/share/doc; ) 0,20s user 3,63s system 35% cpu 10,691 total merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time du -sh /usr/share/doc 606M /usr/share/doc du -sh /usr/share/doc 0,19s user 3,54s system 35% cpu 10,386 total Anyway thats still much faster than your measurements. merkaba:~> df -hT /usr Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf /dev/dm-0 btrfs 19G 11G 5,6G 67% / merkaba:~> btrfs fi sh failed to read /dev/sr0 Label: ''debian'' uuid: […] Total devices 1 FS bytes used 10.25GB devid 1 size 18.62GB used 18.62GB path /dev/dm-0 Btrfs Btrfs v0.19 merkaba:~> btrfs fi df / Data: total=15.10GB, used=9.59GB System, DUP: total=8.00MB, used=4.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.75GB, used=670.43MB Metadata: total=8.00MB, used=0.00 merkaba:~> grep btrfs /proc/mounts /dev/dm-0 / btrfs rw,noatime,compress=lzo,ssd,space_cache,inode_cache 0 0 Somewhat aged BTRFS filesystem on ThinkPad T520, Intel SSD 320, kernel 3.5. Ciao, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Jul-27 04:40 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Tue, Jul 24, 2012 at 09:56:26AM +0200, Martin Steigerwald wrote:> find is fast, du is much slower: > > merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l ) > 404166 > ( find /usr | wc -l; ) 0,03s user 0,07s system 1% cpu 9,212 total > merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr ) > 11G /usr > ( du -sh /usr; ) 1,00s user 19,07s system 41% cpu 48,886 totalYou''re right. gandalfthegreat [mc]# time du -sh src 514M src real 0m25.159s gandalfthegreat [mc]# reset_cache gandalfthegreat [mc]# time bash -c "find src | wc -l" 15261 real 0m7.614s But find is still slower the du on my spinning disk for the same tree.> Anyway thats still much faster than your measurements.Right. Your numbers look reasonable for an SSD. Since then, I did some more tests and I''m also getting slower than normal speeds with ext4, an indications that it''s a problem with the block layer. I''m working with some folks try to pin down the core problem, but it looks like it''s not an easy one. Thanks for your data. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2012-Jul-27 11:08 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote:> > 22 seconds for 15K files on an SSD is super slow and being 5 times > slower than a spinning disk with the same data. > What''s going on?Hi Marc, The easiest way to figure out is with latencytop. I''d either run the latencytop gui or use the latencytop -c patch which sends a text dump to the console. This is assuming that you''re not pegged at 100% CPU... https://oss.oracle.com/~mason/latencytop.patch -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Jul-27 18:42 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Fri, Jul 27, 2012 at 07:08:35AM -0400, Chris Mason wrote:> On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote: > > > > 22 seconds for 15K files on an SSD is super slow and being 5 times > > slower than a spinning disk with the same data. > > What''s going on? > > Hi Marc, > > The easiest way to figure out is with latencytop. I''d either run the > latencytop gui or use the latencytop -c patch which sends a text dump to > the console. > > This is assuming that you''re not pegged at 100% CPU... > > https://oss.oracle.com/~mason/latencytop.patchThanks for the patch, and yes I can confirm I''m definitely not pegged on CPU (not even close and I get the same problem with unencrypted filesystem, actually du -sh is exactly the same speed on encrypted and unecrypted). Here''s the result I think you were looking for. I''m not good at reading this, but hopefully it tells you something useful :) The full run is here if that helps: http://marc.merlins.org/tmp/latencytop.txt Process du (6748) Total: 4280.5 msec Reading directory content 15.2 msec 11.0 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_next_leaf btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath [sleep_on_page] 13.5 msec 88.2 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_lookup_csum __btrfs_lookup_bio_sums btrfs_lookup_bio_sums btrfs_submit_compressed_read btrfs_submit_bio_hook Page fault 12.9 msec 0.6 % sleep_on_page_killable wait_on_page_bit_killable __lock_page_or_retry filemap_fault __do_fault handle_pte_fault handle_mm_fault do_page_fault page_fault Executing a program 7.1 msec 0.2 % sleep_on_page_killable __lock_page_killable generic_file_aio_read do_sync_read vfs_read kernel_read prepare_binprm do_execve_common.isra.27 do_execve sys_execve stub_execve Process du (6748) Total: 9517.4 msec [sleep_on_page] 23.0 msec 82.8 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup __lookup_hash Reading directory content 13.2 msec 17.2 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath Process du (6748) Total: 9524.0 msec [sleep_on_page] 17.1 msec 88.5 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup __lookup_hash Reading directory content 16.0 msec 11.5 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-01 06:01 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Fri, Jul 27, 2012 at 11:42:39AM -0700, Marc MERLIN wrote:> > https://oss.oracle.com/~mason/latencytop.patch > > Thanks for the patch, and yes I can confirm I''m definitely not pegged on CPU > (not even close and I get the same problem with unencrypted filesystem, actually > du -sh is exactly the same speed on encrypted and unecrypted). > > Here''s the result I think you were looking for. I''m not good at reading this, > but hopefully it tells you something useful :) > > The full run is here if that helps: > http://marc.merlins.org/tmp/latencytop.txtI did some other tests since last week since my laptop is hard to use considering how slow the SSD is. (TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster than btrfs, but 3x slower than the same filesystem on spinning disk :( ) Ok, just to help with debuggging this, 1) I put my samsung 830 SSD into another thinkpad and it wasn''t faster or slower. 2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that just died and killed all my data), and du took 0.3 seconds on both my old and new thinkpads. The old thinkpad is running ubuntu 32bit the new one debian testing 64bit both with kernel 3.4.4. So, clearly, there is something wrong with the samsung 830 SSD with linux but I have no clue what :( In raw speed (dd) the samsung is faster than the crucial (350MB/s vs 500MB/s). It it were a random crappy SSD from a random vendor, I''d blame the SSD, but I have a hard time believing that samsung is selling SSDs that are slower than hard drives at random IO and ''seeks''. 3) I just got a 2nd ssd from samsung (same kind), just to make sure the one I had wasn''t bad. It''s brand new, and I formatted it carefully on 512 boundaries: /dev/sda1 2048 502271 250112 83 Linux /dev/sda2 502272 52930559 26214144 7 HPFS/NTFS/exFAT /dev/sda3 52930560 73902079 10485760 82 Linux swap / Solaris /dev/sda4 73902080 1000215215 463156568 83 Linux I also upgraded to 3.5.0 in the meantime but unfortunately the results are similar. First: btrfs is the slowest: gandalfthegreat:/mnt/ssd/var/local# time du -sh src/ 514M src/ real 0m25.741s gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 Second: ext4 is 2x faster than btrfs with mkfs.ext4 -O extent -b 4096 /dev/sda3 gandalfthegreat:/mnt/mnt3# reset_cache gandalfthegreat:/mnt/mnt3# time du -sh src/ 519M src/ real 0m12.459s gandalfthegreat:~# grep mnt3 /proc/mounts /dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0 Third, A freshly made ntfs filesystem through fuse is actually FASTER! gandalfthegreat:/mnt/mnt2# reset_cache gandalfthegreat:/mnt/mnt2# time du -sh src/ 506M src/ real 0m8.928s gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts /dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0 How can ntfs via fuse be the fastest and btrfs so slow? Of course, all 3 are slower than the same filesystem on spinning too, but I''m wondering if there is a scheduling issue that is somehow causing the extreme slowness I''m seeing. Did the latencytop trace I got help in any way? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fajar A. Nugraha
2012-Aug-01 06:08 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Wed, Aug 1, 2012 at 1:01 PM, Marc MERLIN <marc@merlins.org> wrote:> So, clearly, there is something wrong with the samsung 830 SSD with linux> It it were a random crappy SSD from a random vendor, I''d blame the SSD, but > I have a hard time believing that samsung is selling SSDs that are slower > than hard drives at random IO and ''seeks''.You''d be surprised on how badly some vendors can screw up :)> First: btrfs is the slowest:> gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts > /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0Just checking, did you explicitly activate "discard"? Cause on my setup (with corsair SSD) it made things MUCH slower. Also, try adding "noatime" (just in case the slow down was because "du" cause many access time updates) -- Fajar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-01 06:21 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote:> > It it were a random crappy SSD from a random vendor, I''d blame the SSD, but > > I have a hard time believing that samsung is selling SSDs that are slower > > than hard drives at random IO and ''seeks''. > > You''d be surprised on how badly some vendors can screw up :)At some point, it may come down to that indeed :-/ I''m still hopefully that Samsung didn''t, but we''ll see.> > First: btrfs is the slowest: > > > gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts > > /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 > > Just checking, did you explicitly activate "discard"? Cause on myYes. Note that it should a noop when all you''re doing is stating inodes and not writing (I''m using noatime).> setup (with corsair SSD) it made things MUCH slower. Also, try adding > "noatime" (just in case the slow down was because "du" cause many > access time updates)I have noatime in there already :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Samuel
2012-Aug-01 06:36 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On 01/08/12 16:01, Marc MERLIN wrote:> Third, A freshly made ntfs filesystem through fuse is actually FASTER!Could it be that Samsungs FTL has optimisations in it for NTFS ? A horrible thought, but not impossible.. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-01 06:40 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Wed, Aug 01, 2012 at 04:36:22PM +1000, Chris Samuel wrote:> On 01/08/12 16:01, Marc MERLIN wrote: > > > Third, A freshly made ntfs filesystem through fuse is actually FASTER! > > Could it be that Samsungs FTL has optimisations in it for NTFS ? > > A horrible thought, but not impossible..Not impossible, but it''s can be the main reason since it clocks still 2x slower with ntfs than a spinning disk with encrypted btrfs. Since SSDs should "seek" 10-100x faster than spinning disks, that can''t be the only reason. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Aug-01 21:57 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Hi Marc, Am Mittwoch, 1. August 2012 schrieb Marc MERLIN:> On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote: > > > It it were a random crappy SSD from a random vendor, I''d blame the > > > SSD, but I have a hard time believing that samsung is selling SSDs > > > that are slower than hard drives at random IO and ''seeks''. > > > > You''d be surprised on how badly some vendors can screw up :) > > At some point, it may come down to that indeed :-/ > I''m still hopefully that Samsung didn''t, but we''ll see.Its getting quite strange. I lost track of whether you did that already or not, but if you didn´t please post some vmstat 1 iostat -xd 1 on the device while it is being slow. I am interested in wait I/O and latencies and disk utilization. Comparison data of Intel SSD 320 in ThinkPad T520 during merkaba:~> echo 3 > /proc/sys/drop_caches ; du -sch /usr on BTRFS with Kernel 3.5: martin@merkaba:~> vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 21556 4442668 2056 502352 0 0 194 85 247 120 11 2 87 0 1 2 21556 4408888 2448 514884 0 0 11684 328 4975 24585 5 16 65 14 1 0 21556 4389880 2448 528060 0 0 13400 0 4574 23452 2 16 68 14 3 1 21556 4370068 2448 545052 0 0 18132 0 5499 27220 1 18 64 16 1 0 21556 4350228 2448 555580 0 0 10856 0 4122 25339 3 16 67 14 1 1 21556 4315604 2448 569756 0 0 12648 0 4647 31153 5 14 66 15 0 1 21556 4295652 2456 581480 0 0 11548 56 4093 24618 2 13 69 16 0 1 21556 4286720 2456 591580 0 0 10824 0 3750 21445 1 12 71 16 0 1 21556 4266308 2456 603620 0 0 12932 0 4841 26447 4 12 68 17 1 0 21556 4248228 2456 613808 0 0 10264 4 3703 22108 1 13 71 15 5 1 21556 4231976 2456 624356 0 0 10540 0 3581 20436 1 10 72 17 0 1 21556 4197168 2456 639108 0 0 12952 0 4738 28223 4 15 66 15 4 1 21556 4178456 2456 650552 0 0 11656 0 4234 23480 2 14 68 16 0 1 21556 4163616 2456 662992 0 0 13652 0 4619 26580 1 16 70 13 4 1 21556 4138288 2456 675696 0 0 13352 0 4422 22254 1 16 70 13 1 0 21556 4113204 2456 689060 0 0 13232 0 4312 21936 1 15 70 14 0 1 21556 4085532 2456 704160 0 0 14972 0 4820 24238 1 16 69 14 2 0 21556 4055740 2456 719644 0 0 15736 0 5099 25513 3 17 66 14 0 1 21556 4028612 2456 734380 0 0 14504 0 4795 25052 3 15 68 14 2 0 21556 3999108 2456 749040 0 0 14656 16 4672 21878 1 17 69 13 1 1 21556 3972732 2456 762108 0 0 12972 0 4717 22411 1 17 70 13 5 0 21556 3949684 2584 773484 0 0 11528 52 4837 24107 3 15 67 15 1 0 21556 3912504 2584 787420 0 0 12156 0 4883 25201 4 15 67 14 martin@merkaba:~> iostat -xd 1 /dev/sda Linux 3.5.0-tp520 (merkaba) 01.08.2012 _x86_64_ (4 CPU) Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1,29 1,44 11,58 12,78 684,74 299,75 80,81 0,24 9,86 0,95 17,93 0,29 0,71 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2808,00 0,00 11232,00 0,00 8,00 0,57 0,21 0,21 0,00 0,19 54,50 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2967,00 0,00 11868,00 0,00 8,00 0,63 0,21 0,21 0,00 0,21 60,90 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 11,00 2992,00 4,00 11968,00 56,00 8,03 0,64 0,22 0,22 0,25 0,21 62,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2680,00 0,00 10720,00 0,00 8,00 0,70 0,26 0,26 0,00 0,25 66,70 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3153,00 0,00 12612,00 0,00 8,00 0,72 0,23 0,23 0,00 0,22 69,30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2769,00 0,00 11076,00 0,00 8,00 0,63 0,23 0,23 0,00 0,21 58,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2523,00 1,00 10092,00 4,00 8,00 0,74 0,29 0,29 0,00 0,28 71,30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3026,00 0,00 12104,00 0,00 8,00 0,73 0,24 0,24 0,00 0,21 64,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3069,00 0,00 12276,00 0,00 8,00 0,67 0,22 0,22 0,00 0,20 62,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3346,00 0,00 13384,00 0,00 8,00 0,64 0,19 0,19 0,00 0,18 59,90 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3188,00 0,00 12752,00 0,00 8,00 0,80 0,25 0,25 0,00 0,17 54,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3433,00 0,00 13732,00 0,00 8,00 1,03 0,30 0,30 0,00 0,17 57,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3565,00 0,00 14260,00 0,00 8,00 0,92 0,26 0,26 0,00 0,16 57,30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3972,00 0,00 15888,00 0,00 8,00 1,13 0,29 0,29 0,00 0,16 62,90 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3743,00 0,00 14972,00 0,00 8,00 1,03 0,28 0,28 0,00 0,16 59,40 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3408,00 0,00 13632,00 0,00 8,00 1,08 0,32 0,32 0,00 0,17 56,70 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3730,00 3,00 14920,00 16,00 8,00 1,14 0,31 0,31 0,00 0,15 56,30 I also suggest to use fio with with the ssd-test example on the SSD. I have some comparison data available for my setup. Heck it should be publicly available in my ADMIN magazine article about fio. I used a bit different fio jobs with block sizes of 2k to 16k, but its similar enough and I might even have some 4k examples at hand or can easily create one. I also raised size and duration a bit. An example based on whats in my article: [global] ioengine=libaio direct=1 iodepth=64 runtime=60 filename=testfile size=2G bsrange=2k-16k refill_buffers=1 [randomwrite] stonewall rw=randwrite [sequentialwrite] stonewall rw=write [randomread] stonewall rw=randread [sequentialread] stonewall rw=read Remove bsrange if you want 4k blocks only. I put the writes above the reads cause the writes with refill_buffers=1 initialize the testfile with random data. It would contain only zeros which will be compressible nicely by modern SandForce controllers otherwise. Above job is untested, but it should do. Please remove the test file and fstrim your disk unless you have discard mount option on after having run any write tests. (Not necessarily after each one ;). Also I am interested in merkaba:~> hdparm -I /dev/sda | grep -i queue Queue depth: 32 * Native Command Queueing (NCQ) output for your SSD. Try to load the ssd with different iodepths up to twice the amount displayed by hdparm. But note: a du -sch will not reach that high iodepth. I expect it to get as low as an iodepth of one in that case. You see in my comparison output that a single du -sch is not able to load the Intel SSD 320 fully, but load is just about 50-70%. And iowait is just about 10-20%. Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-02 05:07 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote:> Its getting quite strange.I would agree :) Before I paste a bunch of thing, I wanted to thank you for not giving up on me and offering your time to help me figure this out :)> I lost track of whether you did that already or not, but if you didn´t > please post some > > vmstat 1 > iostat -xd 1 > on the device while it is being slow.Sure thing, here''s the 24 second du -s run: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 0 2747264 44 348388 0 0 28 50 242 184 19 6 74 1 1 0 0 2744128 44 351700 0 0 144 0 2758 32115 30 5 61 4 2 1 0 2743100 44 351992 0 0 792 0 2616 30613 28 4 50 18 1 1 0 2741592 44 352668 0 0 776 0 2574 31551 29 4 45 21 1 1 0 2740720 44 353432 0 0 692 0 2734 32891 30 4 45 22 1 1 0 2740104 44 354284 0 0 460 0 2639 31585 30 4 45 21 3 1 0 2738520 44 354692 0 0 544 264 2834 30302 32 5 42 21 1 1 0 2736936 44 355476 0 0 1064 2012 2867 31172 28 4 45 23 3 1 0 2734892 44 356840 0 0 1236 0 2865 31100 30 3 46 21 3 1 0 2733840 44 357432 0 0 636 12 2930 31362 30 4 45 21 3 1 0 2735152 44 357892 0 0 612 0 2546 31641 28 4 46 22 3 1 0 2733608 44 358680 0 0 668 0 2846 29747 30 3 46 22 2 1 0 2731480 44 359556 0 0 712 248 2744 27939 31 3 44 22 2 1 0 2732688 44 360736 0 0 1204 0 2774 30979 30 3 46 21 4 1 0 2730276 44 361872 0 0 872 256 3174 31657 30 5 45 20 1 1 0 2728372 44 363008 0 0 912 132 2752 31811 30 3 46 22 1 1 0 2727060 44 363240 0 0 476 0 2595 30313 33 4 43 21 1 1 0 2725856 44 364000 0 0 580 0 2547 30803 29 4 46 21 1 1 0 2723628 44 364724 0 0 756 272 2766 30652 31 4 44 22 2 1 0 2722128 44 365616 0 0 744 8 2771 31760 30 4 45 21 2 1 0 2721024 44 366284 0 0 788 196 2695 31819 31 4 44 22 3 1 0 2719312 44 367084 0 0 732 0 2737 32038 30 5 45 20 1 1 0 2717436 44 367880 0 0 728 0 2570 30810 29 3 46 22 3 1 0 2716820 44 368500 0 0 608 0 2614 30949 29 4 45 22 2 1 0 2715576 44 369240 0 0 628 264 2715 31180 34 4 40 22 1 1 0 2714308 44 369952 0 0 624 0 2533 31155 28 4 46 22 Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 08/01/2012 _x86_64_ (4 CPU) rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util 2.18 0.68 6.45 17.77 78.12 153.39 19.12 0.18 7.51 8.52 7.15 4.46 10.81 0.00 0.00 118.00 0.00 540.00 0.00 9.15 1.18 9.93 9.93 0.00 4.98 58.80 0.00 0.00 217.00 0.00 868.00 0.00 8.00 1.90 8.77 8.77 0.00 4.44 96.40 0.00 0.00 192.00 0.00 768.00 0.00 8.00 1.63 8.44 8.44 0.00 5.10 98.00 0.00 0.00 119.00 0.00 476.00 0.00 8.00 1.06 9.01 9.01 0.00 8.20 97.60 0.00 0.00 125.00 0.00 500.00 0.00 8.00 1.08 8.67 8.67 0.00 7.55 94.40 0.00 0.00 165.00 0.00 660.00 0.00 8.00 1.50 9.12 9.12 0.00 5.87 96.80 0.00 0.00 195.00 13.00 780.00 272.00 10.12 1.68 8.10 7.94 10.46 4.65 96.80 0.00 0.00 173.00 0.00 692.00 0.00 8.00 1.72 9.87 9.87 0.00 5.71 98.80 0.00 0.00 171.00 0.00 684.00 0.00 8.00 1.62 9.33 9.33 0.00 5.75 98.40 0.00 0.00 161.00 0.00 644.00 0.00 8.00 1.52 9.57 9.57 0.00 6.14 98.80 0.00 0.00 136.00 0.00 544.00 0.00 8.00 1.26 9.29 9.29 0.00 7.24 98.40 0.00 0.00 199.00 0.00 796.00 0.00 8.00 1.94 9.73 9.73 0.00 4.94 98.40 0.00 0.00 201.00 0.00 804.00 0.00 8.00 1.70 8.54 8.54 0.00 4.80 96.40 0.00 0.00 272.00 15.00 1088.00 272.00 9.48 2.35 8.21 8.46 3.73 3.39 97.20 0.00 0.00 208.00 1.00 832.00 4.00 8.00 1.90 9.01 9.04 4.00 4.67 97.60 0.00 0.00 228.00 0.00 912.00 0.00 8.00 1.94 8.56 8.56 0.00 4.23 96.40 0.00 0.00 107.00 0.00 428.00 0.00 8.00 0.98 9.05 9.05 0.00 9.12 97.60 0.00 0.00 163.00 0.00 652.00 0.00 8.00 1.60 9.89 9.89 0.00 6.01 98.00 0.00 0.00 162.00 0.00 648.00 0.00 8.00 1.83 10.20 10.20 0.00 6.07 98.40 1.00 10.00 224.00 71.00 900.00 2157.50 20.73 2.82 10.14 9.50 12.17 3.35 98.80 41.00 8.00 166.00 176.00 832.00 333.50 6.82 2.48 7.24 8.48 6.07 2.92 100.00 18.00 0.00 78.00 72.00 388.00 36.00 5.65 2.30 15.28 16.56 13.89 6.67 100.00 13.00 0.00 190.00 25.00 824.00 12.50 7.78 1.95 9.17 9.43 7.20 4.58 98.40 0.00 0.00 156.00 0.00 624.00 0.00 8.00 1.36 8.72 8.72 0.00 6.26 97.60 0.00 0.00 158.00 19.00 632.00 64.00 7.86 1.50 8.45 8.61 7.16 5.60 99.20 0.00 0.00 176.00 0.00 704.00 0.00 8.00 1.65 9.39 9.39 0.00 5.61 98.80 0.00 0.00 44.00 0.00 176.00 0.00 8.00 0.38 8.55 8.55 0.00 8.27 36.40> I am interested in wait I/O and latencies and disk utilization.Cool tool, I didn''t know about iostat. My r_await numbers don''t look good obviously and yet %util is pretty much 100% the entire time. Does that show that it''s indeed the device that is unable to deliver the requests any quicker, despite being an ssd, or are you reading this differently?> Also I am interested in > merkaba:~> hdparm -I /dev/sda | grep -i queue > Queue depth: 32 > * Native Command Queueing (NCQ) > output for your SSD.gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue Queue depth: 32 * Native Command Queueing (NCQ) gandalfthegreat:/var/local# I''ve the the fio tests in: /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 (discard is there, so fstrim shouldn''t be needed)> I also suggest to use fio with with the ssd-test example on the SSD. I > have some comparison data available for my setup. Heck it should be > publicly available in my ADMIN magazine article about fio. I used a bit > different fio jobs with block sizes of 2k to 16k, but its similar enough > and I might even have some 4k examples at hand or can easily create one. > I also raised size and duration a bit. > > An example based on whats in my article:Thanks, here''s the output. I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending on the io size. That''s "intereesting" to say the least. So, doctor, is it bad? :) randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 4 processes randomwrite: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0 iops] [eta 00m:00s] randomwrite: (groupid=0, jobs=1): err= 0: pid=7193 write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55 clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63 lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57 clat percentiles (msec): | 1.00th=[ 225], 5.00th=[ 241], 10.00th=[ 247], 20.00th=[ 260], | 30.00th=[ 269], 40.00th=[ 281], 50.00th=[ 293], 60.00th=[ 306], | 70.00th=[ 322], 80.00th=[ 351], 90.00th=[ 545], 95.00th=[ 570], | 99.00th=[ 627], 99.50th=[ 644], 99.90th=[ 709], 99.95th=[ 725], | 99.99th=[ 742] bw (KB/s) : min= 11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48 lat (usec) : 50=0.01% lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89% lat (msec) : 500=72.44%, 750=14.43% cpu : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=11396/d=0, short=r=0/w=0/d=0 sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218 write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10 clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12 lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36 clat percentiles (msec): | 1.00th=[ 269], 5.00th=[ 306], 10.00th=[ 322], 20.00th=[ 338], | 30.00th=[ 355], 40.00th=[ 371], 50.00th=[ 388], 60.00th=[ 404], | 70.00th=[ 420], 80.00th=[ 445], 90.00th=[ 603], 95.00th=[ 766], | 99.00th=[ 881], 99.50th=[ 906], 99.90th=[ 938], 99.95th=[ 947], | 99.99th=[ 963] bw (KB/s) : min= 418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39 lat (usec) : 20=0.01% lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93% lat (msec) : 750=6.87%, 1000=5.51% cpu : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=8991/d=0, short=r=0/w=0/d=0 randomread: (groupid=2, jobs=1): err= 0: pid=7234 read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63 clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21 lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81 clat percentiles (msec): | 1.00th=[ 15], 5.00th=[ 25], 10.00th=[ 33], 20.00th=[ 39], | 30.00th=[ 50], 40.00th=[ 57], 50.00th=[ 61], 60.00th=[ 70], | 70.00th=[ 76], 80.00th=[ 89], 90.00th=[ 111], 95.00th=[ 139], | 99.00th=[ 176], 99.50th=[ 190], 99.90th=[ 217], 99.95th=[ 229], | 99.99th=[ 247] bw (KB/s) : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68 lat (usec) : 50=0.01% lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44% lat (msec) : 100=55.05%, 250=14.34% cpu : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=56073/w=0/d=0, short=r=0/w=0/d=0 sequentialread: (groupid=3, jobs=1): err= 0: pid=7249 read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt= 4418msec slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96 clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25 lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06 clat percentiles (usec): | 1.00th=[ 10], 5.00th=[ 187], 10.00th=[ 217], 20.00th=[ 249], | 30.00th=[ 278], 40.00th=[ 302], 50.00th=[ 330], 60.00th=[ 362], | 70.00th=[ 398], 80.00th=[ 450], 90.00th=[ 716], 95.00th=[ 6624], | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696], | 99.99th=[67072] bw (KB/s) : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18 lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36% lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52% lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45% lat (msec) : 100=0.05% cpu : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=233071/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec Run status group 1 (all jobs): WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec Run status group 2 (all jobs): READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec Run status group 3 (all jobs): READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msec Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Aug-02 11:18 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:> On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote: > > Its getting quite strange. > > I would agree :) > > Before I paste a bunch of thing, I wanted to thank you for not giving up on me > and offering your time to help me figure this out :)You are welcome. Well I am holding Linux performance analysis & tuning trainings and I am really interested into issues like this ;) I will take care of myself and I take my time to respond or even do not respond at all anymore if I run out of ideas ;).> > I lost track of whether you did that already or not, but if you didn´t > > please post some > > > > vmstat 1 > > iostat -xd 1 > > on the device while it is being slow. > > Sure thing, here''s the 24 second du -s run: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 2 1 0 2747264 44 348388 0 0 28 50 242 184 19 6 74 1 > 1 0 0 2744128 44 351700 0 0 144 0 2758 32115 30 5 61 4 > 2 1 0 2743100 44 351992 0 0 792 0 2616 30613 28 4 50 18 > 1 1 0 2741592 44 352668 0 0 776 0 2574 31551 29 4 45 21 > 1 1 0 2740720 44 353432 0 0 692 0 2734 32891 30 4 45 22 > 1 1 0 2740104 44 354284 0 0 460 0 2639 31585 30 4 45 21 > 3 1 0 2738520 44 354692 0 0 544 264 2834 30302 32 5 42 21 > 1 1 0 2736936 44 355476 0 0 1064 2012 2867 31172 28 4 45 23A bit more wait I/O with not even 10% of the throughput as compared to the Intel SSD 320 figures. Seems that Intel SSD is running circles around your Samsung SSD while not – as expected for that use case – being fully utilized.> Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 08/01/2012 _x86_64_ (4 CPU) > rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > 2.18 0.68 6.45 17.77 78.12 153.39 19.12 0.18 7.51 8.52 7.15 4.46 10.81 > 0.00 0.00 118.00 0.00 540.00 0.00 9.15 1.18 9.93 9.93 0.00 4.98 58.80 > 0.00 0.00 217.00 0.00 868.00 0.00 8.00 1.90 8.77 8.77 0.00 4.44 96.40 > 0.00 0.00 192.00 0.00 768.00 0.00 8.00 1.63 8.44 8.44 0.00 5.10 98.00 > 0.00 0.00 119.00 0.00 476.00 0.00 8.00 1.06 9.01 9.01 0.00 8.20 97.60 > 0.00 0.00 125.00 0.00 500.00 0.00 8.00 1.08 8.67 8.67 0.00 7.55 94.40 > 0.00 0.00 165.00 0.00 660.00 0.00 8.00 1.50 9.12 9.12 0.00 5.87 96.80 > 0.00 0.00 195.00 13.00 780.00 272.00 10.12 1.68 8.10 7.94 10.46 4.65 96.80 > 0.00 0.00 173.00 0.00 692.00 0.00 8.00 1.72 9.87 9.87 0.00 5.71 98.80 > 0.00 0.00 171.00 0.00 684.00 0.00 8.00 1.62 9.33 9.33 0.00 5.75 98.40 > 0.00 0.00 161.00 0.00 644.00 0.00 8.00 1.52 9.57 9.57 0.00 6.14 98.80 > 0.00 0.00 136.00 0.00 544.00 0.00 8.00 1.26 9.29 9.29 0.00 7.24 98.40 > 0.00 0.00 199.00 0.00 796.00 0.00 8.00 1.94 9.73 9.73 0.00 4.94 98.40 > 0.00 0.00 201.00 0.00 804.00 0.00 8.00 1.70 8.54 8.54 0.00 4.80 96.40 > 0.00 0.00 272.00 15.00 1088.00 272.00 9.48 2.35 8.21 8.46 3.73 3.39 97.20[…]> > I am interested in wait I/O and latencies and disk utilization. > > Cool tool, I didn''t know about iostat. > My r_await numbers don''t look good obviously and yet %util is pretty much > 100% the entire time. > > Does that show that it''s indeed the device that is unable to deliver the requests any quicker, despite > being an ssd, or are you reading this differently?That, or…> > Also I am interested in > > merkaba:~> hdparm -I /dev/sda | grep -i queue > > Queue depth: 32 > > * Native Command Queueing (NCQ) > > output for your SSD. > > gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue > Queue depth: 32 > * Native Command Queueing (NCQ) > gandalfthegreat:/var/local# > > I''ve the the fio tests in: > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0… you are still using dm_crypt? Please test without dm_crypt. My figures are from within LVM, but no dm_crypt. Its good to have a comparable base for the measurements.> (discard is there, so fstrim shouldn''t be needed)I can´t imagine why it should matter, but maybe its worth having some tests without „discard“.> > I also suggest to use fio with with the ssd-test example on the SSD. I > > have some comparison data available for my setup. Heck it should be > > publicly available in my ADMIN magazine article about fio. I used a bit > > different fio jobs with block sizes of 2k to 16k, but its similar enough > > and I might even have some 4k examples at hand or can easily create one. > > I also raised size and duration a bit. > > > > An example based on whats in my article: > > Thanks, here''s the output. > I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending > on the io size. That''s "intereesting" to say the least. > > So, doctor, is it bad? :)Well I do not engineer those SSDs, but to me it looks like either dm_crypt is wildly hogging performance or the SSD is way slower than what I would expect. But then you did some tests without dm_crypt AFAIR, but then it would be interesting to repeat these tests without dm_crypt as well.> randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 > Starting 4 processes > randomwrite: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0 iops] [eta 00m:00s] > randomwrite: (groupid=0, jobs=1): err= 0: pid=7193 > write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec > slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55 > clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63 > lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57 > clat percentiles (msec): > | 1.00th=[ 225], 5.00th=[ 241], 10.00th=[ 247], 20.00th=[ 260], > | 30.00th=[ 269], 40.00th=[ 281], 50.00th=[ 293], 60.00th=[ 306], > | 70.00th=[ 322], 80.00th=[ 351], 90.00th=[ 545], 95.00th=[ 570], > | 99.00th=[ 627], 99.50th=[ 644], 99.90th=[ 709], 99.95th=[ 725], > | 99.99th=[ 742] > bw (KB/s) : min= 11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48 > lat (usec) : 50=0.01% > lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89% > lat (msec) : 500=72.44%, 750=14.43%Gosh, look at these latencies! 72,44% of all requests above 500 (in words: five hundred) milliseconds! And 14,43% above 750 msecs. The percentage of requests served at 100 msecs or less is was below one percent! Hey, is this an SSD or what? Please really test without dm_crypt so that we can see whether it introduces any kind of latency and if so in what amount. But then let me compare. You are using iodepth=64 which fills the device with request at maximum speed. So this will likely increase latencies. martin@merkaba:~[…]> fio iops-iodepth64.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 fio 1.57 Starting 4 processes […] zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=13103 write: io=2048.0MB, bw=76940KB/s, iops=14032 , runt= 27257msec slat (usec): min=3 , max=6426 , avg=14.35, stdev=22.33 clat (usec): min=82 , max=414190 , avg=4540.81, stdev=4996.47 lat (usec): min=97 , max=414205 , avg=4555.53, stdev=4996.42 bw (KB/s) : min=12890, max=112728, per=100.39%, avg=77240.48, stdev=20310.79 cpu : usr=9.67%, sys=29.47%, ctx=72053, majf=0, minf=532 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=0/382480/0, short=0/0/0 lat (usec): 100=0.01%, 250=0.14%, 500=0.66%, 750=0.82%, 1000=1.52% lat (msec): 2=10.86%, 4=27.88%, 10=57.18%, 20=0.79%, 50=0.03% lat (msec): 100=0.07%, 250=0.04%, 500=0.01% Still even with iodepth 64 totally different picture. And look at the IOPS and throughput. For reference, this refers to [global] ioengine=libaio direct=1 iodepth=64 # Für zufällige Daten über die komplette Länge # der Datei vorher Job sequentiell laufen lassen filename=testdatei size=2G bsrange=2k-16k refill_buffers=1 [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 [zufälligschreiben] stonewall rw=randwrite runtime=60 [sequentiellschreiben] stonewall rw=write runtime=60 but on Ext4 instead of BTRFS. This could be another good test. Text with Ext4 on plain logical volume without dm_crypt.> cpu : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=0/w=11396/d=0, short=r=0/w=0/d=0 > sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218 > write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec > slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10 > clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12 > lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36 > clat percentiles (msec): > | 1.00th=[ 269], 5.00th=[ 306], 10.00th=[ 322], 20.00th=[ 338], > | 30.00th=[ 355], 40.00th=[ 371], 50.00th=[ 388], 60.00th=[ 404], > | 70.00th=[ 420], 80.00th=[ 445], 90.00th=[ 603], 95.00th=[ 766], > | 99.00th=[ 881], 99.50th=[ 906], 99.90th=[ 938], 99.95th=[ 947], > | 99.99th=[ 963] > bw (KB/s) : min= 418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39 > lat (usec) : 20=0.01% > lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93% > lat (msec) : 750=6.87%, 1000=5.51%Sequential write latencies are abysmal as well. sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=13105 write: io=2048.0MB, bw=155333KB/s, iops=17290 , runt= 13501msec slat (usec): min=2 , max=4299 , avg=13.51, stdev=18.91 clat (usec): min=334 , max=201706 , avg=3680.94, stdev=2088.53 lat (usec): min=340 , max=201718 , avg=3694.83, stdev=2088.28 bw (KB/s) : min=144952, max=162856, per=100.04%, avg=155394.92, stdev=5720.43 cpu : usr=16.77%, sys=31.17%, ctx=47790, majf=0, minf=535 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=0/233444/0, short=0/0/0 lat (usec): 500=0.01%, 750=0.02%, 1000=0.03% lat (msec): 2=0.55%, 4=77.52%, 10=21.86%, 20=0.01%, 100=0.01% lat (msec): 250=0.01%> cpu : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=0/w=8991/d=0, short=r=0/w=0/d=0 > randomread: (groupid=2, jobs=1): err= 0: pid=7234 > read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec > slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63 > clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21 > lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81 > clat percentiles (msec): > | 1.00th=[ 15], 5.00th=[ 25], 10.00th=[ 33], 20.00th=[ 39], > | 30.00th=[ 50], 40.00th=[ 57], 50.00th=[ 61], 60.00th=[ 70], > | 70.00th=[ 76], 80.00th=[ 89], 90.00th=[ 111], 95.00th=[ 139], > | 99.00th=[ 176], 99.50th=[ 190], 99.90th=[ 217], 99.95th=[ 229], > | 99.99th=[ 247] > bw (KB/s) : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68 > lat (usec) : 50=0.01% > lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44% > lat (msec) : 100=55.05%, 250=14.34%Even read latencies are really high. zufälliglesen: (groupid=0, jobs=1): err= 0: pid=13101 read : io=2048.0MB, bw=162545KB/s, iops=29719 , runt= 12902msec slat (usec): min=2 , max=1485 , avg=10.58, stdev=10.81 clat (usec): min=246 , max=18706 , avg=2140.05, stdev=1073.28 lat (usec): min=267 , max=18714 , avg=2150.97, stdev=1074.42 bw (KB/s) : min=108000, max=205060, per=101.08%, avg=164305.12, stdev=32494.26 cpu : usr=12.74%, sys=49.83%, ctx=82018, majf=0, minf=276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=383439/0/0, short=0/0/0 lat (usec): 250=0.01%, 500=0.03%, 750=0.30%, 1000=3.21% lat (msec): 2=48.26%, 4=44.68%, 10=3.35%, 20=0.18% Here for comparison that 500 GB 2.5 inch eSATA Hitachi disk (5400rpm): merkaba:[…]> fio --readonly iops-gerät-lesend-iodepth64.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 fio 1.57 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [79411K/0K /s] [8617 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=32578 read : io=58290KB, bw=984592 B/s, iops=106 , runt= 60623msec slat (usec): min=3 , max=82 , avg=24.63, stdev= 6.64 clat (msec): min=40 , max=2825 , avg=602.78, stdev=374.10 lat (msec): min=40 , max=2825 , avg=602.80, stdev=374.10 bw (KB/s) : min= 0, max= 1172, per=65.52%, avg=629.66, stdev=466.50 cpu : usr=0.24%, sys=0.64%, ctx=6443, majf=0, minf=275 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.5%, >=64=99.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=6435/0/0, short=0/0/0 lat (msec): 50=0.20%, 100=2.61%, 250=15.37%, 500=26.64%, 750=25.07% lat (msec): 1000=16.13%, 2000=13.44%, >=2000=0.54% Okay, so at least your SSD has shorter random read latencies than this slow harddisk. ;) Has a queue depth of 32 as well.> cpu : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=56073/w=0/d=0, short=r=0/w=0/d=0 > sequentialread: (groupid=3, jobs=1): err= 0: pid=7249 > read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt= 4418msec > slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96 > clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25 > lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06 > clat percentiles (usec): > | 1.00th=[ 10], 5.00th=[ 187], 10.00th=[ 217], 20.00th=[ 249], > | 30.00th=[ 278], 40.00th=[ 302], 50.00th=[ 330], 60.00th=[ 362], > | 70.00th=[ 398], 80.00th=[ 450], 90.00th=[ 716], 95.00th=[ 6624], > | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696], > | 99.99th=[67072] > bw (KB/s) : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18 > lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36% > lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52% > lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45% > lat (msec) : 100=0.05% > cpu : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=233071/w=0/d=0, short=r=0/w=0/d=0Intel SSD 320: sequentielllesen: (groupid=1, jobs=1): err= 0: pid=13102 read : io=2048.0MB, bw=252699KB/s, iops=28043 , runt= 8299msec slat (usec): min=2 , max=1416 , avg=10.75, stdev=10.95 clat (usec): min=305 , max=201105 , avg=2268.82, stdev=2066.69 lat (usec): min=319 , max=201114 , avg=2279.94, stdev=2066.61 bw (KB/s) : min=249424, max=254472, per=100.03%, avg=252776.50, stdev=1416.63 cpu : usr=12.29%, sys=43.62%, ctx=43913, majf=0, minf=278 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=232729/0/0, short=0/0/0 lat (usec): 500=0.01%, 750=0.06%, 1000=0.10% lat (msec): 2=11.00%, 4=88.32%, 10=0.50%, 20=0.01%, 50=0.01% lat (msec): 100=0.01%, 250=0.01% Harddisk again: sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32580 read : io=4510.3MB, bw=76964KB/s, iops=8551 , runt= 60008msec slat (usec): min=1 , max=3373 , avg=10.03, stdev= 8.96 clat (msec): min=1 , max=49 , avg= 7.47, stdev= 1.14 lat (msec): min=1 , max=49 , avg= 7.48, stdev= 1.14 bw (KB/s) : min=72608, max=79588, per=100.05%, avg=77003.96, stdev=1626.66 cpu : usr=6.89%, sys=16.96%, ctx=229486, majf=0, minf=277 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=513184/0/0, short=0/0/0 lat (msec): 2=0.01%, 4=0.02%, 10=99.67%, 20=0.29%, 50=0.01% Maximum latencies on our SSD are longer. Minimum latencies shorter. Well sequential stuff might not be located sequential on the SSD, the SSD controller might put the stuff from a big file into totally different places.> Run status group 0 (all jobs): > WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec > > Run status group 1 (all jobs): > WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec > > Run status group 2 (all jobs): > READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec > > Run status group 3 (all jobs): > READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msecThats pathetic except for the last job group (each group has one job here cause of the stonewall commands) of reading sequentially. Strangely sequential write is also abysmally slow. This is how I expect this to look like with iodepth 64: Run status group 0 (all jobs): READ: io=2048.0MB, aggrb=162544KB/s, minb=166445KB/s, maxb=166445KB/s, mint=12902msec, maxt=12902msec Run status group 1 (all jobs): READ: io=2048.0MB, aggrb=252699KB/s, minb=258764KB/s, maxb=258764KB/s, mint=8299msec, maxt=8299msec Run status group 2 (all jobs): WRITE: io=2048.0MB, aggrb=76939KB/s, minb=78786KB/s, maxb=78786KB/s, mint=27257msec, maxt=27257msec Run status group 3 (all jobs): WRITE: io=2048.0MB, aggrb=155333KB/s, minb=159061KB/s, maxb=159061KB/s, mint=13501msec, maxt=13501msec Can you also post the last lines: Disk stats (read/write): dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78% sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78% martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba> Its gives information on who good the I/O scheduler was able to merge requests. I didn´t see much of a difference between CFQ and noop, so it may not matter much, but since it gives also a number on total disk utilization its still quite nice to have. So my recommendation of now: Remove as much factors as possible and in order to compare results with what I posted try with plain logical volume with Ext4. If the values are still quite slow, I think its good to ask Linux block layer experts – for example by posting on fio mailing list where there are people subscribed that may be able to provide other test results – and SSD experts. There might be a Linux block layer mailing list or use libata or fsdevel, I don´t know. If your Samsung SSD turns out to be this slow or almost this slow on such a basic level, I am out of ideas. Besides except: Have the IOPS run on the device it self. That will remove any filesystem layer. But only the read only tests, to make sure I suggest to use fio with the --readonly option as safety guard. Unless you have a spare SSD that you can afford to use for write testing which will likely destroy every filesystem on it. Or let it run on just one logical volume. If its then slow, then I´d tend to use a different SSD and test from there. If its then faster, I´d tend to believe that its either a hardware or probably more likely a compatibilty issue. What does your SSD report discard alignment – well but even that should not matter for reads: merkaba:/sys/block/sda> cat discard_alignment 0 Seems Intel tells us to not care at all. Or the value for some reason cannot be read. merkaba:/sys/block/sda> cd queue merkaba:/sys/block/sda/queue> grep . * add_random:1 discard_granularity:512 discard_max_bytes:2147450880 discard_zeroes_data:1 hw_sector_size:512 I would be interested whether these above differ. grep: iosched: Ist ein Verzeichnis iostats:1 logical_block_size:512 max_hw_sectors_kb:32767 max_integrity_segments:0 max_sectors_kb:512 max_segments:168 max_segment_size:65536 minimum_io_size:512 nomerges:0 nr_requests:128 optimal_io_size:0 physical_block_size:512 read_ahead_kb:128 rotational:0 rq_affinity:1 scheduler:noop deadline [cfq] And whether there is any optimal io size. One note regarding this: Back then I made the Ext4 I carried above tests in the last year to be aligned to what I think could be good for the SSD. I do not know whether this matters much. merkaba:~> tune2fs -l /dev/merkaba/home RAID stripe width: 128 Which should be in blocks and alignto 512 KiB. Hmmm, I am a bit puzzled about this value at the moment. Either 1 MiB or 128 KiB would make sense to me. But then what I do not about the erase block size of that SSD. Then add BTRFS and then dm_crypt and look at how the numbers change. I think this is a plan to find out whether its either really the hardware or some wierd happening in the low level parts of Linux, e.g. the block layer or dm_crypt or filing system. Reduce it to the most basic level and then work from there. Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Aug-02 11:25 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:> So, doctor, is it bad? :) > > randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 > Starting 4 processes > randomwrite: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0 iops] [eta 00m:00s] > randomwrite: (groupid=0, jobs=1): err= 0: pid=7193 > write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec > slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55 > clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63 > lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57 > clat percentiles (msec):Heck, I didn´t look at the IOPS figure! 189 IOPS for a SATA-600 SSD. Thats pathetic. So again, please test this without dm_crypt. I can´t believe that this is the maximum the hardware is able to achieve. A really fast 15000 rpm SAS harddisk might top that. -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-02 17:39 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote:> > I''ve the the fio tests in: > > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 > > … you are still using dm_crypt?That was my biggest partition and so far I''ve found no performance impact on file access between unencrypted and dm_crypt. I just took out my swap partition and made a smaller btrfs there: /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0 I mounted without discard.> > lat (usec) : 50=0.01% > > lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89% > > lat (msec) : 500=72.44%, 750=14.43% > > Gosh, look at these latencies! > > 72,44% of all requests above 500 (in words: five hundred) milliseconds! > And 14,43% above 750 msecs. The percentage of requests served at 100 msecs > or less is was below one percent! Hey, is this an SSD or what?Yeah, that''s kind of what I''ve been complaining about since the beginning :) Once I''m reading sequentially, it goes fast, but random access/latency is indeed abysmal.> Still even with iodepth 64 totally different picture. And look at the IOPS > and throughput.Yep. I know mine are bad :(> For reference, this refers to > > [global] > ioengine=libaio > direct=1 > iodepth=64Since it''s slightly different than the first job file you gave me, I re-ran with this one this time. gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2 gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:~# grep sda2 /proc/mounts /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0 here''s the btrfs test (ext4 is lower down): gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 4 processes zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193 iops] [eta 02m:10s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318 read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec slat (usec): min=3 , max=37432 , avg=7252.52, stdev=5717.70 clat (usec): min=13 , max=981927 , avg=454046.13, stdev=110527.92 lat (msec): min=5 , max=999 , avg=461.30, stdev=112.00 clat percentiles (msec): | 1.00th=[ 145], 5.00th=[ 269], 10.00th=[ 371], 20.00th=[ 408], | 30.00th=[ 424], 40.00th=[ 437], 50.00th=[ 449], 60.00th=[ 457], | 70.00th=[ 474], 80.00th=[ 490], 90.00th=[ 570], 95.00th=[ 644], | 99.00th=[ 865], 99.50th=[ 922], 99.90th=[ 963], 99.95th=[ 979], | 99.99th=[ 979] bw (KB/s) : min= 8, max= 2807, per=100.00%, avg=1227.75, stdev=317.57 lat (usec) : 20=0.01% lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82% lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58% cpu : usr=0.12%, sys=1.13%, ctx=12186, majf=0, minf=276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=8262/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30340 read : io=2048.0MB, bw=211257KB/s, iops=23473 , runt= 9927msec slat (usec): min=1 , max=56321 , avg=20.51, stdev=424.44 clat (usec): min=0 , max=57987 , avg=2695.98, stdev=6624.00 lat (usec): min=1 , max=58015 , avg=2716.75, stdev=6642.09 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 10], 10.00th=[ 30], 20.00th=[ 100], | 30.00th=[ 217], 40.00th=[ 362], 50.00th=[ 494], 60.00th=[ 636], | 70.00th=[ 892], 80.00th=[ 1656], 90.00th=[ 7392], 95.00th=[21632], | 99.00th=[29056], 99.50th=[29568], 99.90th=[43776], 99.95th=[46848], | 99.99th=[57600] bw (KB/s) : min=166675, max=260984, per=99.83%, avg=210892.26, stdev=22433.65 lat (usec) : 2=2.16%, 4=0.43%, 10=2.33%, 20=2.80%, 50=5.72% lat (usec) : 100=6.47%, 250=12.35%, 500=18.29%, 750=15.52%, 1000=5.44% lat (msec) : 2=13.04%, 4=3.59%, 10=3.83%, 20=1.88%, 50=6.11% lat (msec) : 100=0.04% cpu : usr=4.51%, sys=35.70%, ctx=11480, majf=0, minf=278 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=233025/w=0/d=0, short=r=0/w=0/d=0 zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30348 write: io=110768KB, bw=1845.1KB/s, iops=208 , runt= 60007msec slat (usec): min=26 , max=50160 , avg=4789.12, stdev=4968.75 clat (usec): min=29 , max=752494 , avg=301858.86, stdev=80422.24 lat (msec): min=12 , max=757 , avg=306.65, stdev=81.52 clat percentiles (msec): | 1.00th=[ 208], 5.00th=[ 241], 10.00th=[ 245], 20.00th=[ 255], | 30.00th=[ 262], 40.00th=[ 269], 50.00th=[ 277], 60.00th=[ 289], | 70.00th=[ 306], 80.00th=[ 326], 90.00th=[ 363], 95.00th=[ 519], | 99.00th=[ 627], 99.50th=[ 652], 99.90th=[ 734], 99.95th=[ 742], | 99.99th=[ 750] bw (KB/s) : min= 616, max= 2688, per=99.72%, avg=1839.80, stdev=399.60 lat (usec) : 50=0.01% lat (msec) : 20=0.02%, 50=0.04%, 100=0.07%, 250=14.31%, 500=79.62% lat (msec) : 750=5.91%, 1000=0.01% cpu : usr=0.28%, sys=2.60%, ctx=10167, majf=0, minf=20 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=12495/d=0, short=r=0/w=0/d=0 sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30364 write: io=84902KB, bw=1414.1KB/s, iops=156 , runt= 60005msec slat (usec): min=18 , max=40825 , avg=6389.99, stdev=6072.56 clat (usec): min=22 , max=887097 , avg=401897.77, stdev=108015.23 lat (msec): min=11 , max=899 , avg=408.29, stdev=109.47 clat percentiles (msec): | 1.00th=[ 262], 5.00th=[ 302], 10.00th=[ 318], 20.00th=[ 338], | 30.00th=[ 351], 40.00th=[ 363], 50.00th=[ 375], 60.00th=[ 388], | 70.00th=[ 404], 80.00th=[ 433], 90.00th=[ 502], 95.00th=[ 693], | 99.00th=[ 783], 99.50th=[ 824], 99.90th=[ 873], 99.95th=[ 881], | 99.99th=[ 889] bw (KB/s) : min= 346, max= 2115, per=99.44%, avg=1406.11, stdev=333.58 lat (usec) : 50=0.01% lat (msec) : 20=0.02%, 50=0.03%, 100=0.07%, 250=0.59%, 500=89.12% lat (msec) : 750=7.91%, 1000=2.24% cpu : usr=0.28%, sys=2.21%, ctx=13224, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.3%, >=64=99.3% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=9369/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=73682KB, aggrb=1227KB/s, minb=1227KB/s, maxb=1227KB/s, mint=60004msec, maxt=60004msec Run status group 1 (all jobs): READ: io=2048.0MB, aggrb=211257KB/s, minb=211257KB/s, maxb=211257KB/s, mint=9927msec, maxt=9927msec Run status group 2 (all jobs): WRITE: io=110768KB, aggrb=1845KB/s, minb=1845KB/s, maxb=1845KB/s, mint=60007msec, maxt=60007msec Run status group 3 (all jobs): WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec gandalfthegreat:/mnt/mnt2# (no lines beyond this, fio 2.0.8)> This could be another good test. Text with Ext4 on plain logical volume > without dm_crypt. > Can you also post the last lines: > > Disk stats (read/write): > dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78% > sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78% > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>I didn''t get these lines.> Its gives information on who good the I/O scheduler was able to merge requests. > > I didn´t see much of a difference between CFQ and noop, so it may not > matter much, but since it gives also a number on total disk utilization > its still quite nice to have.I tried deadline and noop, and indeed I''m not see much of a difference for my basic tests. Fro now I have deadline.> So my recommendation of now: > > Remove as much factors as possible and in order to compare results with > what I posted try with plain logical volume with Ext4.gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2 /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0 gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 4 processes zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280 iops] [eta 01m:21s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077 read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt= 7592msec slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01 clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63 lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36 clat percentiles (usec): | 1.00th=[ 450], 5.00th=[ 548], 10.00th=[ 620], 20.00th=[ 724], | 30.00th=[ 820], 40.00th=[ 908], 50.00th=[ 1004], 60.00th=[ 1096], | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040], | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168], | 99.99th=[33024] bw (KB/s) : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29 lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70% lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17% lat (msec) : 100=0.01% cpu : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=383190/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081 read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83 clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26 lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77 clat percentiles (usec): | 1.00th=[ 2608], 5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376], | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920], | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704], | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296], | 99.99th=[16512] bw (KB/s) : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25% cpu : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=232601/w=0/d=0, short=r=0/w=0/d=0 zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088 write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec slat (usec): min=7 , max=51044 , avg=4757.60, stdev=4816.78 clat (usec): min=958 , max=582205 , avg=299854.73, stdev=56663.66 lat (msec): min=12 , max=588 , avg=304.61, stdev=57.22 clat percentiles (msec): | 1.00th=[ 73], 5.00th=[ 225], 10.00th=[ 251], 20.00th=[ 273], | 30.00th=[ 285], 40.00th=[ 297], 50.00th=[ 306], 60.00th=[ 314], | 70.00th=[ 318], 80.00th=[ 330], 90.00th=[ 343], 95.00th=[ 359], | 99.00th=[ 482], 99.50th=[ 519], 99.90th=[ 553], 99.95th=[ 570], | 99.99th=[ 570] bw (KB/s) : min= 1033, max= 3430, per=99.67%, avg=1856.90, stdev=307.24 lat (usec) : 1000=0.01% lat (msec) : 20=0.06%, 50=0.72%, 100=0.44%, 250=8.50%, 500=89.52% lat (msec) : 750=0.75% cpu : usr=0.27%, sys=1.27%, ctx=9505, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=12580/d=0, short=r=0/w=0/d=0 sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102 write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec slat (usec): min=5 , max=40817 , avg=3911.57, stdev=4785.23 clat (msec): min=4 , max=682 , avg=246.61, stdev=77.35 lat (msec): min=5 , max=688 , avg=250.53, stdev=78.39 clat percentiles (msec): | 1.00th=[ 73], 5.00th=[ 143], 10.00th=[ 186], 20.00th=[ 206], | 30.00th=[ 221], 40.00th=[ 229], 50.00th=[ 239], 60.00th=[ 245], | 70.00th=[ 258], 80.00th=[ 273], 90.00th=[ 318], 95.00th=[ 416], | 99.00th=[ 529], 99.50th=[ 562], 99.90th=[ 660], 99.95th=[ 668], | 99.99th=[ 676] bw (KB/s) : min= 811, max= 4243, per=99.65%, avg=2279.92, stdev=610.32 lat (msec) : 10=0.26%, 20=0.03%, 50=0.22%, 100=1.97%, 250=62.01% lat (msec) : 500=33.43%, 750=2.07% cpu : usr=0.35%, sys=1.46%, ctx=10993, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=15293/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec Run status group 1 (all jobs): READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec Run status group 2 (all jobs): WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec Run status group 3 (all jobs): WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msec Disk stats (read/write): sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50% gandalfthegreat:/mnt/mnt2#> If the values are still quite slow, I think its good to ask Linux > block layer experts – for example by posting on fio mailing list where > there are people subscribed that may be able to provide other test > results – and SSD experts. There might be a Linux block layer mailing > list or use libata or fsdevel, I don´t know.I''ll try there next. I think after this mail, and your help which is very much appreciated, it''s fair to take things out of this list to stop spamming the wrong people :)> Have the IOPS run on the device it self. That will remove any filesystem > layer. But only the read only tests, to make sure I suggest to use fio > with the --readonly option as safety guard. Unless you have a spare SSD > that you can afford to use for write testing which will likely destroy > every filesystem on it. Or let it run on just one logical volume.Can you send me a recommended job config you''d like me to run if the runs above haven''t already answered your questions?> What does your SSD report discard alignment – well but even that should > not matter for reads: > > merkaba:/sys/block/sda> cat discard_alignment > 0gandalfthegreat:/sys/block/sda# cat discard_alignment 0> Seems Intel tells us to not care at all. Or the value for some reason > cannot be read. > > merkaba:/sys/block/sda> cd queue > merkaba:/sys/block/sda/queue> grep . * > add_random:1 > discard_granularity:512 > discard_max_bytes:2147450880 > discard_zeroes_data:1 > hw_sector_size:512 > > I would be interested whether these above differ.I have the exact same.> And whether there is any optimal io size.I also have optimal_io_size:0> One note regarding this: Back then I made the Ext4 I carried above tests > in the last year to be aligned to what I think could be good for the SSD. > I do not know whether this matters much. > > merkaba:~> tune2fs -l /dev/merkaba/home > RAID stripe width: 128> I think this is a plan to find out whether its either really the hardware > or some wierd happening in the low level parts of Linux, e.g. the block > layer or dm_crypt or filing system. > > Reduce it to the most basic level and then work from there.So, I went back to basics. On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote:> Heck, I didn´t look at the IOPS figure! > > 189 IOPS for a SATA-600 SSD. Thats pathetic.Yeah, and the new tests above without dmcrypt are no better.> A really fast 15000 rpm SAS harddisk might top that.My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than the SSD right now :) I read your comments about the multiple things you had in mind. Given the new data, should I just go to an io list now, or even save myself some time and just return the damned SSDs and buy from another vendor? :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Aug-02 20:20 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:> On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote: > > > I''ve the the fio tests in: > > > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 > > > > … you are still using dm_crypt?[…]> I just took out my swap partition and made a smaller btrfs there: > /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0 > > I mounted without discard.[…]> > For reference, this refers to > > > > [global] > > ioengine=libaio > > direct=1 > > iodepth=64 > > Since it''s slightly different than the first job file you gave me, I re-ran > with this one this time. > > gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2 > gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2 > gandalfthegreat:~# grep sda2 /proc/mounts > /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0 > > here''s the btrfs test (ext4 is lower down): > gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8Still abysmal except for sequential reads.> Starting 4 processes > zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) > sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) > zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193 iops] [eta 02m:10s] > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318 > read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec[…]> lat (usec) : 20=0.01% > lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82% > lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58%1 second latency? Oh well…> Run status group 3 (all jobs): > WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec > gandalfthegreat:/mnt/mnt2# > (no lines beyond this, fio 2.0.8)[…]> > Can you also post the last lines: > > > > Disk stats (read/write): > > dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78% > > sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78% > > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba> > > I didn''t get these lines.Hmmm, back then I guess this was fio 1.5.9 or so.> I tried deadline and noop, and indeed I''m not see much of a difference for my basic tests. > Fro now I have deadline. > > > So my recommendation of now: > > > > Remove as much factors as possible and in order to compare results with > > what I posted try with plain logical volume with Ext4. > > gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2 > /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0 > > gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 > Starting 4 processes > zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) > sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) > zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280 iops] [eta 01m:21s] > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077 > read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt= 7592msec > slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01 > clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63 > lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36 > clat percentiles (usec): > | 1.00th=[ 450], 5.00th=[ 548], 10.00th=[ 620], 20.00th=[ 724], > | 30.00th=[ 820], 40.00th=[ 908], 50.00th=[ 1004], 60.00th=[ 1096], > | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040], > | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168], > | 99.99th=[33024] > bw (KB/s) : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29 > lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70% > lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17% > lat (msec) : 100=0.01% > cpu : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=383190/w=0/d=0, short=r=0/w=0/d=0Hey, whats this? With Ext4 you have really good random read performance now! Way better than the Intel SSD 320 and…> sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081 > read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec > slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83 > clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26 > lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77 > clat percentiles (usec): > | 1.00th=[ 2608], 5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376], > | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920], > | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704], > | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296], > | 99.99th=[16512] > bw (KB/s) : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35 > lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25% > cpu : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=232601/w=0/d=0, short=r=0/w=0/d=0… even faster than sequential reads. That just doesn´t make any sense to me. The performance of your Samsung SSD seems to be quite erratic. It seems that the device is capable of being fast, but only sometimes shows this capability.> zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088 > write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec[…]> sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102 > write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec[…]> Run status group 0 (all jobs): > READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec > > Run status group 1 (all jobs): > READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec > > Run status group 2 (all jobs): > WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec > > Run status group 3 (all jobs): > WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msecWrites are slow again.> Disk stats (read/write): > sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50%Well there we have that line. You don´t seem to have the other line possibly due to not using LVM.> gandalfthegreat:/mnt/mnt2# > > > If the values are still quite slow, I think its good to ask Linux > > block layer experts – for example by posting on fio mailing list where > > there are people subscribed that may be able to provide other test > > results – and SSD experts. There might be a Linux block layer mailing > > list or use libata or fsdevel, I don´t know. > > I''ll try there next. I think after this mail, and your help which is very > much appreciated, it''s fair to take things out of this list to stop spamming > the wrong people :)Well the only thing interesting regarding BTRFS is the whopping difference in random read performance between BTRFS and Ext4 on this SSD.> > Have the IOPS run on the device it self. That will remove any filesystem > > layer. But only the read only tests, to make sure I suggest to use fio > > with the --readonly option as safety guard. Unless you have a spare SSD > > that you can afford to use for write testing which will likely destroy > > every filesystem on it. Or let it run on just one logical volume. > > Can you send me a recommended job config you''d like me to run if the runs > above haven''t already answered your questions?[global] ioengine=libaio direct=1 filename=/dev/sdb size=476940m bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 I won´t expect much of a difference, but then the random read performance is quite different between Ext4 and BTRFS on this disk. That would make it interesting to test without any filesystem in between and over the whole device. It might make sense to test around with different alignment, I think the option is called bsalign or so…> On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote: > > Heck, I didn´t look at the IOPS figure! > > > > 189 IOPS for a SATA-600 SSD. Thats pathetic. > > Yeah, and the new tests above without dmcrypt are no better. > > > A really fast 15000 rpm SAS harddisk might top that. > > My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than > the SSD right now :) > > I read your comments about the multiple things you had in mind. Given the > new data, should I just go to an io list now, or even save myself some time > and just return the damned SSDs and buy from another vendor? :)… or get yourself another SSD. Its your decision. I admire your endurance. ;) Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-02 20:44 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote:> Hey, whats this? With Ext4 you have really good random read performance > now! Way better than the Intel SSD 320 and…Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. Obviously it''s sending IO in a way that either the IO subsystem, linux driver, or drive prefer.> The performance of your Samsung SSD seems to be quite erratic. It seems > that the device is capable of being fast, but only sometimes shows this > capability.That''s indeed exactly what I''m seeing in real life :)> > > Have the IOPS run on the device it self. That will remove any filesystem > > > layer. But only the read only tests, to make sure I suggest to use fio > > > with the --readonly option as safety guard. Unless you have a spare SSD > > > that you can afford to use for write testing which will likely destroy > > > every filesystem on it. Or let it run on just one logical volume. > > > > Can you send me a recommended job config you''d like me to run if the runs > > above haven''t already answered your questions?> [global](...) I used this and just changed filename to /dev/sda. Since I''m reading from the beginning of the drive, reads have to be aligned.> I won´t expect much of a difference, but then the random read performance > is quite different between Ext4 and BTRFS on this disk. That would make > it interesting to test without any filesystem in between and over the > whole device.Here is the output: gandalfthegreat:~# fio --readonly ./fio.job3 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172 read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64 clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66 lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88 clat percentiles (usec): | 1.00th=[ 126], 5.00th=[ 235], 10.00th=[ 5216], 20.00th=[ 5920], | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480], | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768], | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656], | 99.99th=[27264] bw (KB/s) : min= 423, max= 5776, per=100.00%, avg=986.48, stdev=480.68 lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02% lat (usec) : 1000=0.02% lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18% cpu : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=6532/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199 read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91 clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32 lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06 clat percentiles (usec): | 1.00th=[ 61], 5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920], | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608], | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536], | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568], | 99.99th=[34048] bw (KB/s) : min= 405, max= 2680, per=100.00%, avg=912.92, stdev=261.62 lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02% lat (usec) : 1000=0.03% lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36% cpu : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=6101/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec Run status group 1 (all jobs): READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec Disk stats (read/write): sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30%> … or get yourself another SSD. Its your decision. > > I admire your endurance. ;)Since I''ve gotten 2 SSDs to make sure I didn''t get one bad one, and that the company is greating great reviews for them, I''m now pretty sure that it''s either a problem with a linux driver, which is interesting for us all to debug :) If I go buy another brand, the next guy will have the same problems than me. But yes, if we figure this out, Samsung owes you and me some money :) I''ll try plugging this SSD in a totally different PC and see what happens. This may say if it''s an AHCI/intel sata driver problem. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Aug-02 21:21 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:> On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote: > > Hey, whats this? With Ext4 you have really good random read performance > > now! Way better than the Intel SSD 320 and… > > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. > Obviously it''s sending IO in a way that either the IO subsystem, linux > driver, or drive prefer.But only on reads.> > > > Have the IOPS run on the device it self. That will remove any filesystem > > > > layer. But only the read only tests, to make sure I suggest to use fio > > > > with the --readonly option as safety guard. Unless you have a spare SSD > > > > that you can afford to use for write testing which will likely destroy > > > > every filesystem on it. Or let it run on just one logical volume. > > > > > > Can you send me a recommended job config you''d like me to run if the runs > > > above haven''t already answered your questions? > > > [global] > (...) > > I used this and just changed filename to /dev/sda. Since I''m reading > from the beginning of the drive, reads have to be aligned. > > > I won´t expect much of a difference, but then the random read performance > > is quite different between Ext4 and BTRFS on this disk. That would make > > it interesting to test without any filesystem in between and over the > > whole device. > > Here is the output: > gandalfthegreat:~# fio --readonly ./fio.job3 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 > 2.0.8 > Starting 2 processes > Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0 iops] [eta 01m:00s] > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172 > read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msecWTF? Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely without it. Also I noticed that I had iodepth 1 in there to circumvent any in drive cache / optimization.> slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64 > clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66 > lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88 > clat percentiles (usec): > | 1.00th=[ 126], 5.00th=[ 235], 10.00th=[ 5216], 20.00th=[ 5920], > | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480], > | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768], > | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656], > | 99.99th=[27264] > bw (KB/s) : min= 423, max= 5776, per=100.00%, avg=986.48, stdev=480.68 > lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02% > lat (usec) : 1000=0.02% > lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18% > cpu : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=6532/w=0/d=0, short=r=0/w=0/d=0Latency is still way to high even with iodepth 1, 10 milliseconds for 43% of requests. And the throughput and IOPS is still abysmal even for iodepth 1 (see below for Intel SSD 320 values). Okay, one further idea: Remove the bsrange to just test with 4k blocks. Additionally test these are aligned with blockalign=int[,int], ba=int[,int] At what boundary to align random IO offsets. Defaults to the same as ''blocksize'' the minimum blocksize given. Minimum alignment is typically 512b for using direct IO, though it usually depends on the hardware block size. This option is mutually exclusive with using a random map for files, so it will turn off that option. I would first test with 4k blocks as is. And then do something like: blocksize=4k blockalign=4k And then raise blockalign to some values that may matter like 8k, 128k, 512k, 1m or so. But thats just guess work. I do not even exactly now if it works this way in fio. There is something pretty wierd going on. But I am not sure what it is. Maybe an alignment issue, since Ext4 with stripe alignment was able to do so much faster on reads.> sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199 > read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msecEy, whats this?> slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91 > clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32 > lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06 > clat percentiles (usec): > | 1.00th=[ 61], 5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920], > | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608], > | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536], > | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568], > | 99.99th=[34048] > bw (KB/s) : min= 405, max= 2680, per=100.00%, avg=912.92, stdev=261.62 > lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02% > lat (usec) : 1000=0.03% > lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36% > cpu : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=6101/w=0/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec > > Run status group 1 (all jobs): > READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec > > Disk stats (read/write): > sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30%What on earth is this? You are testing the raw device and are getting these fun values out of it? Let me repeat that here: merkaba:/tmp> cat iops-read.job [global] ioengine=libaio direct=1 filename=/dev/sda bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 merkaba:/tmp> fio iops-read.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [57784K/0K /s] [6471 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31915 read : io=1681.2MB, bw=28692KB/s, iops=3193 , runt= 60001msec slat (usec): min=5 , max=991 , avg=31.18, stdev=10.07 clat (usec): min=2 , max=3312 , avg=276.67, stdev=124.41 lat (usec): min=48 , max=3337 , avg=308.54, stdev=125.51 clat percentiles (usec): | 1.00th=[ 61], 5.00th=[ 82], 10.00th=[ 103], 20.00th=[ 159], | 30.00th=[ 201], 40.00th=[ 237], 50.00th=[ 270], 60.00th=[ 306], | 70.00th=[ 354], 80.00th=[ 402], 90.00th=[ 446], 95.00th=[ 478], | 99.00th=[ 524], 99.50th=[ 548], 99.90th=[ 644], 99.95th=[ 716], | 99.99th=[ 1020] bw (KB/s) : min=27760, max=31784, per=100.00%, avg=28694.49, stdev=625.77 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.15%, 100=8.76% lat (usec) : 250=34.69%, 500=54.13%, 750=2.23%, 1000=0.03% lat (msec) : 2=0.01%, 4=0.01% cpu : usr=2.91%, sys=12.49%, ctx=199678, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=191587/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=31921 read : io=3895.9MB, bw=66487KB/s, iops=7384 , runt= 60001msec slat (usec): min=5 , max=518 , avg=30.18, stdev= 8.16 clat (usec): min=1 , max=3394 , avg=100.21, stdev=83.51 lat (usec): min=44 , max=3429 , avg=131.69, stdev=84.44 clat percentiles (usec): | 1.00th=[ 54], 5.00th=[ 60], 10.00th=[ 61], 20.00th=[ 68], | 30.00th=[ 72], 40.00th=[ 79], 50.00th=[ 86], 60.00th=[ 92], | 70.00th=[ 99], 80.00th=[ 105], 90.00th=[ 112], 95.00th=[ 163], | 99.00th=[ 556], 99.50th=[ 716], 99.90th=[ 900], 99.95th=[ 940], | 99.99th=[ 1080] bw (KB/s) : min=30276, max=81176, per=99.95%, avg=66453.19, stdev=10893.90 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.57% lat (usec) : 100=69.52%, 250=26.59%, 500=2.01%, 750=0.89%, 1000=0.39% lat (msec) : 2=0.02%, 4=0.01% cpu : usr=7.12%, sys=27.34%, ctx=461537, majf=0, minf=27 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=443106/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=1681.2MB, aggrb=28691KB/s, minb=28691KB/s, maxb=28691KB/s, mint=60001msec, maxt=60001msec Run status group 1 (all jobs): READ: io=3895.9MB, aggrb=66487KB/s, minb=66487KB/s, maxb=66487KB/s, mint=60001msec, maxt=60001msec Disk stats (read/write): sda: ios=632851/228, merge=0/140, ticks=106392/351, in_queue=105682, util=87.78% I have seen about 4000 IOPS with pure 4k blocks from that drive with 4k blocks, so that seems to be fine. And look at latencies, barely some requests in low millisecond range. Now just for the sake of it with iodepth 64: merkaba:/tmp> fio iops-read.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [217.8M/0K /s] [24.1K/0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31945 read : io=12412MB, bw=211819KB/s, iops=23795 , runt= 60003msec slat (usec): min=2 , max=2158 , avg=15.69, stdev= 8.96 clat (usec): min=138 , max=19349 , avg=2670.01, stdev=861.85 lat (usec): min=189 , max=19366 , avg=2686.29, stdev=861.67 clat percentiles (usec): | 1.00th=[ 1256], 5.00th=[ 1448], 10.00th=[ 1592], 20.00th=[ 1864], | 30.00th=[ 2128], 40.00th=[ 2384], 50.00th=[ 2640], 60.00th=[ 2896], | 70.00th=[ 3152], 80.00th=[ 3408], 90.00th=[ 3728], 95.00th=[ 3952], | 99.00th=[ 4448], 99.50th=[ 4896], 99.90th=[ 8768], 99.95th=[10048], | 99.99th=[12096] bw (KB/s) : min=116464, max=216400, per=100.00%, avg=211823.16, stdev=9787.89 lat (usec) : 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.05% lat (msec) : 2=25.18%, 4=70.40%, 10=4.28%, 20=0.05% cpu : usr=14.39%, sys=44.38%, ctx=624900, majf=0, minf=278 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=1427811/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32301 read : io=12499MB, bw=213307KB/s, iops=23691 , runt= 60002msec slat (usec): min=1 , max=1509 , avg=12.42, stdev= 8.12 clat (usec): min=306 , max=201523 , avg=2685.75, stdev=1220.14 lat (usec): min=316 , max=201536 , avg=2698.70, stdev=1220.06 clat percentiles (usec): | 1.00th=[ 1816], 5.00th=[ 2024], 10.00th=[ 2096], 20.00th=[ 2192], | 30.00th=[ 2256], 40.00th=[ 2352], 50.00th=[ 2416], 60.00th=[ 2544], | 70.00th=[ 2736], 80.00th=[ 2992], 90.00th=[ 3632], 95.00th=[ 4320], | 99.00th=[ 5664], 99.50th=[ 6112], 99.90th=[ 7136], 99.95th=[ 7712], | 99.99th=[26240] bw (KB/s) : min=144720, max=256568, per=99.96%, avg=213210.08, stdev=23175.58 lat (usec) : 500=0.01%, 750=0.02%, 1000=0.06% lat (msec) : 2=4.10%, 4=88.72%, 10=7.09%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=11.77%, sys=34.79%, ctx=440558, majf=0, minf=280 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=1421534/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=12412MB, aggrb=211818KB/s, minb=211818KB/s, maxb=211818KB/s, mint=60003msec, maxt=60003msec Run status group 1 (all jobs): READ: io=12499MB, aggrb=213306KB/s, minb=213306KB/s, maxb=213306KB/s, mint=60002msec, maxt=60002msec Disk stats (read/write): sda: ios=2151624/243, merge=692022/115, ticks=5719203/651132, in_queue=6372962, util=99.87% (Seems that extra line is gone from fio 2, I darkly remember that Jens Axboe removed some output he found to be superfluuos, but the disk stats are still there.) In that case random versus sequential do not even seem to make much of a difference.> > … or get yourself another SSD. Its your decision. > > > > I admire your endurance. ;) > > Since I''ve gotten 2 SSDs to make sure I didn''t get one bad one, and that the > company is greating great reviews for them, I''m now pretty sure that > it''s either a problem with a linux driver, which is interesting for us > all to debug :)Either that or there possible is some firmware update? But then those reviews would mention it I bet. I´d still look whether there is some firmware update available labeled: "This will raise performance on lots of Linux based workloads by factors of 10x to 250x" ;) Anyway, I think now we are pretty near to the hardware. So issue has to be somewhere in the block layer, the controller, the SSD firmware or the SSD itself IMHO.> If I go buy another brand, the next guy will have the same problems than > me. > > But yes, if we figure this out, Samsung owes you and me some money :);)> I''ll try plugging this SSD in a totally different PC and see what happens. > This may say if it''s an AHCI/intel sata driver problem.Seems we will continue until someone starts to complain here. Maybe another list will be more approbiate? But then this thread has it all in one ;). Adding a CC with some introductory note might be approbiate. Its your problem, so you decide ;). I´d suggest the fio mailing list, there are other performance people how may want to chime in. Another idea: Is there anything funny in SMART values (smartctl -a and possibly even -x)? Well I gather these SSDs are new, but still. -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-02 21:49 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Thu, Aug 02, 2012 at 11:21:51PM +0200, Martin Steigerwald wrote:> > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. > > Obviously it''s sending IO in a way that either the IO subsystem, linux > > driver, or drive prefer. > > But only on reads.Yes. I was focussing on the simple case. SSDs can be slower than hard drives on writes, and now you''re getting into deep firmware magic for corner cases, so I didn''t want to touch that here :)> WTF? > > Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely > without it.I left it as is: [global] ioengine=libaio direct=1 filename=/dev/sda size=476940m bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 I just removed it now.> Okay, one further idea: Remove the bsrange to just test with 4k blocks.Ok, there you go: [global] ioengine=libaio direct=1 filename=/dev/sda blocksize=4k blockalign=4k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 gandalfthegreat:~# fio --readonly ./fio.job4 zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [1476K/0K /s] [369 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3614 read : io=60832KB, bw=1013.7KB/s, iops=253 , runt= 60012msec slat (usec): min=4 , max=892 , avg=24.36, stdev=16.05 clat (usec): min=2 , max=27575 , avg=3915.39, stdev=4990.69 lat (usec): min=44 , max=27591 , avg=3940.35, stdev=4991.99 clat percentiles (usec): | 1.00th=[ 41], 5.00th=[ 46], 10.00th=[ 51], 20.00th=[ 66], | 30.00th=[ 113], 40.00th=[ 163], 50.00th=[ 205], 60.00th=[ 4960], | 70.00th=[ 5856], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608], | 99.00th=[14272], 99.50th=[17280], 99.90th=[18816], 99.95th=[21120], | 99.99th=[22912] bw (KB/s) : min= 253, max= 2520, per=100.00%, avg=1014.86, stdev=364.74 lat (usec) : 4=0.05%, 10=0.01%, 20=0.01%, 50=8.07%, 100=1iiiiiiii9.94% lat (usec) : 250=26.26%, 500=2.78%, 750=0.30%, 1000=0.09% lat (msec) : 2=0.12%, 4=0.07%, 10=20.75%, 20=21.51%, 50=0.05% cpu : usr=0.31%, sys=0.88%, ctx=15250, majf=0, minf=23 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=15208/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3636 read : io=88468KB, bw=1474.3KB/s, iops=368 , runt= 60009msec slat (usec): min=3 , max=340 , avg=16.46, stdev=10.23 clat (usec): min=1 , max=27356 , avg=2692.99, stdev=4564.76 lat (usec): min=30 , max=27370 , avg=2709.87, stdev=4566.31 clat percentiles (usec): | 1.00th=[ 27], 5.00th=[ 29], 10.00th=[ 31], 20.00th=[ 35], | 30.00th=[ 47], 40.00th=[ 61], 50.00th=[ 78], 60.00th=[ 107], | 70.00th=[ 270], 80.00th=[ 5856], 90.00th=[11712], 95.00th=[12608], | 99.00th=[16512], 99.50th=[17536], 99.90th=[18816], 99.95th=[19584], | 99.99th=[21120] bw (KB/s) : min= 282, max= 5096, per=100.00%, avg=1474.40, stdev=899.36 lat (usec) : 2=0.05%, 4=0.01%, 20=0.02%, 50=32.45%, 100=24.61% lat (usec) : 250=12.52%, 500=1.40%, 750=0.03%, 1000=0.03% lat (msec) : 2=0.02%, 4=0.03%, 10=13.95%, 20=14.85%, 50=0.03% cpu : usr=0.33%, sys=0.90%, ctx=22176, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=22117/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=60832KB, aggrb=1013KB/s, minb=1013KB/s, maxb=1013KB/s, mint=60012msec, maxt=60012msec Run status group 1 (all jobs): READ: io=88468KB, aggrb=1474KB/s, minb=1474KB/s, maxb=1474KB/s, mint=60009msec, maxt=60009msec Disk stats (read/write): sda: ios=54013/1836, merge=291/17, ticks=251660/11052, in_queue=262668, util=99.51% and the same test with blockalign=512k: gandalfthegreat:~# fio --readonly ./fio.job4 # blockalign=512k fio: Any use of blockalign= turns off randommap zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [4180K/0K /s] [1045 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3679 read : io=56096KB, bw=957355 B/s, iops=233 , runt= 60001msec slat (usec): min=4 , max=344 , avg=27.73, stdev=15.51 clat (usec): min=2 , max=38954 , avg=4244.27, stdev=5153.32 lat (usec): min=44 , max=38980 , avg=4272.70, stdev=5154.68 clat percentiles (usec): | 1.00th=[ 42], 5.00th=[ 46], 10.00th=[ 53], 20.00th=[ 69], | 30.00th=[ 120], 40.00th=[ 169], 50.00th=[ 211], 60.00th=[ 5024], | 70.00th=[ 5920], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608], | 99.00th=[16768], 99.50th=[17536], 99.90th=[18560], 99.95th=[20096], | 99.99th=[37632] bw (KB/s) : min= 263, max= 3133, per=100.00%, avg=935.09, stdev=392.66 lat (usec) : 4=0.05%, 10=0.03%, 20=0.01%, 50=7.84%, 100=19.12% lat (usec) : 250=25.84%, 500=1.61%, 750=0.08%, 1000=0.05% lat (msec) : 2=0.06%, 4=0.08%, 10=21.99%, 20=23.17%, 50=0.06% cpu : usr=0.29%, sys=0.94%, ctx=14069, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=14024/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3703 read : io=99592KB, bw=1659.7KB/s, iops=414 , runt= 60008msec slat (usec): min=4 , max=186 , avg=15.83, stdev= 9.95 clat (usec): min=1 , max=25530 , avg=2390.79, stdev=4340.79 lat (usec): min=28 , max=25539 , avg=2407.01, stdev=4342.90 clat percentiles (usec): | 1.00th=[ 26], 5.00th=[ 28], 10.00th=[ 30], 20.00th=[ 34], | 30.00th=[ 43], 40.00th=[ 57], 50.00th=[ 72], 60.00th=[ 91], | 70.00th=[ 155], 80.00th=[ 5152], 90.00th=[11712], 95.00th=[12480], | 99.00th=[12992], 99.50th=[16768], 99.90th=[18560], 99.95th=[18816], | 99.99th=[21632] bw (KB/s) : min= 269, max= 8465, per=100.00%, avg=1663.84, stdev=1367.82 lat (usec) : 2=0.01%, 4=0.02%, 20=0.01%, 50=35.42%, 100=26.00% lat (usec) : 250=11.29%, 500=1.31%, 750=0.05%, 1000=0.01% lat (msec) : 2=0.04%, 4=0.02%, 10=12.55%, 20=13.25%, 50=0.01% cpu : usr=0.31%, sys=0.99%, ctx=24963, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=24898/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=56096KB, aggrb=934KB/s, minb=934KB/s, maxb=934KB/s, mint=60001msec, maxt=60001msec Run status group 1 (all jobs): READ: io=99592KB, aggrb=1659KB/s, minb=1659KB/s, maxb=1659KB/s, mint=60008msec, maxt=60008msec Disk stats (read/write): sda: ios=55021/1637, merge=72/39, ticks=240896/12080, in_queue=252912, util=99.59%> > Since I''ve gotten 2 SSDs to make sure I didn''t get one bad one, and that the > > company is greating great reviews for them, I''m now pretty sure that > > it''s either a problem with a linux driver, which is interesting for us > > all to debug :) > > Either that or there possible is some firmware update? But then those > reviews would mention it I bet.So far, there are no firmware updates from mine, I checked right after unboxing it :)> > I''ll try plugging this SSD in a totally different PC and see what happens. > > This may say if it''s an AHCI/intel sata driver problem. > > Seems we will continue until someone starts to complain here. Maybe > another list will be more approbiate? But then this thread has it all > in one ;). Adding a CC with some introductory note might be approbiate. > Its your problem, so you decide ;). I´d suggest the fio mailing list, > there are other performance people how may want to chime in.Actually you know the lists and people involved more than me. I''d be happy if you added a Cc to another list you think is best, and we can move there.> Another idea: Is there anything funny in SMART values (smartctl -a > and possibly even -x)? Well I gather these SSDs are new, but still.smartctl -a /dev/sda (snipped) SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 36 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 7 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 9 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 058 040 000 Old_age Always - 42 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 253 253 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 4 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 433339112 smartctl -x /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.5.0-amd64-preempt-noide-20120410] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION ==Device Model: SAMSUNG SSD 830 Series Serial Number: S0WGNEAC400484 LU WWN Device Id: 5 002538 043584d30 Firmware Version: CXM03B1Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ACS-2 revision 2 Local Time is: Thu Aug 2 14:42:08 2012 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x83) Offline data collection activity is in a Reserved state. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 1980) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 33) minutes. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 9 Power_On_Hours -O--CK 099 099 000 - 36 12 Power_Cycle_Count -O--CK 099 099 000 - 7 177 Wear_Leveling_Count PO--C- 099 099 000 - 9 179 Used_Rsvd_Blk_Cnt_Tot PO--C- 100 100 010 - 0 181 Program_Fail_Cnt_Total -O--CK 100 100 010 - 0 182 Erase_Fail_Count_Total -O--CK 100 100 010 - 0 183 Runtime_Bad_Block PO--C- 100 100 010 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 190 Airflow_Temperature_Cel -O--CK 058 040 000 - 42 195 Hardware_ECC_Recovered -O-RC- 200 200 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 253 253 000 - 0 235 Unknown_Attribute -O--C- 099 099 000 - 4 241 Total_LBAs_Written -O--CK 099 099 000 - 433352008 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] GP/S Log at address 0x00 has 1 sectors [Log Directory] GP/S Log at address 0x01 has 1 sectors [Summary SMART error log] GP/S Log at address 0x02 has 2 sectors [Comprehensive SMART error log] GP/S Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log] GP/S Log at address 0x06 has 1 sectors [SMART self-test log] GP/S Log at address 0x07 has 2 sectors [Extended self-test log] GP/S Log at address 0x09 has 1 sectors [Selective self-test log] GP/S Log at address 0x10 has 1 sectors [NCQ Command Error] GP/S Log at address 0x11 has 1 sectors [SATA Phy Event Counters] GP/S Log at address 0x80 has 16 sectors [Host vendor specific log] GP/S Log at address 0x81 has 16 sectors [Host vendor specific log] GP/S Log at address 0x82 has 16 sectors [Host vendor specific log] GP/S Log at address 0x83 has 16 sectors [Host vendor specific log] (snip) SMART Extended Comprehensive Error Log Version: 1 (2 sectors) No Errors Logged SMART Extended Self-test Log Version: 1 (2 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 27 - # 2 Short offline Completed without error 00% 8 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Warning: device does not support SCT Commands SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0002 2 49152 R_ERR response for data FIS 0x0003 2 53248 R_ERR response for device-to-host data FIS 0x0004 2 57344 R_ERR response for host-to-device data FIS 0x0005 2 7952 R_ERR response for non-data FIS 0x0006 2 2562 R_ERR response for device-to-host non-data FIS 0x0007 2 4096 R_ERR response for host-to-device non-data FIS 0x0008 2 7952 Device-to-host non-data FIS retries 0x0009 2 7953 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 25 Device-to-host register FISes sent due to a COMRESET 0x000b 2 6145 CRC errors within host-to-device FIS 0x000d 2 7953 Non-CRC errors within host-to-device FIS 0x000f 2 8176 R_ERR response for host-to-device data FIS, CRC 0x0010 2 15 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 49153 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 12112 R_ERR response for host-to-device non-data FIS, non-CRC -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Aug-03 18:45 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:> > > I''ll try plugging this SSD in a totally different PC and see what > > > happens. This may say if it''s an AHCI/intel sata driver problem. > > > > > > > > Seems we will continue until someone starts to complain here. Maybe > > another list will be more approbiate? But then this thread has it all > > in one ;). Adding a CC with some introductory note might be > > approbiate. Its your problem, so you decide ;). I´d suggest the fio > > mailing list, there are other performance people how may want to > > chime in. > > > Actually you know the lists and people involved more than me. I''d be > happy if you added a Cc to another list you think is best, and we can > move there.I have no more ideas for the moment. I suggest you to try the fio mailing list at fio lalala vger.kernel.org. I currently have no ideas for other lists. I bet there is some block layer / libata / scsi list that might match. Just browse the lists from kernel.org and pick the one that sounds most suitable for me¹. Probably fsdevel but since the thing does not seem to be filesystem related beside some alignment effect (stripe in Ext4). Hmmm, probably linux-scsi, but look there first, whether block layer and libata related things are discussed there. I would start a new thread. Tell the problem, tell in summary what you have tried and the outcome of that was and ask for advice. Add a link to the original thread here. (It might be a good idea to have one more post here with a link to the new thread so that other curious people can follow there – in case there are any. Otherwise I would end it on this thread. The thread view already looks ridiculous in KMail here ;-) And put me on Cc ;). I´d really like to know whats going on here. As written I have no more ideas right now. It seems to be getting to low level for me. [1] http://vger.kernel.org/vger-lists.html Ciao, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2012-Aug-16 07:45 UTC
Re: How can btrfs take 23sec to stat 23K files from an SSD?
To close this thread. I gave up on the drive after it showed abysmal benchmarking results in windows too. I owed everyone an update, which I just finished typing: http://marc.merlins.org/perso/linux/post_2012-08-15_The-tale-of-SSDs_-Crucial-C300-early-Death_-Samsung-830-extreme-random-IO-slowness_-and-settling-with-OCZ-Vertex-4.html I''m not sure how I could have gotten 2 bad drives from Samsung in 2 different shipments, so I''m afraid the entire line may be bad. At least, it was for me after extensive benchmarking, and even using their own windows benchmarking tool. In the end, I got a OCZ Vertex 4 and it''s superfast as per the benchmarks I posted in the link above. What is interesting is that ext4 is somewhat faster than btrfs in the tests I posted above, but not in ways that should be truly worrisome. I''m still happy to have COW and snapshots, even if that costs me performance. Sorry for spamming the list with what apparently just ended up a crappy drive (well, 2 since I got two just to make sure) from a vendor who shouldn''t have been selling crappy SSDs. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html