Hi all, we got a new test system here and I just also tested btrfs raid6 on that. Write performance is slightly lower than hw-raid (LSI megasas) and md-raid6, but it probably would be much better than any of these two, if it wouldn''t read all the during the writes. Is this a known issue? This is with linux-3.9.2. Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Bernd Schubert (2013-05-23 08:55:47)> Hi all, > > we got a new test system here and I just also tested btrfs raid6 on > that. Write performance is slightly lower than hw-raid (LSI megasas) and > md-raid6, but it probably would be much better than any of these two, if > it wouldn''t read all the during the writes. Is this a known issue? This > is with linux-3.9.2.Hi Bernd, Any time you do a write smaller than a full stripe, we''ll have to do a read/modify/write cycle to satisfy it. This is true of md raid6 and the hw-raid as well, but their reads don''t show up in vmstat (try iostat instead). So the bigger question is where are your small writes coming from. If they are metadata, you can use raid1 for the metadata. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2013 03:11 PM, Chris Mason wrote:> Quoting Bernd Schubert (2013-05-23 08:55:47) >> Hi all, >> >> we got a new test system here and I just also tested btrfs raid6 on >> that. Write performance is slightly lower than hw-raid (LSI megasas) and >> md-raid6, but it probably would be much better than any of these two, if >> it wouldn''t read all the during the writes. Is this a known issue? This >> is with linux-3.9.2. > > Hi Bernd, > > Any time you do a write smaller than a full stripe, we''ll have to do a > read/modify/write cycle to satisfy it. This is true of md raid6 and the > hw-raid as well, but their reads don''t show up in vmstat (try iostat > instead).Yeah, I know and I''m using iostat already. md raid6 does not do rmw, but does not fill the device queue, afaik it flushes the underlying devices quickly as it does not have barrier support - that is another topic, but was the reason why I started to test btrfs.> > So the bigger question is where are your small writes coming from. If > they are metadata, you can use raid1 for the metadata.I used this command /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] so meta-data should be raid10. And I''m using this iozone command:> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ > -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ > /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3Higher IO sizes (e.g. -r16m) don''t make a difference, it goes through the page cache anyway. I''m not familiar with btrfs code at all, but maybe writepages() submits too small IOs? Hrmm, just wanted to try direct IO, but then just noticed it went into RO mode before already:> May 23 14:59:33 c8220a kernel: WARNING: at fs/btrfs/super.c:255 __btrfs_abort_transaction+0xdf/0x100 [btrfs]()> ay 23 14:59:33 c8220a kernel: [<ffffffff8105db76>] warn_slowpath_fmt+0x46/0x50 > May 23 14:59:33 c8220a kernel: [<ffffffffa0b5428a>] ? btrfs_free_path+0x2a/0x40 [btrfs] > May 23 14:59:33 c8220a kernel: [<ffffffffa0b4e18f>] __btrfs_abort_transaction+0xdf/0x100 [btrfs] > May 23 14:59:33 c8220a kernel: [<ffffffffa0b70b2f>] btrfs_save_ino_cache+0x22f/0x310 [btrfs] > May 23 14:59:33 c8220a kernel: [<ffffffffa0b793e2>] commit_fs_roots+0xd2/0x1c0 [btrfs] > May 23 14:59:33 c8220a kernel: [<ffffffff815eb3fe>] ? mutex_lock+0x1e/0x50 > May 23 14:59:33 c8220a kernel: [<ffffffffa0b7a555>] btrfs_commit_transaction+0x495/0xa40 [btrfs] > May 23 14:59:33 c8220a kernel: [<ffffffffa0b7af7b>] ? start_transaction+0xab/0x4d0 [btrfs] > May 23 14:59:33 c8220a kernel: [<ffffffff81082f30>] ? wake_up_bit+0x40/0x40 > May 23 14:59:33 c8220a kernel: [<ffffffffa0b72b96>] transaction_kthread+0x1a6/0x220 [btrfs]> May 23 14:59:33 c8220a kernel: ---[ end trace 3d91874abeab5984 ]--- > May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in btrfs_save_ino_cache:471: error 28 > May 23 14:59:33 c8220a kernel: btrfs is forced readonly > May 23 14:59:33 c8220a kernel: BTRFS warning (device sdx): Skipping commit of aborted transaction. > May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in cleanup_transaction:1455: error 28errno 28 - out of disk space? Going to recreate it and will play with it later on again. Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Bernd Schubert (2013-05-23 09:22:41)> On 05/23/2013 03:11 PM, Chris Mason wrote: > > Quoting Bernd Schubert (2013-05-23 08:55:47) > >> Hi all, > >> > >> we got a new test system here and I just also tested btrfs raid6 on > >> that. Write performance is slightly lower than hw-raid (LSI megasas) and > >> md-raid6, but it probably would be much better than any of these two, if > >> it wouldn''t read all the during the writes. Is this a known issue? This > >> is with linux-3.9.2. > > > > Hi Bernd, > > > > Any time you do a write smaller than a full stripe, we''ll have to do a > > read/modify/write cycle to satisfy it. This is true of md raid6 and the > > hw-raid as well, but their reads don''t show up in vmstat (try iostat > > instead). > > Yeah, I know and I''m using iostat already. md raid6 does not do rmw, but > does not fill the device queue, afaik it flushes the underlying devices > quickly as it does not have barrier support - that is another topic, but > was the reason why I started to test btrfs.md should support barriers with recent kernels. You might want to verify with blktrace that md raid6 isn''t doing r/m/w.> > > > > So the bigger question is where are your small writes coming from. If > > they are metadata, you can use raid1 for the metadata. > > I used this command > > /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB times the number of devices on the FS. If you have 13 devices, that''s 832K. Using buffered writes makes it much more likely the VM will break up the IOs as they go down. The btrfs writepages code does try to do full stripe IO, and it also caches stripes as the IO goes down. But for buffered IO it is surprisingly hard to get a 100% hit rate on full stripe IO at larger stripe sizes.> > so meta-data should be raid10. And I''m using this iozone command: > > > > iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ > > -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ > > /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 > > > Higher IO sizes (e.g. -r16m) don''t make a difference, it goes through > the page cache anyway. > I''m not familiar with btrfs code at all, but maybe writepages() submits > too small IOs? > > Hrmm, just wanted to try direct IO, but then just noticed it went into > RO mode before already:Direct IO will make it easier to get full stripe writes. I thought I had fixed this abort, but it is just running out of space to write the inode cache. For now, please just don''t mount with the inode cache enabled, I''ll send in a fix for the next rc. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/05/2013 15:22, Bernd Schubert wrote:> > Yeah, I know and I''m using iostat already. md raid6 does not do rmw, > but does not fill the device queue, afaik it flushes the underlying > devices quickly as it does not have barrier support - that is another > topic, but was the reason why I started to test btrfs.MD raid6 DOES have barrier support! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2013 03:41 PM, Bob Marley wrote:> On 23/05/2013 15:22, Bernd Schubert wrote: >> >> Yeah, I know and I''m using iostat already. md raid6 does not do rmw, >> but does not fill the device queue, afaik it flushes the underlying >> devices quickly as it does not have barrier support - that is another >> topic, but was the reason why I started to test btrfs. > > MD raid6 DOES have barrier support! >For underlying devices yes, but it does not further use it for additional buffering. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2013 03:34 PM, Chris Mason wrote:> Quoting Bernd Schubert (2013-05-23 09:22:41) >> On 05/23/2013 03:11 PM, Chris Mason wrote: >>> Quoting Bernd Schubert (2013-05-23 08:55:47) >>>> Hi all, >>>> >>>> we got a new test system here and I just also tested btrfs raid6 on >>>> that. Write performance is slightly lower than hw-raid (LSI megasas) and >>>> md-raid6, but it probably would be much better than any of these two, if >>>> it wouldn''t read all the during the writes. Is this a known issue? This >>>> is with linux-3.9.2. >>> >>> Hi Bernd, >>> >>> Any time you do a write smaller than a full stripe, we''ll have to do a >>> read/modify/write cycle to satisfy it. This is true of md raid6 and the >>> hw-raid as well, but their reads don''t show up in vmstat (try iostat >>> instead). >> >> Yeah, I know and I''m using iostat already. md raid6 does not do rmw, but >> does not fill the device queue, afaik it flushes the underlying devices >> quickly as it does not have barrier support - that is another topic, but >> was the reason why I started to test btrfs. > > md should support barriers with recent kernels. You might want to > verify with blktrace that md raid6 isn''t doing r/m/w. > >> >>> >>> So the bigger question is where are your small writes coming from. If >>> they are metadata, you can use raid1 for the metadata. >> >> I used this command >> >> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] > > Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB > times the number of devices on the FS. If you have 13 devices, that''s > 832K.Actually I have 12 devices, but we have to subtract 2 parity disks. In the mean time I also patched btrfsprogs to use a chunksize of 256K. So that should be 2560kiB now if I found the right places. Btw, any chance to generally use chunksize/chunklen instead of stripe, such as the md layer does it? IMHO it is less confusing to use n-datadisks * chunksize = stripesize.> > Using buffered writes makes it much more likely the VM will break up the > IOs as they go down. The btrfs writepages code does try to do full > stripe IO, and it also caches stripes as the IO goes down. But for > buffered IO it is surprisingly hard to get a 100% hit rate on full > stripe IO at larger stripe sizes.I have not found that part yet, somehow it looks like as if writepages would submit single pages to another layer. I''m going to look into it again during the weekend. I can reserve the hardware that long, but I think we first need to fix striped writes in general.> >> >> so meta-data should be raid10. And I''m using this iozone command: >> >> >>> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ >>> -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ >>> /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 >> >> >> Higher IO sizes (e.g. -r16m) don''t make a difference, it goes through >> the page cache anyway. >> I''m not familiar with btrfs code at all, but maybe writepages() submits >> too small IOs? >> >> Hrmm, just wanted to try direct IO, but then just noticed it went into >> RO mode before already: > > Direct IO will make it easier to get full stripe writes. I thought I > had fixed this abort, but it is just running out of space to write the > inode cache. For now, please just don''t mount with the inode cache > enabled, I''ll send in a fix for the next rc.Thanks, I already noticed and disabled the inode cache. Direct-io works as expected and without any RMW cycles. And that provides more than 40% better performance than the Megasas controller or buffered MD writes (I didn''t compare with direct-io MD, as that is very slow). Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Bernd Schubert (2013-05-23 15:33:24)> On 05/23/2013 03:34 PM, Chris Mason wrote: > > Quoting Bernd Schubert (2013-05-23 09:22:41) > >> On 05/23/2013 03:11 PM, Chris Mason wrote: > >>> Quoting Bernd Schubert (2013-05-23 08:55:47) > >>>> Hi all, > >>>> > >>>> we got a new test system here and I just also tested btrfs raid6 on > >>>> that. Write performance is slightly lower than hw-raid (LSI megasas) and > >>>> md-raid6, but it probably would be much better than any of these two, if > >>>> it wouldn''t read all the during the writes. Is this a known issue? This > >>>> is with linux-3.9.2. > >>> > >>> Hi Bernd, > >>> > >>> Any time you do a write smaller than a full stripe, we''ll have to do a > >>> read/modify/write cycle to satisfy it. This is true of md raid6 and the > >>> hw-raid as well, but their reads don''t show up in vmstat (try iostat > >>> instead). > >> > >> Yeah, I know and I''m using iostat already. md raid6 does not do rmw, but > >> does not fill the device queue, afaik it flushes the underlying devices > >> quickly as it does not have barrier support - that is another topic, but > >> was the reason why I started to test btrfs. > > > > md should support barriers with recent kernels. You might want to > > verify with blktrace that md raid6 isn''t doing r/m/w. > > > >> > >>> > >>> So the bigger question is where are your small writes coming from. If > >>> they are metadata, you can use raid1 for the metadata. > >> > >> I used this command > >> > >> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] > > > > Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB > > times the number of devices on the FS. If you have 13 devices, that''s > > 832K. > > Actually I have 12 devices, but we have to subtract 2 parity disks. In > the mean time I also patched btrfsprogs to use a chunksize of 256K. So > that should be 2560kiB now if I found the right places.Sorry, thanks for filling in for my pre-coffee email.> Btw, any chance to generally use chunksize/chunklen instead of stripe, > such as the md layer does it? IMHO it is less confusing to use > n-datadisks * chunksize = stripesize.Definitely, it will become much more configurable.> > > > > Using buffered writes makes it much more likely the VM will break up the > > IOs as they go down. The btrfs writepages code does try to do full > > stripe IO, and it also caches stripes as the IO goes down. But for > > buffered IO it is surprisingly hard to get a 100% hit rate on full > > stripe IO at larger stripe sizes. > > I have not found that part yet, somehow it looks like as if writepages > would submit single pages to another layer. I''m going to look into it > again during the weekend. I can reserve the hardware that long, but I > think we first need to fix striped writes in general.The VM calls writepages and btrfs tries to suck down all the pages that belong to the same extent. And we try to allocate the extents on boundaries. There is definitely some bleeding into rmw when I do it here, but overall it does well. But I was using 8 drives. I''ll try with 12.> > > > >> > >> so meta-data should be raid10. And I''m using this iozone command: > >> > >> > >>> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ > >>> -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ > >>> /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 > >> > >> > >> Higher IO sizes (e.g. -r16m) don''t make a difference, it goes through > >> the page cache anyway. > >> I''m not familiar with btrfs code at all, but maybe writepages() submits > >> too small IOs? > >> > >> Hrmm, just wanted to try direct IO, but then just noticed it went into > >> RO mode before already: > > > > Direct IO will make it easier to get full stripe writes. I thought I > > had fixed this abort, but it is just running out of space to write the > > inode cache. For now, please just don''t mount with the inode cache > > enabled, I''ll send in a fix for the next rc. > > Thanks, I already noticed and disabled the inode cache. > > Direct-io works as expected and without any RMW cycles. And that > provides more than 40% better performance than the Megasas controller or > buffered MD writes (I didn''t compare with direct-io MD, as that is very > slow).You can improve MD performance quite a lot by increasing the size of the stripe cache. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2013 09:37 PM, Chris Mason wrote:> Quoting Bernd Schubert (2013-05-23 15:33:24) >> Btw, any chance to generally use chunksize/chunklen instead of stripe, >> such as the md layer does it? IMHO it is less confusing to use >> n-datadisks * chunksize = stripesize. > > Definitely, it will become much more configurable.Actually I meant in the code. I''m going to write a patch during the weekend.> >> >>> >>> Using buffered writes makes it much more likely the VM will break up the >>> IOs as they go down. The btrfs writepages code does try to do full >>> stripe IO, and it also caches stripes as the IO goes down. But for >>> buffered IO it is surprisingly hard to get a 100% hit rate on full >>> stripe IO at larger stripe sizes. >> >> I have not found that part yet, somehow it looks like as if writepages >> would submit single pages to another layer. I''m going to look into it >> again during the weekend. I can reserve the hardware that long, but I >> think we first need to fix striped writes in general. > > The VM calls writepages and btrfs tries to suck down all the pages that > belong to the same extent. And we try to allocate the extents on > boundaries. There is definitely some bleeding into rmw when I do it > here, but overall it does well. > > But I was using 8 drives. I''ll try with 12.Hmm, I already tried with 10 drives (8+2), doesn''t make a difference for RMW.> >> Direct-io works as expected and without any RMW cycles. And that >> provides more than 40% better performance than the Megasas controller or >> buffered MD writes (I didn''t compare with direct-io MD, as that is very >> slow). > > You can improve MD performance quite a lot by increasing the size of the > stripe cache.I''m already doing that, without a higher stripe cache the performance is much lower. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Bernd Schubert (2013-05-23 15:45:36)> On 05/23/2013 09:37 PM, Chris Mason wrote: > > Quoting Bernd Schubert (2013-05-23 15:33:24) > >> Btw, any chance to generally use chunksize/chunklen instead of stripe, > >> such as the md layer does it? IMHO it is less confusing to use > >> n-datadisks * chunksize = stripesize. > > > > Definitely, it will become much more configurable. > > Actually I meant in the code. I''m going to write a patch during the weekend.The btrfs raid code refers to stripes because a chunk is a very large (~1GB) slice of a set of drives that we allocate into raid levels. We have full stripes and device stripes, I''m afraid there are so many different terms in other projects that it is hard to pick something clear.> > > > >> > >>> > >>> Using buffered writes makes it much more likely the VM will break up the > >>> IOs as they go down. The btrfs writepages code does try to do full > >>> stripe IO, and it also caches stripes as the IO goes down. But for > >>> buffered IO it is surprisingly hard to get a 100% hit rate on full > >>> stripe IO at larger stripe sizes. > >> > >> I have not found that part yet, somehow it looks like as if writepages > >> would submit single pages to another layer. I''m going to look into it > >> again during the weekend. I can reserve the hardware that long, but I > >> think we first need to fix striped writes in general. > > > > The VM calls writepages and btrfs tries to suck down all the pages that > > belong to the same extent. And we try to allocate the extents on > > boundaries. There is definitely some bleeding into rmw when I do it > > here, but overall it does well. > > > > But I was using 8 drives. I''ll try with 12. > > Hmm, I already tried with 10 drives (8+2), doesn''t make a difference for > RMW.My benchmarks were on flash, so the rmw I was seeing may not have had as big an impact. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Chris, On 05/23/2013 10:33 PM, Chris Mason wrote:> But I was using 8 drives. I''ll try with 12. >> > My benchmarks were on flash, so the rmw I was seeing may not have had as > big an impact.I just further played with it and simply introduced a requeue in raid56_rmw_stripe() if the rbio is ''younger'' than 50 jiffies. I can still see reads, but by a factor 10 lower than before. And this is sufficient to bring performance almost to that of direc-io. This is certainly no upstream code, I hope I find some time over the weekend to come up with something better. Btw, I also noticed the cache logic copies pages from those rmw-threads. Well, this a numa system and memory bandwith is terribly bad from the remote cpu. These worker threads probably should numa aware and only handle rbios from their own cpu. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Bernd Schubert (2013-05-24 04:35:37)> Hello Chris, > > On 05/23/2013 10:33 PM, Chris Mason wrote: > > But I was using 8 drives. I''ll try with 12. > >> > > My benchmarks were on flash, so the rmw I was seeing may not have had as > > big an impact. > > > I just further played with it and simply introduced a requeue in > raid56_rmw_stripe() if the rbio is ''younger'' than 50 jiffies. I can > still see reads, but by a factor 10 lower than before. And this is > sufficient to bring performance almost to that of direc-io. > This is certainly no upstream code, I hope I find some time over the > weekend to come up with something better.Interesting. This probably shows that we need to do a better job of maintaining a plug across the writepages calls, or that we need to be much more aggressive in writepages to add more pages once we''ve started.> > Btw, I also noticed the cache logic copies pages from those rmw-threads. > Well, this a numa system and memory bandwith is terribly bad from the > remote cpu. These worker threads probably should numa aware and only > handle rbios from their own cpu.Yes, all of the helpers (especially crc and parity) should be made numa aware. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html