thr3ads.net - Btrfs devel - raid6: rmw writes all the time? [May 2013]

If this information is useful, please help other people find it:
Share via:

Bernd Schubert

2013-May-23 12:55 UTC

raid6: rmw writes all the time?

Hi all,

we got a new test system here and I just also tested btrfs raid6 on 
that. Write performance is slightly lower than hw-raid (LSI megasas) and 
md-raid6, but it probably would be much better than any of these two, if 
it wouldn''t read all the during the writes. Is this a known issue? This
is with linux-3.9.2.

Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-May-23 13:11 UTC

head link

Re: raid6: rmw writes all the time?

Quoting Bernd Schubert (2013-05-23 08:55:47)> Hi all,
> 
> we got a new test system here and I just also tested btrfs raid6 on 
> that. Write performance is slightly lower than hw-raid (LSI megasas) and 
> md-raid6, but it probably would be much better than any of these two, if 
> it wouldn''t read all the during the writes. Is this a known issue?
This
> is with linux-3.9.2.
Hi Bernd,

Any time you do a write smaller than a full stripe, we''ll have to do a
read/modify/write cycle to satisfy it.  This is true of md raid6 and the
hw-raid as well, but their reads don''t show up in vmstat (try iostat
instead).

So the bigger question is where are your small writes coming from.  If
they are metadata, you can use raid1 for the metadata.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schubert

2013-May-23 13:22 UTC

head link

Re: raid6: rmw writes all the time?

On 05/23/2013 03:11 PM, Chris Mason wrote:> Quoting Bernd Schubert (2013-05-23 08:55:47)
>> Hi all,
>>
>> we got a new test system here and I just also tested btrfs raid6 on
>> that. Write performance is slightly lower than hw-raid (LSI megasas)
and
>> md-raid6, but it probably would be much better than any of these two,
if
>> it wouldn''t read all the during the writes. Is this a known
issue? This
>> is with linux-3.9.2.
>
> Hi Bernd,
>
> Any time you do a write smaller than a full stripe, we''ll have to
do a
> read/modify/write cycle to satisfy it.  This is true of md raid6 and the
> hw-raid as well, but their reads don''t show up in vmstat (try
iostat
> instead).
Yeah, I know and I''m using iostat already. md raid6 does not do rmw,
but
does not fill the device queue, afaik it flushes the underlying devices 
quickly as it does not have barrier support - that is another topic, but 
was the reason why I started to test btrfs.
>
> So the bigger question is where are your small writes coming from.  If
> they are metadata, you can use raid1 for the metadata.
I used this command

/tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]

so meta-data should be raid10. And I''m using this iozone command:

> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
>         -F /data/fhgfs/storage/md126/testfile1
/data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
>            /data/fhgfs/storage/md127/testfile1
/data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3

Higher IO sizes (e.g. -r16m) don''t make a difference, it goes through 
the page cache anyway.
I''m not familiar with btrfs code at all, but maybe writepages() submits
too small IOs?

Hrmm, just wanted to try direct IO, but then just noticed it went into 
RO mode before already:
> May 23 14:59:33 c8220a kernel: WARNING: at fs/btrfs/super.c:255
__btrfs_abort_transaction+0xdf/0x100 [btrfs]()
> ay 23 14:59:33 c8220a kernel: [<ffffffff8105db76>]
warn_slowpath_fmt+0x46/0x50
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b5428a>] ?
btrfs_free_path+0x2a/0x40 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b4e18f>]
__btrfs_abort_transaction+0xdf/0x100 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b70b2f>]
btrfs_save_ino_cache+0x22f/0x310 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b793e2>]
commit_fs_roots+0xd2/0x1c0 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffff815eb3fe>] ?
mutex_lock+0x1e/0x50
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b7a555>]
btrfs_commit_transaction+0x495/0xa40 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b7af7b>] ?
start_transaction+0xab/0x4d0 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffff81082f30>] ?
wake_up_bit+0x40/0x40
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b72b96>]
transaction_kthread+0x1a6/0x220 [btrfs]
> May 23 14:59:33 c8220a kernel: ---[ end trace 3d91874abeab5984 ]---
> May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in
btrfs_save_ino_cache:471: error 28
> May 23 14:59:33 c8220a kernel: btrfs is forced readonly
> May 23 14:59:33 c8220a kernel: BTRFS warning (device sdx): Skipping commit
of aborted transaction.
> May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in
cleanup_transaction:1455: error 28
errno 28 - out of disk space?

Going to recreate it and will play with it later on again.


Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-May-23 13:34 UTC

head link

Re: raid6: rmw writes all the time?

Quoting Bernd Schubert (2013-05-23 09:22:41)> On 05/23/2013 03:11 PM, Chris Mason wrote:
> > Quoting Bernd Schubert (2013-05-23 08:55:47)
> >> Hi all,
> >>
> >> we got a new test system here and I just also tested btrfs raid6
on
> >> that. Write performance is slightly lower than hw-raid (LSI
megasas) and
> >> md-raid6, but it probably would be much better than any of these
two, if
> >> it wouldn''t read all the during the writes. Is this a
known issue? This
> >> is with linux-3.9.2.
> >
> > Hi Bernd,
> >
> > Any time you do a write smaller than a full stripe, we''ll
have to do a
> > read/modify/write cycle to satisfy it.  This is true of md raid6 and
the
> > hw-raid as well, but their reads don''t show up in vmstat (try
iostat
> > instead).
> 
> Yeah, I know and I''m using iostat already. md raid6 does not do
rmw, but
> does not fill the device queue, afaik it flushes the underlying devices 
> quickly as it does not have barrier support - that is another topic, but 
> was the reason why I started to test btrfs.
md should support barriers with recent kernels.  You might want to
verify with blktrace that md raid6 isn''t doing r/m/w.
> 
> >
> > So the bigger question is where are your small writes coming from.  If
> > they are metadata, you can use raid1 for the metadata.
> 
> I used this command
> 
> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]
Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB
times the number of devices on the FS.  If you have 13 devices, that''s
832K.

Using buffered writes makes it much more likely the VM will break up the
IOs as they go down.  The btrfs writepages code does try to do full
stripe IO, and it also caches stripes as the IO goes down.  But for
buffered IO it is surprisingly hard to get a 100% hit rate on full
stripe IO at larger stripe sizes.
> 
> so meta-data should be raid10. And I''m using this iozone command:
> 
> 
> > iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
> >         -F /data/fhgfs/storage/md126/testfile1
/data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
> >            /data/fhgfs/storage/md127/testfile1
/data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3
> 
> 
> Higher IO sizes (e.g. -r16m) don''t make a difference, it goes
through
> the page cache anyway.
> I''m not familiar with btrfs code at all, but maybe writepages()
submits
> too small IOs?
> 
> Hrmm, just wanted to try direct IO, but then just noticed it went into 
> RO mode before already:
Direct IO will make it easier to get full stripe writes.  I thought I
had fixed this abort, but it is just running out of space to write the
inode cache.  For now, please just don''t mount with the inode cache
enabled, I''ll send in a fix for the next rc.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bob Marley

2013-May-23 13:41 UTC

head link

Re: raid6: rmw writes all the time?

On 23/05/2013 15:22, Bernd Schubert wrote:>
> Yeah, I know and I''m using iostat already. md raid6 does not do
rmw,
> but does not fill the device queue, afaik it flushes the underlying 
> devices quickly as it does not have barrier support - that is another 
> topic, but was the reason why I started to test btrfs.
MD raid6 DOES have barrier support!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schubert

2013-May-23 16:30 UTC

head link

Re: raid6: rmw writes all the time?

On 05/23/2013 03:41 PM, Bob Marley wrote:> On 23/05/2013 15:22, Bernd Schubert wrote:
>>
>> Yeah, I know and I''m using iostat already. md raid6 does not
do rmw,
>> but does not fill the device queue, afaik it flushes the underlying
>> devices quickly as it does not have barrier support - that is another
>> topic, but was the reason why I started to test btrfs.
> 
> MD raid6 DOES have barrier support!
> 
For underlying devices yes, but it does not further use it for
additional buffering.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schubert

2013-May-23 19:33 UTC

head link

Re: raid6: rmw writes all the time?

On 05/23/2013 03:34 PM, Chris Mason wrote:> Quoting Bernd Schubert (2013-05-23 09:22:41)
>> On 05/23/2013 03:11 PM, Chris Mason wrote:
>>> Quoting Bernd Schubert (2013-05-23 08:55:47)
>>>> Hi all,
>>>>
>>>> we got a new test system here and I just also tested btrfs
raid6 on
>>>> that. Write performance is slightly lower than hw-raid (LSI
megasas) and
>>>> md-raid6, but it probably would be much better than any of
these two, if
>>>> it wouldn''t read all the during the writes. Is this a
known issue? This
>>>> is with linux-3.9.2.
>>>
>>> Hi Bernd,
>>>
>>> Any time you do a write smaller than a full stripe, we''ll
have to do a
>>> read/modify/write cycle to satisfy it.  This is true of md raid6
and the
>>> hw-raid as well, but their reads don''t show up in vmstat
(try iostat
>>> instead).
>>
>> Yeah, I know and I''m using iostat already. md raid6 does not
do rmw, but
>> does not fill the device queue, afaik it flushes the underlying devices
>> quickly as it does not have barrier support - that is another topic,
but
>> was the reason why I started to test btrfs.
> 
> md should support barriers with recent kernels.  You might want to
> verify with blktrace that md raid6 isn''t doing r/m/w.
> 
>>
>>>
>>> So the bigger question is where are your small writes coming from. 
If
>>> they are metadata, you can use raid1 for the metadata.
>>
>> I used this command
>>
>> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]
> 
> Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB
> times the number of devices on the FS.  If you have 13 devices,
that''s
> 832K.
Actually I have 12 devices, but we have to subtract 2 parity disks. In
the mean time I also patched btrfsprogs to use a chunksize of 256K. So
that should be 2560kiB now if I found the right places.
Btw, any chance to generally use chunksize/chunklen instead of stripe,
such as the md layer does it? IMHO it is less confusing to use
n-datadisks * chunksize = stripesize.
> 
> Using buffered writes makes it much more likely the VM will break up the
> IOs as they go down.  The btrfs writepages code does try to do full
> stripe IO, and it also caches stripes as the IO goes down.  But for
> buffered IO it is surprisingly hard to get a 100% hit rate on full
> stripe IO at larger stripe sizes.
I have not found that part yet, somehow it looks like as if writepages
would submit single pages to another layer. I''m going to look into it
again during the weekend. I can reserve the hardware that long, but I
think we first need to fix striped writes in general.
> 
>>
>> so meta-data should be raid10. And I''m using this iozone
command:
>>
>>
>>> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
>>>         -F /data/fhgfs/storage/md126/testfile1
/data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
>>>            /data/fhgfs/storage/md127/testfile1
/data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3
>>
>>
>> Higher IO sizes (e.g. -r16m) don''t make a difference, it goes
through
>> the page cache anyway.
>> I''m not familiar with btrfs code at all, but maybe
writepages() submits
>> too small IOs?
>>
>> Hrmm, just wanted to try direct IO, but then just noticed it went into 
>> RO mode before already:
> 
> Direct IO will make it easier to get full stripe writes.  I thought I
> had fixed this abort, but it is just running out of space to write the
> inode cache.  For now, please just don''t mount with the inode
cache
> enabled, I''ll send in a fix for the next rc.
Thanks, I already noticed and disabled the inode cache.

Direct-io works as expected and without any RMW cycles. And that
provides more than 40% better performance than the Megasas controller or
buffered MD writes (I didn''t compare with direct-io MD, as that is very
slow).


Cheers,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-May-23 19:37 UTC

head link

Re: raid6: rmw writes all the time?

Quoting Bernd Schubert (2013-05-23 15:33:24)> On 05/23/2013 03:34 PM, Chris Mason wrote:
> > Quoting Bernd Schubert (2013-05-23 09:22:41)
> >> On 05/23/2013 03:11 PM, Chris Mason wrote:
> >>> Quoting Bernd Schubert (2013-05-23 08:55:47)
> >>>> Hi all,
> >>>>
> >>>> we got a new test system here and I just also tested btrfs
raid6 on
> >>>> that. Write performance is slightly lower than hw-raid
(LSI megasas) and
> >>>> md-raid6, but it probably would be much better than any of
these two, if
> >>>> it wouldn''t read all the during the writes. Is
this a known issue? This
> >>>> is with linux-3.9.2.
> >>>
> >>> Hi Bernd,
> >>>
> >>> Any time you do a write smaller than a full stripe,
we''ll have to do a
> >>> read/modify/write cycle to satisfy it.  This is true of md
raid6 and the
> >>> hw-raid as well, but their reads don''t show up in
vmstat (try iostat
> >>> instead).
> >>
> >> Yeah, I know and I''m using iostat already. md raid6 does
not do rmw, but
> >> does not fill the device queue, afaik it flushes the underlying
devices
> >> quickly as it does not have barrier support - that is another
topic, but
> >> was the reason why I started to test btrfs.
> > 
> > md should support barriers with recent kernels.  You might want to
> > verify with blktrace that md raid6 isn''t doing r/m/w.
> > 
> >>
> >>>
> >>> So the bigger question is where are your small writes coming
from.  If
> >>> they are metadata, you can use raid1 for the metadata.
> >>
> >> I used this command
> >>
> >> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]
> > 
> > Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB
> > times the number of devices on the FS.  If you have 13 devices,
that''s
> > 832K.
> 
> Actually I have 12 devices, but we have to subtract 2 parity disks. In
> the mean time I also patched btrfsprogs to use a chunksize of 256K. So
> that should be 2560kiB now if I found the right places.
Sorry, thanks for filling in for my pre-coffee email.
> Btw, any chance to generally use chunksize/chunklen instead of stripe,
> such as the md layer does it? IMHO it is less confusing to use
> n-datadisks * chunksize = stripesize.
Definitely, it will become much more configurable.
> 
> > 
> > Using buffered writes makes it much more likely the VM will break up
the
> > IOs as they go down.  The btrfs writepages code does try to do full
> > stripe IO, and it also caches stripes as the IO goes down.  But for
> > buffered IO it is surprisingly hard to get a 100% hit rate on full
> > stripe IO at larger stripe sizes.
> 
> I have not found that part yet, somehow it looks like as if writepages
> would submit single pages to another layer. I''m going to look into
it
> again during the weekend. I can reserve the hardware that long, but I
> think we first need to fix striped writes in general.
The VM calls writepages and btrfs tries to suck down all the pages that
belong to the same extent.  And we try to allocate the extents on
boundaries.  There is definitely some bleeding into rmw when I do it
here, but overall it does well.

But I was using 8 drives.  I''ll try with 12.
> 
> > 
> >>
> >> so meta-data should be raid10. And I''m using this iozone
command:
> >>
> >>
> >>> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
> >>>         -F /data/fhgfs/storage/md126/testfile1
/data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
> >>>            /data/fhgfs/storage/md127/testfile1
/data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3
> >>
> >>
> >> Higher IO sizes (e.g. -r16m) don''t make a difference, it
goes through
> >> the page cache anyway.
> >> I''m not familiar with btrfs code at all, but maybe
writepages() submits
> >> too small IOs?
> >>
> >> Hrmm, just wanted to try direct IO, but then just noticed it went
into
> >> RO mode before already:
> > 
> > Direct IO will make it easier to get full stripe writes.  I thought I
> > had fixed this abort, but it is just running out of space to write the
> > inode cache.  For now, please just don''t mount with the inode
cache
> > enabled, I''ll send in a fix for the next rc.
> 
> Thanks, I already noticed and disabled the inode cache.
> 
> Direct-io works as expected and without any RMW cycles. And that
> provides more than 40% better performance than the Megasas controller or
> buffered MD writes (I didn''t compare with direct-io MD, as that is
very
> slow).
You can improve MD performance quite a lot by increasing the size of the
stripe cache.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schubert

2013-May-23 19:45 UTC

head link

Re: raid6: rmw writes all the time?

On 05/23/2013 09:37 PM, Chris Mason wrote:> Quoting Bernd Schubert (2013-05-23 15:33:24)
>> Btw, any chance to generally use chunksize/chunklen instead of stripe,
>> such as the md layer does it? IMHO it is less confusing to use
>> n-datadisks * chunksize = stripesize.
> 
> Definitely, it will become much more configurable.
Actually I meant in the code. I''m going to write a patch during the
weekend.
> 
>>
>>>
>>> Using buffered writes makes it much more likely the VM will break
up the
>>> IOs as they go down.  The btrfs writepages code does try to do full
>>> stripe IO, and it also caches stripes as the IO goes down.  But for
>>> buffered IO it is surprisingly hard to get a 100% hit rate on full
>>> stripe IO at larger stripe sizes.
>>
>> I have not found that part yet, somehow it looks like as if writepages
>> would submit single pages to another layer. I''m going to look
into it
>> again during the weekend. I can reserve the hardware that long, but I
>> think we first need to fix striped writes in general.
> 
> The VM calls writepages and btrfs tries to suck down all the pages that
> belong to the same extent.  And we try to allocate the extents on
> boundaries.  There is definitely some bleeding into rmw when I do it
> here, but overall it does well.
> 
> But I was using 8 drives.  I''ll try with 12.
Hmm, I already tried with 10 drives (8+2), doesn''t make a difference
for
RMW.
> 
>> Direct-io works as expected and without any RMW cycles. And that
>> provides more than 40% better performance than the Megasas controller
or
>> buffered MD writes (I didn''t compare with direct-io MD, as
that is very
>> slow).
> 
> You can improve MD performance quite a lot by increasing the size of the
> stripe cache.
I''m already doing that, without a higher stripe cache the performance
is
much lower.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-May-23 20:33 UTC

head link

Re: raid6: rmw writes all the time?

Quoting Bernd Schubert (2013-05-23 15:45:36)> On 05/23/2013 09:37 PM, Chris Mason wrote:
> > Quoting Bernd Schubert (2013-05-23 15:33:24)
> >> Btw, any chance to generally use chunksize/chunklen instead of
stripe,
> >> such as the md layer does it? IMHO it is less confusing to use
> >> n-datadisks * chunksize = stripesize.
> > 
> > Definitely, it will become much more configurable.
> 
> Actually I meant in the code. I''m going to write a patch during
the weekend.
The btrfs raid code refers to stripes because a chunk is a very large
(~1GB) slice of a set of drives that we allocate into raid levels. 

We have full stripes and device stripes, I''m afraid there are so many
different terms in other projects that it is hard to pick something clear.
> 
> > 
> >>
> >>>
> >>> Using buffered writes makes it much more likely the VM will
break up the
> >>> IOs as they go down.  The btrfs writepages code does try to do
full
> >>> stripe IO, and it also caches stripes as the IO goes down. 
But for
> >>> buffered IO it is surprisingly hard to get a 100% hit rate on
full
> >>> stripe IO at larger stripe sizes.
> >>
> >> I have not found that part yet, somehow it looks like as if
writepages
> >> would submit single pages to another layer. I''m going to
look into it
> >> again during the weekend. I can reserve the hardware that long,
but I
> >> think we first need to fix striped writes in general.
> > 
> > The VM calls writepages and btrfs tries to suck down all the pages
that
> > belong to the same extent.  And we try to allocate the extents on
> > boundaries.  There is definitely some bleeding into rmw when I do it
> > here, but overall it does well.
> > 
> > But I was using 8 drives.  I''ll try with 12.
> 
> Hmm, I already tried with 10 drives (8+2), doesn''t make a
difference for
> RMW.
My benchmarks were on flash, so the rmw I was seeing may not have had as
big an impact.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schubert

2013-May-24 08:35 UTC

head link

Re: raid6: rmw writes all the time?

Hello Chris,

On 05/23/2013 10:33 PM, Chris Mason wrote:> But I was using 8 drives.  I''ll try with 12.
>>
> My benchmarks were on flash, so the rmw I was seeing may not have had as
> big an impact.

I just further played with it and simply introduced a requeue in 
raid56_rmw_stripe() if the rbio is ''younger'' than 50 jiffies.
I can
still see reads, but by a factor 10 lower than before. And this is 
sufficient to bring performance almost to that of direc-io.
This is certainly no upstream code, I hope I find some time over the 
weekend to come up with something better.

Btw, I also noticed the cache logic copies pages from those rmw-threads. 
Well, this a numa system and memory bandwith is terribly bad from the 
remote cpu. These worker threads probably should numa aware and only 
handle rbios from their own cpu.

Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2013-May-24 12:40 UTC

head link

Re: raid6: rmw writes all the time?

Quoting Bernd Schubert (2013-05-24 04:35:37)> Hello Chris,
> 
> On 05/23/2013 10:33 PM, Chris Mason wrote:
> > But I was using 8 drives.  I''ll try with 12.
> >>
> > My benchmarks were on flash, so the rmw I was seeing may not have had
as
> > big an impact.
> 
> 
> I just further played with it and simply introduced a requeue in 
> raid56_rmw_stripe() if the rbio is ''younger'' than 50
jiffies. I can
> still see reads, but by a factor 10 lower than before. And this is 
> sufficient to bring performance almost to that of direc-io.
> This is certainly no upstream code, I hope I find some time over the 
> weekend to come up with something better.
Interesting.  This probably shows that we need to do a better job of
maintaining a plug across the writepages calls, or that we need to be
much more aggressive in writepages to add more pages once we''ve
started.
> 
> Btw, I also noticed the cache logic copies pages from those rmw-threads. 
> Well, this a numa system and memory bandwith is terribly bad from the 
> remote cpu. These worker threads probably should numa aware and only 
> handle rbios from their own cpu.
Yes, all of the helpers (especially crc and parity) should be made numa aware.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Apparently Analagous Threads

Search for more apparently analagous threads

Btrfs devel - May 2013 - raid6: rmw writes all the time?

raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Re: raid6: rmw writes all the time?

Apparently Analagous Threads