thr3ads.net - Xen devel - Xen disk write slowness in kernel 3.8.x [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Steven Haigh

2013-Apr-03 02:05 UTC

Xen disk write slowness in kernel 3.8.x

Hi all,

I''m still trying to track down the cause of disk write slowness when 
passing through disks to a DomU.

I''ve restructured things a little and now pass the raid array
(/dev/md3)
directly to the DomU. When running a xfs_fsr on the filesystem from 
within the Dom0, I get the following:

# iostat -m 5
(....)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.11    0.00    9.23   44.64    0.21   45.81

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdc             286.20        17.22        34.15         86        170
sdf             284.20        17.45        34.43         87        172
sdd             217.40        17.25        34.15         86        170
sde             211.00        17.25        34.38         86        171
md3            1095.40        69.20        67.18        346        335

This is with the RAID6 mounted on /mnt/fileshare from within the Dom0. 
Speeds are about what I would expect for the task that is going on.

With no changes at all to the setup or the RAID, I attach the same RAID6 
array to a DomU:

# xm block-attach zeus.vm phy:/dev/md3 xvdb w

Now run xfs_fsr from within the DomU, I look at the same output from 
iostat on Dom0:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    2.35    0.00    0.34   97.31

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdc             702.40        11.28        16.09         56         80
sdf             701.00        11.26        16.15         56         80
sdd             698.00        11.18        15.95         55         79
sde             700.60        11.27        16.19         56         80
md3            1641.00        30.30        29.87        151        149

I''m seeing this consistently across all methods of speed testing (dd, 
bonnie++, etc).

If I remove and attach a single disk from the array and run tests on 
that, I obtain full speed for the single drive. As soon as the array is 
passed, the speed drops significantly (as seen above).

I have copied in the linux-raid list to this - as it seems to only 
affect md arrays passed to the Xen DomU guests.

Where do we start debugging this?

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Steven Haigh

2013-Apr-03 02:14 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

On 03/04/13 13:05, Steven Haigh wrote:> Hi all,
>
> I''m still trying to track down the cause of disk write slowness
when
> passing through disks to a DomU.
>
> I''ve restructured things a little and now pass the raid array
(/dev/md3)
> directly to the DomU. When running a xfs_fsr on the filesystem from
> within the Dom0, I get the following:
>
> # iostat -m 5
> (....)
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.11    0.00    9.23   44.64    0.21   45.81
>
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdc             286.20        17.22        34.15         86        170
> sdf             284.20        17.45        34.43         87        172
> sdd             217.40        17.25        34.15         86        170
> sde             211.00        17.25        34.38         86        171
> md3            1095.40        69.20        67.18        346        335
>
> This is with the RAID6 mounted on /mnt/fileshare from within the Dom0.
> Speeds are about what I would expect for the task that is going on.
>
> With no changes at all to the setup or the RAID, I attach the same RAID6
> array to a DomU:
>
> # xm block-attach zeus.vm phy:/dev/md3 xvdb w
>
> Now run xfs_fsr from within the DomU, I look at the same output from
> iostat on Dom0:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.00    0.00    2.35    0.00    0.34   97.31
>
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdc             702.40        11.28        16.09         56         80
> sdf             701.00        11.26        16.15         56         80
> sdd             698.00        11.18        15.95         55         79
> sde             700.60        11.27        16.19         56         80
> md3            1641.00        30.30        29.87        151        149
>
> I''m seeing this consistently across all methods of speed testing
(dd,
> bonnie++, etc).
>
> If I remove and attach a single disk from the array and run tests on
> that, I obtain full speed for the single drive. As soon as the array is
> passed, the speed drops significantly (as seen above).
>
> I have copied in the linux-raid list to this - as it seems to only
> affect md arrays passed to the Xen DomU guests.
>
> Where do we start debugging this?
>
Whoops - forgot to add the details of the array:
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md3 : active raid6 sdd[5] sdc[4] sdf[1] sde[0]
       3906766592 blocks super 1.2 level 6, 128k chunk, algorithm 2 
[4/4] [UUUU]

# mdadm --detail /dev/md3
/dev/md3:
         Version : 1.2
   Creation Time : Mon Apr  1 01:49:18 2013
      Raid Level : raid6
      Array Size : 3906766592 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 1953383296 (1862.89 GiB 2000.26 GB)
    Raid Devices : 4
   Total Devices : 4
     Persistence : Superblock is persistent

     Update Time : Wed Apr  3 13:13:45 2013
           State : active
  Active Devices : 4
Working Devices : 4
  Failed Devices : 0
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 128K

            Name : xenhost.lan.crc.id.au:3  (local to host 
xenhost.lan.crc.id.au)
            UUID : 69cd7c1c:2ffc2df2:0a8afbb3:a2f32dab
          Events : 310

     Number   Major   Minor   RaidDevice State
        0       8       64        0      active sync   /dev/sde
        1       8       80        1      active sync   /dev/sdf
        5       8       48        2      active sync   /dev/sdd
        4       8       32        3      active sync   /dev/sdc

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299
--
To unsubscribe from this list: send the line "unsubscribe linux-raid"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Felipe Franciosi

2013-Apr-03 11:36 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

Based on what you said later, your raid stripe size is 128KB. When you plug
/dev/md3 to your guest, you will be going through blkback/blkfront which only
support 44KB per request.

So can I ask what is the "task that is going on"?
I mean, what is your workload and how does it look like (in terms of
reads/writes, (a)sync and size)?

Felipe

-----Original Message-----
From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-bounces@lists.xen.org]
On Behalf Of Steven Haigh
Sent: 03 April 2013 03:06
To: Xen-devel@lists.xensource.com; linux-raid@vger.kernel.org
Subject: [Xen-devel] Xen disk write slowness in kernel 3.8.x

<...>
# iostat -m 5
(....)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.11    0.00    9.23   44.64    0.21   45.81

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdc             286.20        17.22        34.15         86        170
sdf             284.20        17.45        34.43         87        172
sdd             217.40        17.25        34.15         86        170
sde             211.00        17.25        34.38         86        171
md3            1095.40        69.20        67.18        346        335

This is with the RAID6 mounted on /mnt/fileshare from within the Dom0. 
Speeds are about what I would expect for the task that is going on.
<...>

Steven Haigh

2013-Apr-03 11:42 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

On 03/04/13 22:36, Felipe Franciosi wrote:> Based on what you said later, your raid stripe size is 128KB. When you plug
/dev/md3 to your guest, you will be going through blkback/blkfront which only
support 44KB per request.
This is interesting. The default size for a RAID5/6 via mdadm these days 
is 512Kb.
> So can I ask what is the "task that is going on"?
> I mean, what is your workload and how does it look like (in terms of
reads/writes, (a)sync and size)?
The load is mostly sequential writes. Samba + NFS shares mainly. Files 
are usually 300Mb -> 10Gb. Its mostly writes, with occasional reads. 
Interestingly, I get 220Mb/sec+ reads - but only ~50Mb/sec writes via 
the DomU.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Felipe Franciosi

2013-Apr-03 12:29 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

Do you know the size of your requests?
You used "iostat -m". Perhaps "iostat -xm" could be more
meaningful as it will tell you the average request size in sectors.

I still didn''t understand your environment. I think this is the first
time you mention NFS/SMB.

Weren''t you testing copies from /dev/zero to your device?

I''d also recommend you run "oprofile" to see if there are any
time sinks that we are overlooking. You can find out how to use it here:
http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf

Felipe

-----Original Message-----
From: Steven Haigh [mailto:netwiz@crc.id.au] 
Sent: 03 April 2013 12:42
To: Felipe Franciosi
Cc: Xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] Xen disk write slowness in kernel 3.8.x

On 03/04/13 22:36, Felipe Franciosi wrote:> Based on what you said later, your raid stripe size is 128KB. When you plug
/dev/md3 to your guest, you will be going through blkback/blkfront which only
support 44KB per request.
This is interesting. The default size for a RAID5/6 via mdadm these days is
512Kb.
> So can I ask what is the "task that is going on"?
> I mean, what is your workload and how does it look like (in terms of
reads/writes, (a)sync and size)?
The load is mostly sequential writes. Samba + NFS shares mainly. Files are
usually 300Mb -> 10Gb. Its mostly writes, with occasional reads.
Interestingly, I get 220Mb/sec+ reads - but only ~50Mb/sec writes via the DomU.

--
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Steven Haigh

2013-Apr-03 13:26 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

On 03/04/13 23:29, Felipe Franciosi wrote:> Do you know the size of your requests?
> You used "iostat -m". Perhaps "iostat -xm" could be
more meaningful as it will tell you the average request size in sectors.
Good call. I started just a sequential write with dd from /dev/zero on 
the DomU:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s

iostat -xm shows:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.11    0.00    2.71    0.23    0.56   96.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdc            1208.40  5337.80  150.00  645.20     5.35    23.39 
74.00     6.47    8.15   0.91  72.72
sdf            1199.80  5350.80  148.00  643.40     5.30    23.34 
74.12     6.88    8.68   0.96  76.06
sdd            1203.40  5304.60  148.80  638.60     5.28    23.21 
74.11     5.42    6.88   0.78  61.38
sde            1213.80  5382.40  148.40  652.40     5.37    23.59 
74.08     6.40    8.00   0.94  75.20
md3               0.00     0.00    1.60 1312.20     0.05    42.68 
66.60     0.00    0.00   0.00   0.00

Shutting down the DomU, mounting /dev/md3 in and doing this directly 
from the Dom0 shows:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s

details from iostat -xm:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    2.16    0.00    0.57   97.27

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdc            1175.40  5220.60  143.40  633.60     5.13    22.85 
73.76     6.01    7.73   0.87  67.60
sdf            1175.20  5154.00  147.40  624.40     5.14    22.55 
73.46     6.87    8.88   0.97  74.90
sdd            1183.40  5133.80  145.40  625.20     5.19    22.50 
73.60     5.19    6.73   0.77  59.60
sde            1176.40  5229.60  146.00  637.00     5.16    22.99 
73.62     7.39    9.51   0.99  77.90
md3               0.00     0.00    0.60 1277.40     0.02    41.56 
66.63     0.00    0.00   0.00   0.00

This about ties in with what bonnie++ gets as the write speed:
Version  1.96       ------Sequential Output------ --Sequential Input- 
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP 
/sec %CP
xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35 
409.6  10

bonnie++ maxes out at about the same as dd in the DomU.
> I still didn''t understand your environment. I think this is the
first time you mention NFS/SMB.
/dev/md3 is passed through as a phy device to a DomU. This DomU runs 
both samba and NFS shares. It doesn''t seem to matter what method is the
source of the writes to the drive (dd, bonnie++, sending a file via NFS 
or Samba), the max write speed seems to be bottlenecking at ~50Mb/sec - 
which I believe is purely the write speed to the array via the DomU.
> Weren''t you testing copies from /dev/zero to your device?
I have tested many ways - this being just one of them.
> I''d also recommend you run "oprofile" to see if there
are any time sinks that we are overlooking. You can find out how to use it here:
> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
I ran ''perf top'' on the Dom0, then started a write (dd
if=/dev/zero
of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU. This is 
what I''d see as the relevant output:

Samples: 574K of event ''cpu-clock'', Event count (approx.):
60028
  88.37%  [kernel]             [k] xen_hypercall_sched_op
   4.73%  [kernel]             [k] xen_hypercall_xen_version
   1.00%  [kernel]             [k] xen_hypercall_grant_table_op
   0.99%  [raid456]            [k] handle_stripe
   0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
   0.66%  [raid456]            [k] ops_run_io
   0.57%  [kernel]             [k] memcpy
   0.21%  [kernel]             [k] xen_restore_fl_direct
   0.18%  [raid456]            [k] raid5_end_write_request
   0.18%  [raid456]            [k] __raid_run_ops
   0.14%  [kernel]             [k] xen_hypercall_event_channel_op
   0.11%  [kernel]             [k] get_phys_to_machine
   0.09%  [raid456]            [k] schedule_reconstruction

Now repeated, with the same on the DomU:
Samples: 300K of event ''cpu-clock'', Event count (approx.):
84845

  97.63%  [kernel]            [k] hypercall_page
   0.44%  [kernel]            [k] copy_user_generic_string
   0.41%  [kernel]            [k] __clear_user
   0.08%  [kernel]            [k] __wake_up_bit
   0.07%  [kernel]            [k] xen_restore_fl_direct
   0.07%  [kernel]            [k] __mem_cgroup_commit_charge

Now, interestingly enough... The root drive (xvda) of the DomU in 
question is on a separate RAID1. This is passed as an LV.

# dd if=/dev/zero of=/output.bin bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s

Then from the Dom0 to the same RAID1:
# dd if=/dev/zero of=output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s

I don''t really know what else I can do to try and see where the
slowdown
is here - I am open to suggestions though :)

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Felipe Franciosi

2013-Apr-03 14:30 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

I think you should use i/oflag=direct on your "dd"s to bypass the
buffer cache in the guest. Might provide more consistent results.

The result of oprofile is intriguing. Did you have a chance to try
Roger''s persistent grant implementation?

He mentions it on his 08/03 e-mail. It''s here:
git://xenbits.xen.org/people/royger/linux.git xen-block-indirect

Felipe

-----Original Message-----
From: Steven Haigh [mailto:netwiz@crc.id.au] 
Sent: 03 April 2013 14:27
To: xen-devel@lists.xen.org
Cc: Felipe Franciosi
Subject: Re: Xen disk write slowness in kernel 3.8.x

On 03/04/13 23:29, Felipe Franciosi wrote:> Do you know the size of your requests?
> You used "iostat -m". Perhaps "iostat -xm" could be
more meaningful as it will tell you the average request size in sectors.
Good call. I started just a sequential write with dd from /dev/zero on the DomU:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s

iostat -xm shows:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.11    0.00    2.71    0.23    0.56   96.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdc            1208.40  5337.80  150.00  645.20     5.35    23.39 
74.00     6.47    8.15   0.91  72.72
sdf            1199.80  5350.80  148.00  643.40     5.30    23.34 
74.12     6.88    8.68   0.96  76.06
sdd            1203.40  5304.60  148.80  638.60     5.28    23.21 
74.11     5.42    6.88   0.78  61.38
sde            1213.80  5382.40  148.40  652.40     5.37    23.59 
74.08     6.40    8.00   0.94  75.20
md3               0.00     0.00    1.60 1312.20     0.05    42.68 
66.60     0.00    0.00   0.00   0.00

Shutting down the DomU, mounting /dev/md3 in and doing this directly from the
Dom0 shows:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s

details from iostat -xm:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    2.16    0.00    0.57   97.27

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdc            1175.40  5220.60  143.40  633.60     5.13    22.85 
73.76     6.01    7.73   0.87  67.60
sdf            1175.20  5154.00  147.40  624.40     5.14    22.55 
73.46     6.87    8.88   0.97  74.90
sdd            1183.40  5133.80  145.40  625.20     5.19    22.50 
73.60     5.19    6.73   0.77  59.60
sde            1176.40  5229.60  146.00  637.00     5.16    22.99 
73.62     7.39    9.51   0.99  77.90
md3               0.00     0.00    0.60 1277.40     0.02    41.56 
66.63     0.00    0.00   0.00   0.00

This about ties in with what bonnie++ gets as the write speed:
Version  1.96       ------Sequential Output------ --Sequential Input- 
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP 
/sec %CP
xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35 
409.6  10

bonnie++ maxes out at about the same as dd in the DomU.
> I still didn''t understand your environment. I think this is the
first time you mention NFS/SMB.
/dev/md3 is passed through as a phy device to a DomU. This DomU runs both samba
and NFS shares. It doesn''t seem to matter what method is the source of
the writes to the drive (dd, bonnie++, sending a file via NFS or Samba), the max
write speed seems to be bottlenecking at ~50Mb/sec - which I believe is purely
the write speed to the array via the DomU.
> Weren''t you testing copies from /dev/zero to your device?
I have tested many ways - this being just one of them.
> I''d also recommend you run "oprofile" to see if there
are any time sinks that we are overlooking. You can find out how to use it here:
> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
I ran ''perf top'' on the Dom0, then started a write (dd
if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU. This
is what I''d see as the relevant output:

Samples: 574K of event ''cpu-clock'', Event count (approx.):
60028
  88.37%  [kernel]             [k] xen_hypercall_sched_op
   4.73%  [kernel]             [k] xen_hypercall_xen_version
   1.00%  [kernel]             [k] xen_hypercall_grant_table_op
   0.99%  [raid456]            [k] handle_stripe
   0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
   0.66%  [raid456]            [k] ops_run_io
   0.57%  [kernel]             [k] memcpy
   0.21%  [kernel]             [k] xen_restore_fl_direct
   0.18%  [raid456]            [k] raid5_end_write_request
   0.18%  [raid456]            [k] __raid_run_ops
   0.14%  [kernel]             [k] xen_hypercall_event_channel_op
   0.11%  [kernel]             [k] get_phys_to_machine
   0.09%  [raid456]            [k] schedule_reconstruction

Now repeated, with the same on the DomU:
Samples: 300K of event ''cpu-clock'', Event count (approx.):
84845

  97.63%  [kernel]            [k] hypercall_page
   0.44%  [kernel]            [k] copy_user_generic_string
   0.41%  [kernel]            [k] __clear_user
   0.08%  [kernel]            [k] __wake_up_bit
   0.07%  [kernel]            [k] xen_restore_fl_direct
   0.07%  [kernel]            [k] __mem_cgroup_commit_charge

Now, interestingly enough... The root drive (xvda) of the DomU in question is on
a separate RAID1. This is passed as an LV.

# dd if=/dev/zero of=/output.bin bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s

Then from the Dom0 to the same RAID1:
# dd if=/dev/zero of=output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s

I don''t really know what else I can do to try and see where the
slowdown is here - I am open to suggestions though :)

--
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Steven Haigh

2013-Apr-03 14:48 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

On 04/04/13 01:30, Felipe Franciosi wrote:> I think you should use i/oflag=direct on your "dd"s to bypass the
buffer cache in the guest. Might provide more consistent results.
Ah - you are correct. I usually don''t forget that. This has however 
discovered something interesting...

Entirely from the DomU:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 
oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s
# rm /mnt/fileshare/output.zero
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s

I did this 3-4 times with the same trend...

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 
oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s
# rm /mnt/fileshare/output.zero
# sync
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s

So, when the DomU is using write buffers, write speed is halved. This I 
don''t understand.

I haven''t tweaked the array at all yet (as I''ve just been
experimenting
with this problem) - so with a bit of time, I can probably even increase 
its write speed some more - but without this speed decrease solved, it 
isn''t really worth it yet...
> The result of oprofile is intriguing. Did you have a chance to try
Roger''s persistent grant implementation?
>
> He mentions it on his 08/03 e-mail. It''s here:
> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
I can''t say I have - Although I package Xen for EL6, we''re
getting to
the edge of my knowledge here.
>
> Felipe
>
> -----Original Message-----
> From: Steven Haigh [mailto:netwiz@crc.id.au]
> Sent: 03 April 2013 14:27
> To: xen-devel@lists.xen.org
> Cc: Felipe Franciosi
> Subject: Re: Xen disk write slowness in kernel 3.8.x
>
> On 03/04/13 23:29, Felipe Franciosi wrote:
>> Do you know the size of your requests?
>> You used "iostat -m". Perhaps "iostat -xm" could be
more meaningful as it will tell you the average request size in sectors.
>
> Good call. I started just a sequential write with dd from /dev/zero on the
DomU:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s
>
> iostat -xm shows:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0.11    0.00    2.71    0.23    0.56   96.38
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc            1208.40  5337.80  150.00  645.20     5.35    23.39
> 74.00     6.47    8.15   0.91  72.72
> sdf            1199.80  5350.80  148.00  643.40     5.30    23.34
> 74.12     6.88    8.68   0.96  76.06
> sdd            1203.40  5304.60  148.80  638.60     5.28    23.21
> 74.11     5.42    6.88   0.78  61.38
> sde            1213.80  5382.40  148.40  652.40     5.37    23.59
> 74.08     6.40    8.00   0.94  75.20
> md3               0.00     0.00    1.60 1312.20     0.05    42.68
> 66.60     0.00    0.00   0.00   0.00
>
> Shutting down the DomU, mounting /dev/md3 in and doing this directly from
the Dom0 shows:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s
>
> details from iostat -xm:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0.00    0.00    2.16    0.00    0.57   97.27
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc            1175.40  5220.60  143.40  633.60     5.13    22.85
> 73.76     6.01    7.73   0.87  67.60
> sdf            1175.20  5154.00  147.40  624.40     5.14    22.55
> 73.46     6.87    8.88   0.97  74.90
> sdd            1183.40  5133.80  145.40  625.20     5.19    22.50
> 73.60     5.19    6.73   0.77  59.60
> sde            1176.40  5229.60  146.00  637.00     5.16    22.99
> 73.62     7.39    9.51   0.99  77.90
> md3               0.00     0.00    0.60 1277.40     0.02    41.56
> 66.63     0.00    0.00   0.00   0.00
>
> This about ties in with what bonnie++ gets as the write speed:
> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35
> 409.6  10
>
> bonnie++ maxes out at about the same as dd in the DomU.
>
>> I still didn''t understand your environment. I think this is
the first time you mention NFS/SMB.
>
> /dev/md3 is passed through as a phy device to a DomU. This DomU runs both
samba and NFS shares. It doesn''t seem to matter what method is the
source of the writes to the drive (dd, bonnie++, sending a file via NFS or
Samba), the max write speed seems to be bottlenecking at ~50Mb/sec - which I
believe is purely the write speed to the array via the DomU.
>
>> Weren''t you testing copies from /dev/zero to your device?
>
> I have tested many ways - this being just one of them.
>
>> I''d also recommend you run "oprofile" to see if
there are any time sinks that we are overlooking. You can find out how to use it
here:
>> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
>
> I ran ''perf top'' on the Dom0, then started a write (dd
if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU. This
is what I''d see as the relevant output:
>
> Samples: 574K of event ''cpu-clock'', Event count
(approx.): 60028
>    88.37%  [kernel]             [k] xen_hypercall_sched_op
>     4.73%  [kernel]             [k] xen_hypercall_xen_version
>     1.00%  [kernel]             [k] xen_hypercall_grant_table_op
>     0.99%  [raid456]            [k] handle_stripe
>     0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
>     0.66%  [raid456]            [k] ops_run_io
>     0.57%  [kernel]             [k] memcpy
>     0.21%  [kernel]             [k] xen_restore_fl_direct
>     0.18%  [raid456]            [k] raid5_end_write_request
>     0.18%  [raid456]            [k] __raid_run_ops
>     0.14%  [kernel]             [k] xen_hypercall_event_channel_op
>     0.11%  [kernel]             [k] get_phys_to_machine
>     0.09%  [raid456]            [k] schedule_reconstruction
>
> Now repeated, with the same on the DomU:
> Samples: 300K of event ''cpu-clock'', Event count
(approx.): 84845
>
>    97.63%  [kernel]            [k] hypercall_page
>     0.44%  [kernel]            [k] copy_user_generic_string
>     0.41%  [kernel]            [k] __clear_user
>     0.08%  [kernel]            [k] __wake_up_bit
>     0.07%  [kernel]            [k] xen_restore_fl_direct
>     0.07%  [kernel]            [k] __mem_cgroup_commit_charge
>
> Now, interestingly enough... The root drive (xvda) of the DomU in question
is on a separate RAID1. This is passed as an LV.
>
> # dd if=/dev/zero of=/output.bin bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s
>
> Then from the Dom0 to the same RAID1:
> # dd if=/dev/zero of=output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s
>
> I don''t really know what else I can do to try and see where the
slowdown is here - I am open to suggestions though :)
>
> --
> Steven Haigh
>
> Email: netwiz@crc.id.au
> Web: https://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299
>

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Steven Haigh

2013-Apr-09 06:53 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

Resending this - hopefully for comment...

On 04/04/13 01:30, Felipe Franciosi wrote:> I think you should use i/oflag=direct on your "dd"s to bypass the
buffer cache in the guest. Might provide more consistent results.
Ah - you are correct. I usually don''t forget that. This has however 
discovered something interesting...

Entirely from the DomU:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 
oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s
# rm /mnt/fileshare/output.zero
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s

I did this 3-4 times with the same trend...

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 
oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s
# rm /mnt/fileshare/output.zero
# sync
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s

So, when the DomU is using write buffers, write speed is halved. This I 
don''t understand.

I haven''t tweaked the array at all yet (as I''ve just been
experimenting
with this problem) - so with a bit of time, I can probably even increase 
its write speed some more - but without this speed decrease solved, it 
isn''t really worth it yet...
> The result of oprofile is intriguing. Did you have a chance to try
Roger''s persistent grant implementation?
>
> He mentions it on his 08/03 e-mail. It''s here:
> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
I can''t say I have - Although I package Xen for EL6, we''re
getting to
the edge of my knowledge here.
>
> Felipe
>
> -----Original Message-----
> From: Steven Haigh [mailto:netwiz@crc.id.au]
> Sent: 03 April 2013 14:27
> To: xen-devel@lists.xen.org
> Cc: Felipe Franciosi
> Subject: Re: Xen disk write slowness in kernel 3.8.x
>
> On 03/04/13 23:29, Felipe Franciosi wrote:
>> Do you know the size of your requests?
>> You used "iostat -m". Perhaps "iostat -xm" could be
more meaningful as it will tell you the average request size in sectors.
>
> Good call. I started just a sequential write with dd from /dev/zero on the
DomU:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s
>
> iostat -xm shows:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0.11    0.00    2.71    0.23    0.56   96.38
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc            1208.40  5337.80  150.00  645.20     5.35    23.39
> 74.00     6.47    8.15   0.91  72.72
> sdf            1199.80  5350.80  148.00  643.40     5.30    23.34
> 74.12     6.88    8.68   0.96  76.06
> sdd            1203.40  5304.60  148.80  638.60     5.28    23.21
> 74.11     5.42    6.88   0.78  61.38
> sde            1213.80  5382.40  148.40  652.40     5.37    23.59
> 74.08     6.40    8.00   0.94  75.20
> md3               0.00     0.00    1.60 1312.20     0.05    42.68
> 66.60     0.00    0.00   0.00   0.00
>
> Shutting down the DomU, mounting /dev/md3 in and doing this directly from
the Dom0 shows:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s
>
> details from iostat -xm:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0.00    0.00    2.16    0.00    0.57   97.27
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc            1175.40  5220.60  143.40  633.60     5.13    22.85
> 73.76     6.01    7.73   0.87  67.60
> sdf            1175.20  5154.00  147.40  624.40     5.14    22.55
> 73.46     6.87    8.88   0.97  74.90
> sdd            1183.40  5133.80  145.40  625.20     5.19    22.50
> 73.60     5.19    6.73   0.77  59.60
> sde            1176.40  5229.60  146.00  637.00     5.16    22.99
> 73.62     7.39    9.51   0.99  77.90
> md3               0.00     0.00    0.60 1277.40     0.02    41.56
> 66.63     0.00    0.00   0.00   0.00
>
> This about ties in with what bonnie++ gets as the write speed:
> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35
> 409.6  10
>
> bonnie++ maxes out at about the same as dd in the DomU.
>
>> I still didn''t understand your environment. I think this is
the first time you mention NFS/SMB.
>
> /dev/md3 is passed through as a phy device to a DomU. This DomU runs both
samba and NFS shares. It doesn''t seem to matter what method is the
source of the writes to the drive (dd, bonnie++, sending a file via NFS or
Samba), the max write speed seems to be bottlenecking at ~50Mb/sec - which I
believe is purely the write speed to the array via the DomU.
>
>> Weren''t you testing copies from /dev/zero to your device?
>
> I have tested many ways - this being just one of them.
>
>> I''d also recommend you run "oprofile" to see if
there are any time sinks that we are overlooking. You can find out how to use it
here:
>> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
>
> I ran ''perf top'' on the Dom0, then started a write (dd
if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU. This
is what I''d see as the relevant output:
>
> Samples: 574K of event ''cpu-clock'', Event count
(approx.): 60028
>    88.37%  [kernel]             [k] xen_hypercall_sched_op
>     4.73%  [kernel]             [k] xen_hypercall_xen_version
>     1.00%  [kernel]             [k] xen_hypercall_grant_table_op
>     0.99%  [raid456]            [k] handle_stripe
>     0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
>     0.66%  [raid456]            [k] ops_run_io
>     0.57%  [kernel]             [k] memcpy
>     0.21%  [kernel]             [k] xen_restore_fl_direct
>     0.18%  [raid456]            [k] raid5_end_write_request
>     0.18%  [raid456]            [k] __raid_run_ops
>     0.14%  [kernel]             [k] xen_hypercall_event_channel_op
>     0.11%  [kernel]             [k] get_phys_to_machine
>     0.09%  [raid456]            [k] schedule_reconstruction
>
> Now repeated, with the same on the DomU:
> Samples: 300K of event ''cpu-clock'', Event count
(approx.): 84845
>
>    97.63%  [kernel]            [k] hypercall_page
>     0.44%  [kernel]            [k] copy_user_generic_string
>     0.41%  [kernel]            [k] __clear_user
>     0.08%  [kernel]            [k] __wake_up_bit
>     0.07%  [kernel]            [k] xen_restore_fl_direct
>     0.07%  [kernel]            [k] __mem_cgroup_commit_charge
>
> Now, interestingly enough... The root drive (xvda) of the DomU in question
is on a separate RAID1. This is passed as an LV.
>
> # dd if=/dev/zero of=/output.bin bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s
>
> Then from the Dom0 to the same RAID1:
> # dd if=/dev/zero of=output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s
>
> I don''t really know what else I can do to try and see where the
slowdown is here - I am open to suggestions though :)
>
> --
> Steven Haigh
>
> Email: netwiz@crc.id.au
> Web: https://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299
>

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Felipe Franciosi

2013-Apr-16 14:39 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

Hi Steven, sorry for the delay (doing some training followed by holidays).

Using "direct" from the guest does speed things up. I''m not
entirely sure why. Especially since this is also true (at least with the guests
I tested) in XCP/XenServer where the datapath includes a blktap2/tapdisk model
where the virtual disks are always opened with O_DIRECT and the only valid
operations are READ or WRITE (i.e. there is no special handling for guest
operations such as barriers, etc).

Have a look at http://support.citrix.com/article/CTX136861 for an understanding
of how things work in XCP 1.6 / XenServer 6.1.0.

My guess is that "direct" allows you to control the exact size of the
requests (creating a consistent data flow). If you omit it and shove all your
requests through the guest''s buffer cache, what you get in the backend
will vary largely in terms of request sizes. This will particularly affect the
blkback/blkfront ring utilisation. This hypothesis can hopefully be supported by
running iostat in dom0 (e.g. iostat -xm 1) and observing the request sizes. Let
me know what you see in your environment. If anyone else in the list have other
ideas on this I''d like to hear them as this has been puzzling me too.

Another thing you can try is to use "bs=11M" along i/oflag=direct. The
"11M" magic number comes from the ring size: with single-page rings,
you can have up to 32 requests of 44KiB each in the ring.
32 * 44KiB = 1408 KiB
If you try that and observe iostat in dom0, you''ll see that the average
request size often drops below 44KiB (88 sectors). To keep it consistent I
decided to use multiples of that number. The first "whole" number in
the order of mebibytes you''ll find is 8 rings: 32 * 44 * 1024 * 8 = 11
M.

Verify if that gives you more consistency in your results.

Last but not least, you should definitely try two other things:
1) oprofile.
2) Roger''s kernel with persistent grants and indirect I/O reqs (see
below).

Using oprofile you should be able to get more directions/ideas in the sense of
where your throughput is getting killed.

Finally, high throughput is usually achieved with large bandwidth and low
latency. Persistent granting should help a lot with reducing latency. The
indirect I/O operations that have recently been discussed in this list should
help you to achieve larger bandwidth. This should be particularly beneficial in
arrays such as yours, where the stripe size is larger than 88K.

Link to Roger''s blog post and kernel code:
http://blog.xen.org/index.php/2012/11/23/improving-block-protocol-scalability-with-persistent-grants/#more-5711
http://xenbits.xen.org/gitweb/?p=people/royger/linux.git;a=summary
(xen-block-indirect branch)

I am very interested in your results, please share them.

Cheers,
Felipe


-----Original Message-----
From: Steven Haigh [mailto:netwiz@crc.id.au] 
Sent: 09 April 2013 07:54
To: Felipe Franciosi
Cc: xen-devel@lists.xen.org
Subject: Re: Xen disk write slowness in kernel 3.8.x

Resending this - hopefully for comment...

On 04/04/13 01:30, Felipe Franciosi wrote:> I think you should use i/oflag=direct on your "dd"s to bypass the
buffer cache in the guest. Might provide more consistent results.
Ah - you are correct. I usually don''t forget that. This has however
discovered something interesting...

Entirely from the DomU:
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s # rm
/mnt/fileshare/output.zero # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M
count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s

I did this 3-4 times with the same trend...

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s # rm
/mnt/fileshare/output.zero # sync # dd if=/dev/zero
of=/mnt/fileshare/output.zero bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s

So, when the DomU is using write buffers, write speed is halved. This I
don''t understand.

I haven''t tweaked the array at all yet (as I''ve just been
experimenting with this problem) - so with a bit of time, I can probably even
increase its write speed some more - but without this speed decrease solved, it
isn''t really worth it yet...
> The result of oprofile is intriguing. Did you have a chance to try
Roger''s persistent grant implementation?
>
> He mentions it on his 08/03 e-mail. It''s here:
> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
I can''t say I have - Although I package Xen for EL6, we''re
getting to the edge of my knowledge here.
>
> Felipe
>
> -----Original Message-----
> From: Steven Haigh [mailto:netwiz@crc.id.au]
> Sent: 03 April 2013 14:27
> To: xen-devel@lists.xen.org
> Cc: Felipe Franciosi
> Subject: Re: Xen disk write slowness in kernel 3.8.x
>
> On 03/04/13 23:29, Felipe Franciosi wrote:
>> Do you know the size of your requests?
>> You used "iostat -m". Perhaps "iostat -xm" could be
more meaningful as it will tell you the average request size in sectors.
>
> Good call. I started just a sequential write with dd from /dev/zero on the
DomU:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s
>
> iostat -xm shows:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0.11    0.00    2.71    0.23    0.56   96.38
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc            1208.40  5337.80  150.00  645.20     5.35    23.39
> 74.00     6.47    8.15   0.91  72.72
> sdf            1199.80  5350.80  148.00  643.40     5.30    23.34
> 74.12     6.88    8.68   0.96  76.06
> sdd            1203.40  5304.60  148.80  638.60     5.28    23.21
> 74.11     5.42    6.88   0.78  61.38
> sde            1213.80  5382.40  148.40  652.40     5.37    23.59
> 74.08     6.40    8.00   0.94  75.20
> md3               0.00     0.00    1.60 1312.20     0.05    42.68
> 66.60     0.00    0.00   0.00   0.00
>
> Shutting down the DomU, mounting /dev/md3 in and doing this directly from
the Dom0 shows:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s
>
> details from iostat -xm:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0.00    0.00    2.16    0.00    0.57   97.27
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc            1175.40  5220.60  143.40  633.60     5.13    22.85
> 73.76     6.01    7.73   0.87  67.60
> sdf            1175.20  5154.00  147.40  624.40     5.14    22.55
> 73.46     6.87    8.88   0.97  74.90
> sdd            1183.40  5133.80  145.40  625.20     5.19    22.50
> 73.60     5.19    6.73   0.77  59.60
> sde            1176.40  5229.60  146.00  637.00     5.16    22.99
> 73.62     7.39    9.51   0.99  77.90
> md3               0.00     0.00    0.60 1277.40     0.02    41.56
> 66.63     0.00    0.00   0.00   0.00
>
> This about ties in with what bonnie++ gets as the write speed:
> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35
> 409.6  10
>
> bonnie++ maxes out at about the same as dd in the DomU.
>
>> I still didn''t understand your environment. I think this is
the first time you mention NFS/SMB.
>
> /dev/md3 is passed through as a phy device to a DomU. This DomU runs both
samba and NFS shares. It doesn''t seem to matter what method is the
source of the writes to the drive (dd, bonnie++, sending a file via NFS or
Samba), the max write speed seems to be bottlenecking at ~50Mb/sec - which I
believe is purely the write speed to the array via the DomU.
>
>> Weren''t you testing copies from /dev/zero to your device?
>
> I have tested many ways - this being just one of them.
>
>> I''d also recommend you run "oprofile" to see if
there are any time sinks that we are overlooking. You can find out how to use it
here:
>> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
>
> I ran ''perf top'' on the Dom0, then started a write (dd
if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU. This
is what I''d see as the relevant output:
>
> Samples: 574K of event ''cpu-clock'', Event count
(approx.): 60028
>    88.37%  [kernel]             [k] xen_hypercall_sched_op
>     4.73%  [kernel]             [k] xen_hypercall_xen_version
>     1.00%  [kernel]             [k] xen_hypercall_grant_table_op
>     0.99%  [raid456]            [k] handle_stripe
>     0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
>     0.66%  [raid456]            [k] ops_run_io
>     0.57%  [kernel]             [k] memcpy
>     0.21%  [kernel]             [k] xen_restore_fl_direct
>     0.18%  [raid456]            [k] raid5_end_write_request
>     0.18%  [raid456]            [k] __raid_run_ops
>     0.14%  [kernel]             [k] xen_hypercall_event_channel_op
>     0.11%  [kernel]             [k] get_phys_to_machine
>     0.09%  [raid456]            [k] schedule_reconstruction
>
> Now repeated, with the same on the DomU:
> Samples: 300K of event ''cpu-clock'', Event count
(approx.): 84845
>
>    97.63%  [kernel]            [k] hypercall_page
>     0.44%  [kernel]            [k] copy_user_generic_string
>     0.41%  [kernel]            [k] __clear_user
>     0.08%  [kernel]            [k] __wake_up_bit
>     0.07%  [kernel]            [k] xen_restore_fl_direct
>     0.07%  [kernel]            [k] __mem_cgroup_commit_charge
>
> Now, interestingly enough... The root drive (xvda) of the DomU in question
is on a separate RAID1. This is passed as an LV.
>
> # dd if=/dev/zero of=/output.bin bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s
>
> Then from the Dom0 to the same RAID1:
> # dd if=/dev/zero of=output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s
>
> I don''t really know what else I can do to try and see where the 
> slowdown is here - I am open to suggestions though :)
>
> --
> Steven Haigh
>
> Email: netwiz@crc.id.au
> Web: https://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299
>

--
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Steven Haigh

2013-Apr-17 07:52 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

On 04/17/2013 12:39 AM, Felipe Franciosi wrote:> Hi Steven, sorry for the delay (doing some training followed by holidays).
>
> Using "direct" from the guest does speed things up. I''m
not entirely sure why. Especially since this is also true (at least with the
guests I tested) in XCP/XenServer where the datapath includes a blktap2/tapdisk
model where the virtual disks are always opened with O_DIRECT and the only valid
operations are READ or WRITE (i.e. there is no special handling for guest
operations such as barriers, etc).
>
> Have a look at http://support.citrix.com/article/CTX136861 for an
understanding of how things work in XCP 1.6 / XenServer 6.1.0.
>
> My guess is that "direct" allows you to control the exact size of
the requests (creating a consistent data flow). If you omit it and shove all
your requests through the guest''s buffer cache, what you get in the
backend will vary largely in terms of request sizes. This will particularly
affect the blkback/blkfront ring utilisation. This hypothesis can hopefully be
supported by running iostat in dom0 (e.g. iostat -xm 1) and observing the
request sizes. Let me know what you see in your environment. If anyone else in
the list have other ideas on this I''d like to hear them as this has
been puzzling me too.
>
> Another thing you can try is to use "bs=11M" along
i/oflag=direct. The "11M" magic number comes from the ring size: with
single-page rings, you can have up to 32 requests of 44KiB each in the ring.
> 32 * 44KiB = 1408 KiB
> If you try that and observe iostat in dom0, you''ll see that the
average request size often drops below 44KiB (88 sectors). To keep it consistent
I decided to use multiples of that number. The first "whole" number in
the order of mebibytes you''ll find is 8 rings: 32 * 44 * 1024 * 8 = 11
M.
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300
300+0 records in
300+0 records out
3460300800 bytes (3.5 GB) copied, 73.2253 s, 47.3 MB/s

Dom0 iostat -xm 1:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    9.30    0.00    1.16   89.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde            1267.44  6526.74  100.00  545.35     5.31    27.24 
103.28    10.28   14.86   1.75 113.02
sdc            1316.28  6348.84  105.81  546.51     5.69    26.94 
102.45     4.56    7.02   1.37  89.30
sdd            1139.53  6269.77  110.47  544.19     5.11    26.85 
99.99     5.34    8.30   1.35  88.37
sdf            1286.05  6436.05   96.51  548.84     5.50    27.29 
104.06     3.52    5.47   1.34  86.51
md2               0.00     0.00    0.00 1572.09     0.00    51.11 
66.59     0.00    0.00   0.00   0.00

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300 
oflag=direct
300+0 records in
300+0 records out
3460300800 bytes (3.5 GB) copied, 67.5844 s, 51.2 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.14    0.00   12.50    0.00    0.00   86.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde            1189.77  7005.68   89.77  478.41     5.00    29.35 
123.81     5.62    9.90   1.72  97.50
sdc            1040.91  6852.27   77.27  480.68     4.37    28.76 
121.60     2.84    5.09   1.11  61.93
sdd             911.36  7057.95   67.05  477.27     3.66    29.46 
124.63     4.31    7.51   1.51  82.39
sdf            1185.23  7227.27   72.73  495.45     4.90    30.28 
126.80     7.79   13.73   1.82 103.64
md2               0.00     0.00    0.00 1727.27     0.00    56.54 
67.04     0.00    0.00   0.00   0.00

It seems that using a block size on dd actually reduces the total write 
speed.

Compare this to using 1M block sizes:

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500 
oflag=direct
3500+0 records in
3500+0 records out
3670016000 bytes (3.7 GB) copied, 27.6453 s, 133 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00   27.78    0.00    0.00   72.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde               0.00 19040.00    0.00  303.33     0.00    75.83 
512.00     1.28    4.22   2.81  85.22
sdc               0.00 19040.00    0.00  302.22     0.00    75.56 
512.00     0.50    1.67   1.10  33.33
sdd               0.00 19040.00    0.00  302.22     0.00    75.56 
512.00     0.50    1.66   1.10  33.33
sdf               0.00 19040.00    0.00  303.33     0.00    75.83 
512.00     1.28    4.22   2.81  85.22
md2               0.00     0.00    0.00 4684.44     0.00   151.11 
66.06     0.00    0.00   0.00   0.00

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500
3500+0 records in
3500+0 records out
3670016000 bytes (3.7 GB) copied, 81.3315 s, 45.1 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00   10.47    0.00    0.00   89.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde            1116.28  5995.35  101.16  555.81     5.02    25.84 
96.20     5.99    9.31   1.52 100.00
sdc            1209.30  6019.77  111.63  536.05     5.19    25.07 
95.70     5.66    8.58   1.45  93.95
sdd            1236.05  5967.44   97.67  546.51     5.08    25.54 
97.34     8.16   12.67   1.60 103.02
sdf            1203.49  5939.53   97.67  554.65     5.12    25.63 
96.54     4.76    7.48   1.44  93.95
md2               0.00     0.00    0.00 1486.05     0.00    48.31 
66.59     0.00    0.00   0.00   0.00

Out of interest, trying 40k block sizes seems to show roughly the same 
results as the 11M tests... Maybe this is not related - but I''m not 
quite sure.
> Verify if that gives you more consistency in your results.
>
> Last but not least, you should definitely try two other things:
> 1) oprofile.
> 2) Roger''s kernel with persistent grants and indirect I/O reqs
(see below).
>
> Using oprofile you should be able to get more directions/ideas in the sense
of where your throughput is getting killed.
I''ll have to look at how to use oprofile to give you useful information
from it... I assume you mean running this on the Dom0?
> Finally, high throughput is usually achieved with large bandwidth and low
latency. Persistent granting should help a lot with reducing latency. The
indirect I/O operations that have recently been discussed in this list should
help you to achieve larger bandwidth. This should be particularly beneficial in
arrays such as yours, where the stripe size is larger than 88K.
>
> Link to Roger''s blog post and kernel code:
>
http://blog.xen.org/index.php/2012/11/23/improving-block-protocol-scalability-with-persistent-grants/#more-5711
> http://xenbits.xen.org/gitweb/?p=people/royger/linux.git;a=summary
(xen-block-indirect branch)
I''ll do some research into this - if someone has a binary RPM package 
already built from this, I might be able to get results to you much 
sooner...
> I am very interested in your results, please share them.
>
> Cheers,
> Felipe
>
>
> -----Original Message-----
> From: Steven Haigh [mailto:netwiz@crc.id.au]
> Sent: 09 April 2013 07:54
> To: Felipe Franciosi
> Cc: xen-devel@lists.xen.org
> Subject: Re: Xen disk write slowness in kernel 3.8.x
>
> Resending this - hopefully for comment...
>
> On 04/04/13 01:30, Felipe Franciosi wrote:
>> I think you should use i/oflag=direct on your "dd"s to bypass
the buffer cache in the guest. Might provide more consistent results.
>
> Ah - you are correct. I usually don''t forget that. This has
however discovered something interesting...
>
> Entirely from the DomU:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s # rm
/mnt/fileshare/output.zero # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M
count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s
>
> I did this 3-4 times with the same trend...
>
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s # rm
/mnt/fileshare/output.zero # sync # dd if=/dev/zero
of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s
>
> So, when the DomU is using write buffers, write speed is halved. This I
don''t understand.
>
> I haven''t tweaked the array at all yet (as I''ve just been
experimenting with this problem) - so with a bit of time, I can probably even
increase its write speed some more - but without this speed decrease solved, it
isn''t really worth it yet...
>
>> The result of oprofile is intriguing. Did you have a chance to try
Roger''s persistent grant implementation?
>>
>> He mentions it on his 08/03 e-mail. It''s here:
>> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
>
> I can''t say I have - Although I package Xen for EL6,
we''re getting to the edge of my knowledge here.
>
>>
>> Felipe
>>
>> -----Original Message-----
>> From: Steven Haigh [mailto:netwiz@crc.id.au]
>> Sent: 03 April 2013 14:27
>> To: xen-devel@lists.xen.org
>> Cc: Felipe Franciosi
>> Subject: Re: Xen disk write slowness in kernel 3.8.x
>>
>> On 03/04/13 23:29, Felipe Franciosi wrote:
>>> Do you know the size of your requests?
>>> You used "iostat -m". Perhaps "iostat -xm"
could be more meaningful as it will tell you the average request size in
sectors.
>>
>> Good call. I started just a sequential write with dd from /dev/zero on
the DomU:
>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s
>>
>> iostat -xm shows:
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>               0.11    0.00    2.71    0.23    0.56   96.38
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdc            1208.40  5337.80  150.00  645.20     5.35    23.39
>> 74.00     6.47    8.15   0.91  72.72
>> sdf            1199.80  5350.80  148.00  643.40     5.30    23.34
>> 74.12     6.88    8.68   0.96  76.06
>> sdd            1203.40  5304.60  148.80  638.60     5.28    23.21
>> 74.11     5.42    6.88   0.78  61.38
>> sde            1213.80  5382.40  148.40  652.40     5.37    23.59
>> 74.08     6.40    8.00   0.94  75.20
>> md3               0.00     0.00    1.60 1312.20     0.05    42.68
>> 66.60     0.00    0.00   0.00   0.00
>>
>> Shutting down the DomU, mounting /dev/md3 in and doing this directly
from the Dom0 shows:
>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s
>>
>> details from iostat -xm:
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>               0.00    0.00    2.16    0.00    0.57   97.27
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdc            1175.40  5220.60  143.40  633.60     5.13    22.85
>> 73.76     6.01    7.73   0.87  67.60
>> sdf            1175.20  5154.00  147.40  624.40     5.14    22.55
>> 73.46     6.87    8.88   0.97  74.90
>> sdd            1183.40  5133.80  145.40  625.20     5.19    22.50
>> 73.60     5.19    6.73   0.77  59.60
>> sde            1176.40  5229.60  146.00  637.00     5.16    22.99
>> 73.62     7.39    9.51   0.99  77.90
>> md3               0.00     0.00    0.60 1277.40     0.02    41.56
>> 66.63     0.00    0.00   0.00   0.00
>>
>> This about ties in with what bonnie++ gets as the write speed:
>> Version  1.96       ------Sequential Output------ --Sequential Input-
>> --Random-
>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
>> --Seeks--
>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
>> /sec %CP
>> xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35
>> 409.6  10
>>
>> bonnie++ maxes out at about the same as dd in the DomU.
>>
>>> I still didn''t understand your environment. I think this
is the first time you mention NFS/SMB.
>>
>> /dev/md3 is passed through as a phy device to a DomU. This DomU runs
both samba and NFS shares. It doesn''t seem to matter what method is the
source of the writes to the drive (dd, bonnie++, sending a file via NFS or
Samba), the max write speed seems to be bottlenecking at ~50Mb/sec - which I
believe is purely the write speed to the array via the DomU.
>>
>>> Weren''t you testing copies from /dev/zero to your device?
>>
>> I have tested many ways - this being just one of them.
>>
>>> I''d also recommend you run "oprofile" to see if
there are any time sinks that we are overlooking. You can find out how to use it
here:
>>> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
>>
>> I ran ''perf top'' on the Dom0, then started a write
(dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU.
This is what I''d see as the relevant output:
>>
>> Samples: 574K of event ''cpu-clock'', Event count
(approx.): 60028
>>     88.37%  [kernel]             [k] xen_hypercall_sched_op
>>      4.73%  [kernel]             [k] xen_hypercall_xen_version
>>      1.00%  [kernel]             [k] xen_hypercall_grant_table_op
>>      0.99%  [raid456]            [k] handle_stripe
>>      0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
>>      0.66%  [raid456]            [k] ops_run_io
>>      0.57%  [kernel]             [k] memcpy
>>      0.21%  [kernel]             [k] xen_restore_fl_direct
>>      0.18%  [raid456]            [k] raid5_end_write_request
>>      0.18%  [raid456]            [k] __raid_run_ops
>>      0.14%  [kernel]             [k] xen_hypercall_event_channel_op
>>      0.11%  [kernel]             [k] get_phys_to_machine
>>      0.09%  [raid456]            [k] schedule_reconstruction
>>
>> Now repeated, with the same on the DomU:
>> Samples: 300K of event ''cpu-clock'', Event count
(approx.): 84845
>>
>>     97.63%  [kernel]            [k] hypercall_page
>>      0.44%  [kernel]            [k] copy_user_generic_string
>>      0.41%  [kernel]            [k] __clear_user
>>      0.08%  [kernel]            [k] __wake_up_bit
>>      0.07%  [kernel]            [k] xen_restore_fl_direct
>>      0.07%  [kernel]            [k] __mem_cgroup_commit_charge
>>
>> Now, interestingly enough... The root drive (xvda) of the DomU in
question is on a separate RAID1. This is passed as an LV.
>>
>> # dd if=/dev/zero of=/output.bin bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s
>>
>> Then from the Dom0 to the same RAID1:
>> # dd if=/dev/zero of=output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s
>>
>> I don''t really know what else I can do to try and see where
the
>> slowdown is here - I am open to suggestions though :)
>>
>> --
>> Steven Haigh
>>
>> Email: netwiz@crc.id.au
>> Web: https://www.crc.id.au
>> Phone: (03) 9001 6090 - 0412 935 897
>> Fax: (03) 8338 0299
>>
>
>
> --
> Steven Haigh
>
> Email: netwiz@crc.id.au
> Web: https://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299
>

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Felipe Franciosi

2013-Apr-18 13:10 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

Hi Steven,

I''m a bit surprised with the results of your "dd" tests using
bs=11M.
I''d expect to see a constant avgrq-sz of 88 (44K). You seem to be
experiencing a varying number near 66.

I noticed your "dd" was done on top of a mount point. Could you repeat
that test on top of the raw /dev/xvd* device?

And yes, oprofile should be ran on dom0. If you provide the kernel symbols for
your domU, it will be able to give you detailed information on that as well.

I''m afraid I don''t have RPMs for Roger''s kernel tree.
It shouldn''t be too hard to clone and build it, though. Let me know if
you have problems with that.

Cheers,
Felipe

-----Original Message-----
From: Steven Haigh [mailto:netwiz@crc.id.au] 
Sent: 17 April 2013 08:53
To: Felipe Franciosi
Cc: xen-devel@lists.xen.org
Subject: Re: Xen disk write slowness in kernel 3.8.x

On 04/17/2013 12:39 AM, Felipe Franciosi wrote:> Hi Steven, sorry for the delay (doing some training followed by holidays).
>
> Using "direct" from the guest does speed things up. I''m
not entirely sure why. Especially since this is also true (at least with the
guests I tested) in XCP/XenServer where the datapath includes a blktap2/tapdisk
model where the virtual disks are always opened with O_DIRECT and the only valid
operations are READ or WRITE (i.e. there is no special handling for guest
operations such as barriers, etc).
>
> Have a look at http://support.citrix.com/article/CTX136861 for an
understanding of how things work in XCP 1.6 / XenServer 6.1.0.
>
> My guess is that "direct" allows you to control the exact size of
the requests (creating a consistent data flow). If you omit it and shove all
your requests through the guest''s buffer cache, what you get in the
backend will vary largely in terms of request sizes. This will particularly
affect the blkback/blkfront ring utilisation. This hypothesis can hopefully be
supported by running iostat in dom0 (e.g. iostat -xm 1) and observing the
request sizes. Let me know what you see in your environment. If anyone else in
the list have other ideas on this I''d like to hear them as this has
been puzzling me too.
>
> Another thing you can try is to use "bs=11M" along
i/oflag=direct. The "11M" magic number comes from the ring size: with
single-page rings, you can have up to 32 requests of 44KiB each in the ring.
> 32 * 44KiB = 1408 KiB
> If you try that and observe iostat in dom0, you''ll see that the
average request size often drops below 44KiB (88 sectors). To keep it consistent
I decided to use multiples of that number. The first "whole" number in
the order of mebibytes you''ll find is 8 rings: 32 * 44 * 1024 * 8 = 11
M.
# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300
300+0 records in
300+0 records out
3460300800 bytes (3.5 GB) copied, 73.2253 s, 47.3 MB/s

Dom0 iostat -xm 1:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    9.30    0.00    1.16   89.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde            1267.44  6526.74  100.00  545.35     5.31    27.24 
103.28    10.28   14.86   1.75 113.02
sdc            1316.28  6348.84  105.81  546.51     5.69    26.94 
102.45     4.56    7.02   1.37  89.30
sdd            1139.53  6269.77  110.47  544.19     5.11    26.85 
99.99     5.34    8.30   1.35  88.37
sdf            1286.05  6436.05   96.51  548.84     5.50    27.29 
104.06     3.52    5.47   1.34  86.51
md2               0.00     0.00    0.00 1572.09     0.00    51.11 
66.59     0.00    0.00   0.00   0.00

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300 oflag=direct
300+0 records in
300+0 records out
3460300800 bytes (3.5 GB) copied, 67.5844 s, 51.2 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.14    0.00   12.50    0.00    0.00   86.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde            1189.77  7005.68   89.77  478.41     5.00    29.35 
123.81     5.62    9.90   1.72  97.50
sdc            1040.91  6852.27   77.27  480.68     4.37    28.76 
121.60     2.84    5.09   1.11  61.93
sdd             911.36  7057.95   67.05  477.27     3.66    29.46 
124.63     4.31    7.51   1.51  82.39
sdf            1185.23  7227.27   72.73  495.45     4.90    30.28 
126.80     7.79   13.73   1.82 103.64
md2               0.00     0.00    0.00 1727.27     0.00    56.54 
67.04     0.00    0.00   0.00   0.00

It seems that using a block size on dd actually reduces the total write speed.

Compare this to using 1M block sizes:

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500 oflag=direct
3500+0 records in
3500+0 records out
3670016000 bytes (3.7 GB) copied, 27.6453 s, 133 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00   27.78    0.00    0.00   72.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde               0.00 19040.00    0.00  303.33     0.00    75.83 
512.00     1.28    4.22   2.81  85.22
sdc               0.00 19040.00    0.00  302.22     0.00    75.56 
512.00     0.50    1.67   1.10  33.33
sdd               0.00 19040.00    0.00  302.22     0.00    75.56 
512.00     0.50    1.66   1.10  33.33
sdf               0.00 19040.00    0.00  303.33     0.00    75.83 
512.00     1.28    4.22   2.81  85.22
md2               0.00     0.00    0.00 4684.44     0.00   151.11 
66.06     0.00    0.00   0.00   0.00

# dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500
3500+0 records in
3500+0 records out
3670016000 bytes (3.7 GB) copied, 81.3315 s, 45.1 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00   10.47    0.00    0.00   89.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sde            1116.28  5995.35  101.16  555.81     5.02    25.84 
96.20     5.99    9.31   1.52 100.00
sdc            1209.30  6019.77  111.63  536.05     5.19    25.07 
95.70     5.66    8.58   1.45  93.95
sdd            1236.05  5967.44   97.67  546.51     5.08    25.54 
97.34     8.16   12.67   1.60 103.02
sdf            1203.49  5939.53   97.67  554.65     5.12    25.63 
96.54     4.76    7.48   1.44  93.95
md2               0.00     0.00    0.00 1486.05     0.00    48.31 
66.59     0.00    0.00   0.00   0.00

Out of interest, trying 40k block sizes seems to show roughly the same results
as the 11M tests... Maybe this is not related - but I''m not quite sure.
> Verify if that gives you more consistency in your results.
>
> Last but not least, you should definitely try two other things:
> 1) oprofile.
> 2) Roger''s kernel with persistent grants and indirect I/O reqs
(see below).
>
> Using oprofile you should be able to get more directions/ideas in the sense
of where your throughput is getting killed.
I''ll have to look at how to use oprofile to give you useful information
from it... I assume you mean running this on the Dom0?
> Finally, high throughput is usually achieved with large bandwidth and low
latency. Persistent granting should help a lot with reducing latency. The
indirect I/O operations that have recently been discussed in this list should
help you to achieve larger bandwidth. This should be particularly beneficial in
arrays such as yours, where the stripe size is larger than 88K.
>
> Link to Roger''s blog post and kernel code:
> http://blog.xen.org/index.php/2012/11/23/improving-block-protocol-scal
> ability-with-persistent-grants/#more-5711
> http://xenbits.xen.org/gitweb/?p=people/royger/linux.git;a=summary 
> (xen-block-indirect branch)
I''ll do some research into this - if someone has a binary RPM package
already built from this, I might be able to get results to you much sooner...
> I am very interested in your results, please share them.
>
> Cheers,
> Felipe
>
>
> -----Original Message-----
> From: Steven Haigh [mailto:netwiz@crc.id.au]
> Sent: 09 April 2013 07:54
> To: Felipe Franciosi
> Cc: xen-devel@lists.xen.org
> Subject: Re: Xen disk write slowness in kernel 3.8.x
>
> Resending this - hopefully for comment...
>
> On 04/04/13 01:30, Felipe Franciosi wrote:
>> I think you should use i/oflag=direct on your "dd"s to bypass
the buffer cache in the guest. Might provide more consistent results.
>
> Ah - you are correct. I usually don''t forget that. This has
however discovered something interesting...
>
> Entirely from the DomU:
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 
> oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s # rm 
> /mnt/fileshare/output.zero # dd if=/dev/zero 
> of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s
>
> I did this 3-4 times with the same trend...
>
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096 
> oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s # rm 
> /mnt/fileshare/output.zero # sync # dd if=/dev/zero 
> of=/mnt/fileshare/output.zero bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s
>
> So, when the DomU is using write buffers, write speed is halved. This I
don''t understand.
>
> I haven''t tweaked the array at all yet (as I''ve just been
experimenting with this problem) - so with a bit of time, I can probably even
increase its write speed some more - but without this speed decrease solved, it
isn''t really worth it yet...
>
>> The result of oprofile is intriguing. Did you have a chance to try
Roger''s persistent grant implementation?
>>
>> He mentions it on his 08/03 e-mail. It''s here:
>> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
>
> I can''t say I have - Although I package Xen for EL6,
we''re getting to the edge of my knowledge here.
>
>>
>> Felipe
>>
>> -----Original Message-----
>> From: Steven Haigh [mailto:netwiz@crc.id.au]
>> Sent: 03 April 2013 14:27
>> To: xen-devel@lists.xen.org
>> Cc: Felipe Franciosi
>> Subject: Re: Xen disk write slowness in kernel 3.8.x
>>
>> On 03/04/13 23:29, Felipe Franciosi wrote:
>>> Do you know the size of your requests?
>>> You used "iostat -m". Perhaps "iostat -xm"
could be more meaningful as it will tell you the average request size in
sectors.
>>
>> Good call. I started just a sequential write with dd from /dev/zero on
the DomU:
>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s
>>
>> iostat -xm shows:
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>               0.11    0.00    2.71    0.23    0.56   96.38
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdc            1208.40  5337.80  150.00  645.20     5.35    23.39
>> 74.00     6.47    8.15   0.91  72.72
>> sdf            1199.80  5350.80  148.00  643.40     5.30    23.34
>> 74.12     6.88    8.68   0.96  76.06
>> sdd            1203.40  5304.60  148.80  638.60     5.28    23.21
>> 74.11     5.42    6.88   0.78  61.38
>> sde            1213.80  5382.40  148.40  652.40     5.37    23.59
>> 74.08     6.40    8.00   0.94  75.20
>> md3               0.00     0.00    1.60 1312.20     0.05    42.68
>> 66.60     0.00    0.00   0.00   0.00
>>
>> Shutting down the DomU, mounting /dev/md3 in and doing this directly
from the Dom0 shows:
>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s
>>
>> details from iostat -xm:
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>               0.00    0.00    2.16    0.00    0.57   97.27
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdc            1175.40  5220.60  143.40  633.60     5.13    22.85
>> 73.76     6.01    7.73   0.87  67.60
>> sdf            1175.20  5154.00  147.40  624.40     5.14    22.55
>> 73.46     6.87    8.88   0.97  74.90
>> sdd            1183.40  5133.80  145.40  625.20     5.19    22.50
>> 73.60     5.19    6.73   0.77  59.60
>> sde            1176.40  5229.60  146.00  637.00     5.16    22.99
>> 73.62     7.39    9.51   0.99  77.90
>> md3               0.00     0.00    0.60 1277.40     0.02    41.56
>> 66.63     0.00    0.00   0.00   0.00
>>
>> This about ties in with what bonnie++ gets as the write speed:
>> Version  1.96       ------Sequential Output------ --Sequential Input-
>> --Random-
>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
>> --Seeks--
>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
>> /sec %CP
>> xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833  35
>> 409.6  10
>>
>> bonnie++ maxes out at about the same as dd in the DomU.
>>
>>> I still didn''t understand your environment. I think this
is the first time you mention NFS/SMB.
>>
>> /dev/md3 is passed through as a phy device to a DomU. This DomU runs
both samba and NFS shares. It doesn''t seem to matter what method is the
source of the writes to the drive (dd, bonnie++, sending a file via NFS or
Samba), the max write speed seems to be bottlenecking at ~50Mb/sec - which I
believe is purely the write speed to the array via the DomU.
>>
>>> Weren''t you testing copies from /dev/zero to your device?
>>
>> I have tested many ways - this being just one of them.
>>
>>> I''d also recommend you run "oprofile" to see if
there are any time sinks that we are overlooking. You can find out how to use it
here:
>>> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
>>
>> I ran ''perf top'' on the Dom0, then started a write
(dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096) on the DomU.
This is what I''d see as the relevant output:
>>
>> Samples: 574K of event ''cpu-clock'', Event count
(approx.): 60028
>>     88.37%  [kernel]             [k] xen_hypercall_sched_op
>>      4.73%  [kernel]             [k] xen_hypercall_xen_version
>>      1.00%  [kernel]             [k] xen_hypercall_grant_table_op
>>      0.99%  [raid456]            [k] handle_stripe
>>      0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
>>      0.66%  [raid456]            [k] ops_run_io
>>      0.57%  [kernel]             [k] memcpy
>>      0.21%  [kernel]             [k] xen_restore_fl_direct
>>      0.18%  [raid456]            [k] raid5_end_write_request
>>      0.18%  [raid456]            [k] __raid_run_ops
>>      0.14%  [kernel]             [k] xen_hypercall_event_channel_op
>>      0.11%  [kernel]             [k] get_phys_to_machine
>>      0.09%  [raid456]            [k] schedule_reconstruction
>>
>> Now repeated, with the same on the DomU:
>> Samples: 300K of event ''cpu-clock'', Event count
(approx.): 84845
>>
>>     97.63%  [kernel]            [k] hypercall_page
>>      0.44%  [kernel]            [k] copy_user_generic_string
>>      0.41%  [kernel]            [k] __clear_user
>>      0.08%  [kernel]            [k] __wake_up_bit
>>      0.07%  [kernel]            [k] xen_restore_fl_direct
>>      0.07%  [kernel]            [k] __mem_cgroup_commit_charge
>>
>> Now, interestingly enough... The root drive (xvda) of the DomU in
question is on a separate RAID1. This is passed as an LV.
>>
>> # dd if=/dev/zero of=/output.bin bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s
>>
>> Then from the Dom0 to the same RAID1:
>> # dd if=/dev/zero of=output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s
>>
>> I don''t really know what else I can do to try and see where
the
>> slowdown is here - I am open to suggestions though :)
>>
>> --
>> Steven Haigh
>>
>> Email: netwiz@crc.id.au
>> Web: https://www.crc.id.au
>> Phone: (03) 9001 6090 - 0412 935 897
>> Fax: (03) 8338 0299
>>
>
>
> --
> Steven Haigh
>
> Email: netwiz@crc.id.au
> Web: https://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299
>

--
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Steven Haigh

2013-Apr-19 02:12 UTC

head link

Re: Xen disk write slowness in kernel 3.8.x

On 18/04/2013, at 23:10, Felipe Franciosi <felipe.franciosi@citrix.com> 
wrote:
> Hi Steven,
>
> I''m a bit surprised with the results of your "dd" tests
using bs=11M.
> I''d expect to see a constant avgrq-sz of 88 (44K). You seem to be
experiencing a varying number near 66.
>
> I noticed your "dd" was done on top of a mount point. Could you
repeat that test on top of the raw /dev/xvd* device?
Sadly not. The file system takes up the whole RAID6 array and is XFS. As 
previously stated, the device on Dom0 (/dev/md2) is passed as xvdb to 
the DomU.

XFS Info from the DomU:
$ xfs_info /dev/xvdb
meta-data=/dev/xvdb              isize=256    agcount=32, 
agsize=30521632 blks
          =                       sectsz=4096  attr=2, projid32bit=0
data     =                       bsize=4096   blocks=976691648, imaxpct=5
          =                       sunit=32     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=476900, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
> And yes, oprofile should be ran on dom0. If you provide the kernel symbols
for your domU, it will be able to give you detailed information on that as well.
Hmmm - not exactly sure how to do this. Any pointers to where I can read 
up on it?
> I''m afraid I don''t have RPMs for Roger''s kernel
tree. It shouldn''t be too hard to clone and build it, though. Let me
know if you have problems with that.
I''ll try to have a look at this over the weekend.

--
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299
> -----Original Message-----
> From: Steven Haigh [mailto:netwiz@crc.id.au]
> Sent: 17 April 2013 08:53
> To: Felipe Franciosi
> Cc: xen-devel@lists.xen.org
> Subject: Re: Xen disk write slowness in kernel 3.8.x
>
> On 04/17/2013 12:39 AM, Felipe Franciosi wrote:
>> Hi Steven, sorry for the delay (doing some training followed by
holidays).
>>
>> Using "direct" from the guest does speed things up.
I''m not entirely sure why. Especially since this is also true (at least
with the guests I tested) in XCP/XenServer where the datapath includes a
blktap2/tapdisk model where the virtual disks are always opened with O_DIRECT
and the only valid operations are READ or WRITE (i.e. there is no special
handling for guest operations such as barriers, etc).
>>
>> Have a look at http://support.citrix.com/article/CTX136861 for an
understanding of how things work in XCP 1.6 / XenServer 6.1.0.
>>
>> My guess is that "direct" allows you to control the exact
size of the requests (creating a consistent data flow). If you omit it and shove
all your requests through the guest''s buffer cache, what you get in the
backend will vary largely in terms of request sizes. This will particularly
affect the blkback/blkfront ring utilisation. This hypothesis can hopefully be
supported by running iostat in dom0 (e.g. iostat -xm 1) and observing the
request sizes. Let me know what you see in your environment. If anyone else in
the list have other ideas on this I''d like to hear them as this has
been puzzling me too.
>>
>> Another thing you can try is to use "bs=11M" along
i/oflag=direct. The "11M" magic number comes from the ring size: with
single-page rings, you can have up to 32 requests of 44KiB each in the ring.
>> 32 * 44KiB = 1408 KiB
>> If you try that and observe iostat in dom0, you''ll see that
the average request size often drops below 44KiB (88 sectors). To keep it
consistent I decided to use multiples of that number. The first
"whole" number in the order of mebibytes you''ll find is 8
rings: 32 * 44 * 1024 * 8 = 11 M.
>
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300
> 300+0 records in
> 300+0 records out
> 3460300800 bytes (3.5 GB) copied, 73.2253 s, 47.3 MB/s
>
> Dom0 iostat -xm 1:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    9.30    0.00    1.16   89.53
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sde            1267.44  6526.74  100.00  545.35     5.31    27.24
> 103.28    10.28   14.86   1.75 113.02
> sdc            1316.28  6348.84  105.81  546.51     5.69    26.94
> 102.45     4.56    7.02   1.37  89.30
> sdd            1139.53  6269.77  110.47  544.19     5.11    26.85
> 99.99     5.34    8.30   1.35  88.37
> sdf            1286.05  6436.05   96.51  548.84     5.50    27.29
> 104.06     3.52    5.47   1.34  86.51
> md2               0.00     0.00    0.00 1572.09     0.00    51.11
> 66.59     0.00    0.00   0.00   0.00
>
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=11M count=300
oflag=direct
> 300+0 records in
> 300+0 records out
> 3460300800 bytes (3.5 GB) copied, 67.5844 s, 51.2 MB/s
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            1.14    0.00   12.50    0.00    0.00   86.36
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sde            1189.77  7005.68   89.77  478.41     5.00    29.35
> 123.81     5.62    9.90   1.72  97.50
> sdc            1040.91  6852.27   77.27  480.68     4.37    28.76
> 121.60     2.84    5.09   1.11  61.93
> sdd             911.36  7057.95   67.05  477.27     3.66    29.46
> 124.63     4.31    7.51   1.51  82.39
> sdf            1185.23  7227.27   72.73  495.45     4.90    30.28
> 126.80     7.79   13.73   1.82 103.64
> md2               0.00     0.00    0.00 1727.27     0.00    56.54
> 67.04     0.00    0.00   0.00   0.00
>
> It seems that using a block size on dd actually reduces the total write
speed.
>
> Compare this to using 1M block sizes:
>
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500
oflag=direct
> 3500+0 records in
> 3500+0 records out
> 3670016000 bytes (3.7 GB) copied, 27.6453 s, 133 MB/s
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00   27.78    0.00    0.00   72.22
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sde               0.00 19040.00    0.00  303.33     0.00    75.83
> 512.00     1.28    4.22   2.81  85.22
> sdc               0.00 19040.00    0.00  302.22     0.00    75.56
> 512.00     0.50    1.67   1.10  33.33
> sdd               0.00 19040.00    0.00  302.22     0.00    75.56
> 512.00     0.50    1.66   1.10  33.33
> sdf               0.00 19040.00    0.00  303.33     0.00    75.83
> 512.00     1.28    4.22   2.81  85.22
> md2               0.00     0.00    0.00 4684.44     0.00   151.11
> 66.06     0.00    0.00   0.00   0.00
>
> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=3500
> 3500+0 records in
> 3500+0 records out
> 3670016000 bytes (3.7 GB) copied, 81.3315 s, 45.1 MB/s
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00   10.47    0.00    0.00   89.53
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sde            1116.28  5995.35  101.16  555.81     5.02    25.84
> 96.20     5.99    9.31   1.52 100.00
> sdc            1209.30  6019.77  111.63  536.05     5.19    25.07
> 95.70     5.66    8.58   1.45  93.95
> sdd            1236.05  5967.44   97.67  546.51     5.08    25.54
> 97.34     8.16   12.67   1.60 103.02
> sdf            1203.49  5939.53   97.67  554.65     5.12    25.63
> 96.54     4.76    7.48   1.44  93.95
> md2               0.00     0.00    0.00 1486.05     0.00    48.31
> 66.59     0.00    0.00   0.00   0.00
>
> Out of interest, trying 40k block sizes seems to show roughly the same
results as the 11M tests... Maybe this is not related - but I''m not
quite sure.
>
>> Verify if that gives you more consistency in your results.
>>
>> Last but not least, you should definitely try two other things:
>> 1) oprofile.
>> 2) Roger''s kernel with persistent grants and indirect I/O reqs
(see below).
>>
>> Using oprofile you should be able to get more directions/ideas in the
sense of where your throughput is getting killed.
>
> I''ll have to look at how to use oprofile to give you useful
information from it... I assume you mean running this on the Dom0?
>
>> Finally, high throughput is usually achieved with large bandwidth and
low latency. Persistent granting should help a lot with reducing latency. The
indirect I/O operations that have recently been discussed in this list should
help you to achieve larger bandwidth. This should be particularly beneficial in
arrays such as yours, where the stripe size is larger than 88K.
>>
>> Link to Roger''s blog post and kernel code:
>> http://blog.xen.org/index.php/2012/11/23/improving-block-protocol-scal
>> ability-with-persistent-grants/#more-5711
>> http://xenbits.xen.org/gitweb/?p=people/royger/linux.git;a=summary
>> (xen-block-indirect branch)
>
> I''ll do some research into this - if someone has a binary RPM
package already built from this, I might be able to get results to you much
sooner...
>
>> I am very interested in your results, please share them.
>>
>> Cheers,
>> Felipe
>>
>>
>> -----Original Message-----
>> From: Steven Haigh [mailto:netwiz@crc.id.au]
>> Sent: 09 April 2013 07:54
>> To: Felipe Franciosi
>> Cc: xen-devel@lists.xen.org
>> Subject: Re: Xen disk write slowness in kernel 3.8.x
>>
>> Resending this - hopefully for comment...
>>
>> On 04/04/13 01:30, Felipe Franciosi wrote:
>>> I think you should use i/oflag=direct on your "dd"s to
bypass the buffer cache in the guest. Might provide more consistent results.
>>
>> Ah - you are correct. I usually don''t forget that. This has
however discovered something interesting...
>>
>> Entirely from the DomU:
>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>> oflag=direct
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 45.2753 s, 94.9 MB/s # rm
>> /mnt/fileshare/output.zero # dd if=/dev/zero
>> of=/mnt/fileshare/output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 80.5808 s, 53.3 MB/s
>>
>> I did this 3-4 times with the same trend...
>>
>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>> oflag=direct
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 40.693 s, 106 MB/s # rm
>> /mnt/fileshare/output.zero # sync # dd if=/dev/zero
>> of=/mnt/fileshare/output.zero bs=1M count=4096
>> 4096+0 records in
>> 4096+0 records out
>> 4294967296 bytes (4.3 GB) copied, 82.1511 s, 52.3 MB/s
>>
>> So, when the DomU is using write buffers, write speed is halved. This I
don''t understand.
>>
>> I haven''t tweaked the array at all yet (as I''ve just
been experimenting with this problem) - so with a bit of time, I can probably
even increase its write speed some more - but without this speed decrease
solved, it isn''t really worth it yet...
>>
>>> The result of oprofile is intriguing. Did you have a chance to try
Roger''s persistent grant implementation?
>>>
>>> He mentions it on his 08/03 e-mail. It''s here:
>>> git://xenbits.xen.org/people/royger/linux.git xen-block-indirect
>>
>> I can''t say I have - Although I package Xen for EL6,
we''re getting to the edge of my knowledge here.
>>
>>>
>>> Felipe
>>>
>>> -----Original Message-----
>>> From: Steven Haigh [mailto:netwiz@crc.id.au]
>>> Sent: 03 April 2013 14:27
>>> To: xen-devel@lists.xen.org
>>> Cc: Felipe Franciosi
>>> Subject: Re: Xen disk write slowness in kernel 3.8.x
>>>
>>> On 03/04/13 23:29, Felipe Franciosi wrote:
>>>> Do you know the size of your requests?
>>>> You used "iostat -m". Perhaps "iostat -xm"
could be more meaningful as it will tell you the average request size in
sectors.
>>>
>>> Good call. I started just a sequential write with dd from /dev/zero
on the DomU:
>>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>>> 4096+0 records in
>>> 4096+0 records out
>>> 4294967296 bytes (4.3 GB) copied, 91.269 s, 47.1 MB/s
>>>
>>> iostat -xm shows:
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>              0.11    0.00    2.71    0.23    0.56   96.38
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sdc            1208.40  5337.80  150.00  645.20     5.35    23.39
>>> 74.00     6.47    8.15   0.91  72.72
>>> sdf            1199.80  5350.80  148.00  643.40     5.30    23.34
>>> 74.12     6.88    8.68   0.96  76.06
>>> sdd            1203.40  5304.60  148.80  638.60     5.28    23.21
>>> 74.11     5.42    6.88   0.78  61.38
>>> sde            1213.80  5382.40  148.40  652.40     5.37    23.59
>>> 74.08     6.40    8.00   0.94  75.20
>>> md3               0.00     0.00    1.60 1312.20     0.05    42.68
>>> 66.60     0.00    0.00   0.00   0.00
>>>
>>> Shutting down the DomU, mounting /dev/md3 in and doing this
directly from the Dom0 shows:
>>> # dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096
>>> 4096+0 records in
>>> 4096+0 records out
>>> 4294967296 bytes (4.3 GB) copied, 44.2801 s, 97.0 MB/s
>>>
>>> details from iostat -xm:
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>              0.00    0.00    2.16    0.00    0.57   97.27
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sdc            1175.40  5220.60  143.40  633.60     5.13    22.85
>>> 73.76     6.01    7.73   0.87  67.60
>>> sdf            1175.20  5154.00  147.40  624.40     5.14    22.55
>>> 73.46     6.87    8.88   0.97  74.90
>>> sdd            1183.40  5133.80  145.40  625.20     5.19    22.50
>>> 73.60     5.19    6.73   0.77  59.60
>>> sde            1176.40  5229.60  146.00  637.00     5.16    22.99
>>> 73.62     7.39    9.51   0.99  77.90
>>> md3               0.00     0.00    0.60 1277.40     0.02    41.56
>>> 66.63     0.00    0.00   0.00   0.00
>>>
>>> This about ties in with what bonnie++ gets as the write speed:
>>> Version  1.96       ------Sequential Output------ --Sequential
Input-
>>> --Random-
>>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
--Block--
>>> --Seeks--
>>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP
>>> /sec %CP
>>> xenhost.lan.crc. 2G   635  91 97526  15 67864  16   952  97 295833 
35
>>> 409.6  10
>>>
>>> bonnie++ maxes out at about the same as dd in the DomU.
>>>
>>>> I still didn''t understand your environment. I think
this is the first time you mention NFS/SMB.
>>>
>>> /dev/md3 is passed through as a phy device to a DomU. This DomU
runs both samba and NFS shares. It doesn''t seem to matter what method
is the source of the writes to the drive (dd, bonnie++, sending a file via NFS
or Samba), the max write speed seems to be bottlenecking at ~50Mb/sec - which I
believe is purely the write speed to the array via the DomU.
>>>
>>>> Weren''t you testing copies from /dev/zero to your
device?
>>>
>>> I have tested many ways - this being just one of them.
>>>
>>>> I''d also recommend you run "oprofile" to see
if there are any time sinks that we are overlooking. You can find out how to use
it here:
>>>> http://wiki.xen.org/wiki/Xen_Profiling:_oprofile_and_perf
>>>
>>> I ran ''perf top'' on the Dom0, then started a
write (dd if=/dev/zero of=/mnt/fileshare/output.zero bs=1M count=4096) on the
DomU. This is what I''d see as the relevant output:
>>>
>>> Samples: 574K of event ''cpu-clock'', Event count
(approx.): 60028
>>>    88.37%  [kernel]             [k] xen_hypercall_sched_op
>>>     4.73%  [kernel]             [k] xen_hypercall_xen_version
>>>     1.00%  [kernel]             [k] xen_hypercall_grant_table_op
>>>     0.99%  [raid456]            [k] handle_stripe
>>>     0.77%  [raid6_pq]           [k] raid6_sse24_gen_syndrome
>>>     0.66%  [raid456]            [k] ops_run_io
>>>     0.57%  [kernel]             [k] memcpy
>>>     0.21%  [kernel]             [k] xen_restore_fl_direct
>>>     0.18%  [raid456]            [k] raid5_end_write_request
>>>     0.18%  [raid456]            [k] __raid_run_ops
>>>     0.14%  [kernel]             [k] xen_hypercall_event_channel_op
>>>     0.11%  [kernel]             [k] get_phys_to_machine
>>>     0.09%  [raid456]            [k] schedule_reconstruction
>>>
>>> Now repeated, with the same on the DomU:
>>> Samples: 300K of event ''cpu-clock'', Event count
(approx.): 84845
>>>
>>>    97.63%  [kernel]            [k] hypercall_page
>>>     0.44%  [kernel]            [k] copy_user_generic_string
>>>     0.41%  [kernel]            [k] __clear_user
>>>     0.08%  [kernel]            [k] __wake_up_bit
>>>     0.07%  [kernel]            [k] xen_restore_fl_direct
>>>     0.07%  [kernel]            [k] __mem_cgroup_commit_charge
>>>
>>> Now, interestingly enough... The root drive (xvda) of the DomU in
question is on a separate RAID1. This is passed as an LV.
>>>
>>> # dd if=/dev/zero of=/output.bin bs=1M count=4096
>>> 4096+0 records in
>>> 4096+0 records out
>>> 4294967296 bytes (4.3 GB) copied, 69.7036 s, 61.6 MB/s
>>>
>>> Then from the Dom0 to the same RAID1:
>>> # dd if=/dev/zero of=output.zero bs=1M count=4096
>>> 4096+0 records in
>>> 4096+0 records out
>>> 4294967296 bytes (4.3 GB) copied, 48.4407 s, 88.7 MB/s
>>>
>>> I don''t really know what else I can do to try and see
where the
>>> slowdown is here - I am open to suggestions though :)
>>>
>>> --
>>> Steven Haigh
>>>
>>> Email: netwiz@crc.id.au
>>> Web: https://www.crc.id.au
>>> Phone: (03) 9001 6090 - 0412 935 897
>>> Fax: (03) 8338 0299
>>
>>
>> --
>> Steven Haigh
>>
>> Email: netwiz@crc.id.au
>> Web: https://www.crc.id.au
>> Phone: (03) 9001 6090 - 0412 935 897
>> Fax: (03) 8338 0299
>
>
> --
> Steven Haigh
>
> Email: netwiz@crc.id.au
> Web: https://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299

Xen devel - Apr 2013 - Xen disk write slowness in kernel 3.8.x

Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x

Re: Xen disk write slowness in kernel 3.8.x