The body of dm-ioband. This patch is an all-in-one patch of dm-ioband
so that it replaces dm-add-ioband.patch in the device-mapper development tree.
Signed-off-by: Ryo Tsuruta <ryov at valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka at valinux.co.jp>
---
Documentation/device-mapper/ioband.txt | 1113 +++++++++++++++++++++++++
Documentation/device-mapper/range-bw.txt | 99 ++
drivers/md/Kconfig | 13
drivers/md/Makefile | 3
drivers/md/dm-ioband-ctl.c | 1357 +++++++++++++++++++++++++++++++
drivers/md/dm-ioband-policy.c | 543 ++++++++++++
drivers/md/dm-ioband-rangebw.c | 669 +++++++++++++++
drivers/md/dm-ioband-type.c | 76 +
drivers/md/dm-ioband.h | 231 +++++
include/trace/events/dm-ioband.h | 242 +++++
10 files changed, 4346 insertions(+)
Index: linux-2.6.31/Documentation/device-mapper/ioband.txt
==================================================================--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/ioband.txt
@@ -0,0 +1,1113 @@
+ Block I/O bandwidth control: dm-ioband
+
+ -------------------------------------------------------
+
+ Table of Contents
+
+ [1]What's dm-ioband all about?
+
+ [2]Differences from the CFQ I/O scheduler
+
+ [3]How dm-ioband works.
+
+ [4]Setup and Installation
+
+ [5]Getting started
+
+ [6]Command Reference
+
+ [7]Examples
+
+What's dm-ioband all about?
+
+ dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+ driver. Several jobs using the same block device have to share the
+ bandwidth of the device. dm-ioband gives bandwidth to each job according
+ to bandwidth control policies.
+
+ A job is a group of processes with the same pid or pgrp or uid or a
+ virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+ the blkio-cgroup patch, which can be found at
+ http://sourceforge.net/apps/trac/ioband/.
+
+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
+ |cgroup | |cgroup | | the | | pid | | pid | | the | jobs
+ | A | | B | |others | | X | | Y | |others |
+ +---|---+ +---|---+ +---|---+ +---|---+ +---|---+ +---|---+
+ | | | | | |
+ +-----|---------|---------|----+----|---------|---------|-----+
+ | | /dev/mapper/disk1 | | | /dev/mapper/disk2 | |
+ |-----|---------|---------|----+----|---------|---------|-----|
+ | +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ |
+ | | ioband| | ioband| |default| | ioband| | ioband| |default| |
+ | | group | | group | | group | | group | | group | | group | | dm-ioband
+ | |-------+-+-------+-+-------+-+-------+-+-------+-+-------| |
+ | | bandwidth control | |
+ | +-------------|-----------------------------|-------------+ |
+ ---------------|-----------------------------|---------------
+ | |
+ +---------------V--------------+--------------V---------------+
+ | /dev/sdb1 | /dev/sdb2 | partitions
+ +------------------------------+------------------------------+
+
+
+ --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+ Dm-ioband is flexible to configure the bandwidth settings.
+
+ Dm-ioband can work with any type of I/O scheduler such as the NOOP
+ scheduler, which is often chosen for high-end storages, since it is
+ implemented outside the I/O scheduling layer. It allows both of partition
+ based bandwidth control and job --- a group of processes --- based
+ control. In addition, it can set different configuration on each block
+ device to control its bandwidth.
+
+ Meanwhile the current implementation of the CFQ scheduler has 8 IO
+ priority levels and all jobs whose processes have the same IO priority
+ share the bandwidth assigned to this level between them. And IO priority
+ is an attribute of a process, so that it equally effects to all block
+ devices.
+
+ --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+ The bandwidth of each job is determined by a bandwidth control policy.
+ dm-ioband provides three kinds of policies "weight",
"weight-iosize" and
+ "range-bw", and a user can select one of them at the time of
setup.
+
+ --------------------------------------------------------------------------
+
+ weight and weight-iosize policy
+
+ Every ioband device has one ioband group, which by default is called the
+ default group, and can also have extra ioband groups in the ioband device.
+ Each ioband group has its own weight and tokens. The amount of tokens are
+ determined proportional to the weight of each ioband group.
+
+ The ioband group can pass on I/O requests that its job issues to the
+ underlying layer so long as it has tokens left, while requests are blocked
+ if there aren't any tokens left in the ioband group. The tokens are
+ refilled once all of the ioband groups that have requests on a given
+ underlying block device use up their tokens.
+
+ The weight policy lets dm-ioband consume one token per one I/O request.
+ The weight-iosize policy lets dm-ioband consume one token per one I/O
+ sector, for example, one I/O request which consists of 4Kbytes (512bytes *
+ 8 sectors) read consumes 8 tokens.
+
+ With this approach, a job running on the ioband group with large weight
+ is guaranteed a wide I/O bandwidth.
+
+ --------------------------------------------------------------------------
+
+ range-bw policy
+
+ range-bw means the predicable I/O bandwidth with minimum and maximum
+ value defined by administrator. And it is also possible to set up only
+ maximum value for only I/O limitation. So, you can define the specific and
+ fixed bandwidth to satisfy I/O requirement regardless of whole I/O
+ bandwidth.
+
+ Minimum I/O bandwidth is to guarantee the stable performance or
+ reliability of specific process group and maximum bandwidth is to throttle
+ the unnecessary I/O usage or to reserve the I/O bandwidth for another use.
+ So range-bw supports adequate and predicable I/O bandwidth between minimum
+ and maximum value.
+
+ The setting unit is based on Kbytes/sec. If you want to allocate
+ 3M~5Mbytes/sec I/O bandwidth to X group, you should set 3000 to min-bw,
+ 5000 to max-bw.
+
+ Attention
+
+ Although range-bw supports the predicable I/O bandwidth, it should be
+ configured in the scope of total I/O bandwidth of the I/O system to
+ guarantee the minimum I/O requirement. For example, if total I/O bandwidth
+ is 40Mbytes/sec, the summary of I/O bandwidth configured in each process
+ group should be equal or smaller than 40Mbytes/sec. So, we need to check
+ total I/O bandwidth before set it up.
+
+ --------------------------------------------------------------------------
+
+Setup and Installation
+
+ Build a kernel with these options enabled:
+
+ CONFIG_MD
+ CONFIG_BLK_DEV_DM
+ CONFIG_DM_IOBAND
+
+
+ If compiled as module, use modprobe to load dm-ioband.
+
+ # make modules
+ # make modules_install
+ # depmod -a
+ # modprobe dm-ioband
+
+
+ "dmsetup targets" command shows all available device-mapper
targets.
+ "ioband" and the version number are displayed when dm-ioband has
been
+ loaded.
+
+ # dmsetup targets | grep ioband
+ ioband v1.0.0
+
+
+ --------------------------------------------------------------------------
+
+Getting started
+
+ The following is a brief description how to control the I/O bandwidth of
+ disks. In this description, we'll take one disk with two partitions as
an
+ example target.
+
+ --------------------------------------------------------------------------
+
+ Create and map ioband devices
+
+ Create two ioband devices "ioband1" and "ioband2".
"ioband1" is mapped
+ to "/dev/sda1" and has a weight of 40. "ioband2" is
mapped to "/dev/sda2"
+ and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100
--- of
+ the bandwidth of "/dev/sda" while "ioband2" can use 20%.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0
none" \
+ "weight 0 :40" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0
none" \
+ "weight 0 :10" | dmsetup create ioband2
+
+
+ If the commands are successful then the device files
+ "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will
have been created.
+
+ --------------------------------------------------------------------------
+
+ Additional bandwidth control
+
+ In this example two extra ioband groups are created on
"ioband1."
+
+ First, set the ioband group type as user. Next, create two ioband groups
+ that have id 1000 and 2000. Then, give weights of 30 and 20 to the ioband
+ groups respectively.
+
+ # dmsetup message ioband1 0 type user
+ # dmsetup message ioband1 0 attach 1000
+ # dmsetup message ioband1 0 attach 2000
+ # dmsetup message ioband1 0 weight 1000:30
+ # dmsetup message ioband1 0 weight 2000:20
+
+
+ Now the processes owned by uid 1000 can use 30% --- 30/(30+20+40+10)*100
+ --- of the bandwidth of "/dev/sda" when the processes issue I/O
requests
+ through "ioband1." The processes owned by uid 2000 can use 20% of
the
+ bandwidth likewise.
+
+ Table 1. Weight assignments
+
+ +----------------------------------------------------------------+
+ | ioband device | ioband group | ioband weight |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | user id 1000 | 30 |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | user id 2000 | 20 |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | default group(the other users) | 40 |
+ |---------------+--------------------------------+---------------|
+ | ioband2 | default group | 10 |
+ +----------------------------------------------------------------+
+
+ --------------------------------------------------------------------------
+
+ Remove the ioband devices
+
+ Remove the ioband devices when no longer used.
+
+ # dmsetup remove ioband1
+ # dmsetup remove ioband2
+
+
+ --------------------------------------------------------------------------
+
+Command Reference
+
+ Create an ioband device
+
+ SYNOPSIS
+
+ dmsetup create IOBAND_DEVICE
+
+ DESCRIPTION
+
+ Create an ioband device with the given name IOBAND_DEVICE.
+ Generally, dmsetup reads a table from standard input. Each line of
+ the table specifies a single target and is of the form:
+
+ start_sector num_sectors "ioband" device_file
ioband_device_id \
+ io_throttle io_limit ioband_group_type policy policy_args...
+
+
+ start_sector, num_sectors
+
+ The sector range of the underlying device where
+ dm-ioband maps.
+
+ ioband
+
+ Specify the string "ioband" as a target
type.
+
+ device_file
+
+ Underlying device name.
+
+ ioband_device_id
+
+ The ID for an ioband device can be symbolic,
+ numeric, or mixed. The same ID must be set among the
+ ioband devices that share the same bandwidth. This is
+ useful for grouping disk drives partitioned from one
+ disk drive such as RAID drive or LVM logical striped
+ volume.
+
+ io_throttle
+
+ When a device has a lot of tokens, and the number
+ of in-flight I/Os in dm-ioband exceeds io_throttle,
+ dm-ioband gives priority to the device and issues
+ I/Os to the device until no tokens of the device are
+ left. If 0 is specified, the default value is used.
+ This setting applies all ioband devices which has the
+ same ioband device ID as you specified by
+ "ioband_device_id."
+
+ io_limit
+
+ Dm-ioband blocks all I/O requests for IOBAND_DEVICE
+ when the number of BIOs in progress exceeds this
+ value. If 0 is specified, the default value is used.
+ This setting applies all ioband devices which has the
+ same ioband device ID as you specified by
+ "ioband_device_id."
+
+ ioband_group_type
+
+ Specify how to evaluate the ioband group ID. The
+ selectable group types are "none",
"user", "gid",
+ "pid" or "pgrp." The type
"cgroup" is enabled by
+ applying the blkio-cgroup patch. Specify
"none" if
+ you don't need any ioband groups other than the
+ default ioband group.
+
+ policy and policy_args
+
+ Specify a bandwidth control policy. The selectable
+ policies are "weight",
"weight-iosize" or "range-bw."
+ This setting applies all ioband devices which has the
+ same ioband device ID as you specified by
+ "ioband_device_id."
+
+ policy_args are specific for each policy. See below
+ for information on each policy.
+
+ WEIGHT AND WEIGHT-IOSIZE POLICIES
+
+ The "weight" and "weight-iosize" policies
distribute bandwidth
+ proportional to the weight of each ioband group. Each ioband group
+ is charged on an I/O count basis when the "weight" policy
is used
+ and an I/O size basis when the "weight-iosize" policy is
used. The
+ arguments are of the form:
+
+ token_base :weight [ioband_group_id:weight...]
+
+
+ token_base
+
+ The number of tokens which specified by token_base
+ will be distributed to all ioband groups proportional
+ to the weight of each ioband group. If 0 is
+ specified, the default value is used. This setting
+ applies all ioband devices which has the same ioband
+ device ID as you specified by
"ioband_device_id."
+
+ :weight
+
+ Set the weight of the default ioband group.
+
+ ioband_group_id:weight
+
+ Create an extra ioband group with an
+ ioband_group_id and set its weight. The
+ ioband_group_id is an identification number and
+ corresponds to pid, pgrp , uid and so on which depend
+ on ioband group type settings.
+
+ RANGE-BW POLICY
+
+ The "range-bw" policy distributes the predicable
bandwidth to
+ each group according to the values of minimum and maximum
+ bandwidth value. And range-bw is not based on I/O token which is
+ usually grant for I/O authority.
+
+ So, "0" value is used for token_base parameter in
range-bw
+ policy. And both parameters, min-bw and max-bw, are generally used
+ together, but, max-bw can be used alone for only limitation. The
+ arguments are of the form:
+
+ token_base :min-bw:max-bw [ioband_group_id:min-bw:max-bw...]
+
+
+ token_base
+
+ "0" is used, because it is not meaningful
in this
+ policy
+
+ min-bw
+
+ Set the minimum bandwidth of the default ioband
+ group. This parameter can't be used alone.
+
+ max-bw
+
+ Set the maximum bandwidth of the default ioband
+ group.
+
+ ioband_group_id:min-bw:max-bw
+
+ Create an extra ioband group with an
+ ioband_group_id and set its min and max bandwidth.
+ The ioband_group_id is an identification number and
+ corresponds to pid, pgrp , uid and so on which depend
+ on ioband group type settings.
+
+ EXAMPLE
+
+ Create an ioband device with the following parameters:
+
+ * Starting sector = "0"
+
+ * The number of sectors = "$(blockdev --getsize
/dev/sda1)"
+
+ * Target type = "ioband"
+
+ * Underlying device name = "/dev/sda1"
+
+ * Ioband device ID = "share1"
+
+ * I/O throttle = "10"
+
+ * I/O limit = "400"
+
+ * Ioband group type = "user"
+
+ * Bandwidth control policy = "weight"
+
+ * Token base = "2048"
+
+ * Weight for the default ioband group = "100"
+
+ * Weight for the ioband group 1000 = "80"
+
+ * Weight for the ioband group 2000 = "20"
+
+ * Ioband device name = "ioband1"
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband
/dev/sda1" \
+ "share1 10 400 user weight 2048 :100 1000:80 2000:20"
\
+ | dmsetup create ioband1
+
+
+ Create two device groups (ID=1,2). The bandwidths of these
+ device groups will be individually controlled.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1
1" \
+ "0 0 none weight 0 :80" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2
1" \
+ "0 0 none weight 0 :20" | dmsetup create ioband2
+ # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3
2" \
+ "0 0 none weight 0 :60" | dmsetup create ioband3
+ # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4
2" \
+ "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+ --------------------------------------------------------------------------
+
+ Remove the ioband device
+
+ SYNOPSIS
+
+ dmsetup remove IOBAND_DEVICE
+
+ DESCRIPTION
+
+ Remove the specified ioband device IOBAND_DEVICE. All the band
+ groups attached to the ioband device are also removed
+ automatically.
+
+ EXAMPLE
+
+ Remove ioband device "ioband1."
+
+ # dmsetup remove ioband1
+
+
+ --------------------------------------------------------------------------
+
+ Set an ioband group type
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 type TYPE
+
+ DESCRIPTION
+
+ Set an ioband group type of IOBAND_DEVICE. TYPE must be one of
+ "none", "user", "gid", "pid"
or "pgrp." The type "cgroup" is
+ enabled by applying the blkio-cgroup patch. Once the type is set,
+ new ioband groups can be created on IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Set the ioband group type of ioband device "ioband1" to
"user."
+
+ # dmsetup message ioband1 0 type user
+
+
+ --------------------------------------------------------------------------
+
+ Create an ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 attach ID
+
+ DESCRIPTION
+
+ Create an ioband group and attach it to IOBAND_DEVICE. ID
+ specifies user-id, group-id, process-id or process-group-id
+ depending the ioband group type of IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Create an ioband group which consists of all processes with
+ user-id 1000 and attach it to ioband device "ioband1."
+
+ # dmsetup message ioband1 0 type user
+ # dmsetup message ioband1 0 attach 1000
+
+
+ --------------------------------------------------------------------------
+
+ Detach the ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 detach ID
+
+ DESCRIPTION
+
+ Detach the ioband group specified by ID from ioband device
+ IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Detach the ioband group with ID "2000" from ioband
device
+ "ioband2."
+
+ # dmsetup message ioband2 0 detach 1000
+
+
+ --------------------------------------------------------------------------
+
+ Set bandwidth control policy
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 policy POLICY
+
+ DESCRIPTION
+
+ Set POLICY to a bandwidth control policy. The selectable
+ policies are "weight", "weight-iosize" and
"range-bw." This
+ setting applies all ioband devices which has the same ioband
+ device ID as IOBAND_DEVICE.
+
+ weight
+
+ This policy distributes bandwidth proportional to
+ the weight of each ioband group. Each ioband group is
+ charged on an I/O count basis.
+
+ weight-iosize
+
+ This policy distributes bandwidth proportional to
+ the weight of each ioband group. Each ioband group is
+ charged on an I/O size basis.
+
+ range-bw
+
+ This policy guarantees minimum bandwidth and limits
+ maximum bandwidth for each ioband group.
+
+ EXAMPLE
+
+ Set bandwidth control policy of ioband devices which have the
+ same ioband device ID as "ioband1" to
"weight-iosize."
+
+ # dmsetup message ioband1 0 policy weight-iosize
+
+
+ --------------------------------------------------------------------------
+
+ Set the weight of an ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 weight VAL
+
+ dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+ DESCRIPTION
+
+ Set the weight of the ioband group which belongs to
+ IOBAND_DEVICE. The group is determined by ID. If ID: is omitted,
+ the default ioband group is chosen.
+
+ The following example means that "ioband1" can use 80%
---
+ 40/(40+10)*100 --- of the bandwidth of the underlying block device
+ while "ioband2" can use 20%.
+
+ # dmsetup message ioband1 0 weight 40
+ # dmsetup message ioband2 0 weight 10
+
+
+ The following lines have the same effect as the above:
+
+ # dmsetup message ioband1 0 weight 4
+ # dmsetup message ioband2 0 weight 1
+
+
+ VAL must be an integer larger than 0. The default value, which
+ is assigned to newly created ioband groups, is 100.
+
+ EXAMPLE
+
+ Set the weight of the default ioband group of "ioband1"
to 40.
+
+ # dmsetup message ioband1 0 weight 40
+
+
+ Set the weight of the ioband group of "ioband1" with ID
"1000"
+ to 10.
+
+ # dmsetup message ioband1 0 weight 1000:10
+
+
+ --------------------------------------------------------------------------
+
+ Set the range-bw of an ioband group
+
+ SYNOPSIS
+
+ dmsetup -- message IOBAND_DEVICE 0 range-bw -1:MIN:MAX
+
+ dmsetup message IOBAND_DEVICE 0 range-bw ID:MIN-BW:MAX-BW
+
+ DESCRIPTION
+
+ Set the range-bw of the ioband group which belongs to
+ IOBAND_DEVICE. The group is determined by ID. If -1 is specified
+ as ID, the default ioband group is chosen.
+
+ The following example means that "ioband1" can use
+ 5M~6Mbytes/sec bandwidth of the underlying block device while
+ "ioband2" can use 900K~1Mbytes/sec bandwidth.
+
+ # dmsetup message -- ioband1 0 range-bw -1:5000:6000
+
+ # dmsetup message -- ioband2 0 range-bw -1:900:1000
+
+
+ MIN-BW and MAX-BW and must be an integer larger than 0 and its
+ unit is Kbyte/sec.
+
+ EXAMPLE
+
+ Set the range-bw of the default ioband group of
"ioband1" to
+ 200K~300K I/O bandwidth.
+
+ # dmsetup -- message ioband1 0 range-bw -1:200:300
+
+
+ Set the weight of the ioband group of "ioband1" with ID
"1000"
+ to 10M~12M I/O bandwidth.
+
+ # dmsetup message ioband1 0 range-bw 1000:10000:12000
+
+
+ --------------------------------------------------------------------------
+
+ Set the number of tokens
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 token VAL
+
+ DESCRIPTION
+
+ The number of tokens will be distributed to all ioband groups
+ proportional to the weight of each ioband group. If 0 is
+ specified, the default value is used. This setting applies all
+ ioband devices which has the same ioband device ID as
+ IOBAND_DEVICE
+
+ EXAMPLE
+
+ Set the number of tokens to 256.
+
+ # dmsetup message ioband1 0 token 256
+
+
+ --------------------------------------------------------------------------
+
+ Set a limit of how many tokens are carried over
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 carryover VAL
+
+ DESCRIPTION
+
+ When dm-ioband tries to refill an ioband group with tokens after
+ another ioband group is already refilled several times, dm-ioband
+ determines the number of tokens to refill by multiplying the
+ number of tokens refilled once by the smaller of how many times
+ the other group is already refilled or this limit. If 0 is
+ specified, the default value is used. This setting applies all
+ ioband devices which has the same ioband device ID as
+ IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Set a limit for "ioband1" to 2.
+
+ # dmsetup message ioband1 0 carryover 2
+
+
+ --------------------------------------------------------------------------
+
+ Set I/O throttling
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+ DESCRIPTION
+
+ When a device has a lot of tokens, and the number of in-flight
+ I/Os in dm-ioband exceeds io_throttle, dm-ioband gives priority to
+ the device and issues I/Os to the device until no tokens of the
+ device are left. If 0 is specified, the default value is used.
+ This setting applies all ioband devices which has the same ioband
+ device ID as you specified by "ioband_device_id."
+
+ EXAMPLE
+
+ Set the I/O throttling value of "ioband1" to 16.
+
+ # dmsetup message ioband1 0 io_throttle 16
+
+
+ --------------------------------------------------------------------------
+
+ Set I/O limiting
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+ DESCRIPTION
+
+ Dm-ioband blocks all I/O requests for IOBAND_DEVICE when the
+ number of BIOs in progress exceeds this value. If 0 is specified,
+ the default value is used. This setting applies all ioband devices
+ which has the same ioband device ID as IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Set the I/O limiting value of "ioband1" to 128.
+
+ # dmsetup message ioband1 0 io_limit 128
+
+
+ --------------------------------------------------------------------------
+
+ Display settings
+
+ SYNOPSIS
+
+ dmsetup table --target ioband
+
+ DESCRIPTION
+
+ Display the current table for the ioband device in a format. See
+ "dmsetup create" command for information on the table
format.
+
+ EXAMPLE
+
+ The following output shows the current table of
"ioband1."
+
+ # dmsetup table --target ioband
+ ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+ 2048 :100 1000:80 2000:20
+
+
+ --------------------------------------------------------------------------
+
+ Display Statistics
+
+ SYNOPSIS
+
+ dmsetup status --target ioband
+
+ DESCRIPTION
+
+ Display the statistics of all the ioband devices whose target
+ type is "ioband."
+
+ The output format is as below. the first five columns shows:
+
+ * ioband device name
+
+ * logical start sector of the device (must be 0)
+
+ * device size in sectors
+
+ * target type (must be "ioband")
+
+ * device group ID
+
+ The remaining columns show the statistics of each ioband group
+ on the band device. Each group uses seven columns for its
+ statistics.
+
+ * ioband group ID (-1 means default)
+
+ * total read requests
+
+ * delayed read requests
+
+ * total read sectors
+
+ * total write requests
+
+ * delayed write requests
+
+ * total write sectors
+
+ EXAMPLE
+
+ The following output shows the statistics of two ioband devices.
+ Ioband2 only has the default ioband group and ioband1 has three
+ (default, 1001, 1002) ioband groups.
+
+ # dmsetup status
+ ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+ ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+ 166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+ --------------------------------------------------------------------------
+
+ Reset status counter
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 reset
+
+ DESCRIPTION
+
+ Reset the statistics of ioband device IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Reset the statistics of "ioband1."
+
+ # dmsetup message ioband1 0 reset
+
+
+ --------------------------------------------------------------------------
+
+Examples
+
+ Example #1: Bandwidth control on Partitions
+
+ This example describes how to control the bandwidth with disk
+ partitions. The following diagram illustrates the configuration of this
+ example. You may want to run a database on /dev/mapper/ioband1 and web
+ applications on /dev/mapper/ioband2.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | partitions
+ +---------------------------+---------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Create ioband devices with the same device group ID and assign
+ weights of 80 and 40 to the default ioband groups respectively.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0
0" \
+ "none weight 0 :80" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0
0" \
+ "none weight 0 :40" | dmsetup create ioband2
+
+
+ 2. Create filesystems on the ioband devices and mount them.
+
+ # mkfs.ext3 /dev/mapper/ioband1
+ # mount /dev/mapper/ioband1 /mnt1
+
+ # mkfs.ext3 /dev/mapper/ioband2
+ # mount /dev/mapper/ioband2 /mnt2
+
+
+ --------------------------------------------------------------------------
+
+ Example #2: Bandwidth control on Logical Volumes
+
+ This example is similar to the example #1 but it uses LVM logical
+ volumes instead of disk partitions. This example shows how to configure
+ ioband devices on two striped logical volumes.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/lv0 | | /dev/mapper/lv1 | striped logical
+ | | | | volumes
+ +-------------------------------------------------------+
+ | vg0 | volume group
+ +-------------|----------------------------|------------+
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/sdb | | /dev/sdc | physical disks
+ +--------------------------+ +--------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Initialize the partitions for use by LVM.
+
+ # pvcreate /dev/sdb
+ # pvcreate /dev/sdc
+
+
+ 2. Create a new volume group named "vg0" with /dev/sdb and
/dev/sdc.
+
+ # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+ 3. Create two logical volumes in "vg0." The volumes have to be
striped.
+
+ # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M
+ # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M
+
+
+ The rest is the same as the example #1.
+
+ 4. Create ioband devices corresponding to each logical volume and
+ assign weights of 80 and 40 to the default ioband groups respectively.
+
+ # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+ "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+ dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+ "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+ dmsetup create ioband2
+
+
+ 5. Create filesystems on the ioband devices and mount them.
+
+ # mkfs.ext3 /dev/mapper/ioband1
+ # mount /dev/mapper/ioband1 /mnt1
+
+ # mkfs.ext3 /dev/mapper/ioband2
+ # mount /dev/mapper/ioband2 /mnt2
+
+
+ --------------------------------------------------------------------------
+
+ Example #4: Bandwidth control on processes
+
+ This example describes how to control the bandwidth with groups of
+ processes. You may also want to run an additional application on the same
+ machine described in the example #1. This example shows how to add a new
+ ioband group for this application.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +-------------+------------+ +-------------+------------+
+ | default | | user=1000 | default | ioband groups
+ | (80) | | (20) | (40) | (weight)
+ +-------------+------------+ +-------------+------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | partitions
+ +---------------------------+---------------------------+
+
+
+ The following shows to set up a new ioband group on the machine that is
+ already configured as the example #1. The application will have a weight
+ of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+ 1. Set the type of ioband2 to "user."
+
+ # dmsetup message ioband2 0 type user.
+
+
+ 2. Create a new ioband group on ioband2.
+
+ # dmsetup message ioband2 0 attach 1000
+
+
+ 3. Assign weight of 10 to this newly created ioband group.
+
+ # dmsetup message ioband2 0 weight 1000:20
+
+
+ --------------------------------------------------------------------------
+
+ Example #3: Bandwidth control for Xen virtual block devices
+
+ This example describes how to control the bandwidth for Xen virtual
+ block devices. The following diagram illustrates the configuration of this
+ example.
+
+ Virtual Machine 1 Virtual Machine 2 virtual machines
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/xvda1 | | /dev/xvda1 | virtual block
+ +-------------|------------+ +-------------|------------+ devices
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | partitions
+ +---------------------------+---------------------------+
+
+
+ The followings shows how to map ioband device "ioband1" and
"ioband2" to
+ virtual block device "/dev/xvda1 on Virtual Machine 1" and
"/dev/xvda1 on
+ Virtual Machine 2" respectively on the machine configured as the
example
+ #1. Add the following lines to the configuration files that are referenced
+ when creating "Virtual Machine 1" and "Virtual Machine
2."
+
+ For "Virtual Machine 1"
+ disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+ For "Virtual Machine 2"
+ disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+ --------------------------------------------------------------------------
+
+ Example #4: Bandwidth control for Xen blktap devices
+
+ This example describes how to control the bandwidth for Xen virtual
+ block devices when Xen blktap devices are used. The following diagram
+ illustrates the configuration of this example.
+
+ Virtual Machine 1 Virtual Machine 2 virtual machines
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/xvda1 | | /dev/xvda1 | virtual block
+ +-------------|------------+ +-------------|------------+ devices
+ | |
+ +----------V----------+ +-----------V---------+
+ | tapdisk | | tapdisk | tapdisk daemons
+ | (15011) | | (15276) | (daemon's
pid)
+ +----------|----------+ +-----------|---------+
+ | |
+ +-------------|----------------------------|------------+
+ | | /dev/mapper/ioband1 | | ioband device
+ | | mount on /vmdisk | |
+ +-------------V-------------+--------------V------------+
+ | group for PID=15011 | group for PID=15276 | ioband groups
+ | (80) | (40) | (weight)
+ +-------------|----------------------------|------------+
+ | |
+ +-------------|----------------------------|------------+
+ | +----------V----------+ +-----------V---------+ |
+ | | vm1.img | | vm2.img | | disk image files
+ | +---------------------+ +---------------------+ |
+ | /dev/sda1 | partition
+ +-------------------------------------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Create an ioband device.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1"
\
+ "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+ 2. Add the following lines to the configuration files that are
+ referenced when creating "Virtual Machine 1" and "Virtual
Machine 2."
+ Disk image files "/vmdisk/vm1.img" and
"/vmdisk/vm2.img" will be used.
+
+ For "Virtual Machine 1"
+ disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+ For "Virtual Machine 1"
+ disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+ 3. Run the virtual machines.
+
+ # xm create vm1
+ # xm create vm2
+
+
+ 4. Find out the process IDs of the daemons which control the blktap
+ devices.
+
+ # lsof /vmdisk/disk[12].img
+ COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
+ tapdisk 15011 root 11u REG 253,0 2147483648 48961 /vmdisk/vm1.img
+ tapdisk 15276 root 13u REG 253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+ 5. Create new ioband groups of pid 15011 and pid 15276, which are
+ process IDs of the tapdisks, and assign weight of 80 and 40 to the
+ groups respectively.
+
+ # dmsetup message ioband1 0 type pid
+ # dmsetup message ioband1 0 attach 15011
+ # dmsetup message ioband1 0 weight 15011:80
+ # dmsetup message ioband1 0 attach 15276
+ # dmsetup message ioband1 0 weight 15276:40
Index: linux-2.6.31/drivers/md/Kconfig
==================================================================---
linux-2.6.31.orig/drivers/md/Kconfig
+++ linux-2.6.31/drivers/md/Kconfig
@@ -294,4 +294,17 @@ config DM_UEVENT
---help---
Generate udev events for DM events.
+config DM_IOBAND
+ tristate "I/O bandwidth control (EXPERIMENTAL)"
+ depends on BLK_DEV_DM && EXPERIMENTAL
+ ---help---
+ This device-mapper target allows to define how the
+ available bandwidth of a storage device should be
+ shared between processes, cgroups, the partitions or the LUNs.
+
+ Information on how to use dm-ioband is available in:
+ <file:Documentation/device-mapper/ioband.txt>.
+
+ If unsure, say N.
+
endif # MD
Index: linux-2.6.31/drivers/md/Makefile
==================================================================---
linux-2.6.31.orig/drivers/md/Makefile
+++ linux-2.6.31/drivers/md/Makefile
@@ -8,6 +8,8 @@ dm-multipath-y += dm-path-selector.o dm-
dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
dm-snap-persistent.o
dm-mirror-y += dm-raid1.o
+dm-ioband-y += dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-rangebw.o \
+ dm-ioband-type.o
dm-log-userspace-y \
+= dm-log-userspace-base.o dm-log-userspace-transfer.o
md-mod-y += md.o bitmap.o
@@ -37,6 +39,7 @@ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
+obj-$(CONFIG_DM_IOBAND) += dm-ioband.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
Index: linux-2.6.31/drivers/md/dm-ioband-ctl.c
==================================================================--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-ctl.c
@@ -0,0 +1,1357 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ * Authors: Hirokazu Takahashi <taka at valinux.co.jp>
+ * Ryo Tsuruta <ryov at valinux.co.jp>
+ *
+ * I/O bandwidth control
+ *
+ * Some blktrace messages were added by Alan D. Brunelle <Alan.Brunelle at
hp.com>
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/dm-ioband.h>
+
+static LIST_HEAD(ioband_device_list);
+/* lock up during configuration */
+static DEFINE_MUTEX(ioband_lock);
+
+static void suspend_ioband_device(struct ioband_device *, unsigned long, int);
+static void resume_ioband_device(struct ioband_device *);
+static void ioband_conduct(struct work_struct *);
+static void ioband_hold_bio(struct ioband_group *, struct bio *);
+static struct bio *ioband_pop_bio(struct ioband_group *);
+static int ioband_set_param(struct ioband_group *, const char *, const char *);
+static int ioband_group_attach(struct ioband_group *, int, int, const char *);
+static int ioband_group_type_select(struct ioband_group *, const char *);
+
+static void do_nothing(void) {}
+
+static int policy_init(struct ioband_device *dp, const char *name,
+ int argc, char **argv)
+{
+ const struct ioband_policy_type *p;
+ struct ioband_group *gp;
+ unsigned long flags;
+ int r;
+
+ for (p = dm_ioband_policy_type; p->p_name; p++) {
+ if (!strcmp(name, p->p_name))
+ break;
+ }
+ if (!p->p_name)
+ return -EINVAL;
+ /* do nothing if the same policy is already set */
+ if (dp->g_policy == p)
+ return 0;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ suspend_ioband_device(dp, flags, 1);
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ dp->g_group_dtr(gp);
+
+ /* switch to the new policy */
+ dp->g_policy = p;
+ r = p->p_policy_init(dp, argc, argv);
+ if (!r) {
+ if (!dp->g_hold_bio)
+ dp->g_hold_bio = ioband_hold_bio;
+ if (!dp->g_pop_bio)
+ dp->g_pop_bio = ioband_pop_bio;
+
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ dp->g_group_ctr(gp, NULL);
+ }
+ resume_ioband_device(dp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static struct ioband_device *alloc_ioband_device(const char *name,
+ int io_throttle, int io_limit)
+{
+ struct ioband_device *dp, *new_dp;
+
+ new_dp = kzalloc(sizeof(struct ioband_device), GFP_KERNEL);
+ if (!new_dp)
+ return NULL;
+
+ /*
+ * Prepare its own workqueue as generic_make_request() may
+ * potentially block the workqueue when submitting BIOs.
+ */
+ new_dp->g_ioband_wq = create_workqueue("kioband");
+ if (!new_dp->g_ioband_wq) {
+ kfree(new_dp);
+ return NULL;
+ }
+
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ if (!strcmp(dp->g_name, name)) {
+ dp->g_ref++;
+ destroy_workqueue(new_dp->g_ioband_wq);
+ kfree(new_dp);
+ return dp;
+ }
+ }
+
+ INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
+ INIT_LIST_HEAD(&new_dp->g_groups);
+ INIT_LIST_HEAD(&new_dp->g_list);
+ INIT_LIST_HEAD(&new_dp->g_root_groups);
+ spin_lock_init(&new_dp->g_lock);
+ bio_list_init(&new_dp->g_urgent_bios);
+ new_dp->g_io_throttle = io_throttle;
+ new_dp->g_io_limit = io_limit;
+ new_dp->g_issued[BLK_RW_SYNC] = 0;
+ new_dp->g_issued[BLK_RW_ASYNC] = 0;
+ new_dp->g_blocked = 0;
+ new_dp->g_ref = 1;
+ new_dp->g_flags = 0;
+ strlcpy(new_dp->g_name, name, sizeof(new_dp->g_name));
+ new_dp->g_policy = NULL;
+ new_dp->g_hold_bio = NULL;
+ new_dp->g_pop_bio = NULL;
+ init_waitqueue_head(&new_dp->g_waitq);
+ init_waitqueue_head(&new_dp->g_waitq_suspend);
+ init_waitqueue_head(&new_dp->g_waitq_flush);
+ list_add_tail(&new_dp->g_list, &ioband_device_list);
+ return new_dp;
+}
+
+static void release_ioband_device(struct ioband_device *dp)
+{
+ dp->g_ref--;
+ if (dp->g_ref > 0)
+ return;
+ list_del(&dp->g_list);
+ destroy_workqueue(dp->g_ioband_wq);
+ kfree(dp);
+}
+
+static int is_ioband_device_flushed(struct ioband_device *dp,
+ int wait_completion)
+{
+ struct ioband_group *gp;
+
+ if (wait_completion && nr_issued(dp) > 0)
+ return 0;
+ if (dp->g_blocked || waitqueue_active(&dp->g_waitq))
+ return 0;
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ if (waitqueue_active(&gp->c_waitq))
+ return 0;
+ return 1;
+}
+
+static void suspend_ioband_device(struct ioband_device *dp,
+ unsigned long flags, int wait_completion)
+{
+ struct ioband_group *gp;
+
+ /* block incoming bios */
+ set_device_suspended(dp);
+
+ /* wake up all blocked processes and go down all ioband groups */
+ wake_up_all(&dp->g_waitq);
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (!is_group_down(gp)) {
+ set_group_down(gp);
+ set_group_need_up(gp);
+ }
+ wake_up_all(&gp->c_waitq);
+ }
+
+ /* flush the already mapped bios */
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ flush_workqueue(dp->g_ioband_wq);
+
+ /* wait for all processes to wake up and bios to release */
+ spin_lock_irqsave(&dp->g_lock, flags);
+ wait_event_lock_irq(dp->g_waitq_flush,
+ is_ioband_device_flushed(dp, wait_completion),
+ dp->g_lock, do_nothing());
+}
+
+static void resume_ioband_device(struct ioband_device *dp)
+{
+ struct ioband_group *gp;
+
+ /* go up ioband groups */
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (group_need_up(gp)) {
+ clear_group_need_up(gp);
+ clear_group_down(gp);
+ }
+ }
+
+ /* accept incoming bios */
+ wake_up_all(&dp->g_waitq_suspend);
+ clear_device_suspended(dp);
+}
+
+static struct ioband_group *ioband_group_find(struct ioband_group *head, int
id)
+{
+ struct rb_node *node = head->c_group_root.rb_node;
+
+ while (node) {
+ struct ioband_group *p + rb_entry(node, struct ioband_group, c_group_node);
+
+ if (p->c_id == id || id == IOBAND_ID_ANY)
+ return p;
+ node = (id < p->c_id) ? node->rb_left : node->rb_right;
+ }
+ return NULL;
+}
+
+static void ioband_group_add_node(struct rb_root *root, struct ioband_group
*gp)
+{
+ struct rb_node **node = &root->rb_node, *parent = NULL;
+ struct ioband_group *p;
+
+ while (*node) {
+ p = rb_entry(*node, struct ioband_group, c_group_node);
+ parent = *node;
+ node = (gp->c_id < p->c_id) ?
+ &(*node)->rb_left : &(*node)->rb_right;
+ }
+
+ rb_link_node(&gp->c_group_node, parent, node);
+ rb_insert_color(&gp->c_group_node, root);
+}
+
+static int ioband_group_init(struct ioband_device *dp,
+ struct ioband_group *head,
+ struct ioband_group *parent,
+ struct ioband_group *gp,
+ int id, const char *param)
+{
+ unsigned long flags;
+ int r;
+
+ INIT_LIST_HEAD(&gp->c_list);
+ INIT_LIST_HEAD(&gp->c_sibling);
+ INIT_LIST_HEAD(&gp->c_children);
+ gp->c_parent = parent;
+ bio_list_init(&gp->c_blocked_bios);
+ bio_list_init(&gp->c_prio_bios);
+ gp->c_id = id; /* should be verified */
+ gp->c_blocked = 0;
+ gp->c_prio_blocked = 0;
+ memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+ init_waitqueue_head(&gp->c_waitq);
+ gp->c_flags = 0;
+ gp->c_group_root = RB_ROOT;
+ gp->c_banddev = dp;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (head && ioband_group_find(head, id)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("%s: id=%d already exists.", __func__, id);
+ return -EEXIST;
+ }
+
+ list_add_tail(&gp->c_list, &dp->g_groups);
+
+ if (!parent)
+ list_add_tail(&gp->c_sibling, &dp->g_root_groups);
+ else
+ list_add_tail(&gp->c_sibling, &parent->c_children);
+
+ r = dp->g_group_ctr(gp, param);
+ if (r) {
+ list_del(&gp->c_list);
+ list_del(&gp->c_sibling);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+ }
+
+ if (head) {
+ ioband_group_add_node(&head->c_group_root, gp);
+ gp->c_dev = head->c_dev;
+ gp->c_target = head->c_target;
+ }
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+}
+
+static void ioband_group_release(struct ioband_group *head,
+ struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ list_del(&gp->c_list);
+ list_del(&gp->c_sibling);
+ if (head)
+ rb_erase(&gp->c_group_node, &head->c_group_root);
+ dp->g_group_dtr(gp);
+ kfree(gp);
+}
+
+static void ioband_group_destroy_all(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ while ((p = ioband_group_find(gp, IOBAND_ID_ANY)))
+ ioband_group_release(gp, p);
+ ioband_group_release(NULL, gp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static void ioband_group_stop_all(struct ioband_group *head, int suspend)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *p;
+ struct rb_node *node;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ for (node = rb_first(&head->c_group_root); node; node = rb_next(node))
{
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ set_group_down(p);
+ if (suspend)
+ set_group_suspended(p);
+ }
+ set_group_down(head);
+ if (suspend)
+ set_group_suspended(head);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ flush_workqueue(dp->g_ioband_wq);
+}
+
+static void ioband_group_resume_all(struct ioband_group *head)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *p;
+ struct rb_node *node;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ for (node = rb_first(&head->c_group_root); node; node = rb_next(node))
{
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ clear_group_down(p);
+ clear_group_suspended(p);
+ }
+ clear_group_down(head);
+ clear_group_suspended(head);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static int parse_group_param(const char *param, long *id, char const **value)
+{
+ char *s, *endp;
+ long n;
+
+ s = strpbrk(param, POLICY_PARAM_DELIM);
+ if (!s) {
+ *id = IOBAND_ID_ANY;
+ *value = param;
+ return 0;
+ }
+
+ n = simple_strtol(param, &endp, 0);
+ if (endp != s)
+ return -EINVAL;
+
+ *id = (endp == param) ? IOBAND_ID_ANY : n;
+ *value = endp + 1;
+ return 0;
+}
+
+/*
+ * Create a new band device:
+ * parameters: <device> <device-group-id> <io_throttle>
<io_limit>
+ * <type> <policy> <policy-param...>
<group-id:group-param...>
+ */
+static int ioband_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+ struct ioband_group *gp;
+ struct ioband_device *dp;
+ struct dm_dev *dev;
+ int io_throttle;
+ int io_limit;
+ int i, r, start;
+ long val, id;
+ const char *param;
+ char *s;
+
+ if (argc < POLICY_PARAM_START) {
+ ti->error = "Requires " __stringify(POLICY_PARAM_START)
+ " or more arguments";
+ return -EINVAL;
+ }
+
+ if (strlen(argv[1]) > IOBAND_NAME_MAX) {
+ ti->error = "Ioband device name is too long";
+ return -EINVAL;
+ }
+
+ r = strict_strtol(argv[2], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX) {
+ ti->error = "Invalid io_throttle";
+ return -EINVAL;
+ }
+ io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+
+ r = strict_strtol(argv[3], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX) {
+ ti->error = "Invalid io_limit";
+ return -EINVAL;
+ }
+ io_limit = val;
+
+ r = dm_get_device(ti, argv[0], 0, ti->len,
+ dm_table_get_mode(ti->table), &dev);
+ if (r) {
+ ti->error = "Device lookup failed";
+ return r;
+ }
+
+ if (io_limit == 0) {
+ struct request_queue *q;
+
+ q = bdev_get_queue(dev->bdev);
+ if (!q) {
+ ti->error = "Can't get queue size";
+ r = -ENXIO;
+ goto release_dm_device;
+ }
+ /*
+ * The block layer accepts I/O requests up to 50% over
+ * nr_requests when the requests are issued from a
+ * "batcher" process.
+ */
+ io_limit = (3 * q->nr_requests / 2);
+ }
+
+ if (io_limit < io_throttle)
+ io_limit = io_throttle;
+
+ mutex_lock(&ioband_lock);
+ dp = alloc_ioband_device(argv[1], io_throttle, io_limit);
+ if (!dp) {
+ ti->error = "Cannot create ioband device";
+ r = -EINVAL;
+ mutex_unlock(&ioband_lock);
+ goto release_dm_device;
+ }
+
+ r = policy_init(dp, argv[POLICY_PARAM_START - 1],
+ argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]);
+ if (r) {
+ ti->error = "Invalid policy parameter";
+ goto release_ioband_device;
+ }
+
+ gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+ if (!gp) {
+ ti->error = "Cannot allocate memory for ioband group";
+ r = -ENOMEM;
+ goto release_ioband_device;
+ }
+
+ ti->num_flush_requests = 1;
+ ti->private = gp;
+ gp->c_target = ti;
+ gp->c_dev = dev;
+
+ /* Find a default group parameter */
+ for (start = POLICY_PARAM_START; start < argc; start++) {
+ s = strpbrk(argv[start], POLICY_PARAM_DELIM);
+ if (s == argv[start])
+ break;
+ }
+ param = (start < argc) ? &argv[start][1] : NULL;
+
+ /* Create a default ioband group */
+ r = ioband_group_init(dp, NULL, NULL, gp, IOBAND_ID_ANY, param);
+ if (r) {
+ kfree(gp);
+ ti->error = "Cannot create default ioband group";
+ goto release_ioband_device;
+ }
+
+ r = ioband_group_type_select(gp, argv[4]);
+ if (r) {
+ ti->error = "Cannot set ioband group type";
+ goto release_ioband_group;
+ }
+
+ /* Create sub ioband groups */
+ for (i = start + 1; i < argc; i++) {
+ r = parse_group_param(argv[i], &id, ¶m);
+ if (r) {
+ ti->error = "Invalid ioband group parameter";
+ goto release_ioband_group;
+ }
+ r = ioband_group_attach(gp, 0, id, param);
+ if (r) {
+ ti->error = "Cannot create ioband group";
+ goto release_ioband_group;
+ }
+ }
+ mutex_unlock(&ioband_lock);
+ return 0;
+
+release_ioband_group:
+ ioband_group_destroy_all(gp);
+release_ioband_device:
+ release_ioband_device(dp);
+ mutex_unlock(&ioband_lock);
+release_dm_device:
+ dm_put_device(ti, dev);
+ return r;
+}
+
+static void ioband_dtr(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ struct dm_dev *dev = gp->c_dev;
+
+ mutex_lock(&ioband_lock);
+
+ ioband_group_stop_all(gp, 0);
+ cancel_delayed_work_sync(&dp->g_conductor);
+ ioband_group_destroy_all(gp);
+
+ release_ioband_device(dp);
+ mutex_unlock(&ioband_lock);
+
+ dm_put_device(ti, dev);
+}
+
+static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+ /* Todo: The list should be split into a sync list and an async list */
+ bio_list_add(&gp->c_blocked_bios, bio);
+}
+
+static struct bio *ioband_pop_bio(struct ioband_group *gp)
+{
+ return bio_list_pop(&gp->c_blocked_bios);
+}
+
+static int is_urgent_bio(struct bio *bio)
+{
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ /*
+ * ToDo: A new flag should be added to struct bio, which indicates
+ * it contains urgent I/O requests.
+ */
+ if (!PageReclaim(page))
+ return 0;
+ if (PageSwapCache(page))
+ return 2;
+ return 1;
+}
+
+static inline int device_should_block(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (is_group_down(gp))
+ return 0;
+ if (is_device_blocked(dp))
+ return 1;
+ if (dp->g_blocked >= dp->g_io_limit * 2) {
+ set_device_blocked(dp);
+ return 1;
+ }
+ return 0;
+}
+
+static inline int group_should_block(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (is_group_down(gp))
+ return 0;
+ if (is_group_blocked(gp))
+ return 1;
+ if (dp->g_should_block(gp)) {
+ set_group_blocked(gp);
+ return 1;
+ }
+ return 0;
+}
+
+static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) {
+ /*
+ * Kernel threads shouldn't be blocked easily since each of
+ * them may handle BIOs for several groups on several
+ * partitions.
+ */
+ wait_event_lock_irq(dp->g_waitq, !device_should_block(gp),
+ dp->g_lock, do_nothing());
+ } else {
+ wait_event_lock_irq(gp->c_waitq, !group_should_block(gp),
+ dp->g_lock, do_nothing());
+ }
+}
+
+static inline int should_pushback_bio(struct ioband_group *gp)
+{
+ return is_group_suspended(gp) &&
dm_noflush_suspending(gp->c_target);
+}
+
+static inline bool bio_is_sync(struct bio *bio)
+{
+ /* Must be the same condition as rw_is_sync() in blkdev.h */
+ return !bio_data_dir(bio) || bio_sync(bio);
+}
+
+static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_issued[bio_is_sync(bio)]++;
+ return dp->g_prepare_bio(gp, bio, 0);
+}
+
+static inline int room_for_bio(struct ioband_device *dp)
+{
+ return dp->g_issued[BLK_RW_SYNC] < dp->g_io_limit
+ || dp->g_issued[BLK_RW_ASYNC] < dp->g_io_limit;
+}
+
+static void hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_blocked++;
+ if (is_urgent_bio(bio)) {
+ dp->g_prepare_bio(gp, bio, IOBAND_URGENT);
+ bio_list_add(&dp->g_urgent_bios, bio);
+ trace_ioband_hold_urgent_bio(gp, bio);
+ } else {
+ gp->c_blocked++;
+ dp->g_hold_bio(gp, bio);
+ trace_ioband_hold_bio(gp, bio);
+ }
+}
+
+static inline int room_for_bio_sync(struct ioband_device *dp, int sync)
+{
+ return dp->g_issued[sync] < dp->g_io_limit;
+}
+
+static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int sync)
+{
+ if (bio_list_empty(&gp->c_prio_bios))
+ set_prio_queue(gp, sync);
+ bio_list_add(&gp->c_prio_bios, bio);
+ gp->c_prio_blocked++;
+}
+
+static struct bio *pop_prio_bio(struct ioband_group *gp)
+{
+ struct bio *bio = bio_list_pop(&gp->c_prio_bios);
+
+ if (bio_list_empty(&gp->c_prio_bios))
+ clear_prio_queue(gp);
+
+ if (bio)
+ gp->c_prio_blocked--;
+ return bio;
+}
+
+static int make_issue_list(struct ioband_group *gp, struct bio *bio,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_blocked--;
+ gp->c_blocked--;
+ if (!gp->c_blocked && is_group_blocked(gp)) {
+ clear_group_blocked(gp);
+ wake_up_all(&gp->c_waitq);
+ }
+ if (should_pushback_bio(gp)) {
+ bio_list_add(pushback_list, bio);
+ trace_ioband_make_pback_list(gp, bio);
+ } else {
+ int rw = bio_data_dir(bio);
+
+ gp->c_stats.sectors[rw] += bio_sectors(bio);
+ gp->c_stats.ios[rw]++;
+ bio_list_add(issue_list, bio);
+ trace_ioband_make_issue_list(gp, bio);
+ }
+ return prepare_to_issue(gp, bio);
+}
+
+static void release_urgent_bios(struct ioband_device *dp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct bio *bio;
+
+ if (bio_list_empty(&dp->g_urgent_bios))
+ return;
+ while (room_for_bio_sync(dp, BLK_RW_ASYNC)) {
+ bio = bio_list_pop(&dp->g_urgent_bios);
+ if (!bio)
+ return;
+ dp->g_blocked--;
+ dp->g_issued[bio_is_sync(bio)]++;
+ bio_list_add(issue_list, bio);
+ trace_ioband_release_urgent_bios(dp, bio);
+ }
+}
+
+static int release_prio_bios(struct ioband_group *gp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct bio *bio;
+ int sync;
+ int ret;
+
+ if (bio_list_empty(&gp->c_prio_bios))
+ return R_OK;
+ sync = prio_queue_sync(gp);
+ while (gp->c_prio_blocked) {
+ if (!dp->g_can_submit(gp))
+ return R_BLOCK;
+ if (!room_for_bio_sync(dp, sync))
+ return R_OK;
+ bio = pop_prio_bio(gp);
+ if (!bio)
+ return R_OK;
+ ret = make_issue_list(gp, bio, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ }
+ return R_OK;
+}
+
+static int release_norm_bios(struct ioband_group *gp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct bio *bio;
+ int sync, ret;
+
+ while (gp->c_blocked - gp->c_prio_blocked) {
+ if (!dp->g_can_submit(gp))
+ return R_BLOCK;
+ if (!room_for_bio(dp))
+ return R_OK;
+ bio = dp->g_pop_bio(gp);
+ if (!bio)
+ return R_OK;
+
+ sync = bio_is_sync(bio);
+ if (!room_for_bio_sync(dp, sync)) {
+ push_prio_bio(gp, bio, sync);
+ continue;
+ }
+ ret = make_issue_list(gp, bio, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ }
+ return R_OK;
+}
+
+static inline int release_bios(struct ioband_group *gp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ int ret = release_prio_bios(gp, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ return release_norm_bios(gp, issue_list, pushback_list);
+}
+
+static struct ioband_group *ioband_group_get(struct ioband_group *head,
+ struct bio *bio)
+{
+ struct ioband_group *gp;
+
+ if (!head->c_type->t_getid)
+ return head;
+
+ gp = ioband_group_find(head, head->c_type->t_getid(bio));
+
+ if (!gp)
+ gp = head;
+ return gp;
+}
+
+/*
+ * Start to control the bandwidth once the number of uncompleted BIOs
+ * exceeds the value of "io_throttle".
+ */
+static int ioband_map(struct dm_target *ti, struct bio *bio,
+ union map_info *map_context)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long flags;
+ int rw;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+
+ /*
+ * The device is suspended while some of the ioband device
+ * configurations are being changed.
+ */
+ if (is_device_suspended(dp))
+ wait_event_lock_irq(dp->g_waitq_suspend,
+ !is_device_suspended(dp), dp->g_lock,
+ do_nothing());
+
+ gp = ioband_group_get(gp, bio);
+ prevent_burst_bios(gp, bio);
+ if (should_pushback_bio(gp)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return DM_MAPIO_REQUEUE;
+ }
+
+ bio->bi_bdev = gp->c_dev->bdev;
+ if (bio_sectors(bio))
+ bio->bi_sector -= ti->begin;
+
+ if (!gp->c_blocked && room_for_bio_sync(dp, bio_is_sync(bio))) {
+ if (dp->g_can_submit(gp)) {
+ prepare_to_issue(gp, bio);
+ rw = bio_data_dir(bio);
+ gp->c_stats.sectors[rw] += bio_sectors(bio);
+ gp->c_stats.ios[rw]++;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return DM_MAPIO_REMAPPED;
+ } else if (!dp->g_blocked && nr_issued(dp) == 0) {
+ DMDEBUG("%s: token expired gp:%p", __func__, gp);
+ queue_delayed_work(dp->g_ioband_wq,
+ &dp->g_conductor, 1);
+ }
+ }
+ hold_bio(gp, bio);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Select the best group to resubmit its BIOs.
+ */
+static struct ioband_group *choose_best_group(struct ioband_device *dp)
+{
+ struct ioband_group *gp;
+ struct ioband_group *best = NULL;
+ int highest = 0;
+ int pri;
+
+ /* Todo: The algorithm should be optimized.
+ * It would be better to use rbtree.
+ */
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (!gp->c_blocked || !room_for_bio(dp))
+ continue;
+ if (gp->c_blocked == gp->c_prio_blocked &&
+ !room_for_bio_sync(dp, prio_queue_sync(gp))) {
+ continue;
+ }
+ pri = dp->g_can_submit(gp);
+ if (pri > highest) {
+ highest = pri;
+ best = gp;
+ }
+ }
+
+ return best;
+}
+
+/*
+ * This function is called right after it becomes able to resubmit BIOs.
+ * It selects the best BIOs and passes them to the underlying layer.
+ */
+static void ioband_conduct(struct work_struct *work)
+{
+ struct ioband_device *dp + container_of(work, struct ioband_device,
g_conductor.work);
+ struct ioband_group *gp = NULL;
+ struct bio *bio;
+ unsigned long flags;
+ struct bio_list issue_list, pushback_list;
+
+ bio_list_init(&issue_list);
+ bio_list_init(&pushback_list);
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ release_urgent_bios(dp, &issue_list, &pushback_list);
+ if (dp->g_blocked) {
+ gp = choose_best_group(dp);
+ if (gp &&
+ release_bios(gp, &issue_list, &pushback_list) == R_YIELD)
+ queue_delayed_work(dp->g_ioband_wq,
+ &dp->g_conductor, 0);
+ }
+
+ if (is_device_blocked(dp) && dp->g_blocked < dp->g_io_limit *
2) {
+ clear_device_blocked(dp);
+ wake_up_all(&dp->g_waitq);
+ }
+
+ if (dp->g_blocked &&
+ room_for_bio_sync(dp, BLK_RW_SYNC) &&
+ room_for_bio_sync(dp, BLK_RW_ASYNC) &&
+ bio_list_empty(&issue_list) &&
bio_list_empty(&pushback_list) &&
+ dp->g_restart_bios(dp)) {
+ DMDEBUG("%s: token expired dp:%p issued(%d,%d) g_blocked(%d)",
+ __func__, dp,
+ dp->g_issued[BLK_RW_SYNC], dp->g_issued[BLK_RW_ASYNC],
+ dp->g_blocked);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ }
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ while ((bio = bio_list_pop(&issue_list))) {
+ trace_ioband_make_request(dp, bio);
+ generic_make_request(bio);
+ }
+
+ while ((bio = bio_list_pop(&pushback_list))) {
+ trace_ioband_pushback_bio(dp, bio);
+ bio_endio(bio, -EIO);
+ }
+}
+
+static int ioband_end_io(struct dm_target *ti, struct bio *bio,
+ int error, union map_info *map_context)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long flags;
+ int r = error;
+
+ /*
+ * XXX: A new error code for device mapper devices should be used
+ * rather than EIO.
+ */
+ if (error == -EIO && should_pushback_bio(gp)) {
+ /* This ioband device is suspending */
+ r = DM_ENDIO_REQUEUE;
+ }
+ /*
+ * Todo: The algorithm should be optimized to eliminate the spinlock.
+ */
+ spin_lock_irqsave(&dp->g_lock, flags);
+ dp->g_issued[bio_is_sync(bio)]--;
+
+ /*
+ * Todo: It would be better to introduce high/low water marks here
+ * not to kick the workqueues so often.
+ */
+ if (dp->g_blocked)
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ else if (is_device_suspended(dp) && nr_issued(dp) == 0)
+ wake_up_all(&dp->g_waitq_flush);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static void ioband_presuspend(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+
+ ioband_group_stop_all(gp, 1);
+}
+
+static void ioband_resume(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+
+ ioband_group_resume_all(gp);
+}
+
+static void ioband_group_status(struct ioband_group *gp, int *szp,
+ char *result, unsigned maxlen)
+{
+ int sz = *szp; /* used in DMEMIT() */
+ struct disk_stats *st = &gp->c_stats;
+
+ DMEMIT(" %d %lu %lu %lu %lu %lu %lu %lu %lu %d %lu %lu",
+ gp->c_id,
+ st->ios[0], st->merges[0], st->sectors[0], st->ticks[0],
+ st->ios[1], st->merges[1], st->sectors[1], st->ticks[1],
+ gp->c_blocked, st->io_ticks, st->time_in_queue);
+ *szp = sz;
+}
+
+static int ioband_status(struct dm_target *ti, status_type_t type,
+ char *result, unsigned maxlen)
+{
+ struct ioband_group *gp = ti->private, *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = 0; /* used in DMEMIT() */
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%s", dp->g_name);
+ ioband_group_status(gp, &sz, result, maxlen);
+ for (node = rb_first(&gp->c_group_root); node;
+ node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ ioband_group_status(p, &sz, result, maxlen);
+ }
+ break;
+
+ case STATUSTYPE_TABLE:
+ DMEMIT("%s %s %d %d %s %s",
+ gp->c_dev->name, dp->g_name,
+ dp->g_io_throttle, dp->g_io_limit,
+ gp->c_type->t_name, dp->g_policy->p_name);
+ dp->g_show(gp, &sz, result, maxlen);
+ break;
+ }
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+}
+
+static int ioband_group_type_select(struct ioband_group *gp, const char *name)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ const struct ioband_group_type *t;
+ unsigned long flags;
+
+ for (t = dm_ioband_group_type; (t->t_name); t++) {
+ if (!strcmp(name, t->t_name))
+ break;
+ }
+ if (!t->t_name) {
+ DMWARN("%s: %s isn't supported.", __func__, name);
+ return -EINVAL;
+ }
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (!RB_EMPTY_ROOT(&gp->c_group_root)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EBUSY;
+ }
+ gp->c_type = t;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ return 0;
+}
+
+static int ioband_set_param(struct ioband_group *gp,
+ const char *cmd, const char *value)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ const char *val_str;
+ long id;
+ unsigned long flags;
+ int r;
+
+ r = parse_group_param(value, &id, &val_str);
+ if (r)
+ return r;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (id != IOBAND_ID_ANY) {
+ gp = ioband_group_find(gp, id);
+ if (!gp) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("%s: id=%ld not found.", __func__, id);
+ return -EINVAL;
+ }
+ }
+ r = dp->g_set_param(gp, cmd, val_str);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static int ioband_group_attach(struct ioband_group *head, int parent_id,
+ int id, const char *param)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *parent, *gp;
+ int r;
+
+ if (id < 0) {
+ DMWARN("%s: invalid id:%d", __func__, id);
+ return -EINVAL;
+ }
+ if (!head->c_type->t_getid) {
+ DMWARN("%s: no ioband group type is specified", __func__);
+ return -EINVAL;
+ }
+
+ /* Determines a parent ioband group */
+ switch (parent_id) {
+ case 0:
+ /* Non-hierarchical configuration */
+ parent = NULL;
+ break;
+ case 1:
+ /* The root of a tree, the parent is a default ioband group */
+ parent = head;
+ break;
+ default:
+ /* The node in a tree. */
+ parent = ioband_group_find(head, parent_id);
+ if (!parent) {
+ DMWARN("%s: parent group is not configured", __func__);
+ return -EINVAL;
+ }
+ break;
+ }
+
+ gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+ if (!gp)
+ return -ENOMEM;
+
+ r = ioband_group_init(dp, head, parent, gp, id, param);
+ if (r < 0) {
+ kfree(gp);
+ return r;
+ }
+ return 0;
+}
+
+static int ioband_group_detach(struct ioband_group *head, int id)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *gp;
+ unsigned long flags;
+ int r = 0;
+
+ if (id < 0) {
+ DMWARN("%s: invalid id:%d", __func__, id);
+ return -EINVAL;
+ }
+ spin_lock_irqsave(&dp->g_lock, flags);
+ gp = ioband_group_find(head, id);
+ if (!gp) {
+ DMWARN("%s: invalid id:%d", __func__, id);
+ r = -EINVAL;
+ goto out;
+ }
+
+ if (!list_empty(&gp->c_children)) {
+ DMWARN("%s: group has children", __func__);
+ r = -EBUSY;
+ goto out;
+ }
+
+ /*
+ * Todo: Calling suspend_ioband_device() before releasing the
+ * ioband group has a large overhead. Need improvement.
+ */
+ suspend_ioband_device(dp, flags, 0);
+ ioband_group_release(head, gp);
+ resume_ioband_device(dp);
+out:
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+/*
+ * Message parameters:
+ * "policy" <name>
+ * ex)
+ * "policy" "weight"
+ * "type"
"none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid"
+ * "io_throttle" <value>
+ * "io_limit" <value>
+ * "attach" <group id>
+ * "detach" <group id>
+ * "any-command" <group id>:<value>
+ * ex)
+ * "weight" 0:<value>
+ * "token" 24:<value>
+ */
+static int __ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+ struct ioband_group *gp = ti->private, *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ long val;
+ int r = 0;
+ unsigned long flags;
+
+ if (argc == 1 && !strcmp(argv[0], "reset")) {
+ spin_lock_irqsave(&dp->g_lock, flags);
+ memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+ for (node = rb_first(&gp->c_group_root); node;
+ node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ memset(&p->c_stats, 0, sizeof(p->c_stats));
+ }
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+ }
+
+ if (argc != 2) {
+ DMWARN("Unrecognised band message received.");
+ return -EINVAL;
+ }
+ if (!strcmp(argv[0], "io_throttle")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX)
+ return -EINVAL;
+ if (val == 0)
+ val = DEFAULT_IO_THROTTLE;
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (val > dp->g_io_limit) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EINVAL;
+ }
+ dp->g_io_throttle = val;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ ioband_set_param(gp, argv[0], argv[1]);
+ return 0;
+ } else if (!strcmp(argv[0], "io_limit")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX)
+ return -EINVAL;
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (val == 0) {
+ struct request_queue *q;
+
+ q = bdev_get_queue(gp->c_dev->bdev);
+ if (!q) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -ENXIO;
+ }
+ /*
+ * The block layer accepts I/O requests up to
+ * 50% over nr_requests when the requests are
+ * issued from a "batcher" process.
+ */
+ val = (3 * q->nr_requests / 2);
+ }
+ if (val < dp->g_io_throttle) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EINVAL;
+ }
+ dp->g_io_limit = val;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ ioband_set_param(gp, argv[0], argv[1]);
+ return 0;
+ } else if (!strcmp(argv[0], "type")) {
+ return ioband_group_type_select(gp, argv[1]);
+ } else if (!strcmp(argv[0], "attach")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r)
+ return r;
+ return ioband_group_attach(gp, 0, val, NULL);
+ } else if (!strcmp(argv[0], "detach")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r)
+ return r;
+ return ioband_group_detach(gp, val);
+ } else if (!strcmp(argv[0], "policy")) {
+ r = policy_init(dp, argv[1], 0, &argv[2]);
+ return r;
+ } else {
+ /* message anycommand <group-id>:<value> */
+ r = ioband_set_param(gp, argv[0], argv[1]);
+ if (r < 0)
+ DMWARN("Unrecognised band message received.");
+ return r;
+ }
+ return 0;
+}
+
+static int ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+ int r;
+
+ mutex_lock(&ioband_lock);
+ r = __ioband_message(ti, argc, argv);
+ mutex_unlock(&ioband_lock);
+ return r;
+}
+
+static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+ struct bio_vec *biovec, int max_size)
+{
+ struct ioband_group *gp = ti->private;
+ struct request_queue *q = bdev_get_queue(gp->c_dev->bdev);
+
+ if (!q->merge_bvec_fn)
+ return max_size;
+
+ bvm->bi_bdev = gp->c_dev->bdev;
+ bvm->bi_sector -= ti->begin;
+
+ return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static int ioband_iterate_devices(struct dm_target *ti,
+ iterate_devices_callout_fn fn, void *data)
+{
+ struct ioband_group *gp = ti->private;
+
+ return fn(ti, gp->c_dev, 0, ti->len, data);
+}
+
+static struct target_type ioband_target = {
+ .name = "ioband",
+ .module = THIS_MODULE,
+ .version = {1, 13, 0},
+ .ctr = ioband_ctr,
+ .dtr = ioband_dtr,
+ .map = ioband_map,
+ .end_io = ioband_end_io,
+ .presuspend = ioband_presuspend,
+ .resume = ioband_resume,
+ .status = ioband_status,
+ .message = ioband_message,
+ .merge = ioband_merge,
+ .iterate_devices = ioband_iterate_devices,
+};
+
+static int __init dm_ioband_init(void)
+{
+ int r;
+
+ r = dm_register_target(&ioband_target);
+ if (r < 0)
+ DMERR("register failed %d", r);
+ return r;
+}
+
+static void __exit dm_ioband_exit(void)
+{
+ dm_unregister_target(&ioband_target);
+}
+
+module_init(dm_ioband_init);
+module_exit(dm_ioband_exit);
+
+MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control");
+MODULE_AUTHOR("Hirokazu Takahashi, Ryo Tsuruta, Dong-Jae Kang");
+MODULE_LICENSE("GPL");
Index: linux-2.6.31/drivers/md/dm-ioband-policy.c
==================================================================--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-policy.c
@@ -0,0 +1,543 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * The following functions determine when and which BIOs should
+ * be submitted to control the I/O flow.
+ * It is possible to add a new BIO scheduling policy with it.
+ */
+
+/*
+ * Functions for weight balancing policy based on the number of I/Os.
+ */
+#define DEFAULT_WEIGHT 100
+#define DEFAULT_TOKENPOOL 2048
+#define DEFAULT_BUCKET 2
+#define IOBAND_IOPRIO_BASE 100
+#define TOKEN_BATCH_UNIT 20
+#define PROCEED_THRESHOLD 8
+#define LOCAL_ACTIVE_RATIO 8
+#define GLOBAL_ACTIVE_RATIO 16
+#define OVERCOMMIT_RATE 4
+#define WEIGHT_MAX 100
+
+/*
+ * Calculate the effective number of tokens this group has.
+ */
+static int get_token(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int token = gp->c_token;
+ int allowance = dp->g_epoch - gp->c_my_epoch;
+
+ if (allowance) {
+ if (allowance > dp->g_carryover)
+ allowance = dp->g_carryover;
+ token += gp->c_token_initial * allowance;
+ }
+ if (is_group_down(gp))
+ token += gp->c_token_initial * dp->g_carryover * 2;
+
+ return token;
+}
+
+/*
+ * Calculate the priority of a given group.
+ */
+static int iopriority(struct ioband_group *gp)
+{
+ return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1;
+}
+
+/*
+ * This function is called when all the active group on the same ioband
+ * device has used up their tokens. It makes a new global epoch so that
+ * all groups on this device will get freshly assigned tokens.
+ */
+static int make_global_epoch(struct ioband_device *dp)
+{
+ struct ioband_group *gp = dp->g_dominant;
+
+ /*
+ * Don't make a new epoch if the dominant group still has a lot of
+ * tokens, except when the I/O load is low.
+ */
+ if (gp) {
+ int iopri = iopriority(gp);
+ if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE &&
+ nr_issued(dp) >= dp->g_io_throttle)
+ return 0;
+ }
+
+ dp->g_epoch++;
+ DMDEBUG("make_epoch %d", dp->g_epoch);
+
+ /* The leftover tokens will be used in the next epoch. */
+ dp->g_token_extra = dp->g_token_left;
+ if (dp->g_token_extra < 0)
+ dp->g_token_extra = 0;
+ dp->g_token_left = dp->g_token_bucket;
+
+ dp->g_expired = NULL;
+ dp->g_dominant = NULL;
+
+ return 1;
+}
+
+/*
+ * This function is called when this group has used up its own tokens.
+ * It will check whether it's possible to make a new epoch of this group.
+ */
+static inline int make_epoch(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int allowance = dp->g_epoch - gp->c_my_epoch;
+
+ if (!allowance)
+ return 0;
+ if (allowance > dp->g_carryover)
+ allowance = dp->g_carryover;
+ gp->c_my_epoch = dp->g_epoch;
+ return allowance;
+}
+
+/*
+ * Check whether this group has tokens to issue an I/O. Return 0 if it
+ * doesn't have any, otherwise return the priority of this group.
+ */
+static int is_token_left(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int allowance;
+ int delta;
+ int extra;
+
+ if (gp->c_token > 0)
+ return iopriority(gp);
+
+ if (is_group_down(gp)) {
+ gp->c_token = gp->c_token_initial;
+ return iopriority(gp);
+ }
+ allowance = make_epoch(gp);
+ if (!allowance)
+ return 0;
+ /*
+ * If this group has the right to get tokens for several epochs,
+ * give all of them to the group here.
+ */
+ delta = gp->c_token_initial * allowance;
+ dp->g_token_left -= delta;
+ /*
+ * Give some extra tokens to this group when there have left unused
+ * tokens on this ioband device from the previous epoch.
+ */
+ extra = dp->g_token_extra * gp->c_token_initial /
+ (dp->g_token_bucket - dp->g_token_extra / 2);
+ delta += extra;
+ gp->c_token += delta;
+ gp->c_consumed = 0;
+
+ if (gp == dp->g_current)
+ dp->g_yield_mark += delta;
+ DMDEBUG("refill token: gp:%p token:%d->%d extra(%d)
allowance(%d)",
+ gp, gp->c_token - delta, gp->c_token, extra, allowance);
+ if (gp->c_token > 0)
+ return iopriority(gp);
+ DMDEBUG("refill token: yet empty gp:%p token:%d", gp,
gp->c_token);
+ return 0;
+}
+
+/*
+ * Use tokens to issue an I/O. After the operation, the number of tokens left
+ * on this group may become negative value, which will be treated as debt.
+ */
+static int consume_token(struct ioband_group *gp, int count, int flag)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial
&&
+ gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) {
+ ; /* Do nothing unless this group is really active. */
+ } else if (!dp->g_dominant ||
+ get_token(gp) > get_token(dp->g_dominant)) {
+ /*
+ * Regard this group as the dominant group on this
+ * ioband device when it has larger number of tokens
+ * than those of the previous one.
+ */
+ dp->g_dominant = gp;
+ }
+ if (dp->g_epoch == gp->c_my_epoch &&
+ gp->c_token > 0 && gp->c_token - count <= 0) {
+ /* Remember the last group which used up its own tokens. */
+ dp->g_expired = gp;
+ if (dp->g_dominant == gp)
+ dp->g_dominant = NULL;
+ }
+
+ if (gp != dp->g_current) {
+ /* This group is the current already. */
+ dp->g_current = gp;
+ dp->g_yield_mark + gp->c_token - (TOKEN_BATCH_UNIT <<
dp->g_token_unit);
+ }
+ gp->c_token -= count;
+ gp->c_consumed += count;
+ if (gp->c_token <= dp->g_yield_mark && !(flag &
IOBAND_URGENT)) {
+ /*
+ * Return-value 1 means that this policy requests dm-ioband
+ * to give a chance to another group to be selected since
+ * this group has already issued enough amount of I/Os.
+ */
+ dp->g_current = NULL;
+ return R_YIELD;
+ }
+ /*
+ * Return-value 0 means that this policy allows dm-ioband to select
+ * this group to issue I/Os without a break.
+ */
+ return R_OK;
+}
+
+/*
+ * Consume one token on each I/O.
+ */
+static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+ return consume_token(gp, 1, flag);
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ */
+static int is_queue_full(struct ioband_group *gp)
+{
+ return gp->c_blocked >= gp->c_limit;
+}
+
+static void __set_weight(struct ioband_group *gp, int weight_total,
+ int token_bucket, int limit_bucket)
+{
+ int token, limit;
+
+ if (weight_total > 0) {
+ token = token_bucket * gp->c_weight / weight_total;
+ if (token < 1)
+ token = 1;
+ limit = limit_bucket * gp->c_weight / weight_total;
+ if (limit < 1)
+ limit = 1;
+
+ /*
+ * In the hierarchical configuration,
+ * child's tokens are distributed from the parent.
+ */
+ if (gp->c_parent) {
+ gp->c_parent->c_token_initial -= token;
+ if (gp->c_parent->c_token_initial < 1)
+ gp->c_parent->c_token_initial = 1;
+
+ gp->c_parent->c_limit -= limit / OVERCOMMIT_RATE;
+ if (gp->c_parent->c_limit < 1)
+ gp->c_parent->c_limit = 1;
+ }
+ } else
+ token = limit = 1;
+
+ gp->c_token = gp->c_token_initial = gp->c_token_bucket = token;
+ gp->c_limit_bucket = limit;
+ gp->c_limit = limit / OVERCOMMIT_RATE;
+ if (gp->c_limit < 1)
+ gp->c_limit = 1;
+}
+
+static int set_weight(struct ioband_group *group, int new)
+{
+ struct ioband_device *dp = group->c_banddev;
+ struct ioband_group *parent = group->c_parent, *gp;
+ struct list_head *siblings;
+ int weight_total = 0, token_bucket, limit;
+
+ group->c_weight = new;
+
+ if (!parent) {
+ siblings = &dp->g_root_groups;
+ token_bucket = dp->g_token_bucket;
+ limit = dp->g_io_limit * 2;
+ } else {
+ siblings = &parent->c_children;
+ token_bucket = parent->c_token_bucket;
+ limit = parent->c_limit_bucket;
+ }
+
+ list_for_each_entry(gp, siblings, c_sibling)
+ weight_total += gp->c_weight;
+
+ if (parent) {
+ /*
+ * In the hierarchical configuration, each child's
+ * weight is evaluated as a percentage of its parent's
+ * bandwidth.
+ */
+ if (weight_total > WEIGHT_MAX)
+ return -EINVAL;
+ weight_total = WEIGHT_MAX;
+ }
+
+ list_for_each_entry(parent, siblings, c_sibling) {
+ struct ioband_group *this_parent = parent;
+ struct list_head *next;
+
+ __set_weight(parent, weight_total, token_bucket, limit);
+
+ repeat:
+ next = this_parent->c_children.next;
+ resume:
+ while (next != &this_parent->c_children) {
+ /* Descend the hierarchy */
+ struct list_head *tmp = next;
+
+ gp = list_entry(tmp, struct ioband_group, c_sibling);
+ next = tmp->next;
+
+ __set_weight(gp, WEIGHT_MAX,
+ this_parent->c_token_bucket,
+ this_parent->c_limit_bucket);
+
+ if (!list_empty(&gp->c_children)) {
+ this_parent = gp;
+ goto repeat;
+ }
+ }
+
+ if (this_parent != parent) {
+ /* Ascend and resume the search */
+ next = this_parent->c_sibling.next;
+ this_parent = this_parent->c_parent;
+ goto resume;
+ }
+ }
+
+ return 0;
+}
+
+static void init_token_bucket(struct ioband_device *dp,
+ int token_bucket, int carryover)
+{
+ if (!token_bucket)
+ dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+ dp->g_token_unit;
+ else
+ dp->g_token_bucket = token_bucket;
+ if (!carryover)
+ dp->g_carryover = (DEFAULT_TOKENPOOL << dp->g_token_unit) /
+ dp->g_token_bucket;
+ else
+ dp->g_carryover = carryover;
+ if (dp->g_carryover < 1)
+ dp->g_carryover = 1;
+ dp->g_token_left = 0;
+}
+
+static int policy_weight_param(struct ioband_group *gp,
+ const char *cmd, const char *value)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ long val = 0;
+ int r = 0, err = 0;
+
+ if (value)
+ err = strict_strtol(value, 0, &val);
+
+ if (!strcmp(cmd, "weight")) {
+ if (!value)
+ r = set_weight(gp, DEFAULT_WEIGHT);
+ else if (!err && 0 < val && val <= SHORT_MAX)
+ r = set_weight(gp, val);
+ else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "token")) {
+ if (!err && 0 <= val && val <= INT_MAX) {
+ init_token_bucket(dp, val, 0);
+ set_weight(gp, gp->c_weight);
+ dp->g_token_extra = 0;
+ } else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "carryover")) {
+ if (!err && 0 <= val && val <= INT_MAX) {
+ init_token_bucket(dp, dp->g_token_bucket, val);
+ set_weight(gp, gp->c_weight);
+ dp->g_token_extra = 0;
+ } else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "io_limit")) {
+ init_token_bucket(dp, 0, 0);
+ set_weight(gp, gp->c_weight);
+ } else {
+ r = -EINVAL;
+ }
+ return r;
+}
+
+static int policy_weight_ctr(struct ioband_group *gp, const char *arg)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ gp->c_my_epoch = dp->g_epoch;
+ gp->c_weight = 0;
+ gp->c_consumed = 0;
+ return policy_weight_param(gp, "weight", arg);
+}
+
+static void policy_weight_dtr(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ set_weight(gp, 0);
+ dp->g_dominant = NULL;
+ dp->g_expired = NULL;
+}
+
+static void policy_weight_show(struct ioband_group *gp, int *szp,
+ char *result, unsigned maxlen)
+{
+ struct ioband_group *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = *szp; /* used in DMEMIT() */
+
+ DMEMIT(" %d :%d", dp->g_token_bucket, gp->c_weight);
+
+ for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ DMEMIT(" %d:%d", p->c_id, p->c_weight);
+ }
+ *szp = sz;
+}
+
+/*
+ * <Method> <description>
+ * g_can_submit : To determine whether a given group has the right to
+ * submit BIOs. The larger the return value the higher the
+ * priority to submit. Zero means it has no right.
+ * g_prepare_bio : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ * of them can be submitted now. This method has to
+ * reinitialize the data to restart to submit BIOs and return
+ * 0 or 1.
+ * The return value 0 means that it has become able to submit
+ * them now so that this ioband device will continue its work.
+ * The return value 1 means that it is still unable to submit
+ * them so that this device will stop its work. And this
+ * policy module has to reactivate the device when it gets
+ * to be able to submit BIOs.
+ * g_hold_bio : To hold a given BIO until it is submitted.
+ * The default function is used when this method is undefined.
+ * g_pop_bio : To select and get the best BIO to submit.
+ * g_group_ctr : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr : Called when struct ioband_group is removed.
+ * g_set_param : To update the policy own date.
+ * The parameters can be passed through "dmsetup
message"
+ * command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ * Return 1 if a given group can't receive any more BIOs,
+ * otherwise return 0.
+ * g_show : Show the configuration.
+ */
+static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0 || val > INT_MAX)
+ return -EINVAL;
+ }
+
+ dp->g_can_submit = is_token_left;
+ dp->g_prepare_bio = prepare_token;
+ dp->g_restart_bios = make_global_epoch;
+ dp->g_group_ctr = policy_weight_ctr;
+ dp->g_group_dtr = policy_weight_dtr;
+ dp->g_set_param = policy_weight_param;
+ dp->g_should_block = is_queue_full;
+ dp->g_show = policy_weight_show;
+
+ dp->g_epoch = 0;
+ dp->g_weight_total = 0;
+ dp->g_current = NULL;
+ dp->g_dominant = NULL;
+ dp->g_expired = NULL;
+ dp->g_token_extra = 0;
+ dp->g_token_unit = 0;
+ init_token_bucket(dp, val, 0);
+ dp->g_token_left = dp->g_token_bucket;
+
+ return 0;
+}
+
+/* weight balancing policy based on the number of I/Os. --- End --- */
+
+/*
+ * Functions for weight balancing policy based on I/O size.
+ * It just borrows a lot of functions from the regular weight balancing policy.
+ */
+static int iosize_prepare_token(struct ioband_group *gp,
+ struct bio *bio, int flag)
+{
+ /* Consume tokens depending on the size of a given bio. */
+ return consume_token(gp, bio_sectors(bio), flag);
+}
+
+static int policy_weight_iosize_init(struct ioband_device *dp,
+ int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0 || val > INT_MAX)
+ return -EINVAL;
+ }
+
+ r = policy_weight_init(dp, argc, argv);
+ if (r < 0)
+ return r;
+
+ dp->g_prepare_bio = iosize_prepare_token;
+ dp->g_token_unit = PAGE_SHIFT - 9;
+ init_token_bucket(dp, val, 0);
+ dp->g_token_left = dp->g_token_bucket;
+ return 0;
+}
+
+/* weight balancing policy based on I/O size. --- End --- */
+
+static int policy_default_init(struct ioband_device *dp, int argc, char **argv)
+{
+ return policy_weight_init(dp, argc, argv);
+}
+
+const struct ioband_policy_type dm_ioband_policy_type[] = {
+ { "default", policy_default_init },
+ { "weight", policy_weight_init },
+ { "weight-iosize", policy_weight_iosize_init },
+ { "range-bw", policy_range_bw_init },
+ { NULL, policy_default_init }
+};
Index: linux-2.6.31/drivers/md/dm-ioband-type.c
==================================================================--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-type.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * Any I/O bandwidth can be divided into several bandwidth groups, each of
which
+ * has its own unique ID. The following functions are called to determine
+ * which group a given BIO belongs to and return the ID of the group.
+ */
+
+/* ToDo: unsigned long value would be better for group ID */
+
+static int ioband_process_id(struct bio *bio)
+{
+ /*
+ * This function will work for KVM and Xen.
+ */
+ return (int)current->tgid;
+}
+
+static int ioband_process_group(struct bio *bio)
+{
+ return (int)task_pgrp_nr(current);
+}
+
+static int ioband_uid(struct bio *bio)
+{
+ return (int)current_uid();
+}
+
+static int ioband_gid(struct bio *bio)
+{
+ return (int)current_gid();
+}
+
+static int ioband_cpuset(struct bio *bio)
+{
+ return 0; /* not implemented yet */
+}
+
+static int ioband_node(struct bio *bio)
+{
+ return 0; /* not implemented yet */
+}
+
+static int ioband_cgroup(struct bio *bio)
+{
+ /*
+ * This function should return the ID of the cgroup which
+ * issued "bio". The ID of the cgroup which the current
+ * process belongs to won't be suitable ID for this purpose,
+ * since some BIOs will be handled by kernel threads like aio
+ * or pdflush on behalf of the process requesting the BIOs.
+ */
+ return 0; /* not implemented yet */
+}
+
+const struct ioband_group_type dm_ioband_group_type[] = {
+ { "none", NULL },
+ { "pgrp", ioband_process_group },
+ { "pid", ioband_process_id },
+ { "node", ioband_node },
+ { "cpuset", ioband_cpuset },
+ { "cgroup", ioband_cgroup },
+ { "user", ioband_uid },
+ { "uid", ioband_uid },
+ { "gid", ioband_gid },
+ { NULL, NULL}
+};
Index: linux-2.6.31/drivers/md/dm-ioband.h
==================================================================--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband.h
@@ -0,0 +1,231 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_IOBAND_H
+#define DM_IOBAND_H
+
+#include <linux/version.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "ioband"
+
+#define DEFAULT_IO_THROTTLE 4
+#define IOBAND_NAME_MAX 31
+#define IOBAND_ID_ANY (-1)
+#define POLICY_PARAM_START 6
+#define POLICY_PARAM_DELIM "=:,"
+
+#define MAX_BW_OVER 1
+#define MAX_BW_UNDER 0
+#define NO_IO_MODE 4
+
+#define TIME_COMPENSATOR 10
+
+struct ioband_group;
+
+struct ioband_device {
+ struct list_head g_groups;
+ struct delayed_work g_conductor;
+ struct workqueue_struct *g_ioband_wq;
+ struct bio_list g_urgent_bios;
+ int g_io_throttle;
+ int g_io_limit;
+ int g_issued[2];
+ int g_blocked;
+ spinlock_t g_lock;
+ wait_queue_head_t g_waitq;
+ wait_queue_head_t g_waitq_suspend;
+ wait_queue_head_t g_waitq_flush;
+
+ int g_ref;
+ struct list_head g_list;
+ struct list_head g_root_groups;
+ int g_flags;
+ char g_name[IOBAND_NAME_MAX + 1];
+ const struct ioband_policy_type *g_policy;
+
+ /* policy dependent */
+ int (*g_can_submit) (struct ioband_group *);
+ int (*g_prepare_bio) (struct ioband_group *, struct bio *, int);
+ int (*g_restart_bios) (struct ioband_device *);
+ void (*g_hold_bio) (struct ioband_group *, struct bio *);
+ struct bio *(*g_pop_bio) (struct ioband_group *);
+ int (*g_group_ctr) (struct ioband_group *, const char *);
+ void (*g_group_dtr) (struct ioband_group *);
+ int (*g_set_param) (struct ioband_group *, const char *, const char *);
+ int (*g_should_block) (struct ioband_group *);
+ void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+
+ /* members for weight balancing policy */
+ int g_epoch;
+ int g_weight_total;
+ /* the number of tokens which can be used in every epoch */
+ int g_token_bucket;
+ /* how many epochs tokens can be carried over */
+ int g_carryover;
+ /* how many tokens should be used for one page-sized I/O */
+ int g_token_unit;
+ /* the last group which used a token */
+ struct ioband_group *g_current;
+ /* give another group a chance to be scheduled when the rest
+ of tokens of the current group reaches this mark */
+ int g_yield_mark;
+ /* the latest group which used up its tokens */
+ struct ioband_group *g_expired;
+ /* the group which has the largest number of tokens in the
+ active groups */
+ struct ioband_group *g_dominant;
+ /* the number of unused tokens in this epoch */
+ int g_token_left;
+ /* left-over tokens from the previous epoch */
+ int g_token_extra;
+
+ /* members for range-bw policy */
+ int g_min_bw_total;
+ int g_max_bw_total;
+ unsigned long g_next_time_period;
+ int g_time_period_expired;
+ struct ioband_group *g_running_gp;
+ int g_total_min_bw_token;
+ int g_consumed_min_bw_token;
+ int g_io_mode;
+
+};
+
+struct ioband_group {
+ struct list_head c_list;
+ struct list_head c_sibling;
+ struct list_head c_children;
+ struct ioband_group *c_parent;
+ struct ioband_device *c_banddev;
+ struct dm_dev *c_dev;
+ struct dm_target *c_target;
+ struct bio_list c_blocked_bios;
+ struct bio_list c_prio_bios;
+ struct rb_root c_group_root;
+ struct rb_node c_group_node;
+ int c_id; /* should be unsigned long or unsigned long long */
+ char c_name[IOBAND_NAME_MAX + 1]; /* rfu */
+ int c_blocked;
+ int c_prio_blocked;
+ wait_queue_head_t c_waitq;
+ int c_flags;
+ struct disk_stats c_stats; /* hold rd/wr status */
+ const struct ioband_group_type *c_type;
+
+ /* members for weight balancing policy */
+ int c_weight;
+ int c_my_epoch;
+ int c_token;
+ int c_token_initial;
+ int c_token_bucket;
+ int c_limit;
+ int c_limit_bucket;
+ int c_consumed;
+
+ /* rfu */
+ /* struct bio_list c_ordered_tag_bios; */
+
+ /* members for range-bw policy */
+ wait_queue_head_t c_max_bw_over_waitq;
+ struct timer_list *c_timer;
+ int timer_set;
+ int c_min_bw;
+ int c_max_bw;
+ int c_time_slice_expired;
+ int c_min_bw_token;
+ int c_max_bw_token;
+ int c_consumed_min_bw_token;
+ int c_is_over_max_bw;
+ int c_io_mode;
+ unsigned long c_time_slice;
+ unsigned long c_time_slice_start;
+ unsigned long c_time_slice_end;
+ int c_wait_p_count;
+
+};
+
+#define IOBAND_URGENT 1
+
+#define DEV_BIO_BLOCKED 1
+#define DEV_SUSPENDED 2
+
+#define set_device_blocked(dp) ((dp)->g_flags |= DEV_BIO_BLOCKED)
+#define clear_device_blocked(dp) ((dp)->g_flags &= ~DEV_BIO_BLOCKED)
+#define is_device_blocked(dp) ((dp)->g_flags & DEV_BIO_BLOCKED)
+
+#define set_device_suspended(dp) ((dp)->g_flags |= DEV_SUSPENDED)
+#define clear_device_suspended(dp) ((dp)->g_flags &= ~DEV_SUSPENDED)
+#define is_device_suspended(dp) ((dp)->g_flags & DEV_SUSPENDED)
+
+#define IOG_PRIO_BIO_SYNC 1
+#define IOG_PRIO_QUEUE 2
+#define IOG_BIO_BLOCKED 4
+#define IOG_GOING_DOWN 8
+#define IOG_SUSPENDED 16
+#define IOG_NEED_UP 32
+
+#define R_OK 0
+#define R_BLOCK 1
+#define R_YIELD 2
+
+#define set_group_blocked(gp) ((gp)->c_flags |= IOG_BIO_BLOCKED)
+#define clear_group_blocked(gp) ((gp)->c_flags &= ~IOG_BIO_BLOCKED)
+#define is_group_blocked(gp) ((gp)->c_flags & IOG_BIO_BLOCKED)
+
+#define set_group_down(gp) ((gp)->c_flags |= IOG_GOING_DOWN)
+#define clear_group_down(gp) ((gp)->c_flags &= ~IOG_GOING_DOWN)
+#define is_group_down(gp) ((gp)->c_flags & IOG_GOING_DOWN)
+
+#define set_group_suspended(gp) ((gp)->c_flags |= IOG_SUSPENDED)
+#define clear_group_suspended(gp) ((gp)->c_flags &= ~IOG_SUSPENDED)
+#define is_group_suspended(gp) ((gp)->c_flags & IOG_SUSPENDED)
+
+#define set_group_need_up(gp) ((gp)->c_flags |= IOG_NEED_UP)
+#define clear_group_need_up(gp) ((gp)->c_flags &= ~IOG_NEED_UP)
+#define group_need_up(gp) ((gp)->c_flags & IOG_NEED_UP)
+
+#define set_prio_async(gp) ((gp)->c_flags |= IOG_PRIO_QUEUE)
+#define clear_prio_async(gp) ((gp)->c_flags &= ~IOG_PRIO_QUEUE)
+#define is_prio_async(gp) \
+ ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == IOG_PRIO_QUEUE)
+
+#define set_prio_sync(gp) \
+ ((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define clear_prio_sync(gp) \
+ ((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define is_prio_sync(gp) \
+ ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == \
+ (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+
+#define set_prio_queue(gp, sync) \
+ ((gp)->c_flags |= (IOG_PRIO_QUEUE|sync))
+#define clear_prio_queue(gp) clear_prio_sync(gp)
+#define is_prio_queue(gp) ((gp)->c_flags & IOG_PRIO_QUEUE)
+#define prio_queue_sync(gp) ((gp)->c_flags & IOG_PRIO_BIO_SYNC)
+
+#define nr_issued(dp) \
+ ((dp)->g_issued[BLK_RW_SYNC] + (dp)->g_issued[BLK_RW_ASYNC])
+
+struct ioband_policy_type {
+ const char *p_name;
+ int (*p_policy_init) (struct ioband_device *, int, char **);
+};
+
+extern const struct ioband_policy_type dm_ioband_policy_type[];
+
+struct ioband_group_type {
+ const char *t_name;
+ int (*t_getid) (struct bio *);
+};
+
+extern const struct ioband_group_type dm_ioband_group_type[];
+
+extern int policy_range_bw_init(struct ioband_device *, int, char **);
+
+#endif /* DM_IOBAND_H */
Index: linux-2.6.31/drivers/md/dm-ioband-rangebw.c
==================================================================--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-rangebw.c
@@ -0,0 +1,669 @@
+/*
+ * dm-ioband-rangebw.c
+ *
+ * This is a I/O control policy to support the Range Bandwidth in Disk I/O.
+ * And this policy is for dm-ioband controller by Ryo Tsuruta,
+ * Hirokazu Takahashi
+ *
+ * Copyright (C) 2008 - 2011
+ * Electronics and Telecommunications Research Institute(ETRI)
+ *
+ * This program is free software. you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License(GPL) as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Contact Information:
+ * Dong-Jae, Kang <djkang at etri.re.kr>, Chei-Yol,Kim <gauri at
etri.re.kr>,
+ * Sung-In,Jung <sijung at etri.re.kr>
+ */
+
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include <linux/jiffies.h>
+#include <linux/random.h>
+#include <linux/time.h>
+#include <linux/timer.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+static void range_bw_timeover(unsigned long);
+static void range_bw_timer_register(struct timer_list *,
+ unsigned long, unsigned long);
+
+/*
+ * Functions for Range Bandwidth(range-bw) policy based on
+ * the time slice and token.
+ */
+#define DEFAULT_BUCKET 2
+#define DEFAULT_TOKENPOOL 2048
+
+#define TIME_SLICE_EXPIRED 1
+#define TIME_SLICE_NOT_EXPIRED 0
+
+#define MINBW_IO_MODE 0
+#define LEFTOVER_IO_MODE 1
+#define RANGE_IO_MODE 2
+#define DEFAULT_IO_MODE 3
+#define NO_IO_MODE 4
+
+#define MINBW_PRIO_BASE 10
+#define OVER_IO_RATE 4
+
+#define DEFAULT_RANGE_BW "0:0"
+#define DEFAULT_MIN_BW 0
+#define DEFAULT_MAX_BW 0
+
+static const int time_slice_base = HZ / 10;
+static const int range_time_slice_base = HZ / 50;
+static void do_nothing(void) {}
+/*
+ * g_restart_bios function for range-bw policy
+ */
+static int range_bw_restart_bios(struct ioband_device *dp)
+{
+ return 1;
+}
+
+/*
+ * Allocate the time slice when IO mode is MINBW_IO_MODE,
+ * RANGE_IO_MODE or LEFTOVER_IO_MODE
+ */
+static int set_time_slice(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int dp_io_mode, gp_io_mode;
+ unsigned long now = jiffies;
+
+ dp_io_mode = dp->g_io_mode;
+ gp_io_mode = gp->c_io_mode;
+
+ gp->c_time_slice_start = now;
+
+ if (dp_io_mode == LEFTOVER_IO_MODE) {
+ gp->c_time_slice_end = now + gp->c_time_slice;
+ return 0;
+ }
+
+ if (gp_io_mode == MINBW_IO_MODE)
+ gp->c_time_slice_end = now + gp->c_time_slice;
+ else if (gp_io_mode == RANGE_IO_MODE)
+ gp->c_time_slice_end = now + range_time_slice_base;
+ else if (gp_io_mode == DEFAULT_IO_MODE)
+ gp->c_time_slice_end = now + time_slice_base;
+ else if (gp_io_mode == NO_IO_MODE) {
+ gp->c_time_slice_end = 0;
+ gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+ return 0;
+ }
+
+ gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+
+ return 0;
+}
+
+/*
+ * Calculate the priority of given ioband_group
+ */
+static int range_bw_priority(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int prio = 0;
+
+ if (dp->g_io_mode == LEFTOVER_IO_MODE) {
+ prio = random32() % MINBW_PRIO_BASE;
+ if (prio == 0)
+ prio = 1;
+ } else if (gp->c_io_mode == MINBW_IO_MODE) {
+ prio = (gp->c_min_bw_token - gp->c_consumed_min_bw_token) *
+ MINBW_PRIO_BASE;
+ } else if (gp->c_io_mode == DEFAULT_IO_MODE) {
+ prio = MINBW_PRIO_BASE;
+ } else if (gp->c_io_mode == RANGE_IO_MODE) {
+ prio = MINBW_PRIO_BASE / 2;
+ } else {
+ prio = 0;
+ }
+
+ return prio;
+}
+
+/*
+ * Check whether this group has right to issue an I/O in range-bw policy mode.
+ * Return 0 if it doesn't have right, otherwise return the non-zero value.
+ */
+static int has_right_to_issue(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int prio;
+
+ if (gp->c_prio_blocked > 0 || gp->c_blocked - gp->c_prio_blocked
> 0) {
+ prio = range_bw_priority(gp);
+ if (prio <= 0)
+ return 1;
+ return prio;
+ }
+
+ if (gp == dp->g_running_gp) {
+
+ if (gp->c_time_slice_expired == TIME_SLICE_EXPIRED) {
+
+ gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+ gp->c_time_slice_end = 0;
+
+ return 0;
+ }
+
+ if (gp->c_time_slice_end == 0)
+ set_time_slice(gp);
+
+ return range_bw_priority(gp);
+
+ }
+
+ dp->g_running_gp = gp;
+ set_time_slice(gp);
+
+ return range_bw_priority(gp);
+}
+
+/*
+ * Reset all variables related with range-bw token and time slice
+ */
+static int reset_range_bw_token(struct ioband_group *gp, unsigned long now)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_consumed_min_bw_token = 0;
+ p->c_is_over_max_bw = MAX_BW_UNDER;
+ if (p->c_io_mode != DEFAULT_IO_MODE)
+ p->c_io_mode = MINBW_IO_MODE;
+ }
+
+ dp->g_consumed_min_bw_token = 0;
+
+ dp->g_next_time_period = now + HZ;
+ dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+ dp->g_io_mode = MINBW_IO_MODE;
+
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ if (waitqueue_active(&p->c_max_bw_over_waitq))
+ wake_up_all(&p->c_max_bw_over_waitq);
+ }
+ return 0;
+}
+
+/*
+ * Use tokens(Increase the number of consumed token) to issue an I/O
+ * for guranteeing the range-bw. and check the expiration of local and
+ * global time slice, and overflow of max bw
+ */
+static int range_bw_consume_token(struct ioband_group *gp, int count, int flag)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ unsigned long now = jiffies;
+
+ dp->g_current = gp;
+
+ if (dp->g_next_time_period == 0) {
+ dp->g_next_time_period = now + HZ;
+ dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+ }
+
+ if (time_after(now, dp->g_next_time_period)) {
+ reset_range_bw_token(gp, now);
+ } else {
+ gp->c_consumed_min_bw_token += count;
+ dp->g_consumed_min_bw_token += count;
+
+ if (gp->c_max_bw > 0 && gp->c_consumed_min_bw_token >+
gp->c_max_bw_token) {
+ gp->c_is_over_max_bw = MAX_BW_OVER;
+ gp->c_io_mode = NO_IO_MODE;
+ return R_YIELD;
+ }
+
+ if (gp->c_io_mode != RANGE_IO_MODE && gp->c_min_bw_token <+
gp->c_consumed_min_bw_token) {
+ gp->c_io_mode = RANGE_IO_MODE;
+
+ if (dp->g_total_min_bw_token <+ dp->g_consumed_min_bw_token) {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ if (p->c_io_mode != RANGE_IO_MODE &&
+ p->c_io_mode != DEFAULT_IO_MODE)
+ goto out;
+ }
+
+ if (dp->g_io_mode == MINBW_IO_MODE)
+ dp->g_io_mode = LEFTOVER_IO_MODE;
+ out:;
+ }
+ }
+ }
+
+ if (gp->c_time_slice_end != 0 &&
+ time_after(now, gp->c_time_slice_end)) {
+ gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+ return R_YIELD;
+ }
+
+ return R_OK;
+}
+
+static int is_no_io_mode(struct ioband_group *gp)
+{
+ if (gp->c_io_mode == NO_IO_MODE)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ * in range bw policy, we only check that ioband device should be blocked
+ */
+static int range_bw_queue_full(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long now, time_step;
+
+ if (is_no_io_mode(gp)) {
+ now = jiffies;
+ if (time_after(dp->g_next_time_period, now)) {
+ time_step = dp->g_next_time_period - now;
+ range_bw_timer_register(gp->c_timer,
+ (time_step + TIME_COMPENSATOR),
+ (unsigned long)gp);
+ wait_event_lock_irq(gp->c_max_bw_over_waitq,
+ !is_no_io_mode(gp),
+ dp->g_lock, do_nothing());
+ }
+ }
+
+ return (gp->c_blocked >= gp->c_limit);
+}
+
+/*
+ * Convert the bw valuse to the number of bw token
+ * bw : Kbyte unit bandwidth
+ * token_base : the number of tokens used for one 1Kbyte-size IO
+ * -- Attention : Currently, We support the 512byte or 1Kbyte per 1 token
+ */
+static int convert_bw_to_token(int bw, int token_unit)
+{
+ int token;
+ int token_base;
+
+ token_base = (1 << token_unit) / 4;
+ token = bw * token_base;
+
+ return token;
+}
+
+
+/*
+ * Allocate the time slice for MINBW_IO_MODE to each group
+ */
+static void range_bw_time_slice_init(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+
+ if (dp->g_min_bw_total == 0)
+ p->c_time_slice = time_slice_base;
+ else
+ p->c_time_slice = time_slice_base +
+ ((time_slice_base *
+ ((p->c_min_bw + p->c_max_bw) / 2)) /
+ dp->g_min_bw_total);
+ }
+}
+
+/*
+ * Allocate the range_bw and range_bw_token to the given group
+ */
+static void set_range_bw(struct ioband_group *gp, int new_min, int new_max)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ int token_unit;
+
+ dp->g_min_bw_total += (new_min - gp->c_min_bw);
+ gp->c_min_bw = new_min;
+
+ dp->g_max_bw_total += (new_max - gp->c_max_bw);
+ gp->c_max_bw = new_max;
+
+ if (new_min)
+ gp->c_io_mode = MINBW_IO_MODE;
+ else
+ gp->c_io_mode = DEFAULT_IO_MODE;
+
+ range_bw_time_slice_init(gp);
+
+ token_unit = dp->g_token_unit;
+ gp->c_min_bw_token = convert_bw_to_token(new_min, token_unit);
+ dp->g_total_min_bw_token + convert_bw_to_token(dp->g_min_bw_total,
token_unit);
+
+ gp->c_max_bw_token = convert_bw_to_token(new_max, token_unit);
+
+ if (dp->g_min_bw_total == 0) {
+ list_for_each_entry(p, &dp->g_groups, c_list)
+ p->c_limit = 1;
+ } else {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+ dp->g_min_bw_total / OVER_IO_RATE + 1;
+ }
+ }
+
+ return;
+}
+
+/*
+ * Allocate the min_bw and min_bw_token to the given group
+ */
+static void set_min_bw(struct ioband_group *gp, int new)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ int token_unit;
+
+ dp->g_min_bw_total += (new - gp->c_min_bw);
+ gp->c_min_bw = new;
+
+ if (new)
+ gp->c_io_mode = MINBW_IO_MODE;
+ else
+ gp->c_io_mode = DEFAULT_IO_MODE;
+
+ range_bw_time_slice_init(gp);
+
+ token_unit = dp->g_token_unit;
+ gp->c_min_bw_token = convert_bw_to_token(gp->c_min_bw, token_unit);
+ dp->g_total_min_bw_token + convert_bw_to_token(dp->g_min_bw_total,
token_unit);
+
+ if (dp->g_min_bw_total == 0) {
+ list_for_each_entry(p, &dp->g_groups, c_list)
+ p->c_limit = 1;
+ } else {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+ dp->g_min_bw_total / OVER_IO_RATE + 1;
+ }
+ }
+
+ return;
+}
+
+/*
+ * Allocate the max_bw and max_bw_token to the pointed group
+ */
+static void set_max_bw(struct ioband_group *gp, int new)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int token_unit;
+
+ token_unit = dp->g_token_unit;
+
+ dp->g_max_bw_total += (new - gp->c_max_bw);
+ gp->c_max_bw = new;
+ gp->c_max_bw_token = convert_bw_to_token(new, token_unit);
+
+ range_bw_time_slice_init(gp);
+
+ return;
+
+}
+
+static void init_range_bw_token_bucket(struct ioband_device *dp, int val)
+{
+ dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+ dp->g_token_unit;
+ if (!val)
+ val = DEFAULT_TOKENPOOL << dp->g_token_unit;
+ if (val < dp->g_token_bucket)
+ val = dp->g_token_bucket;
+ dp->g_carryover = val/dp->g_token_bucket;
+ dp->g_token_left = 0;
+}
+
+static int policy_range_bw_param(struct ioband_group *gp,
+ const char *cmd, const char *value)
+{
+ long val = 0, min_val = DEFAULT_MIN_BW, max_val = DEFAULT_MAX_BW;
+ int r = 0, err = 0;
+ char *endp;
+
+ if (value) {
+ min_val = simple_strtol(value, &endp, 0);
+ if (strchr(POLICY_PARAM_DELIM, *endp)) {
+ max_val = simple_strtol(endp + 1, &endp, 0);
+ if (*endp != '\0')
+ err++;
+ } else
+ err++;
+ }
+
+ if (!strcmp(cmd, "range-bw")) {
+ if (!err && 0 <= min_val &&
+ min_val <= (INT_MAX / 2) && 0 <= max_val &&
+ max_val <= (INT_MAX / 2) && min_val <= max_val)
+ set_range_bw(gp, min_val, max_val);
+ else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "min-bw")) {
+ if (!err && 0 <= val && val <= (INT_MAX / 2))
+ set_min_bw(gp, val);
+ else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "max-bw")) {
+ if ((!err && 0 <= val && val <= (INT_MAX / 2)
&&
+ gp->c_min_bw <= val) || val == 0)
+ set_max_bw(gp, val);
+ else
+ r = -EINVAL;
+ } else {
+ r = -EINVAL;
+ }
+ return r;
+}
+
+static int policy_range_bw_ctr(struct ioband_group *gp, const char *arg)
+{
+ int ret;
+
+ init_waitqueue_head(&gp->c_max_bw_over_waitq);
+
+ gp->c_min_bw = 0;
+ gp->c_max_bw = 0;
+ gp->c_io_mode = DEFAULT_IO_MODE;
+ gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+ gp->c_min_bw_token = 0;
+ gp->c_max_bw_token = 0;
+ gp->c_consumed_min_bw_token = 0;
+ gp->c_is_over_max_bw = MAX_BW_UNDER;
+ gp->c_time_slice_start = 0;
+ gp->c_time_slice_end = 0;
+ gp->c_wait_p_count = 0;
+
+ gp->c_time_slice = time_slice_base;
+
+ gp->c_timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL);
+ if (gp->c_timer == NULL)
+ return -EINVAL;
+ memset(gp->c_timer, 0, sizeof(struct timer_list));
+ gp->timer_set = 0;
+
+ ret = policy_range_bw_param(gp, "range-bw", arg);
+
+ return ret;
+}
+
+static void policy_range_bw_dtr(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ gp->c_time_slice = 0;
+ set_range_bw(gp, 0, 0);
+
+ dp->g_running_gp = NULL;
+
+ if (gp->c_timer != NULL) {
+ del_timer(gp->c_timer);
+ kfree(gp->c_timer);
+ }
+}
+
+static void policy_range_bw_show(struct ioband_group *gp, int *szp,
+ char *result, unsigned int maxlen)
+{
+ struct ioband_group *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = *szp; /* used in DMEMIT() */
+
+ DMEMIT(" %d :%d:%d", dp->g_token_bucket * dp->g_carryover,
+ gp->c_min_bw, gp->c_max_bw);
+
+ for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ DMEMIT(" %d:%d:%d", p->c_id, p->c_min_bw, p->c_max_bw);
+ }
+ *szp = sz;
+}
+
+static int range_bw_prepare_token(struct ioband_group *gp,
+ struct bio *bio, int flag)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int unit;
+ int bio_count;
+ int token_count = 0;
+
+ unit = (1 << dp->g_token_unit);
+ bio_count = bio_sectors(bio);
+
+ if (unit == 8)
+ token_count = bio_count;
+ else if (unit == 4)
+ token_count = bio_count / 2;
+ else if (unit == 2)
+ token_count = bio_count / 4;
+ else if (unit == 1)
+ token_count = bio_count / 8;
+
+ return range_bw_consume_token(gp, token_count, flag);
+}
+
+static void range_bw_timer_register(struct timer_list *ptimer,
+ unsigned long timeover, unsigned long gp)
+{
+ struct ioband_group *group = (struct ioband_group *)gp;
+
+ if (group->timer_set == 0) {
+ init_timer(ptimer);
+ ptimer->expires = get_jiffies_64() + timeover;
+ ptimer->data = gp;
+ ptimer->function = range_bw_timeover;
+ add_timer(ptimer);
+ group->timer_set = 1;
+ }
+}
+
+/*
+ * Timer Handler function to protect the all processes's hanging in
+ * lower min-bw configuration
+ */
+static void range_bw_timeover(unsigned long gp)
+{
+ struct ioband_group *group = (struct ioband_group *)gp;
+
+ if (group->c_is_over_max_bw == MAX_BW_OVER)
+ group->c_is_over_max_bw = MAX_BW_UNDER;
+
+ if (group->c_io_mode == NO_IO_MODE)
+ group->c_io_mode = MINBW_IO_MODE;
+
+ if (waitqueue_active(&group->c_max_bw_over_waitq))
+ wake_up_all(&group->c_max_bw_over_waitq);
+
+ group->timer_set = 0;
+}
+
+/*
+ * <Method> <description>
+ * g_can_submit : To determine whether a given group has the right to
+ * submit BIOs. The larger the return value the higher the
+ * priority to submit. Zero means it has no right.
+ * g_prepare_bio : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ * of them can be submitted now. This method has to
+ * reinitialize the data to restart to submit BIOs and return
+ * 0 or 1.
+ * The return value 0 means that it has become able to submit
+ * them now so that this ioband device will continue its work.
+ * The return value 1 means that it is still unable to submit
+ * them so that this device will stop its work. And this
+ * policy module has to reactivate the device when it gets
+ * to be able to submit BIOs.
+ * g_hold_bio : To hold a given BIO until it is submitted.
+ * The default function is used when this method is undefined.
+ * g_pop_bio : To select and get the best BIO to submit.
+ * g_group_ctr : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr : Called when struct ioband_group is removed.
+ * g_set_param : To update the policy own date.
+ * The parameters can be passed through "dmsetup
message"
+ * command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ * Return 1 if a given group can't receive any more BIOs,
+ * otherwise return 0.
+ * g_show : Show the configuration.
+ */
+
+int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0)
+ return -EINVAL;
+ }
+
+ dp->g_can_submit = has_right_to_issue;
+ dp->g_prepare_bio = range_bw_prepare_token;
+ dp->g_restart_bios = range_bw_restart_bios;
+ dp->g_group_ctr = policy_range_bw_ctr;
+ dp->g_group_dtr = policy_range_bw_dtr;
+ dp->g_set_param = policy_range_bw_param;
+ dp->g_should_block = range_bw_queue_full;
+ dp->g_show = policy_range_bw_show;
+
+ dp->g_min_bw_total = 0;
+ dp->g_running_gp = NULL;
+ dp->g_total_min_bw_token = 0;
+ dp->g_io_mode = MINBW_IO_MODE;
+ dp->g_consumed_min_bw_token = 0;
+ dp->g_current = NULL;
+ dp->g_next_time_period = 0;
+ dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+
+ dp->g_token_unit = PAGE_SHIFT - 9;
+ init_range_bw_token_bucket(dp, val);
+
+ return 0;
+}
Index: linux-2.6.31/Documentation/device-mapper/range-bw.txt
==================================================================--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/range-bw.txt
@@ -0,0 +1,99 @@
+Range-BW I/O controller by Dong-Jae Kang <djkang at etri.re.kr>
+
+
+1. Introduction
+==============+
+The design of Range-BW is related with three another parts, Cgroup,
+bio-cgroup (or blkio-cgroup) and dm-ioband and it was implemented as
+an additional controller for dm-ioband.
+Cgroup framework is used to support process grouping mechanism and
+bio-cgroup is used to control delayed I/O or non-direct I/O. Finally,
+dm-ioband is a kind of I/O controller allowing the proportional I/O
+bandwidth to process groups based on its priority.
+The supposed controller supports the process group-based range
+bandwidth according to the priority or importance of the group. Range
+bandwidth means the predicable I/O bandwidth with minimum and maximum
+value defined by administrator.
+
+Minimum I/O bandwidth should be guaranteed for stable performance or
+reliability of specific service and I/O bandwidth over maximum should
+be throttled to protect the limited I/O resource from
+over-provisioning in unnecessary usage or to reserve the I/O bandwidth
+for another use.
+So, Range-BW was implemented to include the two concepts, guaranteeing
+of minimum I/O requirement and limitation of unnecessary bandwidth
+depending on its priority.
+And it was implemented as device mapper driver such like dm-ioband.
+So, it is independent of the underlying specific I/O scheduler, for
+example, CFQ, AS, NOOP, deadline and so on.
+
+* Attention
+Range-BW supports the predicable I/O bandwidth, but it should be
+configured in the scope of total I/O bandwidth of the I/O system to
+guarantee the minimum I/O requirement. For example, if total I/O
+bandwidth is 40Mbytes/sec,
+
+the summary of I/O bandwidth configured in each process group should
+be equal or smaller than 40Mbytes/sec.
+So, we need to check total I/O bandwidth before set it up.
+
+2. Setup and Installation
+========================+
+This part is same with dm-ioband,
+../../Documentation/device-mapper/ioband.txt or
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband/man/setup
+except the allocation of range-bw values.
+
+3. Usage
+=======+
+It is very useful to refer the documentation for dm-ioband in
+../../Documentation/device-mapper/ioband.txt or
+
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband, because
+Range-BW follows the basic semantics of dm-ioband.
+This example is for range-bw configuration.
+
+# mount the cgroup
+mount -t cgroup -o blkio none /root/cgroup/blkio
+
+# create the process groups (3 groups)
+mkdir /root/cgroup/blkio/bgroup1
+mkdir /root/cgroup/blkio/bgroup2
+mkdir /root/cgroup/blkio/bgroup3
+
+# create the ioband device ( name : ioband1 )
+echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none
+range-bw 0 :0:0" | dmsetup create ioband1
+: Attention - device name (/dev/sdb2) should be modified depending on
+your system
+
+# init ioband device ( type and policy )
+dmsetup message ioband1 0 type cgroup
+dmsetup message ioband1 0 policy range-bw
+
+# attach the groups to the ioband device
+dmsetup message ioband1 0 attach 2
+dmsetup message ioband1 0 attach 3
+dmsetup message ioband1 0 attach 4
+: group number can be referred in /root/cgroup/blkio/bgroup1/blkio.id
+
+# allocate the values ( range-bw ) : XXX Kbytes
+: the sum of minimum I/O bandwidth in each group should be equal or
+smaller than total bandwidth to be supported by your system
+
+# range : about 100~500 Kbytes
+dmsetup message ioband1 0 range-bw 2:100:500
+
+# range : about 700~1000 Kbytes
+dmsetup message ioband1 0 range-bw 3:700:1000
+
+# range : about 30~35Mbytes
+dmsetup message ioband1 0 range-bw 4:30000:35000
+
+You can confirm the configuration of range-bw by using this command :
+[root at localhost range-bw]# dmsetup table --target ioband
+ioband1: 0 305235000 ioband 8:18 1 4 128 cgroup \
+ range-bw 16384 :0:0 2:100:500 3:700:1000 4:30000:35000
Index: linux-2.6.31/include/trace/events/dm-ioband.h
==================================================================--- /dev/null
+++ linux-2.6.31/include/trace/events/dm-ioband.h
@@ -0,0 +1,242 @@
+#if !defined(_TRACE_DM_IOBAND_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DM_IOBAND_H
+
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dm-ioband
+
+TRACE_EVENT(ioband_hold_urgent_bio,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_hold_bio,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_pback_list,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_issue_list,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_release_urgent_bios,
+
+ TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+ TP_ARGS(dp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, dp->g_name )
+ __field( int, g_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, dp->g_name);
+ __entry->g_blocked = dp->g_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s: %d,%d %c %llu + %u %d",
+ __get_str(g_name),
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked)
+);
+
+TRACE_EVENT(ioband_make_request,
+
+ TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+ TP_ARGS(dp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, dp->g_name )
+ __field( int, c_id )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, dp->g_name);
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s: %d,%d %c %llu + %u",
+ __get_str(g_name),
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector)
+);
+
+TRACE_EVENT(ioband_pushback_bio,
+
+ TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+ TP_ARGS(dp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, dp->g_name )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, dp->g_name);
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s: %d,%d %c %llu + %u",
+ __get_str(g_name),
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector)
+);
+
+#endif /* _TRACE_DM_IOBAND_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>