Justin T. Gibbs
2012-Feb-20 18:07 UTC
[PATCH 0 of 4 v3] blkif.h: Document protocol and existing extensions
This patch series attempts to document the blkif PV interface and the various extensions to it that are out in the wild. Changes in v3: patch 3 (blkif.h: Document the RedHat and Citrix blkif multi-page ring extensions.): o Mark the RedHat/Amazon EC2 ring extension nodes as deprecated. While there is nothing fundamentally wrong with the method used there, it was not publicly documented (not even in patch form), and differs from the "page-order" large ring XenBus node convention used for other drivers. patch 4 (blkif.h: Define and document the request number/size/segments extension.) o Fix typo "max-ring_pages" -> "max-ring-pages". Changes in v2: patch 2 (blkif.h: Provide more complete documentation of the blkif interface.): o Mark backend device identification section as private to the backend driver. o Refer to docs/misc/vbd-interface.txt for the format of the virtual-device front-end node. o Correct field size for the virtual-device front-end node. patch 3 (blkif.h: Document the RedHat and Citrix blkif multi-page ring extensions.): o Correct node names for the RedHat/Amazon multi-ring extension. The previous patch mistakenly documented the now defunct FreeBSD extension. o Clarify the note on multi-page ring interoperability to indicate that identical ring parameters should be published to the XenStore for all supported schemes. v1 patch 4 (Deleted): o Remove patch that added the XEN_*_MAJOR definitions. patch 4 (blkif.h: Define and document the request number/size/segments extension.) o Bump __XEN_LATEST_INTERFACE_VERSION to 0x00040201 and use this version to guard the change to BLKIF_MAX_SEGMENTS_PER_REQUEST. -- Justin
o Remove trailing whitespace. o Remove a blank line so that a comment block is adjacent to the code it documents. No functional changes. Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> diff -r 87218bd367be -r 28137a4e39a3 xen/include/public/io/blkif.h --- a/xen/include/public/io/blkif.h Fri Feb 17 12:24:38 2012 +0000 +++ b/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 @@ -1,8 +1,8 @@ /****************************************************************************** * blkif.h - * + * * Unified block-device I/O interface for Xen guest OSes. - * + * * Permission is hereby granted, free of charge, to any person obtaining a copy * of this software and associated documentation files (the "Software"), to * deal in the Software without restriction, including without limitation the @@ -35,7 +35,7 @@ * notification can be made conditional on req_event (i.e., the generic * hold-off mechanism provided by the ring macros). Backends must set * req_event appropriately (e.g., using RING_FINAL_CHECK_FOR_REQUESTS()). - * + * * Back->front notifications: When enqueuing a new response, sending a * notification can be made conditional on rsp_event (i.e., the generic * hold-off mechanism provided by the ring macros). Frontends must set @@ -133,7 +133,7 @@ */ #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11 -/* +/* * NB. first_sect and last_sect in blkif_request_segment, as well as * sector_number in blkif_request, are always expressed in 512-byte units. * However they must be properly aligned to the real sector size of the @@ -192,7 +192,6 @@ typedef struct blkif_response blkif_resp /* * Generate blkif ring structures and types. */ - DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response); #define VDISK_CDROM 0x1
Justin T. Gibbs
2012-Feb-20 18:07 UTC
[PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
o Document the XenBus nodes used in this protocol. o Add a state diagram illustrating the roles and responsibilities of both the front and backend during startup. o Correct missed BLKIF_OP_TRIM => BLKIF_OP_DISCARD conversion in a comment. No functional changes. Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> diff -r 28137a4e39a3 -r e79902456819 xen/include/public/io/blkif.h --- a/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 +++ b/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 @@ -22,6 +22,7 @@ * DEALINGS IN THE SOFTWARE. * * Copyright (c) 2003-2004, Keir Fraser + * Copyright (c) 2012, Spectra Logic Corporation */ #ifndef __XEN_PUBLIC_IO_BLKIF_H__ @@ -48,32 +49,253 @@ #define blkif_sector_t uint64_t /* + * Feature and Parameter Negotiation + * ================================+ * The two halves of a Xen block driver utilize nodes within the XenStore to + * communicate capabilities and to negotiate operating parameters. This + * section enumerates these nodes which reside in the respective front and + * backend portions of the XenStore, following the XenBus convention. + * + * All data in the XenStore is stored as strings. Nodes specifying numeric + * values are encoded in decimal. Integer value ranges listed below are + * expressed as fixed sized integer types capable of storing the conversion + * of a properly formated node string, without loss of information. + * + * Any specified default value is in effect if the corresponding XenBus node + * is not present in the XenStore. + * + * XenStore nodes in sections marked "PRIVATE" are solely for use by the + * driver side whose XenBus tree contains them. + * + * See the XenBus state transition diagram below for details on when XenBus + * nodes must be published and when they can be queried. + * + ***************************************************************************** + * Backend XenBus Nodes + ***************************************************************************** + * + *------------------ Backend Device Identification (PRIVATE) ------------------ + * + * mode + * Values: "r" (read only), "w" (writable) + * + * The read or write access permissions to the backing store to be + * granted to the frontend. + * + * params + * Values: string + * + * A free formatted string providing sufficient information for the + * backend driver to open the backing device. (e.g. the path to the + * file or block device representing the backing store.) + * + * type + * Values: "file", "phy", "tap" + * + * The type of the backing device/object. + * + *--------------------------------- Features --------------------------------- + * + * feature-barrier + * Values: 0/1 (boolean) + * Default Value: 0 + * + * A value of "1" indicates that the backend can process requests + * containing the BLKIF_OP_WRITE_BARRIER request opcode. Requests + * of this type may still be returned at any time with the + * BLKIF_RSP_EOPNOTSUPP result code. + * + * feature-flush-cache + * Values: 0/1 (boolean) + * Default Value: 0 + * + * A value of "1" indicates that the backend can process requests + * containing the BLKIF_OP_FLUSH_DISKCACHE request opcode. Requests + * of this type may still be returned at any time with the + * BLKIF_RSP_EOPNOTSUPP result code. + * + * feature-discard + * Values: 0/1 (boolean) + * Default Value: 0 + * + * A value of "1" indicates that the backend can process requests + * containing the BLKIF_OP_DISCARD request opcode. Requests + * of this type may still be returned at any time with the + * BLKIF_RSP_EOPNOTSUPP result code. + * + *------------------------- Backend Device Properties ------------------------- + * + * discard-aligment + * Values: <uint32_t> + * Default Value: 0 + * Notes: 1, 2 + * + * The offset, in bytes from the beginning of the virtual block device, + * to the first, addressable, discard extent on the underlying device. + * + * discard-granularity + * Values: <uint32_t> + * Default Value: <"sector-size"> + * Notes: 1 + * + * The size, in bytes, of the individually addressable discard extents + * of the underlying device. + * + * discard-secure + * Values: 0/1 (boolean) + * Default Value: 0 + * + * A value of "1" indicates that the backend can process BLKIF_OP_DISCARD + * requests with the BLKIF_DISCARD_SECURE flag set. + * + * info + * Values: <uint32_t> (bitmap) + * + * A collection of bit flags describing attributes of the backing + * device. The VDISK_* macros define the meaning of each bit + * location. + * + * sector-size + * Values: <uint32_t> + * + * The native sector size, in bytes, of the backend device. + * + * sectors + * Values: <uint64_t> + * + * The size of the backend device, expressed in units of its native + * sector size ("sector-size"). + * + ***************************************************************************** + * Frontend XenBus Nodes + ***************************************************************************** + * + *----------------------- Request Transport Parameters ----------------------- + * + * event-channel + * Values: <uint32_t> + * + * The identifier of the Xen event channel used to signal activity + * in the ring buffer. + * + * ring-ref + * Values: <uint32_t> + * + * The Xen grant reference granting permission for the backend to map + * the sole page in a single page sized ring buffer. + * + * protocol + * Values: string (XEN_IO_PROTO_ABI_*) + * Default Value: XEN_IO_PROTO_ABI_NATIVE + * + * The machine ABI rules governing the format of all ring request and + * response structures. + * + *------------------------- Virtual Device Properties ------------------------- + * + * device-type + * Values: "disk", "cdrom", "floppy", etc. + * + * virtual-device + * Values: <uint32_t> + * + * A value indicating the physical device to virtualize within the + * frontend''s domain. (e.g. "The first ATA disk", "The third SCSI + * disk", etc.) + * + * See docs/misc/vbd-interface.txt for details on the format of this + * value. + * + * Notes + * ----- + * (1) Devices that support discard functionality may internally allocate + * space (discardable extents) in units that are larger than the + * exported logical block size. + * (2) The discard-alignment parameter allows a physical device to be + * partitioned into virtual devices that do not necessarily begin or + * end on a discardable extent boundary. + */ + +/* + * STATE DIAGRAMS + * + ***************************************************************************** + * Startup * + ***************************************************************************** + * + * Tool stack creates front and back nodes with state XenbusStateInitialising. + * + * Front Back + * ================================= ====================================+ * XenbusStateInitialising XenbusStateInitialising + * o Query virtual device o Query backend device identification + * properties. data. + * o Setup OS device instance. o Open and validate backend device. + * o Publish backend features. + * | + * | + * V + * XenbusStateInitWait + * + * o Query backend features. + * o Allocate and initialize the + * request ring. + * | + * | + * V + * XenbusStateInitialised + * + * o Connect to the request ring and + * event channel. + * o Publish backend device properties. + * | + * | + * V + * XenbusStateConnected + * + * o Query backend device properties. + * o Finalize OS virtual device + * instance. + * | + * | + * V + * XenbusStateConnected + * + * Note: Drivers that do not support any optional features can skip certain + * states in the state machine: + * + * o A frontend may transition to XenbusStateInitialised without + * waiting for the backend to enter XenbusStateInitWait. + * + * o A backend may transition to XenbusStateInitialised, bypassing + * XenbusStateInitWait, without waiting for the frontend to first + * enter the XenbusStateInitialised state. + * + * Drivers that support optional features must tolerate these additional + * state transition paths. In general this means performing the work of + * any skipped state transition, if it has not already been performed, + * in addition to the work associated with entry into the current state. + */ + +/* * REQUEST CODES. */ #define BLKIF_OP_READ 0 #define BLKIF_OP_WRITE 1 /* - * Recognised only if "feature-barrier" is present in backend xenbus info. - * The "feature-barrier" node contains a boolean indicating whether barrier - * requests are likely to succeed or fail. Either way, a barrier request - * may fail at any time with BLKIF_RSP_EOPNOTSUPP if it is unsupported by - * the underlying block-device hardware. The boolean simply indicates whether - * or not it is worthwhile for the frontend to attempt barrier requests. - * If a backend does not recognise BLKIF_OP_WRITE_BARRIER, it should *not* - * create the "feature-barrier" node! + * All writes issued prior to a request with the BLKIF_OP_WRITE_BARRIER + * operation code ("barrier request") must be completed prior to the + * execution of the barrier request. All writes issued after the barrier + * request must not execute until after the completion of the barrier request. + * + * Optional. See "feature-barrier" XenBus node documentation above. */ #define BLKIF_OP_WRITE_BARRIER 2 /* - * Recognised if "feature-flush-cache" is present in backend xenbus - * info. A flush will ask the underlying storage hardware to flush its - * non-volatile caches as appropriate. The "feature-flush-cache" node - * contains a boolean indicating whether flush requests are likely to - * succeed or fail. Either way, a flush request may fail at any time - * with BLKIF_RSP_EOPNOTSUPP if it is unsupported by the underlying - * block-device hardware. The boolean simply indicates whether or not it - * is worthwhile for the frontend to attempt flushes. If a backend does - * not recognise BLKIF_OP_WRITE_FLUSH_CACHE, it should *not* create the - * "feature-flush-cache" node! + * Commit any uncommitted contents of the backing device''s volatile cache + * to stable storage. + * + * Optional. See "feature-flush-cache" XenBus node documentation above. */ #define BLKIF_OP_FLUSH_DISKCACHE 3 /* @@ -82,47 +304,24 @@ */ #define BLKIF_OP_RESERVED_1 4 /* - * Recognised only if "feature-discard" is present in backend xenbus info. - * The "feature-discard" node contains a boolean indicating whether trim - * (ATA) or unmap (SCSI) - conviently called discard requests are likely - * to succeed or fail. Either way, a discard request - * may fail at any time with BLKIF_RSP_EOPNOTSUPP if it is unsupported by - * the underlying block-device hardware. The boolean simply indicates whether - * or not it is worthwhile for the frontend to attempt discard requests. - * If a backend does not recognise BLKIF_OP_DISCARD, it should *not* - * create the "feature-discard" node! + * Indicate to the backend device that a region of storage is no longer in + * use, and may be discarded at any time without impact to the client. If + * the BLKIF_DISCARD_SECURE flag is set on the request, all copies of the + * discarded region on the device must be rendered unrecoverable before the + * command returns. * - * Discard operation is a request for the underlying block device to mark - * extents to be erased. However, discard does not guarantee that the blocks - * will be erased from the device - it is just a hint to the device - * controller that these blocks are no longer in use. What the device - * controller does with that information is left to the controller. - * Discard operations are passed with sector_number as the - * sector index to begin discard operations at and nr_sectors as the number of - * sectors to be discarded. The specified sectors should be discarded if the - * underlying block device supports trim (ATA) or unmap (SCSI) operations, - * or a BLKIF_RSP_EOPNOTSUPP should be returned. - * More information about trim/unmap operations at: + * This operation is analogous to performing a trim (ATA) or unamp (SCSI), + * command on a native device. + * + * More information about trim/unmap operations can be found at: * http://t13.org/Documents/UploadedDocuments/docs2008/ * e07154r6-Data_Set_Management_Proposal_for_ATA-ACS2.doc * http://www.seagate.com/staticfiles/support/disc/manuals/ * Interface%20manuals/100293068c.pdf - * The backend can optionally provide these extra XenBus attributes to - * further optimize the discard functionality: - * ''discard-aligment'' - Devices that support discard functionality may - * internally allocate space in units that are bigger than the exported - * logical block size. The discard-alignment parameter indicates how many bytes - * the beginning of the partition is offset from the internal allocation unit''s - * natural alignment. Do not confuse this with natural disk alignment offset. - * ''discard-granularity'' - Devices that support discard functionality may - * internally allocate space using units that are bigger than the logical block - * size. The discard-granularity parameter indicates the size of the internal - * allocation unit in bytes if reported by the device. Otherwise the - * discard-granularity will be set to match the device''s physical block size. - * It is the minimum size you can discard. - * ''discard-secure'' - All copies of the discarded sectors (potentially created - * by garbage collection) must also be erased. To use this feature, the flag - * BLKIF_DISCARD_SECURE must be set in the blkif_request_discard. + * + * Optional. See "feature-discard", "discard-alignment", + * "discard-granularity", and "discard-secure" in the XenBus node + * documentation above. */ #define BLKIF_OP_DISCARD 5 @@ -147,6 +346,9 @@ struct blkif_request_segment { uint8_t first_sect, last_sect; }; +/* + * Starting ring element for any I/O request. + */ struct blkif_request { uint8_t operation; /* BLKIF_OP_??? */ uint8_t nr_segments; /* number of segments */ @@ -158,7 +360,7 @@ struct blkif_request { typedef struct blkif_request blkif_request_t; /* - * Cast to this structure when blkif_request.operation == BLKIF_OP_TRIM + * Cast to this structure when blkif_request.operation == BLKIF_OP_DISCARD * sizeof(struct blkif_request_discard) <= sizeof(struct blkif_request) */ struct blkif_request_discard {
Justin T. Gibbs
2012-Feb-20 18:07 UTC
[PATCH 3 of 4 v3] blkif.h: Document the RedHat and Citrix blkif multi-page ring extensions
No functional changes. Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> diff -r e79902456819 -r 09051133e2fe xen/include/public/io/blkif.h --- a/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 @@ -67,6 +67,9 @@ * XenStore nodes in sections marked "PRIVATE" are solely for use by the * driver side whose XenBus tree contains them. * + * XenStore nodes marked "DEPRECATED" in their notes section should only be + * used to provide interoperability with legacy implementations. + * * See the XenBus state transition diagram below for details on when XenBus * nodes must be published and when they can be queried. * @@ -123,12 +126,31 @@ * of this type may still be returned at any time with the * BLKIF_RSP_EOPNOTSUPP result code. * + *----------------------- Request Transport Parameters ------------------------ + * + * max-ring-page-order + * Values: <uint32_t> + * Default Value: 0 + * Notes: 1, 3 + * + * The maximum supported size of the request ring buffer in units of + * lb(machine pages). (e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages, + * etc.). + * + * max-ring-pages + * Values: <uint32_t> + * Default Value: 1 + * Notes: DEPRECATED, 2, 3 + * + * The maximum supported size of the request ring buffer in units of + * machine pages. The value must be a power of 2. + * *------------------------- Backend Device Properties ------------------------- * * discard-aligment * Values: <uint32_t> * Default Value: 0 - * Notes: 1, 2 + * Notes: 4, 5 * * The offset, in bytes from the beginning of the virtual block device, * to the first, addressable, discard extent on the underlying device. @@ -136,7 +158,7 @@ * discard-granularity * Values: <uint32_t> * Default Value: <"sector-size"> - * Notes: 1 + * Notes: 4 * * The size, in bytes, of the individually addressable discard extents * of the underlying device. @@ -180,10 +202,20 @@ * * ring-ref * Values: <uint32_t> + * Notes: 6 * * The Xen grant reference granting permission for the backend to map * the sole page in a single page sized ring buffer. * + * ring-ref%u + * Values: <uint32_t> + * Notes: 6 + * + * For a frontend providing a multi-page ring, a "number of ring pages" + * sized list of nodes, each containing a Xen grant reference granting + * permission for the backend to map the page of the ring located + * at page index "%u". Page indexes are zero based. + * * protocol * Values: string (XEN_IO_PROTO_ABI_*) * Default Value: XEN_IO_PROTO_ABI_NATIVE @@ -191,6 +223,25 @@ * The machine ABI rules governing the format of all ring request and * response structures. * + * ring-page-order + * Values: <uint32_t> + * Default Value: 0 + * Maximum Value: MAX(ffs(max-ring-pages) - 1, max-ring-page-order) + * Notes: 1, 3 + * + * The size of the frontend allocated request ring buffer in units + * of lb(machine pages). (e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages, + * etc.). + * + * num-ring-pages + * Values: <uint32_t> + * Default Value: 1 + * Maximum Value: MAX(max-ring-pages,(0x1 << max-ring-page-order)) + * Notes: DEPRECATED, 2, 3 + * + * The size of the frontend allocated request ring buffer in units of + * machine pages. The value must be a power of 2. + * *------------------------- Virtual Device Properties ------------------------- * * device-type @@ -208,12 +259,26 @@ * * Notes * ----- - * (1) Devices that support discard functionality may internally allocate + * (1) Multi-page ring buffer scheme first developed in the Citrix XenServer + * PV drivers. + * (2) Multi-page ring buffer scheme first used in some RedHat distributions + * including a distribution deployed on certain nodes of the Amazon + * EC2 cluster. + * (3) Support for multi-page ring buffers was implemented independently, + * in slightly different forms, by both Citrix and RedHat/Amazon. + * For full interoperability, block front and backends should publish + * identical ring parameters, adjusted for unit differences, to the + * XenStore nodes used in both schemes. + * (4) Devices that support discard functionality may internally allocate * space (discardable extents) in units that are larger than the * exported logical block size. - * (2) The discard-alignment parameter allows a physical device to be + * (5) The discard-alignment parameter allows a physical device to be * partitioned into virtual devices that do not necessarily begin or * end on a discardable extent boundary. + * (6) When there is only a single page allocated to the request ring, + * ''ring-ref'' is used to communicate the grant reference for this + * page to the backend. When using a multi-page ring, the ''ring-ref'' + * node is not created. Instead ''ring-ref0'' - ''ring-refN'' are used. */ /* @@ -231,20 +296,26 @@ * o Query virtual device o Query backend device identification * properties. data. * o Setup OS device instance. o Open and validate backend device. - * o Publish backend features. + * o Publish backend features and + * transport parameters. * | * | * V * XenbusStateInitWait * - * o Query backend features. + * o Query backend features and + * transport parameters. * o Allocate and initialize the * request ring. + * o Publish transport parameters + * that will be in effect during + * this connection. * | * | * V * XenbusStateInitialised * + * o Query frontend transport parameters. * o Connect to the request ring and * event channel. * o Publish backend device properties. @@ -261,20 +332,26 @@ * V * XenbusStateConnected * - * Note: Drivers that do not support any optional features can skip certain - * states in the state machine: + * Note: Drivers that do not support any optional features, or the negotiation + * of transport parameters, can skip certain states in the state machine: * * o A frontend may transition to XenbusStateInitialised without - * waiting for the backend to enter XenbusStateInitWait. + * waiting for the backend to enter XenbusStateInitWait. In this + * case, default transport parameters are in effect and any + * transport parameters published by the frontend must contain + * their default values. * * o A backend may transition to XenbusStateInitialised, bypassing * XenbusStateInitWait, without waiting for the frontend to first - * enter the XenbusStateInitialised state. + * enter the XenbusStateInitialised state. In this case, default + * transport parameters are in effect and any transport parameters + * published by the backend must contain their default values. * - * Drivers that support optional features must tolerate these additional - * state transition paths. In general this means performing the work of - * any skipped state transition, if it has not already been performed, - * in addition to the work associated with entry into the current state. + * Drivers that support optional features and/or transport parameter + * negotiation must tolerate these additional state transition paths. + * In general this means performing the work of any skipped state + * transition, if it has not already been performed, in addition to the + * work associated with entry into the current state. */ /*
Justin T. Gibbs
2012-Feb-20 18:07 UTC
[PATCH 4 of 4 v3] blkif.h: Define and document the request number/size/segments extension
Note: As of __XEN_INTERFACE_VERSION__ 0x00040201 the definition of BLKIF_MAX_SEGMENTS_PER_REQUEST has changed. Drivers must be updated to, at minimum, use BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, before being recompiled with a __XEN_INTERFACE_VERSION greater than or equal to this value. This extension first appeared in the FreeBSD Operating System. Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/io/blkif.h --- a/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:03:01 2012 -0700 @@ -145,6 +145,32 @@ * The maximum supported size of the request ring buffer in units of * machine pages. The value must be a power of 2. * + * max-requests <uint32_t> + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) + * + * The maximum number of concurrent, logical requests that will be + * issued by the backend. + * + * Note: A logical request may span multiple ring entries. + * + * max-request-segments + * Values: <uint8_t> + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK + * Maximum Value: BLKIF_MAX_SEGMENTS_PER_REQUEST + * + * The maximum value of blkif_request.nr_segments supported by + * the backend. + * + * max-request-size + * Values: <uint32_t> + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK * PAGE_SIZE + * Maximum Value: BLKIF_MAX_SEGMENTS_PER_REQUEST * PAGE_SIZE + * + * The maximum amount of data, in bytes, that can be referenced by a + * request type that accesses frontend memory (currently BLKIF_OP_READ, + * BLKIF_OP_WRITE, or BLKIF_OP_WRITE_BARRIER). + * *------------------------- Backend Device Properties ------------------------- * * discard-aligment @@ -242,6 +268,33 @@ * The size of the frontend allocated request ring buffer in units of * machine pages. The value must be a power of 2. * + * max-requests + * Values: <uint32_t> + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) + * + * The maximum number of concurrent, logical requests that will be + * issued by the frontend. + * + * Note: A logical request may span multiple ring entries. + * + * max-request-segments + * Values: <uint8_t> + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK + * Maximum Value: MIN(255, backend/max-request-segments) + * + * The maximum value the frontend will set in the + * blkif_request.nr_segments field. + * + * max-request-size + * Values: <uint32_t> + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK * PAGE_SIZE + * Maximum Value: max-request-segments * PAGE_SIZE + * + * The maximum amount of data, in bytes, that can be referenced by + * a request type that accesses frontend memory (currently BLKIF_OP_READ, + * BLKIF_OP_WRITE, or BLKIF_OP_WRITE_BARRIER). + * *------------------------- Virtual Device Properties ------------------------- * * device-type @@ -403,11 +456,28 @@ #define BLKIF_OP_DISCARD 5 /* - * Maximum scatter/gather segments per request. + * Maximum scatter/gather segments associated with a request header block. * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE. * NB. This could be 12 if the ring indexes weren''t stored in the same page. */ -#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11 +#define BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK 11 + +/* + * Maximum scatter/gather segments associated with a segment block. + */ +#define BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK 14 + +#if __XEN_INTERFACE_VERSION__ >= 0x00040201 +/* + * Maximum scatter/gather segments per request (header + segment blocks). + */ +#define BLKIF_MAX_SEGMENTS_PER_REQUEST 255 +#else +/* + * Maximum scatter/gather segments per request (header block only). + */ +#define BLKIF_MAX_SEGMENTS_PER_REQUEST BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK +#endif /* * NB. first_sect and last_sect in blkif_request_segment, as well as @@ -422,9 +492,25 @@ struct blkif_request_segment { /* @last_sect: last sector in frame to transfer (inclusive). */ uint8_t first_sect, last_sect; }; +typedef struct blkif_request_segment blkif_request_segment_t; /* * Starting ring element for any I/O request. + * + * One or more segment blocks can be inserted into the request ring + * just after a blkif_request_t, allowing requests to operate on + * up to BLKIF_MAX_SEGMENTS_PER_REQUEST. + * + * BLKIF_SEGS_TO_BLOCKS() can be used on blkif_requst.nr_segments + * to determine the number of contiguous ring entries associated + * with this request. + * + * Note: Due to the way Xen request rings operate, the producer and + * consumer indices of the ring must be incremented by the + * BLKIF_SEGS_TO_BLOCKS() value of the associated request. + * (e.g. a response to a 3 ring entry request must also consume + * 3 entries in the ring, even though only the first ring entry + * in the response has any data.) */ struct blkif_request { uint8_t operation; /* BLKIF_OP_??? */ @@ -432,11 +518,22 @@ struct blkif_request { blkif_vdev_t handle; /* only for read/write requests */ uint64_t id; /* private guest value, echoed in resp */ blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */ - struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; + blkif_request_segment_t seg[BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK]; }; typedef struct blkif_request blkif_request_t; /* + * A segment block is a ring request structure that contains only + * segment data. + * + * sizeof(struct blkif_segment_block) <= sizeof(struct blkif_request) + */ +struct blkif_segment_block { + blkif_request_segment_t seg[BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK]; +}; +typedef struct blkif_segment_block blkif_segment_block_t; + +/* * Cast to this structure when blkif_request.operation == BLKIF_OP_DISCARD * sizeof(struct blkif_request_discard) <= sizeof(struct blkif_request) */ @@ -473,6 +570,21 @@ typedef struct blkif_response blkif_resp */ DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response); +/* + * Index to, and treat as a segment block, an entry in the ring. + */ +#define BLKRING_GET_SEG_BLOCK(_r, _idx) \ + (((blkif_segment_block_t *)RING_GET_REQUEST(_r, _idx))->seg) + +/* + * The number of ring request blocks required to handle an I/O + * request containing _segs segments. + */ +#define BLKIF_SEGS_TO_BLOCKS(_segs) \ + ((((_segs - BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK) \ + + (BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK - 1)) \ + / BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK) + /*header_block*/1) + #define VDISK_CDROM 0x1 #define VDISK_REMOVABLE 0x2 #define VDISK_READONLY 0x4 diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/xen-compat.h --- a/xen/include/public/xen-compat.h Mon Feb 20 11:02:53 2012 -0700 +++ b/xen/include/public/xen-compat.h Mon Feb 20 11:03:01 2012 -0700 @@ -27,7 +27,7 @@ #ifndef __XEN_PUBLIC_XEN_COMPAT_H__ #define __XEN_PUBLIC_XEN_COMPAT_H__ -#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040200 +#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040201 #if defined(__XEN__) || defined(__XEN_TOOLS__) /* Xen is built with matching headers and implements the latest interface. */
Konrad Rzeszutek Wilk
2012-Feb-21 14:02 UTC
Re: [PATCH 0 of 4 v3] blkif.h: Document protocol and existing extensions
On Mon, Feb 20, 2012 at 11:07:21AM -0700, Justin T. Gibbs wrote:> This patch series attempts to document the blkif PV interface and > the various extensions to it that are out in the wild. > > Changes in v3: > > patch 3 (blkif.h: Document the RedHat and Citrix blkif multi-page > ring extensions.):It is called ''Red Hat'', not RedHat.> o Mark the RedHat/Amazon EC2 ring extension nodes as deprecated. While > there is nothing fundamentally wrong with the method used there, it > was not publicly documented (not even in patch form), and differs from > the "page-order" large ring XenBus node convention used for other > drivers. > > patch 4 (blkif.h: Define and document the request number/size/segments > extension.) > o Fix typo "max-ring_pages" -> "max-ring-pages". > > Changes in v2: > > patch 2 (blkif.h: Provide more complete documentation of the blkif interface.): > o Mark backend device identification section as private to the > backend driver. > o Refer to docs/misc/vbd-interface.txt for the format of the > virtual-device front-end node. > o Correct field size for the virtual-device front-end node. > > patch 3 (blkif.h: Document the RedHat and Citrix blkif multi-page > ring extensions.): > o Correct node names for the RedHat/Amazon multi-ring extension. The > previous patch mistakenly documented the now defunct FreeBSD extension. > o Clarify the note on multi-page ring interoperability to indicate that > identical ring parameters should be published to the XenStore for > all supported schemes. > > v1 patch 4 (Deleted): > o Remove patch that added the XEN_*_MAJOR definitions. > > patch 4 (blkif.h: Define and document the request number/size/segments > extension.) > o Bump __XEN_LATEST_INTERFACE_VERSION to 0x00040201 and use this version > to guard the change to BLKIF_MAX_SEGMENTS_PER_REQUEST. > > -- > Justin > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2012-Feb-21 14:10 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Mon, Feb 20, 2012 at 11:07:23AM -0700, Justin T. Gibbs wrote:> o Document the XenBus nodes used in this protocol. > o Add a state diagram illustrating the roles and responsibilities > of both the front and backend during startup. > o Correct missed BLKIF_OP_TRIM => BLKIF_OP_DISCARD conversion in a comment. > > No functional changes.Some comments below.> > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> > > diff -r 28137a4e39a3 -r e79902456819 xen/include/public/io/blkif.h > --- a/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 > +++ b/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 > @@ -22,6 +22,7 @@ > * DEALINGS IN THE SOFTWARE. > * > * Copyright (c) 2003-2004, Keir Fraser > + * Copyright (c) 2012, Spectra Logic Corporation > */ > > #ifndef __XEN_PUBLIC_IO_BLKIF_H__ > @@ -48,32 +49,253 @@ > #define blkif_sector_t uint64_t > > /* > + * Feature and Parameter Negotiation > + * ================================> + * The two halves of a Xen block driver utilize nodes within the XenStore to > + * communicate capabilities and to negotiate operating parameters. This > + * section enumerates these nodes which reside in the respective front and > + * backend portions of the XenStore, following the XenBus convention. > + * > + * All data in the XenStore is stored as strings. Nodes specifying numeric > + * values are encoded in decimal. Integer value ranges listed below are > + * expressed as fixed sized integer types capable of storing the conversion > + * of a properly formated node string, without loss of information. > + * > + * Any specified default value is in effect if the corresponding XenBus node > + * is not present in the XenStore. > + * > + * XenStore nodes in sections marked "PRIVATE" are solely for use by the > + * driver side whose XenBus tree contains them. > + * > + * See the XenBus state transition diagram below for details on when XenBus > + * nodes must be published and when they can be queried. > + * > + ***************************************************************************** > + * Backend XenBus Nodes > + ***************************************************************************** > + * > + *------------------ Backend Device Identification (PRIVATE) ------------------ > + * > + * mode > + * Values: "r" (read only), "w" (writable) > + * > + * The read or write access permissions to the backing store to be > + * granted to the frontend. > + * > + * params > + * Values: string > + * > + * A free formatted string providing sufficient information for the > + * backend driver to open the backing device. (e.g. the path to the > + * file or block device representing the backing store.) > + * > + * type > + * Values: "file", "phy", "tap" > + * > + * The type of the backing device/object. > + * > + *--------------------------------- Features --------------------------------- > + * > + * feature-barrier > + * Values: 0/1 (boolean) > + * Default Value: 0 > + * > + * A value of "1" indicates that the backend can process requests > + * containing the BLKIF_OP_WRITE_BARRIER request opcode. Requests > + * of this type may still be returned at any time with the > + * BLKIF_RSP_EOPNOTSUPP result code. > + * > + * feature-flush-cache > + * Values: 0/1 (boolean) > + * Default Value: 0 > + * > + * A value of "1" indicates that the backend can process requests > + * containing the BLKIF_OP_FLUSH_DISKCACHE request opcode. Requests > + * of this type may still be returned at any time with the > + * BLKIF_RSP_EOPNOTSUPP result code. > + * > + * feature-discard > + * Values: 0/1 (boolean) > + * Default Value: 0 > + * > + * A value of "1" indicates that the backend can process requests > + * containing the BLKIF_OP_DISCARD request opcode. Requests > + * of this type may still be returned at any time with the > + * BLKIF_RSP_EOPNOTSUPP result code. > + * > + *------------------------- Backend Device Properties ------------------------- > + * > + * discard-aligment > + * Values: <uint32_t> > + * Default Value: 0 > + * Notes: 1, 2 > + * > + * The offset, in bytes from the beginning of the virtual block device, > + * to the first, addressable, discard extent on the underlying device.You are also missing the rest of the information about it. Please include the details that the previous comment had.> + * > + * discard-granularity > + * Values: <uint32_t> > + * Default Value: <"sector-size"> > + * Notes: 1 > + * > + * The size, in bytes, of the individually addressable discard extents > + * of the underlying device.Please include more data - the old comment had more contents.> + * > + * discard-secure > + * Values: 0/1 (boolean) > + * Default Value: 0 > + * > + * A value of "1" indicates that the backend can process BLKIF_OP_DISCARD > + * requests with the BLKIF_DISCARD_SECURE flag set.That is not very specific to what this does. It just says X will do Y.> + * > + * info > + * Values: <uint32_t> (bitmap) > + * > + * A collection of bit flags describing attributes of the backing > + * device. The VDISK_* macros define the meaning of each bit > + * location. > + * > + * sector-size > + * Values: <uint32_t> > + * > + * The native sector size, in bytes, of the backend device. > + * > + * sectors > + * Values: <uint64_t> > + * > + * The size of the backend device, expressed in units of its native > + * sector size ("sector-size"). > + * > + ***************************************************************************** > + * Frontend XenBus Nodes > + ***************************************************************************** > + * > + *----------------------- Request Transport Parameters ----------------------- > + * > + * event-channel > + * Values: <uint32_t> > + * > + * The identifier of the Xen event channel used to signal activity > + * in the ring buffer. > + * > + * ring-ref > + * Values: <uint32_t> > + * > + * The Xen grant reference granting permission for the backend to map > + * the sole page in a single page sized ring buffer. > + * > + * protocol > + * Values: string (XEN_IO_PROTO_ABI_*) > + * Default Value: XEN_IO_PROTO_ABI_NATIVE > + * > + * The machine ABI rules governing the format of all ring request and > + * response structures. > + * > + *------------------------- Virtual Device Properties ------------------------- > + * > + * device-type > + * Values: "disk", "cdrom", "floppy", etc. > + * > + * virtual-device > + * Values: <uint32_t> > + * > + * A value indicating the physical device to virtualize within the > + * frontend''s domain. (e.g. "The first ATA disk", "The third SCSI > + * disk", etc.) > + * > + * See docs/misc/vbd-interface.txt for details on the format of this > + * value. > + * > + * Notes > + * ----- > + * (1) Devices that support discard functionality may internally allocate > + * space (discardable extents) in units that are larger than the > + * exported logical block size. > + * (2) The discard-alignment parameter allows a physical device to be > + * partitioned into virtual devices that do not necessarily begin or > + * end on a discardable extent boundary. > + */ > + > +/* > + * STATE DIAGRAMS > + * > + ***************************************************************************** > + * Startup * > + ***************************************************************************** > + * > + * Tool stack creates front and back nodes with state XenbusStateInitialising. > + * > + * Front Back > + * ================================= ====================================> + * XenbusStateInitialising XenbusStateInitialising > + * o Query virtual device o Query backend device identification > + * properties. data. > + * o Setup OS device instance. o Open and validate backend device. > + * o Publish backend features. > + * | > + * | > + * V > + * XenbusStateInitWait > + * > + * o Query backend features. > + * o Allocate and initialize the > + * request ring. > + * | > + * | > + * V > + * XenbusStateInitialised > + * > + * o Connect to the request ring and > + * event channel. > + * o Publish backend device properties. > + * | > + * | > + * V > + * XenbusStateConnected > + * > + * o Query backend device properties. > + * o Finalize OS virtual device > + * instance. > + * | > + * | > + * V > + * XenbusStateConnected > + * > + * Note: Drivers that do not support any optional features can skip certain > + * states in the state machine: > + * > + * o A frontend may transition to XenbusStateInitialised without > + * waiting for the backend to enter XenbusStateInitWait. > + * > + * o A backend may transition to XenbusStateInitialised, bypassing > + * XenbusStateInitWait, without waiting for the frontend to first > + * enter the XenbusStateInitialised state. > + * > + * Drivers that support optional features must tolerate these additional > + * state transition paths. In general this means performing the work of > + * any skipped state transition, if it has not already been performed, > + * in addition to the work associated with entry into the current state. > + */ > + > +/* > * REQUEST CODES. > */ > #define BLKIF_OP_READ 0 > #define BLKIF_OP_WRITE 1 > /* > - * Recognised only if "feature-barrier" is present in backend xenbus info. > - * The "feature-barrier" node contains a boolean indicating whether barrier > - * requests are likely to succeed or fail. Either way, a barrier request > - * may fail at any time with BLKIF_RSP_EOPNOTSUPP if it is unsupported by > - * the underlying block-device hardware. The boolean simply indicates whether > - * or not it is worthwhile for the frontend to attempt barrier requests. > - * If a backend does not recognise BLKIF_OP_WRITE_BARRIER, it should *not* > - * create the "feature-barrier" node! > + * All writes issued prior to a request with the BLKIF_OP_WRITE_BARRIER > + * operation code ("barrier request") must be completed prior to the > + * execution of the barrier request. All writes issued after the barrier > + * request must not execute until after the completion of the barrier request. > + * > + * Optional. See "feature-barrier" XenBus node documentation above. > */ > #define BLKIF_OP_WRITE_BARRIER 2 > /* > - * Recognised if "feature-flush-cache" is present in backend xenbus > - * info. A flush will ask the underlying storage hardware to flush its > - * non-volatile caches as appropriate. The "feature-flush-cache" node > - * contains a boolean indicating whether flush requests are likely to > - * succeed or fail. Either way, a flush request may fail at any time > - * with BLKIF_RSP_EOPNOTSUPP if it is unsupported by the underlying > - * block-device hardware. The boolean simply indicates whether or not it > - * is worthwhile for the frontend to attempt flushes. If a backend does > - * not recognise BLKIF_OP_WRITE_FLUSH_CACHE, it should *not* create the > - * "feature-flush-cache" node! > + * Commit any uncommitted contents of the backing device''s volatile cache > + * to stable storage. > + * > + * Optional. See "feature-flush-cache" XenBus node documentation above. > */ > #define BLKIF_OP_FLUSH_DISKCACHE 3 > /* > @@ -82,47 +304,24 @@ > */ > #define BLKIF_OP_RESERVED_1 4 > /* > - * Recognised only if "feature-discard" is present in backend xenbus info. > - * The "feature-discard" node contains a boolean indicating whether trim > - * (ATA) or unmap (SCSI) - conviently called discard requests are likely > - * to succeed or fail. Either way, a discard request > - * may fail at any time with BLKIF_RSP_EOPNOTSUPP if it is unsupported by > - * the underlying block-device hardware. The boolean simply indicates whether > - * or not it is worthwhile for the frontend to attempt discard requests. > - * If a backend does not recognise BLKIF_OP_DISCARD, it should *not* > - * create the "feature-discard" node! > + * Indicate to the backend device that a region of storage is no longer in > + * use, and may be discarded at any time without impact to the client. If > + * the BLKIF_DISCARD_SECURE flag is set on the request, all copies of the > + * discarded region on the device must be rendered unrecoverable before the > + * command returns. > * > - * Discard operation is a request for the underlying block device to mark > - * extents to be erased. However, discard does not guarantee that the blocks > - * will be erased from the device - it is just a hint to the device > - * controller that these blocks are no longer in use. What the device > - * controller does with that information is left to the controller. > - * Discard operations are passed with sector_number as the > - * sector index to begin discard operations at and nr_sectors as the number of > - * sectors to be discarded. The specified sectors should be discarded if the > - * underlying block device supports trim (ATA) or unmap (SCSI) operations, > - * or a BLKIF_RSP_EOPNOTSUPP should be returned. > - * More information about trim/unmap operations at: > + * This operation is analogous to performing a trim (ATA) or unamp (SCSI),unmap.> + * command on a native device.And you might want to mention what the previous comment did - that this is a hint to the backend.> + * > + * More information about trim/unmap operations can be found at: > * http://t13.org/Documents/UploadedDocuments/docs2008/ > * e07154r6-Data_Set_Management_Proposal_for_ATA-ACS2.doc > * http://www.seagate.com/staticfiles/support/disc/manuals/ > * Interface%20manuals/100293068c.pdf > - * The backend can optionally provide these extra XenBus attributes to > - * further optimize the discard functionality: > - * ''discard-aligment'' - Devices that support discard functionality may > - * internally allocate space in units that are bigger than the exported > - * logical block size. The discard-alignment parameter indicates how many bytes > - * the beginning of the partition is offset from the internal allocation unit''s > - * natural alignment. Do not confuse this with natural disk alignment offset. > - * ''discard-granularity'' - Devices that support discard functionality may > - * internally allocate space using units that are bigger than the logical block > - * size. The discard-granularity parameter indicates the size of the internal > - * allocation unit in bytes if reported by the device. Otherwise the > - * discard-granularity will be set to match the device''s physical block size. > - * It is the minimum size you can discard. > - * ''discard-secure'' - All copies of the discarded sectors (potentially created > - * by garbage collection) must also be erased. To use this feature, the flag > - * BLKIF_DISCARD_SECURE must be set in the blkif_request_discard. > + * > + * Optional. See "feature-discard", "discard-alignment", > + * "discard-granularity", and "discard-secure" in the XenBus node > + * documentation above. > */ > #define BLKIF_OP_DISCARD 5 > > @@ -147,6 +346,9 @@ struct blkif_request_segment { > uint8_t first_sect, last_sect; > }; > > +/* > + * Starting ring element for any I/O request. > + */ > struct blkif_request { > uint8_t operation; /* BLKIF_OP_??? */ > uint8_t nr_segments; /* number of segments */ > @@ -158,7 +360,7 @@ struct blkif_request { > typedef struct blkif_request blkif_request_t; > > /* > - * Cast to this structure when blkif_request.operation == BLKIF_OP_TRIM > + * Cast to this structure when blkif_request.operation == BLKIF_OP_DISCARD > * sizeof(struct blkif_request_discard) <= sizeof(struct blkif_request) > */ > struct blkif_request_discard { > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2012-Feb-21 14:10 UTC
Re: [PATCH 1 of 4 v3] blkif.h: Miscelaneous style fixes
On Mon, Feb 20, 2012 at 11:07:22AM -0700, Justin T. Gibbs wrote:> o Remove trailing whitespace. > o Remove a blank line so that a comment block is adjacent to the > code it documents. > > No functional changes. > > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com>OK, looks good.> > diff -r 87218bd367be -r 28137a4e39a3 xen/include/public/io/blkif.h > --- a/xen/include/public/io/blkif.h Fri Feb 17 12:24:38 2012 +0000 > +++ b/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 > @@ -1,8 +1,8 @@ > /****************************************************************************** > * blkif.h > - * > + * > * Unified block-device I/O interface for Xen guest OSes. > - * > + * > * Permission is hereby granted, free of charge, to any person obtaining a copy > * of this software and associated documentation files (the "Software"), to > * deal in the Software without restriction, including without limitation the > @@ -35,7 +35,7 @@ > * notification can be made conditional on req_event (i.e., the generic > * hold-off mechanism provided by the ring macros). Backends must set > * req_event appropriately (e.g., using RING_FINAL_CHECK_FOR_REQUESTS()). > - * > + * > * Back->front notifications: When enqueuing a new response, sending a > * notification can be made conditional on rsp_event (i.e., the generic > * hold-off mechanism provided by the ring macros). Frontends must set > @@ -133,7 +133,7 @@ > */ > #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11 > > -/* > +/* > * NB. first_sect and last_sect in blkif_request_segment, as well as > * sector_number in blkif_request, are always expressed in 512-byte units. > * However they must be properly aligned to the real sector size of the > @@ -192,7 +192,6 @@ typedef struct blkif_response blkif_resp > /* > * Generate blkif ring structures and types. > */ > - > DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response); > > #define VDISK_CDROM 0x1 > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2012-Feb-21 14:12 UTC
Re: [PATCH 3 of 4 v3] blkif.h: Document the RedHat and Citrix blkif multi-page ring extensions
On Mon, Feb 20, 2012 at 11:07:24AM -0700, Justin T. Gibbs wrote:> No functional changes. > > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> > > diff -r e79902456819 -r 09051133e2fe xen/include/public/io/blkif.h > --- a/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 > +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 > @@ -67,6 +67,9 @@ > * XenStore nodes in sections marked "PRIVATE" are solely for use by the > * driver side whose XenBus tree contains them. > * > + * XenStore nodes marked "DEPRECATED" in their notes section should only be > + * used to provide interoperability with legacy implementations. > + * > * See the XenBus state transition diagram below for details on when XenBus > * nodes must be published and when they can be queried. > * > @@ -123,12 +126,31 @@ > * of this type may still be returned at any time with the > * BLKIF_RSP_EOPNOTSUPP result code. > * > + *----------------------- Request Transport Parameters ------------------------ > + * > + * max-ring-page-order > + * Values: <uint32_t> > + * Default Value: 0 > + * Notes: 1, 3 > + * > + * The maximum supported size of the request ring buffer in units of > + * lb(machine pages). (e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages, > + * etc.). > + * > + * max-ring-pages > + * Values: <uint32_t> > + * Default Value: 1 > + * Notes: DEPRECATED, 2, 3 > + * > + * The maximum supported size of the request ring buffer in units of > + * machine pages. The value must be a power of 2. > + * > *------------------------- Backend Device Properties ------------------------- > * > * discard-aligment > * Values: <uint32_t> > * Default Value: 0 > - * Notes: 1, 2 > + * Notes: 4, 5 > * > * The offset, in bytes from the beginning of the virtual block device, > * to the first, addressable, discard extent on the underlying device. > @@ -136,7 +158,7 @@ > * discard-granularity > * Values: <uint32_t> > * Default Value: <"sector-size"> > - * Notes: 1 > + * Notes: 4 > * > * The size, in bytes, of the individually addressable discard extents > * of the underlying device. > @@ -180,10 +202,20 @@ > * > * ring-ref > * Values: <uint32_t> > + * Notes: 6 > * > * The Xen grant reference granting permission for the backend to map > * the sole page in a single page sized ring buffer. > * > + * ring-ref%u > + * Values: <uint32_t> > + * Notes: 6 > + * > + * For a frontend providing a multi-page ring, a "number of ring pages" > + * sized list of nodes, each containing a Xen grant reference granting > + * permission for the backend to map the page of the ring located > + * at page index "%u". Page indexes are zero based. > + * > * protocol > * Values: string (XEN_IO_PROTO_ABI_*) > * Default Value: XEN_IO_PROTO_ABI_NATIVE > @@ -191,6 +223,25 @@ > * The machine ABI rules governing the format of all ring request and > * response structures. > * > + * ring-page-order > + * Values: <uint32_t> > + * Default Value: 0 > + * Maximum Value: MAX(ffs(max-ring-pages) - 1, max-ring-page-order) > + * Notes: 1, 3 > + * > + * The size of the frontend allocated request ring buffer in units > + * of lb(machine pages). (e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages, > + * etc.). > + * > + * num-ring-pages > + * Values: <uint32_t> > + * Default Value: 1 > + * Maximum Value: MAX(max-ring-pages,(0x1 << max-ring-page-order)) > + * Notes: DEPRECATED, 2, 3 > + * > + * The size of the frontend allocated request ring buffer in units of > + * machine pages. The value must be a power of 2. > + * > *------------------------- Virtual Device Properties ------------------------- > * > * device-type > @@ -208,12 +259,26 @@ > * > * Notes > * ----- > - * (1) Devices that support discard functionality may internally allocate > + * (1) Multi-page ring buffer scheme first developed in the Citrix XenServer > + * PV drivers. > + * (2) Multi-page ring buffer scheme first used in some RedHat distributionsRed Hat.> + * including a distribution deployed on certain nodes of the Amazon > + * EC2 cluster. > + * (3) Support for multi-page ring buffers was implemented independently, > + * in slightly different forms, by both Citrix and RedHat/Amazon.Red Hat.> + * For full interoperability, block front and backends should publish > + * identical ring parameters, adjusted for unit differences, to the > + * XenStore nodes used in both schemes. > + * (4) Devices that support discard functionality may internally allocate > * space (discardable extents) in units that are larger than the > * exported logical block size. > - * (2) The discard-alignment parameter allows a physical device to be > + * (5) The discard-alignment parameter allows a physical device to be > * partitioned into virtual devices that do not necessarily begin or > * end on a discardable extent boundary. > + * (6) When there is only a single page allocated to the request ring, > + * ''ring-ref'' is used to communicate the grant reference for this > + * page to the backend. When using a multi-page ring, the ''ring-ref'' > + * node is not created. Instead ''ring-ref0'' - ''ring-refN'' are used. > */ > > /* > @@ -231,20 +296,26 @@ > * o Query virtual device o Query backend device identification > * properties. data. > * o Setup OS device instance. o Open and validate backend device. > - * o Publish backend features. > + * o Publish backend features and > + * transport parameters. > * | > * | > * V > * XenbusStateInitWait > * > - * o Query backend features. > + * o Query backend features and > + * transport parameters. > * o Allocate and initialize the > * request ring. > + * o Publish transport parameters > + * that will be in effect during > + * this connection. > * | > * | > * V > * XenbusStateInitialised > * > + * o Query frontend transport parameters. > * o Connect to the request ring and > * event channel. > * o Publish backend device properties. > @@ -261,20 +332,26 @@ > * V > * XenbusStateConnected > * > - * Note: Drivers that do not support any optional features can skip certain > - * states in the state machine: > + * Note: Drivers that do not support any optional features, or the negotiation > + * of transport parameters, can skip certain states in the state machine: > * > * o A frontend may transition to XenbusStateInitialised without > - * waiting for the backend to enter XenbusStateInitWait. > + * waiting for the backend to enter XenbusStateInitWait. In this > + * case, default transport parameters are in effect and any > + * transport parameters published by the frontend must contain > + * their default values. > * > * o A backend may transition to XenbusStateInitialised, bypassing > * XenbusStateInitWait, without waiting for the frontend to first > - * enter the XenbusStateInitialised state. > + * enter the XenbusStateInitialised state. In this case, default > + * transport parameters are in effect and any transport parameters > + * published by the backend must contain their default values. > * > - * Drivers that support optional features must tolerate these additional > - * state transition paths. In general this means performing the work of > - * any skipped state transition, if it has not already been performed, > - * in addition to the work associated with entry into the current state. > + * Drivers that support optional features and/or transport parameter > + * negotiation must tolerate these additional state transition paths. > + * In general this means performing the work of any skipped state > + * transition, if it has not already been performed, in addition to the > + * work associated with entry into the current state. > */ > > /* > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xensource.com/xen-devel
Ian Campbell
2012-Feb-21 14:15 UTC
Re: [PATCH 1 of 4 v3] blkif.h: Miscelaneous style fixes
On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote:> o Remove trailing whitespace. > o Remove a blank line so that a comment block is adjacent to the > code it documents. > > No functional changes. > > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com>Acked-by: Ian Campbell <ian.campbell@citrix.com>> > diff -r 87218bd367be -r 28137a4e39a3 xen/include/public/io/blkif.h > --- a/xen/include/public/io/blkif.h Fri Feb 17 12:24:38 2012 +0000 > +++ b/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 > @@ -1,8 +1,8 @@ > /****************************************************************************** > * blkif.h > - * > + * > * Unified block-device I/O interface for Xen guest OSes. > - * > + * > * Permission is hereby granted, free of charge, to any person obtaining a copy > * of this software and associated documentation files (the "Software"), to > * deal in the Software without restriction, including without limitation the > @@ -35,7 +35,7 @@ > * notification can be made conditional on req_event (i.e., the generic > * hold-off mechanism provided by the ring macros). Backends must set > * req_event appropriately (e.g., using RING_FINAL_CHECK_FOR_REQUESTS()). > - * > + * > * Back->front notifications: When enqueuing a new response, sending a > * notification can be made conditional on rsp_event (i.e., the generic > * hold-off mechanism provided by the ring macros). Frontends must set > @@ -133,7 +133,7 @@ > */ > #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11 > > -/* > +/* > * NB. first_sect and last_sect in blkif_request_segment, as well as > * sector_number in blkif_request, are always expressed in 512-byte units. > * However they must be properly aligned to the real sector size of the > @@ -192,7 +192,6 @@ typedef struct blkif_response blkif_resp > /* > * Generate blkif ring structures and types. > */ > - > DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response); > > #define VDISK_CDROM 0x1 > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2012-Feb-21 14:22 UTC
Re: [PATCH 4 of 4 v3] blkif.h: Define and document the request number/size/segments extension
On Mon, Feb 20, 2012 at 11:07:25AM -0700, Justin T. Gibbs wrote:> Note: As of __XEN_INTERFACE_VERSION__ 0x00040201 the definition of > BLKIF_MAX_SEGMENTS_PER_REQUEST has changed. Drivers must be updated > to, at minimum, use BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, before being > recompiled with a __XEN_INTERFACE_VERSION greater than or equal to > this value. > > This extension first appeared in the FreeBSD Operating System. > > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> > > diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/io/blkif.h > --- a/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 > +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:03:01 2012 -0700 > @@ -145,6 +145,32 @@ > * The maximum supported size of the request ring buffer in units of > * machine pages. The value must be a power of 2. > * > + * max-requests <uint32_t> > + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) > + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) > + * > + * The maximum number of concurrent, logical requests that will be > + * issued by the backend.Don''t you mean responses?> + * > + * Note: A logical request may span multiple ring entries. > + * > + * max-request-segments > + * Values: <uint8_t> > + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK > + * Maximum Value: BLKIF_MAX_SEGMENTS_PER_REQUEST > + * > + * The maximum value of blkif_request.nr_segments supported by > + * the backend. > + * > + * max-request-size > + * Values: <uint32_t> > + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK * PAGE_SIZE > + * Maximum Value: BLKIF_MAX_SEGMENTS_PER_REQUEST * PAGE_SIZE > + * > + * The maximum amount of data, in bytes, that can be referenced by a > + * request type that accesses frontend memory (currently BLKIF_OP_READ, > + * BLKIF_OP_WRITE, or BLKIF_OP_WRITE_BARRIER).So just to make sure my math is correct. That would be 1MB right?> + * > *------------------------- Backend Device Properties ------------------------- > * > * discard-aligment > @@ -242,6 +268,33 @@ > * The size of the frontend allocated request ring buffer in units of > * machine pages. The value must be a power of 2. > * > + * max-requests > + * Values: <uint32_t> > + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) > + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) > + * > + * The maximum number of concurrent, logical requests that will be > + * issued by the frontend. > + * > + * Note: A logical request may span multiple ring entries. > + * > + * max-request-segments > + * Values: <uint8_t> > + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK > + * Maximum Value: MIN(255, backend/max-request-segments)Um, that looks like its referencing itself? This is the backend section.> + * > + * The maximum value the frontend will set in the > + * blkif_request.nr_segments field. > + * > + * max-request-size > + * Values: <uint32_t> > + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK * PAGE_SIZE > + * Maximum Value: max-request-segments * PAGE_SIZE > + * > + * The maximum amount of data, in bytes, that can be referenced by > + * a request type that accesses frontend memory (currently BLKIF_OP_READ, > + * BLKIF_OP_WRITE, or BLKIF_OP_WRITE_BARRIER). > + * > *------------------------- Virtual Device Properties ------------------------- > * > * device-type > @@ -403,11 +456,28 @@ > #define BLKIF_OP_DISCARD 5 > > /* > - * Maximum scatter/gather segments per request. > + * Maximum scatter/gather segments associated with a request header block. > * This is carefully chosen so that sizeof(blkif_ring_t) <= PAGE_SIZE. > * NB. This could be 12 if the ring indexes weren''t stored in the same page. > */ > -#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11 > +#define BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK 11 > + > +/* > + * Maximum scatter/gather segments associated with a segment block. > + */ > +#define BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK 14 > + > +#if __XEN_INTERFACE_VERSION__ >= 0x00040201 > +/* > + * Maximum scatter/gather segments per request (header + segment blocks). > + */ > +#define BLKIF_MAX_SEGMENTS_PER_REQUEST 255 > +#else > +/* > + * Maximum scatter/gather segments per request (header block only). > + */ > +#define BLKIF_MAX_SEGMENTS_PER_REQUEST BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK > +#endif > > /* > * NB. first_sect and last_sect in blkif_request_segment, as well as > @@ -422,9 +492,25 @@ struct blkif_request_segment { > /* @last_sect: last sector in frame to transfer (inclusive). */ > uint8_t first_sect, last_sect; > }; > +typedef struct blkif_request_segment blkif_request_segment_t; > > /* > * Starting ring element for any I/O request. > + * > + * One or more segment blocks can be inserted into the request ring > + * just after a blkif_request_t, allowing requests to operate on > + * up to BLKIF_MAX_SEGMENTS_PER_REQUEST. > + * > + * BLKIF_SEGS_TO_BLOCKS() can be used on blkif_requst.nr_segments > + * to determine the number of contiguous ring entries associated > + * with this request. > + * > + * Note: Due to the way Xen request rings operate, the producer and > + * consumer indices of the ring must be incremented by the > + * BLKIF_SEGS_TO_BLOCKS() value of the associated request. > + * (e.g. a response to a 3 ring entry request must also consume > + * 3 entries in the ring, even though only the first ring entry > + * in the response has any data.) > */ > struct blkif_request { > uint8_t operation; /* BLKIF_OP_??? */ > @@ -432,11 +518,22 @@ struct blkif_request { > blkif_vdev_t handle; /* only for read/write requests */ > uint64_t id; /* private guest value, echoed in resp */ > blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */ > - struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > + blkif_request_segment_t seg[BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK]; > }; > typedef struct blkif_request blkif_request_t; > > /* > + * A segment block is a ring request structure that contains only > + * segment data. > + * > + * sizeof(struct blkif_segment_block) <= sizeof(struct blkif_request) > + */ > +struct blkif_segment_block { > + blkif_request_segment_t seg[BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK]; > +}; > +typedef struct blkif_segment_block blkif_segment_block_t; > + > +/* > * Cast to this structure when blkif_request.operation == BLKIF_OP_DISCARD > * sizeof(struct blkif_request_discard) <= sizeof(struct blkif_request) > */ > @@ -473,6 +570,21 @@ typedef struct blkif_response blkif_resp > */ > DEFINE_RING_TYPES(blkif, struct blkif_request, struct blkif_response); > > +/* > + * Index to, and treat as a segment block, an entry in the ring. > + */ > +#define BLKRING_GET_SEG_BLOCK(_r, _idx) \ > + (((blkif_segment_block_t *)RING_GET_REQUEST(_r, _idx))->seg) > + > +/* > + * The number of ring request blocks required to handle an I/O > + * request containing _segs segments. > + */ > +#define BLKIF_SEGS_TO_BLOCKS(_segs) \ > + ((((_segs - BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK) \ > + + (BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK - 1)) \ > + / BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK) + /*header_block*/1) > + > #define VDISK_CDROM 0x1 > #define VDISK_REMOVABLE 0x2 > #define VDISK_READONLY 0x4 > diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/xen-compat.h > --- a/xen/include/public/xen-compat.h Mon Feb 20 11:02:53 2012 -0700 > +++ b/xen/include/public/xen-compat.h Mon Feb 20 11:03:01 2012 -0700 > @@ -27,7 +27,7 @@ > #ifndef __XEN_PUBLIC_XEN_COMPAT_H__ > #define __XEN_PUBLIC_XEN_COMPAT_H__ > > -#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040200 > +#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040201 > > #if defined(__XEN__) || defined(__XEN_TOOLS__) > /* Xen is built with matching headers and implements the latest interface. */ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xensource.com/xen-devel
Ian Campbell
2012-Feb-21 14:27 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote:> o Document the XenBus nodes used in this protocol. > o Add a state diagram illustrating the roles and responsibilities > of both the front and backend during startup. > o Correct missed BLKIF_OP_TRIM => BLKIF_OP_DISCARD conversion in a comment. > > No functional changes. > > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com>Acked-by: Ian Campbell <ian.campbell@citrix.com> I''ve made some very minor comments below but I think getting the basic documentation of this stuff in as a baseline to improve and correct over time is more important than any of them. Thanks again for doing this -- it is really valuable!> diff -r 28137a4e39a3 -r e79902456819 xen/include/public/io/blkif.h > --- a/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 > +++ b/xen/include/public/io/blkif.h Mon Feb 20 10:48:09 2012 -0700 > @@ -22,6 +22,7 @@ > * DEALINGS IN THE SOFTWARE. > * > * Copyright (c) 2003-2004, Keir Fraser > + * Copyright (c) 2012, Spectra Logic Corporation > */ > > #ifndef __XEN_PUBLIC_IO_BLKIF_H__ > @@ -48,32 +49,253 @@ > #define blkif_sector_t uint64_t > > /* > + * Feature and Parameter Negotiation > + * ================================> + * The two halves of a Xen block driver utilize nodes within the XenStore to > + * communicate capabilities and to negotiate operating parameters. This > + * section enumerates these nodes which reside in the respective front and > + * backend portions of the XenStore, following the XenBus convention. > + * > + * All data in the XenStore is stored as strings. Nodes specifying numeric > + * values are encoded in decimal. Integer value ranges listed below are > + * expressed as fixed sized integer types capable of storing the conversion > + * of a properly formated node string, without loss of information.formatted> + * > + * Any specified default value is in effect if the corresponding XenBus node > + * is not present in the XenStore. > + * > + * XenStore nodes in sections marked "PRIVATE" are solely for use by the > + * driver side whose XenBus tree contains them. > + * > + * See the XenBus state transition diagram below for details on when XenBus > + * nodes must be published and when they can be queried. > + * > + ***************************************************************************** > + * Backend XenBus Nodes > + ***************************************************************************** > + * > + *------------------ Backend Device Identification (PRIVATE) ------------------ > + * > + * mode > + * Values: "r" (read only), "w" (writable) > + * > + * The read or write access permissions to the backing store to be > + * granted to the frontend. > + * > + * params > + * Values: string > + * > + * A free formatted string providing sufficient information for the > + * backend driver to open the backing device. (e.g. the path to the > + * file or block device representing the backing store.)The syntax and semantics of params is defined by the particular backend, rather than being "free formatted" as such. I think it would be worth saying that explicitly. Ian.
Ian Campbell
2012-Feb-21 14:32 UTC
Re: [PATCH 3 of 4 v3] blkif.h: Document the RedHat and Citrix blkif multi-page ring extensions
On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote:> No functional changes. > > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com>Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell
2012-Feb-21 14:38 UTC
Re: [PATCH 4 of 4 v3] blkif.h: Define and document the request number/size/segments extension
On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote:> Note: As of __XEN_INTERFACE_VERSION__ 0x00040201 the definition of > BLKIF_MAX_SEGMENTS_PER_REQUEST has changed. Drivers must be updated > to, at minimum, use BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, before being > recompiled with a __XEN_INTERFACE_VERSION greater than or equal to > this value. > > This extension first appeared in the FreeBSD Operating System. > > Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> > > diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/io/blkif.h > --- a/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 > +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:03:01 2012 -0700 > @@ -145,6 +145,32 @@ > * The maximum supported size of the request ring buffer in units of > * machine pages. The value must be a power of 2. > * > + * max-requests <uint32_t> > + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) > + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) > + * > + * The maximum number of concurrent, logical requests that will be > + * issued by the backend.by "issued" do you mean the maximum number that the backend is willing to consume/have-in-flight at once? I guess that means issued to the underlying storage? (backend doesn''t issue requests to the frontend does it? That''s how I first read it, hence my question) Ian.
Justin T. Gibbs
2012-Feb-21 18:03 UTC
Re: [PATCH 4 of 4 v3] blkif.h: Define and document the request number/size/segments extension
On Feb 21, 2012, at 7:38 AM, Ian Campbell wrote:> On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote: >> Note: As of __XEN_INTERFACE_VERSION__ 0x00040201 the definition of >> BLKIF_MAX_SEGMENTS_PER_REQUEST has changed. Drivers must be updated >> to, at minimum, use BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, before being >> recompiled with a __XEN_INTERFACE_VERSION greater than or equal to >> this value. >> >> This extension first appeared in the FreeBSD Operating System. >> >> Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> >> >> diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/io/blkif.h >> --- a/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 >> +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:03:01 2012 -0700 >> @@ -145,6 +145,32 @@ >> * The maximum supported size of the request ring buffer in units of >> * machine pages. The value must be a power of 2. >> * >> + * max-requests <uint32_t> >> + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) >> + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) >> + * >> + * The maximum number of concurrent, logical requests that will be >> + * issued by the backend. > > by "issued" do you mean the maximum number that the backend is willing > to consume/have-in-flight at once? I guess that means issued to the > underlying storage? (backend doesn''t issue requests to the frontend does > it? That''s how I first read it, hence my question) > > Ian. >I should have used the same language as in the other entries. Does this make more sense? * The maximum number of concurrent, logical requests supported by * the backend. -- Justin
Justin T. Gibbs
2012-Feb-21 18:11 UTC
Re: [PATCH 4 of 4 v3] blkif.h: Define and document the request number/size/segments extension
On Feb 21, 2012, at 7:22 AM, Konrad Rzeszutek Wilk wrote:> On Mon, Feb 20, 2012 at 11:07:25AM -0700, Justin T. Gibbs wrote: >> Note: As of __XEN_INTERFACE_VERSION__ 0x00040201 the definition of >> BLKIF_MAX_SEGMENTS_PER_REQUEST has changed. Drivers must be updated >> to, at minimum, use BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, before being >> recompiled with a __XEN_INTERFACE_VERSION greater than or equal to >> this value. >> >> This extension first appeared in the FreeBSD Operating System. >> >> Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> >> >> diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/io/blkif.h >> --- a/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 >> +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:03:01 2012 -0700 >> @@ -145,6 +145,32 @@ >> * The maximum supported size of the request ring buffer in units of >> * machine pages. The value must be a power of 2. >> * >> + * max-requests <uint32_t> >> + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) >> + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) >> + * >> + * The maximum number of concurrent, logical requests that will be >> + * issued by the backend. > > Don''t you mean responses?No, but it is poorly worded. See my reply to Ian C.>> + * max-request-size >> + * Values: <uint32_t> >> + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK * PAGE_SIZE >> + * Maximum Value: BLKIF_MAX_SEGMENTS_PER_REQUEST * PAGE_SIZE >> + * >> + * The maximum amount of data, in bytes, that can be referenced by a >> + * request type that accesses frontend memory (currently BLKIF_OP_READ, >> + * BLKIF_OP_WRITE, or BLKIF_OP_WRITE_BARRIER). > > So just to make sure my math is correct. That would be 1MB right?Assuming a 4K PAGE_SIZE, the maximum I/O possible is 1020KiB or .996MiB. Allowing more segments would require expanding the 8bit nr_segments field, severely complicating backwards compatibility.> >> + * >> *------------------------- Backend Device Properties ------------------------- >> * >> * discard-aligment >> @@ -242,6 +268,33 @@ >> * The size of the frontend allocated request ring buffer in units of >> * machine pages. The value must be a power of 2. >> * >> + * max-requests >> + * Values: <uint32_t> >> + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) >> + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) >> + * >> + * The maximum number of concurrent, logical requests that will be >> + * issued by the frontend. >> + * >> + * Note: A logical request may span multiple ring entries. >> + * >> + * max-request-segments >> + * Values: <uint8_t> >> + * Default Value: BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK >> + * Maximum Value: MIN(255, backend/max-request-segments) > > Um, that looks like its referencing itself? This is the backend section.This is at the end of the frontend section, but the diff doesn''t have enough context to make that obvious. -- Justin
Justin T. Gibbs
2012-Feb-21 18:14 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Feb 21, 2012, at 7:27 AM, Ian Campbell wrote:> On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote: >> o Document the XenBus nodes used in this protocol. >> o Add a state diagram illustrating the roles and responsibilities >> of both the front and backend during startup. >> o Correct missed BLKIF_OP_TRIM => BLKIF_OP_DISCARD conversion in a comment. >> >> No functional changes. >> >> Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> > > Acked-by: Ian Campbell <ian.campbell@citrix.com> > > I''ve made some very minor comments below but I think getting the basic > documentation of this stuff in as a baseline to improve and correct over > time is more important than any of them. > > Thanks again for doing this -- it is really valuable!Thanks.>> + * params >> + * Values: string >> + * >> + * A free formatted string providing sufficient information for the >> + * backend driver to open the backing device. (e.g. the path to the >> + * file or block device representing the backing store.) > > The syntax and semantics of params is defined by the particular backend, > rather than being "free formatted" as such. I think it would be worth > saying that explicitly. > > Ian.Perhaps something like this? * params * Values: string * * Data used by the backend driver to locate and configure the backing * device. The format and semantics of this data vary according to the * backing device in use and are outside the scope of this specification. -- Justin
Ian Campbell
2012-Feb-21 19:41 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Tue, 2012-02-21 at 18:14 +0000, Justin T. Gibbs wrote:> On Feb 21, 2012, at 7:27 AM, Ian Campbell wrote: > > > On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote: > >> o Document the XenBus nodes used in this protocol. > >> o Add a state diagram illustrating the roles and responsibilities > >> of both the front and backend during startup. > >> o Correct missed BLKIF_OP_TRIM => BLKIF_OP_DISCARD conversion in a comment. > >> > >> No functional changes. > >> > >> Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> > > > > Acked-by: Ian Campbell <ian.campbell@citrix.com> > > > > I''ve made some very minor comments below but I think getting the basic > > documentation of this stuff in as a baseline to improve and correct over > > time is more important than any of them. > > > > Thanks again for doing this -- it is really valuable! > > Thanks. > > >> + * params > >> + * Values: string > >> + * > >> + * A free formatted string providing sufficient information for the > >> + * backend driver to open the backing device. (e.g. the path to the > >> + * file or block device representing the backing store.) > > > > The syntax and semantics of params is defined by the particular backend, > > rather than being "free formatted" as such. I think it would be worth > > saying that explicitly. > > > > Ian. > > Perhaps something like this? > > * params > * Values: string > * > * Data used by the backend driver to locate and configure the backing > * device. The format and semantics of this data vary according to the > * backing device in use and are outside the scope of this specification.Sounds good.> > -- > Justin
Ian Campbell
2012-Feb-21 19:42 UTC
Re: [PATCH 4 of 4 v3] blkif.h: Define and document the request number/size/segments extension
On Tue, 2012-02-21 at 18:03 +0000, Justin T. Gibbs wrote:> On Feb 21, 2012, at 7:38 AM, Ian Campbell wrote: > > > On Mon, 2012-02-20 at 18:07 +0000, Justin T. Gibbs wrote: > >> Note: As of __XEN_INTERFACE_VERSION__ 0x00040201 the definition of > >> BLKIF_MAX_SEGMENTS_PER_REQUEST has changed. Drivers must be updated > >> to, at minimum, use BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, before being > >> recompiled with a __XEN_INTERFACE_VERSION greater than or equal to > >> this value. > >> > >> This extension first appeared in the FreeBSD Operating System. > >> > >> Signed-off-by: Justin T. Gibbs <justing@spectralogic.com> > >> > >> diff -r 09051133e2fe -r a777cbc5f48b xen/include/public/io/blkif.h > >> --- a/xen/include/public/io/blkif.h Mon Feb 20 11:02:53 2012 -0700 > >> +++ b/xen/include/public/io/blkif.h Mon Feb 20 11:03:01 2012 -0700 > >> @@ -145,6 +145,32 @@ > >> * The maximum supported size of the request ring buffer in units of > >> * machine pages. The value must be a power of 2. > >> * > >> + * max-requests <uint32_t> > >> + * Default Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE) > >> + * Maximum Value: BLKIF_MAX_RING_REQUESTS(PAGE_SIZE * max-ring-pages) > >> + * > >> + * The maximum number of concurrent, logical requests that will be > >> + * issued by the backend. > > > > by "issued" do you mean the maximum number that the backend is willing > > to consume/have-in-flight at once? I guess that means issued to the > > underlying storage? (backend doesn''t issue requests to the frontend does > > it? That''s how I first read it, hence my question) > > > > Ian. > > > > I should have used the same language as in the other entries. Does this > make more sense? > > * The maximum number of concurrent, logical requests supported by > * the backend.Makes sense, thanks. Ian.> > -- > Justin
Justin T. Gibbs
2012-Feb-21 20:59 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Feb 21, 2012, at 7:10 AM, Konrad Rzeszutek Wilk wrote:> On Mon, Feb 20, 2012 at 11:07:23AM -0700, Justin T. Gibbs wrote: >> o Document the XenBus nodes used in this protocol. >> o Add a state diagram illustrating the roles and responsibilities >> of both the front and backend during startup. >> o Correct missed BLKIF_OP_TRIM => BLKIF_OP_DISCARD conversion in a comment. >> >> No functional changes. > > Some comments below.…>> + *------------------------- Backend Device Properties ------------------------- >> + * >> + * discard-aligment >> + * Values: <uint32_t> >> + * Default Value: 0 >> + * Notes: 1, 2 >> + * >> + * The offset, in bytes from the beginning of the virtual block device, >> + * to the first, addressable, discard extent on the underlying device. > > You are also missing the rest of the information about it. Please include the > details that the previous comment had.The strategy I have employed in all of this documentation is to treat blkif.h as a formal spec. This means that blkif needs to describe how its terminology maps to industry standards (e.g. discard means trim/unmap) and how these requests can be communicated between front and back driver. It should not, in my opinion, attempt to document information that is managed in other specs (e.g. T10/T13) and could easily become stale in blkif.h. It also should avoid stating information that someone reasonably proficient in writing (storage) device drivers is expected to know (e.g. don''t say you support something that you don''t). With that in mind, I believe that all of the required information about discard is still present in blkif.h, but perhaps presented differently or moved to different sections. Can you be more specific about the information that you feel is missing here and in the other places you noted below?>> + * >> + * discard-secure >> + * Values: 0/1 (boolean) >> + * Default Value: 0 >> + * >> + * A value of "1" indicates that the backend can process BLKIF_OP_DISCARD >> + * requests with the BLKIF_DISCARD_SECURE flag set. > > > That is not very specific to what this does. It just says X will do Y.Did you mean, "That is not very specific to what BLKIF_DISCARD_SECURE does."? BLKIF_DISCARD_SECURE is documented in the comment section for BLKIF_OP_DISCARD. The pertinent text is: "If the BLKIF_DISCARD_SECURE flag is set on the request, all copies of the discarded region on the device must be rendered unrecoverable before the command returns." Is there a compelling reason to document it here instead? The comment block here mirrors the language of all the other feature flags. Do you believe they should be changed in some way too?>> - * or a BLKIF_RSP_EOPNOTSUPP should be returned. >> - * More information about trim/unmap operations at: >> + * This operation is analogous to performing a trim (ATA) or unamp (SCSI), > > unmap.Good catch.> >> + * command on a native device. > > And you might want to mention what the previous comment did - that this is > a hint to the backend.The proposed text already does this: "Indicate to the backend device that a region of storage is no longer in use, and may be discarded at any time without impact to the client." If the backend was required to discard the region, I would have used "shall" or "must", not "may". This is the typical language convention of standards documents. -- Justin
Konrad Rzeszutek Wilk
2012-Feb-21 21:36 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Tue, Feb 21, 2012 at 01:59:15PM -0700, Justin T. Gibbs wrote:> On Feb 21, 2012, at 7:10 AM, Konrad Rzeszutek Wilk wrote: > > > On Mon, Feb 20, 2012 at 11:07:23AM -0700, Justin T. Gibbs wrote: > >> o Document the XenBus nodes used in this protocol. > >> o Add a state diagram illustrating the roles and responsibilities > >> of both the front and backend during startup. > >> o Correct missed BLKIF_OP_TRIM => BLKIF_OP_DISCARD conversion in a comment. > >> > >> No functional changes. > > > > Some comments below. > > … > > >> + *------------------------- Backend Device Properties ------------------------- > >> + * > >> + * discard-aligment > >> + * Values: <uint32_t> > >> + * Default Value: 0 > >> + * Notes: 1, 2 > >> + * > >> + * The offset, in bytes from the beginning of the virtual block device, > >> + * to the first, addressable, discard extent on the underlying device. > > > > You are also missing the rest of the information about it. Please include the > > details that the previous comment had. > > The strategy I have employed in all of this documentation is to > treat blkif.h as a formal spec. This means that blkif needs toOK.> describe how its terminology maps to industry standards (e.g. discard > means trim/unmap) and how these requests can be communicated between > front and back driver. It should not, in my opinion, attempt to > document information that is managed in other specs (e.g. T10/T13) > and could easily become stale in blkif.h. It also should avoid > stating information that someone reasonably proficient in writing > (storage) device drivers is expected to know (e.g. don't say you > support something that you don't).And that is what I hate about specs. They don't provide enough details on the corner cases or what an implentation could look like.> > With that in mind, I believe that all of the required information > about discard is still present in blkif.h, but perhaps presented > differently or moved to different sections. Can you be more > specific about the information that you feel is missing here and > in the other places you noted below?The original statement about how the backend would determine how to expose this information.> > >> + * > >> + * discard-secure > >> + * Values: 0/1 (boolean) > >> + * Default Value: 0 > >> + * > >> + * A value of "1" indicates that the backend can process BLKIF_OP_DISCARD > >> + * requests with the BLKIF_DISCARD_SECURE flag set. > > > > > > That is not very specific to what this does. It just says X will do Y. > > Did you mean, "That is not very specific to what BLKIF_DISCARD_SECURE > does."? BLKIF_DISCARD_SECURE is documented in the comment section > for BLKIF_OP_DISCARD. The pertinent text is: "If the BLKIF_DISCARD_SECURE > flag is set on the request, all copies of the discarded region on > the device must be rendered unrecoverable before the command returns." > Is there a compelling reason to document it here instead?I just think that having more information (even duplicate) is better. But that has the disadvantage that it can get out of sync.> > The comment block here mirrors the language of all the other feature > flags. Do you believe they should be changed in some way too?Well, the desire (mine) is to include as much details (or even more) in the spec so that it can be read as a novel if desired. But I will defer to Ian - if he is OK then this shouldn't gate the acceptance of this patch which is a step forward. .. snip..> >> + * command on a native device. > > > > And you might want to mention what the previous comment did - that this is > > a hint to the backend. > > The proposed text already does this: > > "Indicate to the backend device that a region of storage is no > longer in use, and may be discarded at any time without impact > to the client." > > If the backend was required to discard the region, I would have > used "shall" or "must", not "may". This is the typical language > convention of standards documents.Fair enough.> > -- > Justin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Feb-22 11:14 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Tue, 2012-02-21 at 21:36 +0000, Konrad Rzeszutek Wilk wrote: [...]> > With that in mind, I believe that all of the required information > > about discard is still present in blkif.h, but perhaps presented > > differently or moved to different sections. Can you be more > > specific about the information that you feel is missing here and > > in the other places you noted below? > > The original statement about how the backend would determine how > to expose this information.Please can you quote the actual text of the statements which you think are missing and/or should be retained so we know precisely what you are talking about.> > The comment block here mirrors the language of all the other feature > > flags. Do you believe they should be changed in some way too? > > Well, the desire (mine) is to include as much details (or even more) in the > spec so that it can be read as a novel if desired. > > But I will defer to Ian - if he is OK then this shouldn''t gate > the acceptance of this patch which is a step forward.Perhaps details which aren''t part of the formal spec, e.g. implementation tips, details of various OSes implementation choices etc could be kept elsewhere, e.g. on the wiki? The problem with including these sorts of things in the spec is that you then have to get all pedantic about which statements are normative and which are just quirks or details of a particular valid implementation of the spec etc. Ian.
Konrad Rzeszutek Wilk
2012-Feb-23 18:50 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Wed, Feb 22, 2012 at 11:14:31AM +0000, Ian Campbell wrote:> On Tue, 2012-02-21 at 21:36 +0000, Konrad Rzeszutek Wilk wrote: > [...] > > > With that in mind, I believe that all of the required information > > > about discard is still present in blkif.h, but perhaps presented > > > differently or moved to different sections. Can you be more > > > specific about the information that you feel is missing here and > > > in the other places you noted below? > > > > The original statement about how the backend would determine how > > to expose this information. > > Please can you quote the actual text of the statements which you think > are missing and/or should be retained so we know precisely what you are > talking about.Ah, it is in the Notes section. However the information that is not present in 4) is The discard-alignment parameter indicates how many bytes the beginning of the partition is offset from the internal allocation unit''s natural alignment. Do not confuse this with natural disk alignment offset. [granted we don''t expose the natural disk alignment offset right now, so maybe that comment is useless] and 6): Devices that support discard functionality may internally allocate space using units that are bigger than the logical block size. The discard-granularity parameter indicates the size of the internal allocation unit in bytes if reported by the device. Otherwise the discard-granularity will be set to match the device''s physical block size. It is the minimum size you can discard. [though parts of that comment are there]> > > > The comment block here mirrors the language of all the other feature > > > flags. Do you believe they should be changed in some way too? > > > > Well, the desire (mine) is to include as much details (or even more) in the > > spec so that it can be read as a novel if desired. > > > > But I will defer to Ian - if he is OK then this shouldn''t gate > > the acceptance of this patch which is a step forward. > > Perhaps details which aren''t part of the formal spec, e.g. > implementation tips, details of various OSes implementation choices etc > could be kept elsewhere, e.g. on the wiki?On the Document day when I started documenting the PSE != PAT and the page pinning procedure - I had no idea where to stick that explanation of what goes behind the scene and how to properly do it so Ian Jackson suggested in the header of the function. That is an implementation tip of how to properly do it so you don''t shoot yourself in the foot. Where should something similar for blkback be so it can grow along with the source?> > The problem with including these sorts of things in the spec is that you > then have to get all pedantic about which statements are normative and > which are just quirks or details of a particular valid implementation of > the spec etc.I cheerish when folks are enthuastic to be pedantic about it so at the end the result has all t''s crossed and i''s dotted. Perhaps we should stick in the document a line saying: "BEWARE: PATCHES TO THIS AREA ARE GOING TO BE REVIEWED EXTREMELY CLOSE. SO GET YOUR ASBESTOS UNDERWEAR ON!"> > Ian.
Justin T. Gibbs
2012-Feb-23 22:41 UTC
Re: [PATCH 2 of 4 v3] blkif.h: Provide more complete documentation of the blkif interface
On Feb 23, 2012, at 11:50 AM, Konrad Rzeszutek Wilk wrote:> On Wed, Feb 22, 2012 at 11:14:31AM +0000, Ian Campbell wrote: >> On Tue, 2012-02-21 at 21:36 +0000, Konrad Rzeszutek Wilk wrote: >> [...] >>>> With that in mind, I believe that all of the required information >>>> about discard is still present in blkif.h, but perhaps presented >>>> differently or moved to different sections. Can you be more >>>> specific about the information that you feel is missing here and >>>> in the other places you noted below? >>> >>> The original statement about how the backend would determine how >>> to expose this information. >> >> Please can you quote the actual text of the statements which you think >> are missing and/or should be retained so we know precisely what you are >> talking about. > > Ah, it is in the Notes section. However the information that is not present in 4) is > > The discard-alignment parameter indicates how many bytes > the beginning of the partition is offset from the internal allocation unit''s > natural alignment. Do not confuse this with natural disk alignment offset. > > [granted we don''t expose the natural disk alignment offset right now, so maybe > that comment is useless]I removed the old text because it was imprecise (what is "natural disk alignment" and how is it different from "discard natural alignment"?) and, even if you understand its intended meaning, there is no information provided in the interface to allow for this type of confusion. I think there would be tremendous value in exporting a XenBus node to indicate the physical sector size. I would be happy to propose this feature in a different patch set and clarify the three different allocation sizes that are published. This email did cause me to go review the description for sector-size. I have now changed it to be: * sector-size * Values: <uint32_t> * * The size, in bytes, of the individually addressible data blocks * on the backend device. It used to say "The native sector size, in bytes, of the backend device." which it isn''t. "sector-size"''d access may be emulated.> and 6): > Devices that support discard functionality may > internally allocate space using units that are bigger than the logical block > size. The discard-granularity parameter indicates the size of the internal > allocation unit in bytes if reported by the device. Otherwise the > discard-granularity will be set to match the device''s physical block size. > It is the minimum size you can discard. > > [though parts of that comment are there]Just the last sentence here is not directly replicated in the patch set. To my mind, the definition of this value is clear and it is up to the implementor to publish a valid value or not claim support for this feature. How this is achieved does not belong in the blkif spec.>> Perhaps details which aren''t part of the formal spec, e.g. >> implementation tips, details of various OSes implementation choices etc >> could be kept elsewhere, e.g. on the wiki? > > On the Document day when I started documenting the PSE != PAT and the page > pinning procedure - I had no idea where to stick that explanation of what goes behind > the scene and how to properly do it so Ian Jackson suggested in the header of > the function. That is an implementation tip of how to properly do it > so you don''t shoot yourself in the foot. > > Where should something similar for blkback be so it can grow along with the source?Now that they are documented, the features aren''t an area ripe for foot shooting. The state transitions are, and I would support adding more state diagrams to this header file (e.g. tool stack initiated closes, the impact of the "online" attribute, etc.). -- Justin