Hello all, I''m a Masters student at UBC and I''ve been doing some work on XenStore for the last few months. In doing so, I have come across some idiosyncrasies and annoyances with it. I''m planning on doing a re-write of it to see if some of these can be resolved and would like input from the community. First of all, I will list some issues I have come across with XenStore: Transactions: - Currently the entire TDB database (more on TDB later) is copied on each transaction, which is really slow and unneeded - Interleaving transactions cause EAGAIN to be issued regardless of whether the transactions actually conflict or not - There is no support for nested transactions Watches: - Apparently when a domain disconnects and reconnects, its watches aren''t deleted from XenStore, but nor can it access them again upon reconnect - Watches cannot be set on non-existent directories. While this makes sense, it causes performance problems with some devices that require a watch placed on a path that hasn''t yet been created. The solution is to place a watch on / and check each event for the creation of the desired path, which can cause excessive amounts of unneeded watches Code: - Things that start with t are bad - TDB is a bad choice for a backend - talloc is a pain to use - Policy and mechanism are completely tangled - This should be separated out, then different policy modules can be implemented and trivially enabled (e.g. legacy, Chinese Wall, protocol enforcement) - Modularity -- related to the above comment, but with greater scope - The backend should be pluggable (e.g. TDB, in-memory-only store, flat file, sqlite, anything else you want) My intention is to rewrite XenStore using OCaml in order to provide stronger assertions about the code and the type system. An interface will be provided so that modules can be written in C and plugged into the core XenStore code. I''ve also noticed some aspects of the XenStore protocol which I think might be good to look at. I notice that XenStore has RELEASE and RESUME commands, but according to the current XenStore documentation (which I have also modified/cleaned up to reflect an implementation independent view of the XenStore API and will follow later in this message) there are no current users of RESUME in the xen-unstable source. In addition, some other work done here on a project called Remus has found that XenStore''s RELEASE/RESUME functionality is slow. Remus has in fact removed XenStore from the suspend/resume process all together and all seems to function well. As such, I''m proposing the complete remove of both RELEASE and RESUME from the XenStore specification. Additionally, I see a recent modification to XenStore to include a SET_TARGET command which has plenty of issues. For example, say you have two domains A and B, both of which have permissions to a node X. A has full permissions and B only has read permissions. If you set a target from A to B (or whichever way around it is so that A gets B''s permissions too) and B comes first in the permission list for X, then A will get B''s permissions (read-only) instead of having full permissions. Thus, SET_TARGET can actually cripple the access of a domain. This sort of functionality should be able to be implemented without having a special, dedicated function (through the use of pluggable policy modules, as mentioned above). Thanks, Patrick Colp Following is a proposed update to the XenStore Protocol Specification. I have some questions/comments about the protocol, which are the lines starting with %. I would especially appreciate feedback about those. XenStore Protocol Specification ------------------------------- XenStore implements a map between `keys'' (which are filename-like pathnames) and values. Clients may read and write values, watch for changes, and set permissions to allow or deny access. There is also a rudimentary transaction system. % Avoided? Or make it a hard requirement? % Normally 7-bit ASCII or always? % Generally add a nul byte? Should they always? Never? While XenStore and most tools and APIs are capable of dealing with arbitrary binary data as values, this should generally be avoided. Instead, data should generally be human-readable for ease of management and debugging. XenStore is not a high-performance facility and should be used only for small amounts of control plane data. Therefore, XenStore values should be 7-bit ASCII text strings containing bytes 0x20..0x7F only, and should not contain a trailing nul byte (the APIs used for accessing XenStore generally add a nul when reading, for the caller''s convenience). Paths are separated by a / and the root path is /, just like in Unix file systems. A path (<parent>) is a parent of another path (<child>) if <parent> and <child> are not identical and if <parent> is / (the root path) or <parent>/ is an initial substring of <child>. % Conventional to not store values? Should this be a requirement or a non-convention? If a path exists, all of its parents do too. Every path maps to a value, which may be empty. It can also have zero or more immediate children. There is thus no particular distinction between directories and leaf nodes. However, it is conventional to not store values at nodes which also have children. The permitted character for paths set are the ASCII alphanumerics and the four punctuation characters -/_@ (hyphen slash underscore atsign). @ should be avoided except to specify special watches (see below). Doubled slashes and trailing slashes (except to specify the root) are forbidden. The empty path is also forbidden. Paths longer than 3072 bytes are forbidden; clients specifying relative paths should keep them to within 2048 bytes (see XENSTORE_*_PATH_MAX in xs_wire.h). Communication with XenStore is either via sockets or event channel and shared memory, as specified in io/xs_wire.h. Each message in either direction has a header formatted as a a struct xsd_sockmsg, which has the following format: unsigned 32-bit integer: type unsigned 32-bit integer: req_id unsigned 32-bit integer: tx_id unsigned 32-bit integer: len After the header, the message contains len bytes of payload. The payload syntax varies according to the type field. Generally requests each generate a reply with an identical type, req_id, and tx_id. However, if an error occurs, a reply will be returned with type ERROR, and only req_id and tx_id copied from the request. A caller who sends several requests may receive the replies in any order and must use req_id (and tx_id, if applicable) to match up replies to requests. % Payload is limited to 4096? Or header + payload? The payload length (len field of the header) is limited to 4096 bytes (XENSTORE_PAYLOAD_MAX) in both directions. If a client exceeds the limit, its XenStore connection will be immediately killed by XenStore, which is usually catastrophic from the client''s point of view. Clients (particularly domains, which cannot just reconnect) should avoid this. Due to this limitation, bulk data should not be passed through XenStore as the performance properties are poor. In addition, this would violate the intended use of XenStore. ---------- Xenstore Protocol Details - Introduction ---------- The payload syntax and semantics of the requests and replies are described below. In the payload syntax specifications the following notations are used: | A nul (zero) byte. <foo> A string guaranteed not to contain any nul bytes. <foo|> Binary data (which may contain zero or more nul bytes) <foo>|* Zero or more strings each followed by a trailing nul <foo>|+ One or more strings each followed by a trailing nul ? Reserved value (may not contain nuls) ?? Reserved value (may contain nuls) Reserved values for the most part will be empty strings. However, they exist in order to enable extensions in the future. Error replies are as follows: ERROR E<something>| Where E<something> is the name of an errno value listed in io/xs_wire.h. Note that the string name is transmitted, not a numeric value. Where no reply payload format is specified below, success responses have the following payload: OK| Values commonly included in payloads include: <path> Specifies a path in the hierarchical key structure. If <path> starts with a / it simply represents that path. <path> is allowed to not start with /, in which case the caller must be a domain (rather than connected via a socket) and the path is taken to be relative to /local/domain/<domid> (e.g., `x/y'' sent by domain 3 would mean `/local/domain/3/x/y''). <domid> Integer domid, represented as decimal number 0..65535 (16-bit unsigned integer). Parsing errors and values out of range generally go undetected. The special DOMID_... values (see xen.h) are represented as 16-bit unsigned integers; unless otherwise specified it is an error not to specify a real domain id. The following sections give the actual type values, including the request and reply payloads as applicable. ---------- Database Read, Write, and Permissions Operatons ---------- READ <path>| <value|> WRITE <path>|<value|> Store and read the octet string <value> at <path>. WRITE creates any missing parent paths with empty values. MKDIR <path>| Ensures that the <path> exists, if necessary by creating it and any missing parents with empty values. If <path> or any parent already exists, its value is left unchanged. RM <path>| Ensures that the <path> does not exist, by deleting it and all of its children. It is not an error if <path> does not exist, but it _is_ an error if <path>''s immediate parent does not exist either. DIRECTORY <path>| <child-leaf-name>|* Gives a list of the immediate children of <path> as only the leafnames. The resulting children are each named <path>/<child-leaf-name>. GET_PERMS <path>| <perm-as-string>|+ SET_PERMS <path>|<perm-as-string>|+? <perm-as-string> is one of the following w<domid> write only r<domid> read only b<domid> both read and write n<domid> no access See http://wiki.xensource.com/xenwiki/XenBus section `Permissions'' for details of the permissions system. ---------- Watches ---------- WATCH <wpath>|<token>|? Adds a watch. When a path is modified (including path creation, removal, contents change or permissions change) this generates an event on the changed path. Changes made in transactions cause an event only if and when committed. Each occurring event is matched against all the watches currently set up and each matching watch results in a WATCH_EVENT message (see below). The event''s path matches the watch''s <wpath> if it is <wpath> or a child of <wpath>. <wpath> can be a path to watch or @<wspecial>. In the latter case <wspecial> may have any syntax but it matches (according to the rules above) only the following special events which are invented by XenStore: @introduceDomain occurs on INTRODUCE @releaseDomain occurs on any domain crash or shutdown, and also on RELEASE and domain destruction When a watch is first set up, it is triggered once straight away, with the path equal to <wpath>. Watches may be triggered spuriously. The tx_id in a WATCH request is ignored. Watches are restricted by the permissions system. Applications will not be sent a notification for paths that they cannot read. However, an application will be sent a watch when a path which it is able to read is deleted, even if that leaves only a nonexistent, unreadable parent. A notification will be omitted if a node''s permissions are changed so as to make it unreadable, in which case future notifications will also be suppressed (and if the node is later made readable, some notifications may have been "lost"). WATCH_EVENT <epath>|<token>| Unsolicited `reply'' generated for matching modfication events as described above. req_id and tx_id are both 0. <epath> is the event''s path (i.e. the actual path that was modifed). However, if the event was the recursive removal of a parent of the watched path, <epath> is the watched path (rather than the actual path which was removed). So <epath> is either the watched path or a child of the watched path. Iff the watched path was specified as a relative pathname, then <epath> will also be relative (with the same base as the watched path). UNWATCH <wpath>|<token>|? Remove a watch placed on the path <wpath>. ---------- Transactions ---------- TRANSACTION_START | <transid>| <transid> is an opaque unsigned 32-bit integer allocated by XenStore. After this, transaction may be referenced by using <transid> in the tx_id request header field. It is not legal to send a non-0 tx_id in TRANSACTION_START. TRANSACTION_END T| TRANSACTION_END F| tx_id must refer to and existing transaction. After this request, the tx_id is no longer valid and may be reused by XenStore. If F is sent, the transaction is discarded. If T is sent, it is committed. If T is sent but there were intervening writes which conflict (meaning only writes or other commits which changed paths which were read or written in the transaction at hand), then the writes cause an EAGAIN message to be sent as a reply. ---------- Domain Management and XenStore Communications ---------- INTRODUCE <domid>|<mfn>|<evtchn>|? Notifies XenStore to communicate with this domain. INTRODUCE is used during domain startup, restore, and resume. <domid> must be a real domain id (not 0 and not a special DOMID_... value). <mfn> must be a machine page in that domain represented as an unsigned 32-bit integer. <evtchn> must be an unbound event channel in the <domid> domain (likewise in 32-bit unsigned integer), on which XenStore will call bind_interdomain. XenStore prevents the use of INTRODUCE other than by Domain-0. RELEASE <domid>| Manually requests that XenStore disconnect from the domain. The event channel is unbound at the XenStore end and the machine page unmapped. If the domain is still running, it won''t be able to communicate with XenStore. Note that XenStore will in any case detect domain destruction and disconnect by itself. XenStore prevents the use of RELEASE other than by Domain-0. GET_DOMAIN_PATH <domid>| <path>| Returns the domain''s base path, as is used for relative transactions (i.e. /local/domain/<domid> -- with <domid> normalised). The answer will be useless unless <domid> is a real domain id. IS_DOMAIN_INTRODUCED <domid>| T| or F| Returns T if XenStore is in communication with the domain (i.e. if INTRODUCE for the domain has not yet been followed by domain destruction or explicit RELEASE). RESUME <domid>| Arranges that @releaseDomain events will once more be generated when the domain becomes shut down. This might have to be used if a domain were to be shut down (generating one @releaseDomain) and then subsequently restarted, since the state-sensitive algorithm in XenStore will not otherwise send further watch event notifications if the domain were to be shut down again. It is not clear whether this is possible since one would normally expect a domain not to be restarted after being shut down without being destroyed in the meantime. There are currently no users of this request in xen-unstable. XenStore prevents the use of RESUME other than by Domain-0. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hello all, I''m a Masters student at UBC and I''ve been doing some work on XenStore for the last few months. In doing so, I have come across some idiosyncrasies and annoyances with it. I''m planning on doing a re-write of it to see if some of these can be resolved and would like input from the community. First of all, I will list some issues I have come across with XenStore: Transactions: - Currently the entire TDB database (more on TDB later) is copied on each transaction, which is really slow and unneeded - Interleaving transactions cause EAGAIN to be issued regardless of whether the transactions actually conflict or not - There is no support for nested transactions Watches: - Apparently when a domain disconnects and reconnects, its watches aren''t deleted from XenStore, but nor can it access them again upon reconnect - Watches cannot be set on non-existent directories. While this makes sense, it causes performance problems with some devices that require a watch placed on a path that hasn''t yet been created. The solution is to place a watch on / and check each event for the creation of the desired path, which can cause excessive amounts of unneeded watches Code: - Things that start with t are bad - TDB is a bad choice for a backend - talloc is a pain to use - Policy and mechanism are completely tangled - This should be separated out, then different policy modules can be implemented and trivially enabled (e.g. legacy, Chinese Wall, protocol enforcement) - Modularity -- related to the above comment, but with greater scope - The backend should be pluggable (e.g. TDB, in-memory-only store, flat file, sqlite, anything else you want) My intention is to rewrite XenStore using OCaml in order to provide stronger assertions about the code and the type system. An interface will be provided so that modules can be written in C and plugged into the core XenStore code. I''ve also noticed some aspects of the XenStore protocol which I think might be good to look at. I notice that XenStore has RELEASE and RESUME commands, but according to the current XenStore documentation (which I have also modified/cleaned up to reflect an implementation independent view of the XenStore API and will follow later in this message) there are no current users of RESUME in the xen-unstable source. In addition, some other work done here on a project called Remus has found that XenStore''s RELEASE/RESUME functionality is slow. Remus has in fact removed XenStore from the suspend/resume process all together and all seems to function well. As such, I''m proposing the complete remove of both RELEASE and RESUME from the XenStore specification. Additionally, I see a recent modification to XenStore to include a SET_TARGET command which has plenty of issues. For example, say you have two domains A and B, both of which have permissions to a node X. A has full permissions and B only has read permissions. If you set a target from A to B (or whichever way around it is so that A gets B''s permissions too) and B comes first in the permission list for X, then A will get B''s permissions (read-only) instead of having full permissions. Thus, SET_TARGET can actually cripple the access of a domain. This sort of functionality should be able to be implemented without having a special, dedicated function (through the use of pluggable policy modules, as mentioned above). Thanks, Patrick Colp Following is a proposed update to the XenStore Protocol Specification. I have some questions/comments about the protocol, which are the lines starting with %. I would especially appreciate feedback about those. XenStore Protocol Specification ------------------------------- XenStore implements a map between `keys'' (which are filename-like pathnames) and values. Clients may read and write values, watch for changes, and set permissions to allow or deny access. There is also a rudimentary transaction system. % Avoided? Or make it a hard requirement? % Normally 7-bit ASCII or always? % Generally add a nul byte? Should they always? Never? While XenStore and most tools and APIs are capable of dealing with arbitrary binary data as values, this should generally be avoided. Instead, data should generally be human-readable for ease of management and debugging. XenStore is not a high-performance facility and should be used only for small amounts of control plane data. Therefore, XenStore values should be 7-bit ASCII text strings containing bytes 0x20..0x7F only, and should not contain a trailing nul byte (the APIs used for accessing XenStore generally add a nul when reading, for the caller''s convenience). Paths are separated by a / and the root path is /, just like in Unix file systems. A path (<parent>) is a parent of another path (<child>) if <parent> and <child> are not identical and if <parent> is / (the root path) or <parent>/ is an initial substring of <child>. % Conventional to not store values? Should this be a requirement or a non-convention? If a path exists, all of its parents do too. Every path maps to a value, which may be empty. It can also have zero or more immediate children. There is thus no particular distinction between directories and leaf nodes. However, it is conventional to not store values at nodes which also have children. The permitted character for paths set are the ASCII alphanumerics and the four punctuation characters -/_@ (hyphen slash underscore atsign). @ should be avoided except to specify special watches (see below). Doubled slashes and trailing slashes (except to specify the root) are forbidden. The empty path is also forbidden. Paths longer than 3072 bytes are forbidden; clients specifying relative paths should keep them to within 2048 bytes (see XENSTORE_*_PATH_MAX in xs_wire.h). Communication with XenStore is either via sockets or event channel and shared memory, as specified in io/xs_wire.h. Each message in either direction has a header formatted as a a struct xsd_sockmsg, which has the following format: unsigned 32-bit integer: type unsigned 32-bit integer: req_id unsigned 32-bit integer: tx_id unsigned 32-bit integer: len After the header, the message contains len bytes of payload. The payload syntax varies according to the type field. Generally requests each generate a reply with an identical type, req_id, and tx_id. However, if an error occurs, a reply will be returned with type ERROR, and only req_id and tx_id copied from the request. A caller who sends several requests may receive the replies in any order and must use req_id (and tx_id, if applicable) to match up replies to requests. % Payload is limited to 4096? Or header + payload? The payload length (len field of the header) is limited to 4096 bytes (XENSTORE_PAYLOAD_MAX) in both directions. If a client exceeds the limit, its XenStore connection will be immediately killed by XenStore, which is usually catastrophic from the client''s point of view. Clients (particularly domains, which cannot just reconnect) should avoid this. Due to this limitation, bulk data should not be passed through XenStore as the performance properties are poor. In addition, this would violate the intended use of XenStore. ---------- Xenstore Protocol Details - Introduction ---------- The payload syntax and semantics of the requests and replies are described below. In the payload syntax specifications the following notations are used: | A nul (zero) byte. <foo> A string guaranteed not to contain any nul bytes. <foo|> Binary data (which may contain zero or more nul bytes) <foo>|* Zero or more strings each followed by a trailing nul <foo>|+ One or more strings each followed by a trailing nul ? Reserved value (may not contain nuls) ?? Reserved value (may contain nuls) Reserved values for the most part will be empty strings. However, they exist in order to enable extensions in the future. Error replies are as follows: ERROR E<something>| Where E<something> is the name of an errno value listed in io/xs_wire.h. Note that the string name is transmitted, not a numeric value. Where no reply payload format is specified below, success responses have the following payload: OK| Values commonly included in payloads include: <path> Specifies a path in the hierarchical key structure. If <path> starts with a / it simply represents that path. <path> is allowed to not start with /, in which case the caller must be a domain (rather than connected via a socket) and the path is taken to be relative to /local/domain/<domid> (e.g., `x/y'' sent by domain 3 would mean `/local/domain/3/x/y''). <domid> Integer domid, represented as decimal number 0..65535 (16-bit unsigned integer). Parsing errors and values out of range generally go undetected. The special DOMID_... values (see xen.h) are represented as 16-bit unsigned integers; unless otherwise specified it is an error not to specify a real domain id. The following sections give the actual type values, including the request and reply payloads as applicable. ---------- Database Read, Write, and Permissions Operatons ---------- READ <path>| <value|> WRITE <path>|<value|> Store and read the octet string <value> at <path>. WRITE creates any missing parent paths with empty values. MKDIR <path>| Ensures that the <path> exists, if necessary by creating it and any missing parents with empty values. If <path> or any parent already exists, its value is left unchanged. RM <path>| Ensures that the <path> does not exist, by deleting it and all of its children. It is not an error if <path> does not exist, but it _is_ an error if <path>''s immediate parent does not exist either. DIRECTORY <path>| <child-leaf-name>|* Gives a list of the immediate children of <path> as only the leafnames. The resulting children are each named <path>/<child-leaf-name>. GET_PERMS <path>| <perm-as-string>|+ SET_PERMS <path>|<perm-as-string>|+? <perm-as-string> is one of the following w<domid> write only r<domid> read only b<domid> both read and write n<domid> no access See http://wiki.xensource.com/xenwiki/XenBus section `Permissions'' for details of the permissions system. ---------- Watches ---------- WATCH <wpath>|<token>|? Adds a watch. When a path is modified (including path creation, removal, contents change or permissions change) this generates an event on the changed path. Changes made in transactions cause an event only if and when committed. Each occurring event is matched against all the watches currently set up and each matching watch results in a WATCH_EVENT message (see below). The event''s path matches the watch''s <wpath> if it is <wpath> or a child of <wpath>. <wpath> can be a path to watch or @<wspecial>. In the latter case <wspecial> may have any syntax but it matches (according to the rules above) only the following special events which are invented by XenStore: @introduceDomain occurs on INTRODUCE @releaseDomain occurs on any domain crash or shutdown, and also on RELEASE and domain destruction When a watch is first set up, it is triggered once straight away, with the path equal to <wpath>. Watches may be triggered spuriously. The tx_id in a WATCH request is ignored. Watches are restricted by the permissions system. Applications will not be sent a notification for paths that they cannot read. However, an application will be sent a watch when a path which it is able to read is deleted, even if that leaves only a nonexistent, unreadable parent. A notification will be omitted if a node''s permissions are changed so as to make it unreadable, in which case future notifications will also be suppressed (and if the node is later made readable, some notifications may have been "lost"). WATCH_EVENT <epath>|<token>| Unsolicited `reply'' generated for matching modfication events as described above. req_id and tx_id are both 0. <epath> is the event''s path (i.e. the actual path that was modifed). However, if the event was the recursive removal of a parent of the watched path, <epath> is the watched path (rather than the actual path which was removed). So <epath> is either the watched path or a child of the watched path. Iff the watched path was specified as a relative pathname, then <epath> will also be relative (with the same base as the watched path). UNWATCH <wpath>|<token>|? Remove a watch placed on the path <wpath>. ---------- Transactions ---------- TRANSACTION_START | <transid>| <transid> is an opaque unsigned 32-bit integer allocated by XenStore. After this, transaction may be referenced by using <transid> in the tx_id request header field. It is not legal to send a non-0 tx_id in TRANSACTION_START. TRANSACTION_END T| TRANSACTION_END F| tx_id must refer to and existing transaction. After this request, the tx_id is no longer valid and may be reused by XenStore. If F is sent, the transaction is discarded. If T is sent, it is committed. If T is sent but there were intervening writes which conflict (meaning only writes or other commits which changed paths which were read or written in the transaction at hand), then the writes cause an EAGAIN message to be sent as a reply. ---------- Domain Management and XenStore Communications ---------- INTRODUCE <domid>|<mfn>|<evtchn>|? Notifies XenStore to communicate with this domain. INTRODUCE is used during domain startup, restore, and resume. <domid> must be a real domain id (not 0 and not a special DOMID_... value). <mfn> must be a machine page in that domain represented as an unsigned 32-bit integer. <evtchn> must be an unbound event channel in the <domid> domain (likewise in 32-bit unsigned integer), on which XenStore will call bind_interdomain. XenStore prevents the use of INTRODUCE other than by Domain-0. RELEASE <domid>| Manually requests that XenStore disconnect from the domain. The event channel is unbound at the XenStore end and the machine page unmapped. If the domain is still running, it won''t be able to communicate with XenStore. Note that XenStore will in any case detect domain destruction and disconnect by itself. XenStore prevents the use of RELEASE other than by Domain-0. GET_DOMAIN_PATH <domid>| <path>| Returns the domain''s base path, as is used for relative transactions (i.e. /local/domain/<domid> -- with <domid> normalised). The answer will be useless unless <domid> is a real domain id. IS_DOMAIN_INTRODUCED <domid>| T| or F| Returns T if XenStore is in communication with the domain (i.e. if INTRODUCE for the domain has not yet been followed by domain destruction or explicit RELEASE). RESUME <domid>| Arranges that @releaseDomain events will once more be generated when the domain becomes shut down. This might have to be used if a domain were to be shut down (generating one @releaseDomain) and then subsequently restarted, since the state-sensitive algorithm in XenStore will not otherwise send further watch event notifications if the domain were to be shut down again. It is not clear whether this is possible since one would normally expect a domain not to be restarted after being shut down without being destroyed in the meantime. There are currently no users of this request in xen-unstable. XenStore prevents the use of RESUME other than by Domain-0. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel