Aggarwal, Vikas \(OFT\)
2005-May-05 15:18 UTC
[Xen-devel] please help: initialize XEND for my debug-FE/BE.c
Hi, I have added a debug-Front-end.c & debug-Back-end.c. Also created my debug.py based on blkif.py But my front-end still times out even after sending ctrl_if_register_receiver() to domain controller during module_init. I think I still did not initialize my debug.py properly. I saw VBD gets initialized in device_create or create_devices. Whats the difference in the two? Where should I look to initialize the event channel in domain controller? My virtual device in back-end collects the debug data and writes to file system so I don''t want it to be initialized during boot time via some config file. -vikas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Harry Butterworth
2005-May-05 20:37 UTC
Re: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c
To answer your questions: I think (75% confidence) create_devices is used to create devices when a frontend domain is started whereas device_create is used to create an extra single device on the fly when the frontend domain is already running. Your frontend is probably timing out waiting to hear about interfaces from xend. To establish interface connections between the frontend and backend with the current xend code you need to implement the following basic flow: When xend starts, it creates a controller factory for your device class. When a frontend domain is started, the controller factory is used to create a controller for the frontend domain. Configured devices for your device class are attached to the controller for the frontend domain. Attaching the device creates the device object which in turn does a lookup of the "backend interface" object. The backend interface object represents the communication channel between the frontend domain and the backend domain. If the backend interface object does not exist it is created at this point. When a backend interface is created, it finds the backend controller object for its backend domain. If the backend controller object does not exist it is created at this point. Some of the drivers assume that the backend driver is already up when the backend controller object is created and so do not listen for the driver status up message from the backend. I''m implementing a driver as loadable modules so I need to use this message. In my case, the backend driver sends a driver status up message to xend which is handled by the backend controller object. The backend controller object notifies all its backend interface objects that the backend driver is up. When the frontend driver is initialised, it sends a driver status up message to xend. This is handled by the controller object. The controller object notifies all its backend interface objects that the frontend driver is up. So, a backend interface object gets a driver up/down notification at each end. When a backend interface finds that the drivers at both its ends are up, it can squence the establishment of the connection: The backend interface object sends a "be create" message to the backend to create the object representing the interface in the backend. The backend interface object sends a "fe interface status disconnected" message to the frontend to create the object representing the interface in the frontend. The frontend allocates a page for the ring interface and sends a "fe interface connect" message to xend which is forwarded by the controller object to the backend interface object. xend opens an event channel between the two domains and sends a "be connect" message to the backend passing the event channel and the address of the page for the ring interface. The backend responds to the be connect and the backend interface object sends an fe interface status connected message to the frontend, passing the event channel number. Unloading and reloading the frontend requires that the backend stops writing to the shared page of memory before the frontend frees it for reuse. In my case, the frontend sends a driver status down message, which is handled by the controller object for the frontend. This object notifies all the backend interface objects for that frontend which then send be disconnect messages and wait for a response before acknowledging back to the controller object. The controller object waits for all backend interfaces to complete before acknowledging the frontend driver status down. Unloading and reloading the backend requires that the connection is broken and reestablished: the backend stops using the shared pages before sending a driver status down message. The driver status down is handled by the backend controller which notifies all the connected backend interfaces which in turn send fe interface status disconnected messages to their respective frontend domains. The frontend domains respond with fe interface connect messages, trying to reestablish a connection. When the back-end driver is reloaded it sends a driver status up which is propagated to the backend interface objects which send be create messages again and follow up with be connect messages containing the new shared pages and fe interface status connected messages to complete the connection. This is just the establishment of the connection. There are also messages sent by the devices to create the devices in the backend after the backend interfaces send the be create messages to create the interfaces. Also, this description covers the internals of xend only. The Frontend and backend drivers must handle the messages, map the shared page, initialise the ring interface, allocate IRQs and bind to the event channel as appropriate. There are a number of people working to make this simpler, in particular Mike Wray is apparently rewriting xend and Anthony Liguori is working on an alternative set of tools. Also, the grant tables implementation probably means that the above description is out of date. I''ve not investigated grant tables much yet. I''ve not been following the checkins to unstable since the xen summit so I''ll have missed anything not mentioned on the mailing list or IRC. The current overhead in terms of client code to establish an entity on the xen inter-domain communication "bus" is currently of the order of 1000 statements (counting FE, BE and slice of xend). A better inter-domain communication API could reduce this to fewer than 10 statements. If it''s not done by the time I finish the USB work, I will hopefully be allowed to help with this. Harry. On Thu, 2005-05-05 at 11:18 -0400, Aggarwal, Vikas (OFT) wrote:> Hi, > > I have added a debug-Front-end.c & debug-Back-end.c. Also created my > debug.py based on blkif.py > > But my front-end still times out even after sending > ctrl_if_register_receiver() to domain controller during module_init. > > I think I still did not initialize my debug.py properly. I saw VBD > gets initialized in device_create or create_devices. Whats the > difference in the two? > > Where should I look to initialize the event channel in domain > controller? > > My virtual device in back-end collects the debug data and writes to file > system so I don''t want it to be initialized during boot time via some > config file. > > -vikas > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:> Harry Butterworth wrote: > > > The current overhead in terms of client code to establish an entity on > > the xen inter-domain communication "bus" is currently of the order of > > 1000 statements (counting FE, BE and slice of xend). A better > > inter-domain communication API could reduce this to fewer than 10 > > statements. If it''s not done by the time I finish the USB work, I will > > hopefully be allowed to help with this. > > > > This reminded me you had suggested a different model for inter-domain comms. > I recently suggested a more socket-like API but it didn''t go down well.What exactly were the issues with the socket-like proposal?> > I agree with you that the event channel model could be improved - > what kind of comms model do you suggest?The event-channel and shared memory page are fine as low-level primitives to implement a comms channel between domains on the same physical machine. The problem is that the primitives are unnecessarily low-level from the client''s perspective and result in too much per-client code. The inter-domain communication API should preserve the efficiency of these primitives but provide a higher level API which is more convenient to use. Another issue with the current API is that, in the future, it is likely (for a number of virtual-iron/fault-tolerant-virtual-machine-like reasons) that it will be useful for the inter-domain communication API to span physical nodes in a cluster. The problem with the current API is that it directly couples the clients to a shared memory implementation with a direct connection between the front and back end domains and the clients would all need to be rewritten if the implementation was to span physical machines or require indirection. Eventually I would expect the effort invested in the clients of the inter-domain API to equal or exceed the effort invested in the hypervisor in the same way that the linux device drivers make up the bulk of the linux kernel code. There is a risk therefore that this might become a significant architectural limitation. So, I think we''re looking for a higher-level API which can preserve the current efficient implementation for domains resident on the same physical machine but allows for domains to be separated by a network interface without having to rewrite all the drivers. The API needs to address the following issues: Resource discovery --- Discovering the targets of IDC is an inherent requirement. Dynamic behaviour --- Domains are going to come and go all the time. Stale communications --- When domains come and go, client protocols must have a way to recover from communications in flight or potentially in flight from before the last transition. Deadlock --- IDC is a shared resource and must not introduce resource deadlock issues, for example when FE and BEs are arranged symetrically in reverse across the same interface or when BEs are stacked and so introduce chains of dependencies. Security --- There are varying degrees of trust beween the domains. Ease of use --- This is important for developer productivity and also to help ensure the other goals (security/robustness) are actually met. Efficiency/Performance --- obviously. I''d need a few days (which I don''t have right now) to put together a coherent proposal tailored specifically to xen. However, it would probably be along the lines of the following: A buffer abstraction to decouple the IDC API from the memory management implementation: struct local_buffer_reference; An endpoint abstraction to represent one end of an IDC connection. It''s important that this is done on a per connection basis rather than having one per domain for all IDC activity because it avoids deadlock issues arising from chained, dependent communication. struct idc_endpoint; A message abstraction because some protocols are more efficiently implemented using one-way messages than request-response pairs, particularly when the protocol involves more than two parties. struct idc_message { ... struct local_buffer_reference message_body; }; /* When a received message is finished with */ void idc_message_complete( struct idc_message * message ); A request-response transaction abstraction because most protocols are more easily implemented with these. struct idc_transaction { ... struct local_buffer_reference transaction_parameters; struct local_buffer_reference transaction_status; }; /* Useful to have an error code in addition to status. */ /* When a received transaction is finished with. */ void idc_transaction_complete ( struct idc_transaction * transaction, error_code error ); /* When an initiated transaction completes. Error code also reports transport errors when endpoint disconnects whilst transaction is outstanding. */ error_code idc_transaction_query_error_code ( struct idc_transaction * transaction ); An IDC address abstraction: struct idc_address; A mechanism to initiate connection establishment, can''t fail because endpoint resource is pre-allocated and create doesn''t actually need to establish the connection. The endpoint calls the registered notification functions as follows: ''appear'' when the remote endpoint is discovered then ''disappear'' if it goes away again or ''connect'' if a connection is actually established. After ''connect'', the client can submit messages and transactions. ''disconnect'' when the connection is failing, the client must wait for outstanding messages and transactions to complete (sucessfully or with a transport error) before completing the disconnect callback and must flush received messages and transactions whilst disconnected. Then ''connect'' if the connection is reestablished or ''disappear'' if the remote endpoint has gone away. A disconnect, connect cycle guarantees that the remote endpoint also goes through a disconnect, connect cycle. This API allows multi-pathing clients to make intelligent decisions and provides sufficient guarantees about stale messages and transactions to make a useful foundation. void idc_endpoint_create ( struct idc_endpoint * endpoint, struct idc_address address, void ( * appear )( struct idc_endpoint * endpoint ), void ( * connect )( struct idc_endpoint * endpoint ), void ( * disconnect ) ( struct idc_endpoint * endpoint, struct callback * callback ), void ( * disappear )( struct idc_endpoint * endpoint ), void ( * handle_message ) ( struct idc_endpoint * endpoint, struct idc_message * message ), void ( * handle_transaction ) ( struct idc_endpoint * endpoint, struct idc_transaction * transaction ) ); void idc_endpoint_submit_message ( struct idc_endpoint * endpoint, struct idc_message * message ); void idc_endpoint_submit_transaction ( struct idc_endpoint * endpoint, struct idc_transaction * transaction ); idc_endpoint_destroy completes the callback once the endpoint has ''disconnected'' and ''disappeared'' and the endpoint resource is free for reuse for a different connection. void idc_endpoint_destroy ( struct idc_endpoint * endpoint, struct callback * callback ); The messages and transaction parameters and status must be of finite length (these quota properties might be parameters of the endpoint resource allocation). Need a mechanism for efficient, arbitrary length bulk transfer too. An abstraction for buffers owned by remote domains: struct remote_buffer_reference; Can register a local buffer with the IDC to get a remote buffer reference: struct remote_buffer_reference idc_register_buffer ( struct local_buffer_reference buffer, some kind of resource probably required here ); remote buffer references may be passed between domains in idc messages or transaction parameters or transaction status. remote buffer references may be forwarded between domains and are usable from any domain. Once in posession of a remote buffer reference, a domain can transfer data between the remote buffer and a local buffer: void idc_send_to_remote_buffer ( struct remote_buffer_reference remote_buffer, struct local_buffer_reference local_buffer, struct callback * callback, /* transfer completes asynchronously */ some kind of resource required here ); void idc_receive_from_remote_buffer ( struct remote_buffer_reference remote_buffer, struct local_buffer_reference local_buffer, struct callback * callback, /* Again, completes asynchronously */ some kind of resource required here ); Can unregister to free a local buffer independent of remote buffer references still knocking around in remote domains (subsequent sends/receives fail): void idc_unregister_buffer ( probably a pointer to the resource passed on registration ); So, the 1000 statements of establishment code in the current drivers becomes: Receive an idc address from somewhere (resource discovery is outside the scope of this sketch). Allocate an IDC endpoint from somewhere (resource management is again outside the scope of this sketch). Call idc_endpoint_create. Wait for ''connect'' before attempting to use connection for device specific protocol implemented using messages/transactions/remote buffer references. Call idc_endpoint_destroy and quiesce before unloading module. The implementation of the local buffer references and memory management can hide the use of pages which are shared between domains and reference counted to provide a zero copy implementation of bulk data transfer and shared page-caches. I implemented something very similar to this before for a cluster interconnect and it worked very nicely. There are some subtleties to get right about the remote buffer reference implementation and the implications for out-of-order and idempotent bulk data transfers. As I said, it would require a few more days work to nail down a good API. Harry. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I like it. To start with, local communication only would be fine. Eventually it would scale neatly to things like remote device access. I particularly like the abstraction for remote memory - this would be an excellent fit to take advantage of RDMA where available (e.g. a cluster running on an IB fabric). Cheers, Mark On Friday 06 May 2005 13:14, Harry Butterworth wrote:> On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote: > > Harry Butterworth wrote: > > > The current overhead in terms of client code to establish an entity on > > > the xen inter-domain communication "bus" is currently of the order of > > > 1000 statements (counting FE, BE and slice of xend). A better > > > inter-domain communication API could reduce this to fewer than 10 > > > statements. If it''s not done by the time I finish the USB work, I will > > > hopefully be allowed to help with this. > > > > This reminded me you had suggested a different model for inter-domain > > comms. I recently suggested a more socket-like API but it didn''t go down > > well. > > What exactly were the issues with the socket-like proposal? > > > I agree with you that the event channel model could be improved - > > what kind of comms model do you suggest? > > The event-channel and shared memory page are fine as low-level > primitives to implement a comms channel between domains on the same > physical machine. The problem is that the primitives are unnecessarily > low-level from the client''s perspective and result in too much > per-client code. > > The inter-domain communication API should preserve the efficiency of > these primitives but provide a higher level API which is more convenient > to use. > > Another issue with the current API is that, in the future, it is likely > (for a number of virtual-iron/fault-tolerant-virtual-machine-like > reasons) that it will be useful for the inter-domain communication API > to span physical nodes in a cluster. The problem with the current API is > that it directly couples the clients to a shared memory implementation > with a direct connection between the front and back end domains and the > clients would all need to be rewritten if the implementation was to span > physical machines or require indirection. Eventually I would expect the > effort invested in the clients of the inter-domain API to equal or > exceed the effort invested in the hypervisor in the same way that the > linux device drivers make up the bulk of the linux kernel code. There is > a risk therefore that this might become a significant architectural > limitation. > > So, I think we''re looking for a higher-level API which can preserve the > current efficient implementation for domains resident on the same > physical machine but allows for domains to be separated by a network > interface without having to rewrite all the drivers. > > The API needs to address the following issues: > > Resource discovery --- Discovering the targets of IDC is an inherent > requirement. > > Dynamic behaviour --- Domains are going to come and go all the time. > > Stale communications --- When domains come and go, client protocols must > have a way to recover from communications in flight or potentially in > flight from before the last transition. > > Deadlock --- IDC is a shared resource and must not introduce resource > deadlock issues, for example when FE and BEs are arranged symetrically > in reverse across the same interface or when BEs are stacked and so > introduce chains of dependencies. > > Security --- There are varying degrees of trust beween the domains. > > Ease of use --- This is important for developer productivity and also to > help ensure the other goals (security/robustness) are actually met. > > Efficiency/Performance --- obviously. > > I''d need a few days (which I don''t have right now) to put together a > coherent proposal tailored specifically to xen. However, it would > probably be along the lines of the following: > > A buffer abstraction to decouple the IDC API from the memory management > implementation: > > struct local_buffer_reference; > > An endpoint abstraction to represent one end of an IDC connection. It''s > important that this is done on a per connection basis rather than having > one per domain for all IDC activity because it avoids deadlock issues > arising from chained, dependent communication. > > struct idc_endpoint; > > A message abstraction because some protocols are more efficiently > implemented using one-way messages than request-response pairs, > particularly when the protocol involves more than two parties. > > struct idc_message > { > ... > struct local_buffer_reference message_body; > }; > > /* When a received message is finished with */ > > void idc_message_complete( struct idc_message * message ); > > A request-response transaction abstraction because most protocols are > more easily implemented with these. > > struct idc_transaction > { > ... > struct local_buffer_reference transaction_parameters; > struct local_buffer_reference transaction_status; > }; > > /* Useful to have an error code in addition to status. */ > > /* When a received transaction is finished with. */ > > void idc_transaction_complete > ( struct idc_transaction * transaction, error_code error ); > > /* When an initiated transaction completes. Error code also reports > transport errors when endpoint disconnects whilst transaction is > outstanding. */ > > error_code idc_transaction_query_error_code > ( struct idc_transaction * transaction ); > > An IDC address abstraction: > > struct idc_address; > > A mechanism to initiate connection establishment, can''t fail because > endpoint resource is pre-allocated and create doesn''t actually need to > establish the connection. > > The endpoint calls the registered notification functions as follows: > > ''appear'' when the remote endpoint is discovered then ''disappear'' if it > goes away again or ''connect'' if a connection is actually established. > > After ''connect'', the client can submit messages and transactions. > > ''disconnect'' when the connection is failing, the client must wait for > outstanding messages and transactions to complete (sucessfully or with a > transport error) before completing the disconnect callback and must > flush received messages and transactions whilst disconnected. > > Then ''connect'' if the connection is reestablished or ''disappear'' if the > remote endpoint has gone away. > > A disconnect, connect cycle guarantees that the remote endpoint also > goes through a disconnect, connect cycle. > > This API allows multi-pathing clients to make intelligent decisions and > provides sufficient guarantees about stale messages and transactions to > make a useful foundation. > > void idc_endpoint_create > ( > struct idc_endpoint * endpoint, > struct idc_address address, > void ( * appear )( struct idc_endpoint * endpoint ), > void ( * connect )( struct idc_endpoint * endpoint ), > void ( * disconnect ) > ( struct idc_endpoint * endpoint, struct callback * callback ), > void ( * disappear )( struct idc_endpoint * endpoint ), > void ( * handle_message ) > ( struct idc_endpoint * endpoint, struct idc_message * message ), > void ( * handle_transaction ) > ( > struct idc_endpoint * endpoint, > struct idc_transaction * transaction > ) > ); > > void idc_endpoint_submit_message > ( struct idc_endpoint * endpoint, struct idc_message * message ); > > void idc_endpoint_submit_transaction > ( struct idc_endpoint * endpoint, struct idc_transaction * > transaction ); > > idc_endpoint_destroy completes the callback once the endpoint has > ''disconnected'' and ''disappeared'' and the endpoint resource is free for > reuse for a different connection. > > void idc_endpoint_destroy > ( > struct idc_endpoint * endpoint, > struct callback * callback > ); > > The messages and transaction parameters and status must be of finite > length (these quota properties might be parameters of the endpoint > resource allocation). Need a mechanism for efficient, arbitrary length > bulk transfer too. > > An abstraction for buffers owned by remote domains: > > struct remote_buffer_reference; > > Can register a local buffer with the IDC to get a remote buffer > reference: > > struct remote_buffer_reference idc_register_buffer > ( struct local_buffer_reference buffer, some kind of resource probably > required here ); > > remote buffer references may be passed between domains in idc messages > or transaction parameters or transaction status. > > remote buffer references may be forwarded between domains and are usable > from any domain. > > Once in posession of a remote buffer reference, a domain can transfer > data between the remote buffer and a local buffer: > > void idc_send_to_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* transfer completes asynchronously */ > some kind of resource required here > ); > > void idc_receive_from_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* Again, completes asynchronously */ > some kind of resource required here > ); > > Can unregister to free a local buffer independent of remote buffer > references still knocking around in remote domains (subsequent > sends/receives fail): > > void idc_unregister_buffer > ( probably a pointer to the resource passed on registration ); > > So, the 1000 statements of establishment code in the current drivers > becomes: > > Receive an idc address from somewhere (resource discovery is outside the > scope of this sketch). > > Allocate an IDC endpoint from somewhere (resource management is again > outside the scope of this sketch). > > Call idc_endpoint_create. > > Wait for ''connect'' before attempting to use connection for device > specific protocol implemented using messages/transactions/remote buffer > references. > > Call idc_endpoint_destroy and quiesce before unloading module. > > The implementation of the local buffer references and memory management > can hide the use of pages which are shared between domains and reference > counted to provide a zero copy implementation of bulk data transfer and > shared page-caches. > > I implemented something very similar to this before for a cluster > interconnect and it worked very nicely. There are some subtleties to > get right about the remote buffer reference implementation and the > implications for out-of-order and idempotent bulk data transfers. > > As I said, it would require a few more days work to nail down a good > API. > > Harry. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I would be happiest (is Eric Van Hensbergen out there?) if we could move to a model where the basic inter-domain communications is 9p, with all that implies. I can try to say more if there is interest. Right now, there are multiple types of comms for inter-domain, which all work very differently (consider vbd and network). I am continually dealing with chasing changes in how things work, which can get tough in a non-gcc world. From my point of view, the differences between (e.g.) bsd and linux are very minor compares to linux and plan 9. The biggest effort for me in forward-ports is dealing with gcc-isms in the code, and multiplying that by the number of different devices, adding in the binary structures, turns this into a lot of work. It would sure be nice if the option were there to have a set of devices which all had the ring-buffer structure of (e.g.) the vbd, with no page loaning etc., and which allowed fairly arbitrary data transfer (i.e. not the current limits of vbd, with assumed 512-byte blocks and assumed upper-limits of 4096 bytes etc.). Plan 9 has shown that in practice a simple open/read/write/close model works fine for any device, and that devices can present a file-system-like interface to the kernel (and hence programs) and it all works. A lot of the binary structure of the current interdomain comms and devices could be eliminated and replaced with a common, simple abstraction usable for all devices. You could get rid of all the different bits of code (block, net, etc.) and replace with one simple bit of code. ron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/6/05, Ronald G. Minnich <rminnich@lanl.gov> wrote:> I would be happiest (is Eric Van Hensbergen out there?) if we could move > to a model where the basic inter-domain communications is 9p, with all > that implies. I can try to say more if there is interest. >I''ve been developing such a model using IBM''s research hypervisor - but aside from some minor differences to the underlying transport mechanisms, it should be easy enough to apply to a Xen framework (it was my plan to look into that sometime this summer). I selected 9P as the protocol because of its limited set of underlying transport requirements, (it requires a reliable, in-order pipe and otherwise doesn''t make any assumptions), its flexibility (it can be used to share file systems, devices, protocol stacks, communication channels and application interfaces both across partitions and across the network), and the simplicity of its design (and resulting simplicity of implementation). The initial target of my implementation was special-purpose application kernels (built with a library-OS) running alongside a Linux kernel acting as an I/O host and controller. Our prototype library-OS version of the client library is approximately 2000 lines of code and the Linux file system client (http://v9fs.sf.net) is 4000 lines of code (extra weight added by VFS mapping routines). This summer we will be looking at improving the efficiency of our transport infrastructure, adding failure detection, recovery, and/or fail over (including network fail over), and adding intelligent network transports (we work over Ethernet and shared memory right now, but we want to fully prototype Infiniband and related transports). Some of the bits (like the v9fs Linux client and the u9fs user-space server) are already available open-source. We''ll are in the process of jumping through the paperwork hoops to open-source the library-OS infrastructure and shared memory transport pieces. There was a paper presented on v9fs at Freenix 2005. I''d be happy to answer any questions on this work. I was planning on presenting the approach to the Xen community once I had a more solid prototype and some performance numbers, but Ron has let the cat out of the bag. ;) -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Harry Butterworth wrote:> On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:Harry, thanks for bringing this to xen-devel discussion..> The inter-domain communication API should preserve the efficiency of > these primitives but provide a higher level API which is more convenient > to use.We certainly need to simplify the API for the frontends - making it easier to add frontends for new devices and OSs. We also need to build in support for frontend - frontend communication in an efficient way.> So, I think we''re looking for a higher-level API which can preserve the > current efficient implementation for domains resident on the same > physical machine but allows for domains to be separated by a network > interface without having to rewrite all the drivers. > > The API needs to address the following issues: > > Resource discovery --- Discovering the targets of IDC is an inherent > requirement. > > Dynamic behaviour --- Domains are going to come and go all the time. > > Stale communications --- When domains come and go, client protocols must > have a way to recover from communications in flight or potentially in > flight from before the last transition. > > Deadlock --- IDC is a shared resource and must not introduce resource > deadlock issues, for example when FE and BEs are arranged symetrically > in reverse across the same interface or when BEs are stacked and so > introduce chains of dependencies. > > Security --- There are varying degrees of trust beween the domains. > > Ease of use --- This is important for developer productivity and also to > help ensure the other goals (security/robustness) are actually met. > > Efficiency/Performance --- obviously. > > I''d need a few days (which I don''t have right now) to put together a > coherent proposal tailored specifically to xen. However, it would > probably be along the lines of the following: > > A buffer abstraction to decouple the IDC API from the memory management > implementation: > > struct local_buffer_reference; > > An endpoint abstraction to represent one end of an IDC connection. It''s > important that this is done on a per connection basis rather than having > one per domain for all IDC activity because it avoids deadlock issues > arising from chained, dependent communication. > > struct idc_endpoint; > > A message abstraction because some protocols are more efficiently > implemented using one-way messages than request-response pairs, > particularly when the protocol involves more than two parties. > > struct idc_message > { > ... > struct local_buffer_reference message_body; > }; > > /* When a received message is finished with */ > > void idc_message_complete( struct idc_message * message ); > > A request-response transaction abstraction because most protocols are > more easily implemented with these. > > struct idc_transaction > { > ... > struct local_buffer_reference transaction_parameters; > struct local_buffer_reference transaction_status; > }; > > /* Useful to have an error code in addition to status. */ > > /* When a received transaction is finished with. */ > > void idc_transaction_complete > ( struct idc_transaction * transaction, error_code error ); > > /* When an initiated transaction completes. Error code also reports > transport errors when endpoint disconnects whilst transaction is > outstanding. */ > > error_code idc_transaction_query_error_code > ( struct idc_transaction * transaction ); > > An IDC address abstraction: > > struct idc_address; > > A mechanism to initiate connection establishment, can''t fail because > endpoint resource is pre-allocated and create doesn''t actually need to > establish the connection. > > The endpoint calls the registered notification functions as follows: > > ''appear'' when the remote endpoint is discovered then ''disappear'' if it > goes away again or ''connect'' if a connection is actually established. > > After ''connect'', the client can submit messages and transactions. > > ''disconnect'' when the connection is failing, the client must wait for > outstanding messages and transactions to complete (sucessfully or with a > transport error) before completing the disconnect callback and must > flush received messages and transactions whilst disconnected. > > Then ''connect'' if the connection is reestablished or ''disappear'' if the > remote endpoint has gone away. > > A disconnect, connect cycle guarantees that the remote endpoint also > goes through a disconnect, connect cycle. > > This API allows multi-pathing clients to make intelligent decisions and > provides sufficient guarantees about stale messages and transactions to > make a useful foundation. > > void idc_endpoint_create > ( > struct idc_endpoint * endpoint, > struct idc_address address, > void ( * appear )( struct idc_endpoint * endpoint ), > void ( * connect )( struct idc_endpoint * endpoint ), > void ( * disconnect ) > ( struct idc_endpoint * endpoint, struct callback * callback ), > void ( * disappear )( struct idc_endpoint * endpoint ), > void ( * handle_message ) > ( struct idc_endpoint * endpoint, struct idc_message * message ), > void ( * handle_transaction ) > ( > struct idc_endpoint * endpoint, > struct idc_transaction * transaction > ) > ); > > void idc_endpoint_submit_message > ( struct idc_endpoint * endpoint, struct idc_message * message ); > > void idc_endpoint_submit_transaction > ( struct idc_endpoint * endpoint, struct idc_transaction * > transaction ); > > idc_endpoint_destroy completes the callback once the endpoint has > ''disconnected'' and ''disappeared'' and the endpoint resource is free for > reuse for a different connection. > > void idc_endpoint_destroy > ( > struct idc_endpoint * endpoint, > struct callback * callback > ); > > The messages and transaction parameters and status must be of finite > length (these quota properties might be parameters of the endpoint > resource allocation). Need a mechanism for efficient, arbitrary length > bulk transfer too. > > An abstraction for buffers owned by remote domains: > > struct remote_buffer_reference; > > Can register a local buffer with the IDC to get a remote buffer > reference: > > struct remote_buffer_reference idc_register_buffer > ( struct local_buffer_reference buffer, some kind of resource probably > required here ); > > remote buffer references may be passed between domains in idc messages > or transaction parameters or transaction status. > > remote buffer references may be forwarded between domains and are usable > from any domain. > > Once in posession of a remote buffer reference, a domain can transfer > data between the remote buffer and a local buffer: > > void idc_send_to_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* transfer completes asynchronously */ > some kind of resource required here > ); > > void idc_receive_from_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* Again, completes asynchronously */ > some kind of resource required here > ); > > Can unregister to free a local buffer independent of remote buffer > references still knocking around in remote domains (subsequent > sends/receives fail): > > void idc_unregister_buffer > ( probably a pointer to the resource passed on registration ); > > So, the 1000 statements of establishment code in the current drivers > becomes: > > Receive an idc address from somewhere (resource discovery is outside the > scope of this sketch). > > Allocate an IDC endpoint from somewhere (resource management is again > outside the scope of this sketch). > > Call idc_endpoint_create. > > Wait for ''connect'' before attempting to use connection for device > specific protocol implemented using messages/transactions/remote buffer > references. > > Call idc_endpoint_destroy and quiesce before unloading module.quiesce across remote nodes as well?> The implementation of the local buffer references and memory management > can hide the use of pages which are shared between domains and reference > counted to provide a zero copy implementation of bulk data transfer and > shared page-caches. > > I implemented something very similar to this before for a cluster > interconnect and it worked very nicely. There are some subtleties to > get right about the remote buffer reference implementation and the > implications for out-of-order and idempotent bulk data transfers.All the above looked very sane. How does stuff get out of order, though? We have effectively per-device queues.> As I said, it would require a few more days work to nail down a good > API.thanks, Nivedita _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I skimmed what I could find immediately on 9P: http://www.cs.bell-labs.com/sys/man/5/INDEX.html and came to the conclusion that what I proposed is lower level and more general (it''s also simpler but, given that it''s not doing the same thing, that isn''t really significant). You could implement 9P on top of my IDC API proposal---and this might very well be a valid thing to do given that some sort of higher level protocol is required---but, with what I proposed, you could also do things like implement a stack of software I/O virtualization functions and send the bulk data between the top of the stack and the bottom without going via all the components in the middle. My 10 mins of 9P experience indicates that 9P doesn''t have anything like that. Or does it---any pointers to good 9P documentation? Harry _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/6/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk> wrote:> I skimmed what I could find immediately on 9P: > http://www.cs.bell-labs.com/sys/man/5/INDEX.html and came to the > conclusion that what I proposed is lower level and more general (it''s > also simpler but, given that it''s not doing the same thing, that isn''t > really significant).There isn''t a great deal of detail on the specifics of the protocol beyond the man pages. I think you''ll find the following Plan 9 foundational papers more illustrative of the type of things that it can be used to do: http://plan9.bell-labs.com/sys/doc/9.pdf http://plan9.bell-labs.com/sys/doc/names.pdf and in particular: http://plan9.bell-labs.com/sys/doc/net/net.pdf For the security minded: http://plan9.bell-labs.com/sys/doc/auth.pdf I believe Ron''s original statements were motivated by your earlier statement:>> >>The event-channel and shared memory page are fine as low-level >>primitives to implement a comms channel between domains on the same >>physical machine. The problem is that the primitives are unnecessarily >>low-level from the client''s perspective and result in too much >>per-client code. >>This is exactly the sort of thing that the Plan 9 networking model was designed to do. The idea being that the Plan 9 model provides a nice abstract layer which to communicate with AND to organize (the organization is an important feature) the resulting communication channels and endpoints (no matter what the underlying transport is or where the particular endpoints may be). The Plan 9 networking model can be run over named pipes, shared memory, PCI buses, Ethernet, Infiniband, or whatever other flavor of network with relatively minor requirements on the underlying transport layer.> > You could implement 9P on top of my IDC API proposal >You could implement 9P on top of such an IDC API, however, it seems like it would add unnecessary overhead. 9P can be easily implemented directly on top of the existing event/shared-page mechanisms and then trivially bridged to the network at the I/O partition(s). As far as multi-level protocol stacks and circumventing data paths, I''d hope we could come up with a simpler, more elegant solution. Looking over your earlier proposal it seems like an awful lot of complexity to accomplish a relatively simple task. Perhaps the refined API will demonstrate that simplicity better? I''d be really interested in the security details of your model as well when you finish the more detailed proposal.>> >>The inter-domain communication API should preserve the efficiency of >>these primitives but provide a higher level API which is more convenient >>to use. >>I think this is a great mission statement -- convenience and simplicity are the key to selling such an API. I''d suggest we step through a complete example with actual applications and actual devices that would demonstrate the problem that we are trying to solve. Perhaps Ron and I can pull together an alternate proposal based on such a concrete example. -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2005-05-06 at 19:19 -0500, Eric Van Hensbergen wrote:> This is exactly the sort of thing that the Plan 9 networking model was > designed to do. > The idea being that the Plan 9 model provides a nice abstract layer > which to communicate with AND to organize (the organization is an > important feature)Yes, the organisation is important, I expect we can learn a lot from 9P here. I''m not trying to address the organisation aspect yet though (my API proposal is lower level) so please let''s postpone that aspect of the discussion.> Looking over your earlier proposal it seems like an awful lot of > complexity to accomplish a relatively simple task. Perhaps the > refined API will demonstrate that simplicity better? I''d be really > interested in the security details of your model as well when you > finish the more detailed proposal.I''d need help from security experts but, as an initial stab, if the idc_address and remote_buffer_references are capabilities then I think the security falls out in the wash since it''s impossible to access something unless you have been granted permission.> I''d suggest we step through a complete example with actual > applications and actual devices that would demonstrate the problem > that we are trying to solve. Perhaps Ron and I can pull together an > alternate proposal based on such a concrete example.OK, here goes: A significant difference between Plan 9 and Xen which is relevant to this discussion is that Plan 9 is designed to construct a single shared environment from multiple physical machines whereas Xen is designed to partition a single physical machine into multiple isolated environments. Arguably, Xen clusters might also partition multiple physical machines into multiple isolated environments with some weird and wonderful cross-machine sharing and replication going on. The significance of this difference is that in the Xen environment, there are many interesting opportunities for optimisations across the virtual machines running on the same physical machine. These optimisations are not relevant to a native Plan 9 system and so (AFAICT with 20 mins experience :-) ) there is no provision for them in 9P. Take, for example, the case where 1000 virtual machines are booting on the same physical machine from copy-on-write file-systems which share a common ancestry. With my proposal, when the virtual machines read a file into memory, the operation might proceed as follows: The FE registers a local buffer to receive the file contents and is given a remote buffer reference. The FE issues a transaction to the BE to read the file, passing the remote buffer reference. The BE looks up the file using its meta data and happens to find the file contents in its buffer cache. The BE makes an idc_send_to_remote_buffer call passing a local buffer reference for the file data in its local buffer cache and the remote buffer reference provided by the FE. The IDC implementation resolves the remote buffer reference to a local buffer reference (since the FE buffer happens to reside on the same physical machine in this example) and makes a call to local_buffer_reference_copy to copy the file data from its buffer cache to the FE buffer. The local_buffer_reference_copy implementation determines the type of copy based on the types of the source and destination local buffer references which, for the sake of argument, both happen to be of a type backed by reference counted pages from compatible page pools. The implementation of local_buffer_reference_copy for that specific combination of buffer types maps the BE pages into the FE address space incrementing their reference counts and also unmaps the old FE pages and decrements their reference counts, returning them to the free pool if necessary. The other 999 virtual machines boot and do the same thing (since the file in question happens to be one which has not diverged) leaving all 1000 virtual machines sharing the same physical memory. Had the virtual machines been booting on different physical machines then the path through the FE and BE client code would have been identical (so we have met the network transparency goal) but the IDC implementation would have taken an alternative path upon discovering that the remote_buffer_reference was genuinely remote. Hopefully this explains better where I''m coming from. Harry. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/7/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk> wrote:> > I''d need help from security experts but, as an initial stab, if the > idc_address and remote_buffer_references are capabilities then I think > the security falls out in the wash since it''s impossible to access > something unless you have been granted permission. >That seems straightforward and clear to me on the local host, but it seems like there might be additional concerns when bridging to the cluster. Of course, those security issues may be embedded in the underlying network transport layer, so maybe its not as much of a concern.> > OK, here goes: > > A significant difference between Plan 9 and Xen which is relevant to > this discussion is that Plan 9 is designed to construct a single shared > environment from multiple physical machines whereas Xen is designed to > partition a single physical machine into multiple isolated environments. > Arguably, Xen clusters might also partition multiple physical machines > into multiple isolated environments with some weird and wonderful > cross-machine sharing and replication going on. >Yes and no, Plan 9 does provide a coherent mechanism to unify access to the resources of an entire cluster of physical machines -- but it also provides a lot of facilities for partitioning and organizing those resources into a private name spaces. Its both of these aspects that Ron and I would look to see leveraged in any sort of future Xen I/O architecture. But this gets more into organizational features, which may be a separate topic.> > The significance of this difference is that in the Xen environment, > there are many interesting opportunities for optimisations across the > virtual machines running on the same physical machine. These > optimisations are not relevant to a native Plan 9 system and so (AFAICT > with 20 mins experience :-) ) there is no provision for them in 9P. >This is true. The example you step through sounds like an implementation of DSM targeted at a buffer cache. In the past, 9P has not been used to provide such a level of transparent sharing of underlying memory. However, an area that Orran and I have been talking about is exploring the addition of scatter/gather type semantics to the 9P protocol implementations (there''s really not that much that has to change in the specification, just some differences in the way the protocol looks on the underlying transport). In other words, there is nothing to prevent read/write calls from having pointers to the page containing the data versus having a copy of the data. This page containing the data could be a copy, or it could be shared as in your example. The cool thing is, since 9P is already setup to be a network protocol, if it did end up having to leave shared memory and go over an external wire, there''s already a fairly rich organizational infrastructure in place (with built-in support for security/encryption/digesting).> > Had the virtual machines been booting on different physical machines > then the path through the FE and BE client code would have been > identical (so we have met the network transparency goal) but the IDC > implementation would have taken an alternative path upon discovering > that the remote_buffer_reference was genuinely remote. >The scenario you walk through sounds really great, and providing mechanisms to manage and recognize shared buffers on the same machine sounds like absolutely the right thing to do. I''m just arguing for a simpler client interface (and I think something akin to the 9P API together with a nice Channel abstraction to management endpoints would be the right way to go about it). Okay Ron, you got me into this, what are your thoughts?> Hopefully this explains better where I''m coming from.This paints a clearer picture of what you were going after, and I think we are on similar tracks. I need to get more engaged in looking at the Xen stuff so I have a better context for some of the problems particular to its environment. -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 7 May 2005, Harry Butterworth wrote:> A significant difference between Plan 9 and Xen which is relevant to > this discussion is that Plan 9 is designed to construct a single shared > environment from multiple physical machines whereas Xen is designed to > partition a single physical machine into multiple isolated environments.this is not in my view germane. We''re proposing using 9p as the low level interdomain protocol for xen. 9p stands alone from plan 9. Both Eric and I have felt for some time that this would work well, and I''m pretty strongly convinced that it would work better than what we have now for the various devices. Your 1000 machine example is very interesting, however. I do know that some folks who worked on the IBM hypervisor had a lot of concerns about the page flipping done in the xen network FE/BE drivers, and I wonder if their concerns would apply to your proposed implementation -- I wish we could find a way to hear from them. thanks ron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 7 May 2005, at 17:15, Ronald G. Minnich wrote:> this is not in my view germane. We''re proposing using 9p as the low > level > interdomain protocol for xen. 9p stands alone from plan 9. Both Eric > and I > have felt for some time that this would work well, and I''m pretty > strongly > convinced that it would work better than what we have now for the > various > devices.Isn''t 9p a file transfer protocol, or is that just one aspect of it? At the very lowest level it doesn''t appear so different from what we have now (multiple oustanding request messages, each with a different tag that is echoed in the asynchronous response message); but the next level up seems to be defined in terms of walking a hierarchical file system, obtaining handles on files, and so on. To me that doesn''t sound obviously applicable to our inter-driver communications. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 2005-05-07 at 10:15 -0600, Ronald G. Minnich wrote:> > On Sat, 7 May 2005, Harry Butterworth wrote: > > A significant difference between Plan 9 and Xen which is relevant to > > this discussion is that Plan 9 is designed to construct a single shared > > environment from multiple physical machines whereas Xen is designed to > > partition a single physical machine into multiple isolated environments. > > this is not in my view germane. We''re proposing using 9p as the low level > interdomain protocol for xen. 9p stands alone from plan 9. Both Eric and I > have felt for some time that this would work well, and I''m pretty strongly > convinced that it would work better than what we have now for the various > devices.I agree it would work well, definitely better than the current code, but I don''t think it''s right for Xen because it precludes the kind of optimisations that are relevant to Xen but not Plan 9. Eric mentioned adding scatter/gather type semantics to the 9P protocol which would get you half-way to what I was proposing and start to look similar to the remote buffer reference concept in that you pass around references to the bulk data instead of the data itself but you still wouldn''t get the ability to manipulate the meta-data of a request in a stack of I/O applications and then do a direct transfer of the bulk data from the source to the target. Of course, if you modify the 9P protocol enough then you can make it a superset of what I was proposing but I think my API proposal captured the essence of what was required at the low level without complicating the description with the higher level organisation aspects of the protocol.> > Your 1000 machine example is very interesting, however. I do know that > some folks who worked on the IBM hypervisor had a lot of concerns about > the page flipping done in the xen network FE/BE drivers, and I wonder if > their concerns would apply to your proposed implementation -- I wish we > could find a way to hear from them.Rusty and I exchanged some mail about the page flipping and I think Rusty demonstrated that their concerns were unfounded. Perhaps he can comment. In any case, my proposal decouples the IDC API clients from the actual memory management implementation so different page-flipping behaviours could be evaluated without requiring code changes in the clients. Harry. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/7/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> > Isn''t 9p a file transfer protocol, or is that just one aspect of it? >Its an easy misconception, 9P is many things - at its core it is the API to all resources (system, device, application, files) in the Plan 9 paradigm - on the local system the system calls translate directly to function calls within the kernel - never being marshaled into a message form. It is also a protocol for remote resource sharing (since 9P is used to more than just files, the network encapsulation of 9P can be used to share all the different resources of the system).> > At the very lowest level it doesn''t appear so different from what we have > now (multiple oustanding request messages, each with a different tag > that is echoed in the asynchronous response message); but the next > level up seems to be defined in terms of walking a hierarchical file > system, obtaining handles on files, and so on. To me that doesn''t sound > obviously applicable to our inter-driver communications. >The dynamic, stackable, hierarchical organization of 9P resources within Plan 9 may be one of the most important thing for virtualization environments to leverage. It creates a easily understood and manipulated metaphor for organizing system, partition, or cluster resources. Within device implementation, it can create a hierarchy of interfaces allow core resources as well as extensions that not all clients can take advantage of. One thing which might be useful would be to look at the Plan 9 man page sections on devices: (http://plan9.bell-labs.com/sys/man/3/INDEX.html) and file servers: (http://plan9.bell-labs.com/sys/man/4/INDEX.html) I hate talking about this stuff before we have some sort of prototype within the Xen environment to demonstrate, a lot of the power of this methodology is best experienced rather than pontificated about. Regardless, we will be working towards this over the next few months and it should tie in nicely with our v9fs driver in Linux (which will also be likely ported to K42 and naturally already exists within Plan 9). There has been quite a bit of interest from certain sects within the BSD community in implementing something similar to v9fs, so perhaps we''ll have something there by summer''s end as well. -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/7/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk> wrote:> > I agree it would work well, definitely better than the current code, but > I don''t think it''s right for Xen because it precludes the kind of > optimisations that are relevant to Xen but not Plan 9. >I really don''t see how they preclude any optimizations in Xen, particularly with some sort of scatter/gather mechanism in place - the 9P API would be unchanged, just the link-layer would be a bit different (so its not even a protocol change).> > but you still wouldn''t get the ability to manipulate the meta-data of a > request in a stack of I/O applications and then do a direct transfer of the > bulk data from the source to the target. >Not clear why this is the case. It seems like you''d be able to do this sort of thing in any sort of protocol where the data is distinct from the rest of the operation encapsulation. In fact, there are many different forms of synthetic "filter file systems" within Plan 9 that take advantage of just this property (the fact that the bulk data is handled separately from the control data).> > In any case, my proposal decouples the IDC API clients from the actual > memory management implementation so different page-flipping behaviours > could be evaluated without requiring code changes in the clients. >Again, this sounds like a good thing. A nice abstract interface which is valid on all platforms for communicating on byte streams between domains (not just to the controller, but between peer domains as well) is exactly the sort of transport we would want to build 9P on, but it really seemed (from your initial proposal) that you were constructing interfaces for much more than that. Its also not clear to me whether having extensions built into the IDC protocol for doing network transactions would be the right thing to do, it sounds like it would add a lot of complexity to an otherwise simple core mechanism -- and as I stated previously, I think simplicity is the ultimate goal of such an interface. -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
If you could go through with 9p the same concrete example I did I think I''d find that useful. Also, I should probably spend another 10 mins on the docs ;-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/7/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk> wrote:> If you could go through with 9p the same concrete example I did I think > I''d find that useful. Also, I should probably spend another 10 mins on > the docs ;-) >It would be way better if we could step you through a demo in the environment. The documentation for Plan 9 has always been spartan at best -- so getting a good idea of how things work without looking over someones shoulder who is experienced has always been somewhat difficult. I''ve been trying to correct this while pushing some of the bits into Linux - my Freenix v9fs paper spent quite a bit of time talking about how 9P actually works and showing traces for typical file system operations. The OLS paper which I''m supposed to be writing right now covers the same sort of thing for the dynamic private name spaces. The reason why I didn''t post a counter-example was because I didn''t see much difference between our two ideas for the scenario you lay out (except you obviously understand a lot more about some of the details that I haven''t gotten into yet): (from the looks of things, you had already established an authenticated connection between the FE and BE and the top-level had already been established and whatever file system you are using to read file data had already traversed to the file and opened it. A quick summary of how we do the above with 9P: a) establish the connection to the BE (there are various semantics possible here, in my current stuff I pass a reference to the Channel around. Alternatively you could use socket like semantics to connect to the other partition) b) Issue a protocol negotiation packet to establish buffer sizes, protocol version, etc. (t_version) c) Attach to the remote resource, providing authentication information (if necessary) t_attach -- this will also create an initial piece of shared meta-data referencing the root resource of the BE (for devices there may only be a single resource, or one resource such as a block device may have multiple nodes such as partitions, in Plan 9 devices also present different aspects of their interface (ctl, data, stat, etc.) as different nodes in a hierarchical fashion. The reference to different nodes in the tree is called a FID (think of it as a file descriptor) - it contains information about who has attached to the resource and where in the hierarchy they are. A key thing to remember is that in Plan 9, every device is a file server. d) You would then traverse the object tree to the resource you wanted to use (in your case it sounded like a file in a file system, so the metaphor is straightforward). The client would issue a t_walk message to perform this traversal. e) The object would then need to be opened (t_open) with information about what type of operation will be executed (read, write, both) and can include additional information about the type of transactions (append only, exclusive access, etc.) that may be beneficial to managing the underlying resource. The BE could use information cached in the Fid from the attach to check the FE''s permission to be accessing this resource with that mode. In our world, this would result in you holding a Fid pointing to the open object. The Fid is a pointer to meta-data and is considered state on both the FE and the BE. (this has downsides in terms of reliability and the ability to recover sessions or fail over to different BE''s -- one of our summer students will be addressing the reliability problem this summer). The FE performs a read operation passing it the necessary bits: ret = read( fd, *buf, count ); This actually would be translated (in collaboration with local meta-data into a t_read mesage) t_read tag fid offset count (where offset is determined by local fid metadata) The BE receives the read request, and based on state information kept in the Fid (basically your metadata), it finds the file contents in the buffer cache. It sends a response packet with a pointer to its local buffer cache entry: r_read tag count *data There are a couple ways we could go when the FE receives the response: a) it could memcopy the data to the user buffer *buf . This is the way things currently work, and isn''t very efficient -- but may be the way to go for the ultra-paranoid who don''t like sharing memory references between partitions. b) We could have registered the memory pointed to by *buf and passed that reference along the path -- but then it probably would just amount to the BE doing the copy rather than the front end. Perhaps this approximates what you were talking about doing? c) As long as the buffers in question (both *buf and the buffer cache entry) were page-aligned, etc. -- we could play clever VM games marking the page as shared RO between the two partitions and alias the virtual memory pointed to by *buf to the shared page. This is very sketchy and high level and I need to delve into all sorts of details -- but the idea would be to use virtual memory as your friend for these sort of shared read-only buffer caches. It would also require careful allocation of buffers of the right size on the right alignment -- but driver writers are used to that sort of thing. To do this sort of thing, we''d need to do the exact same sort of accounting you describe:>The implementation of local_buffer_reference_copy for that specific >combination of buffer types maps the BE pages into the FE address space >incrementing their reference counts and also unmaps the old FE pages and >decrements their reference counts, returning them to the free pool if >necessary.When the FE was done with the BE, it would close the resources (issuing t_clunk on any fids associated with the BE). The above looks complicated, but to a FE writer would be as simple as: channel = dial("net!BE"); /* establish connection */ /* in my current code, channel is passed as an argument to the FE as a boot arg */ root = fsmount(channel, NULL); /* this does the t_version, auth, & attach */ fd = open(root, "/some/path/file", OREAD); ret = read(fd, *buf, sizeof(buf)); close(fd); close(root); close(channel); If you want to get fancy, you could get rid of the root arg to open and use a private name space (after fsmount): bind(root, "/mnt/be", MREPL); /* bind the back end to a well known place */ then it would be: fd=open("/mnt/be/some/path/file", OREAD); There''s also all sorts of cool stuff you can do on the domain controller to provision child partitions using dynamic name space and then just exporting the custom fashioned environment using 9P -- but that''s higher level organization stuff again. There''s all sorts of cool tricks you can play with 9P (similar to the stuff that the FUSE and FiST user-space file system packages provide) like copy-on-write file systems, COW block devices, muxed ttys, etc. etc. The reality is that I''m not sure I''d actually want to use a BE to implement a file system, but its quite reasonable to implement a buffer cache that way. In all likelihood this would result in the FE opening up a connection (and a single object) on the BE buffer cache, then using different offsets to grab specific blocks from the BE buffer cache using the t_read operation. I''ve described it in terms of a file system, using your example as a basis, but the same sort of thing would be true for a block device or a network connection (with some slightly different semantic rules on the network connection). The main point is to keep things simple for the FE and BE writers, and deal with all the accounting and magic you describe within the infrastructure (no small task). Another difference would involve what would happen if you did have to bridge a cluster network - the 9P network encapsulation is well defined, all you would need to do (at the I/O partition bridging the network) is marshall the data according to the existing protocol spec. For more intelligent networks using RDMA and such things, you could keep the scatter/gather style semantics and send pointers into the RDMA space for buffer references. As I said before, there''s lots of similarities in what we are talking about, I''m just gluing a slightly more abstract interface on top, which has some benefits in some additional organizational and security mechanisms (and a well-established (but not widely used yet) network protocol encapsulation). There are plenty of details I know I''m glossing over, and I''m sure I''ll need lots of help getting things right. I''d have preferred staying quiet until I had my act together a little more, but Orran and Ron convinced me that it was important to let people know the direction I''m planning on exploring. -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Eric, Your thoughts on 9P are all really interesting -- I''d come across the protocol years ago in looking into approaches to remote device/fs access but had a hard time finding details. It''s quite interesting to hear a bit more about the approach taken. Having a more accessible inter-domain comms API is clearly a good thing, and extending device channels (in our terminology -- shared memory + event notification) to work across a cluster is something that we''ve talked about on several occasions at the lab. I do think though, that as mentioned above there are some concerns with the VMM environment that make this a little trickier. For the general case of inefficient comms between VMs, using the regular IP stack may be okay for many people. The net drivers are being fixed up to special-case local communications. For the more specific cases of FE/BE comms, I think the devil may be in the details more than the current discussion is alluding to. Specifically:> c) As long as the buffers in question (both *buf and the buffer cache > entry) were page-aligned, etc. -- we could play clever VM games > marking the page as shared RO between the two partitions and alias the > virtual memory pointed to by *buf to the shared page. This is very > sketchy and high level and I need to delve into all sorts of details > -- but the idea would be to use virtual memory as your friend for > these sort of shared read-only buffer caches. It would also require > careful allocation of buffers of the right size on the right alignment > -- but driver writers are used to that sort of thing.Most of the good performance that Xen gets off of block and net split devices are specifically because of these clever VM games. Block FEs pass page references down to be mapped directly for DMA. Net devices pass pages into a free pool, and actually exchange physical pages under the feet of the VM as inbound packets are demultiplexed. The grant tables that have recently been added provide separate mechanisms for the mapping and ownership transfer of pages across domains. In addition to these tricks, we make careful use of timing event notification in order to batch messages. In the case of the buffer cache that has come up several times in the thread, a cache across domains would potentially neet to pass read only page mappings as CoW in many situations, and a fault handler somewhere would need to bring in a new page to the guest on a write. There are also a pile of complicating cases with regards cache eviction from a BE domain, migration, and so on that make the accounting really tricky. I think it would be quite good to have a discussion of generalized interdomain comms address the current drivers, as well as a hypothetical buffer cache as potential cases. Does 9P already have hooks that would allow you to handle this sort of per-application special case? Additionally, I think we get away with a lot in the current drivers from a falure model that excludes transport. The FE or BE can crash, and the two drivers can be written defensively to handle that. How does 9P handle the strangenesses of real distribution? Anyhow, very interesting discussion... looking forward to your thoughts. a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I quite like some of this. A few more comments below. On Sat, 2005-05-07 at 19:57 -0500, Eric Van Hensbergen wrote:> > In our world, this would result in you holding a Fid pointing to the > open object. The Fid is a pointer to meta-data and is considered > state on both the FE and the BE. (this has downsides in terms of > reliability and the ability to recover sessions or fail over to > different BE''s -- one of our summer students will be addressing the > reliability problem this summer).OK, so this is an area of concern for me. I used the last version of the sketchy API I outlined to create an HA cluster infrastructure. So I had to solve these kind of protocol issues and, whilst it was actually pretty easy starting from scratch, retrofitting a solution to an existing protocol might be challenging, even for a summer student.> > The FE performs a read operation passing it the necessary bits: > ret = read( fd, *buf, count );Here the API is coupling the client to the memory management implementation by assuming that the buffer is mapped into the client''s virtual address space. This is probably likely to be true most of the time so an API at this level will be useful but I''d also like to be able to write I/O applications that manage the data in buffers that are never mapped into the application address space. Also, I''d like to be able to write applications that have clients which use different types of buffers without having to code for each case in my application. This is why my API deals in terms of local_buffer_references which, for the sake of argument, might look like this: struct local_buffer_reference { local_buffer_reference_type type; local_buffer_reference_base base; buffer_reference_offset offset; buffer_reference_length length; }; A local buffer reference of type virtual_address would have a base value equal to the buf pointer above, an offset of zero and a length of count. A local buffer reference for the hidden buffer pages would have a different type, say hidden_buffer_page, the base would be a pointer to a vector of page indices, the offset would be the offset of the start of the buffer into the first page and the length would be the length of the buffer. So, my application can deal with buffers described like that without having to worry about the flavour of memory management backing them. Also, I can change the memory management without changing all the calls to the API, I only have to change where I get buffers from. BTW, this specific abstraction I learnt about from an embedded OS architected by Nik Shalor. He might have got it from somewhere else.> > This actually would be translated (in collaboration with local > meta-data into a t_read mesage) > t_read tag fid offset count (where offset is determined by local fid metadata) > > The BE receives the read request, and based on state information kept > in the Fid (basically your metadata), it finds the file contents in > the buffer cache. It sends a response packet with a pointer to its > local buffer cache entry: > > r_read tag count *data > > There are a couple ways we could go when the FE receives the response: > a) it could memcopy the data to the user buffer *buf . This is the > way things currently work, and isn''t very efficient -- but may be > the way to go for the ultra-paranoid who don''t like sharing memory > references between partitions. > > b) We could have registered the memory pointed to by *buf and passed > that reference along the path -- but then it probably would just > amount to the BE doing the copy rather than the front end. Perhaps > this approximates what you were talking about doing?No, ''c'' is closer to what I was sketching out, except that I was proposing a general mechanism that had one code path in the clients even though the underlying implementation could be a, b or c or any other memory management strategy determined at run-time according to the type of the local buffer references involved.> > c) As long as the buffers in question (both *buf and the buffer cache > entry) were page-aligned, etc. -- we could play clever VM games > marking the page as shared RO between the two partitions and alias the > virtual memory pointed to by *buf to the shared page. This is very > sketchy and high level and I need to delve into all sorts of details > -- but the idea would be to use virtual memory as your friend for > these sort of shared read-only buffer caches. It would also require > careful allocation of buffers of the right size on the right alignment > -- but driver writers are used to that sort of thing.Yes, it also requires buffers of the right size and alignment to be used at the receiving end of any network transfers and for the alignment to be preserved across the network even if the transfer starts at a non-zero page offset. You might think that once the data goes over the network you don''t care but it might be received by an application that wants to share pages with another application so, in fact, it does matter. This is just something you have to get right if you want any kind of page referencing technique to work although you can fall back to memcopy for misaligned data if necessary.> The above looks complicated, but to a FE writer would be as simple as: > channel = dial("net!BE"); /* establish connection */ > /* in my current code, channel is passed as an argument to the FE as a > boot arg */ > root = fsmount(channel, NULL); /* this does the t_version, auth, & attach */ > fd = open(root, "/some/path/file", OREAD); > ret = read(fd, *buf, sizeof(buf)); > close(fd); > close(root); > close(channel);So, this is obviously a blocking API. My API was non-blocking because the network latency means that you need a lot of concurrency for high throughput and you don''t necessarily want so many threads. Like AIO. Having a blocking API as well is convenient though.> > If you want to get fancy, you could get rid of the root arg to open > and use a private name space (after fsmount): > bind(root, "/mnt/be", MREPL); /* bind the back end to a well known place */ > then it would be: > fd=open("/mnt/be/some/path/file", OREAD); > > There''s also all sorts of cool stuff you can do on the domain > controller to provision child partitions using dynamic name space and > then just exporting the custom fashioned environment using 9P -- but > that''s higher level organization stuff again. There''s all sorts of > cool tricks you can play with 9P (similar to the stuff that the FUSE > and FiST user-space file system packages provide) like copy-on-write > file systems, COW block devices, muxed ttys, etc. etc.I''m definitely going to study the organisational aspects of 9P.> I''ve described it in terms of a file system, using your example as a > basis, but the same sort of thing would be true for a block device or > a network connection (with some slightly different semantic rules on > the network connection). The main point is to keep things simple for > the FE and BE writers, and deal with all the accounting and magic you > describe within the infrastructure (no small task).If you use the buffer abstraction I described then you can start with a very simple mm implementation and improve it without having to change the clients.> > Another difference would involve what would happen if you did have to > bridge a cluster network - the 9P network encapsulation is well > defined, all you would need to do (at the I/O partition bridging the > network) is marshall the data according to the existing protocol spec. > For more intelligent networks using RDMA and such things, you could > keep the scatter/gather style semantics and send pointers into the > RDMA space for buffer references.I don''t see this as a difference. My API was explicitly compatible with a networked implementation. One of the thoughts that did occur to me was that a reliance on in-order message delivery (which 9p has) turns out to be quite painful to satisfy when combined with multi-pathing and failover because a link failure results in the remaining links being full of messages that can''t be delivered until the lost messages (which happened to include the message that must be delivered first) have been resent. This is more problematical than you might think because all communications stall for the time taken to _detect_ the lost link and all the remaining links are full so you can''t do the recovery without cleaning up another link. This is not insurmountable but it''s not obviously the best solution either, particularly since, with SMP machines, you aren''t going to want to serialise the concurrent activity so there needn''t be any fundamental in-order requirements at the protocol level.> > As I said before, there''s lots of similarities in what we are talking > about, I''m just gluing a slightly more abstract interface on top, > which has some benefits in some additional organizational and security > mechanisms (and a well-established (but not widely used yet) network > protocol encapsulation).Maybe something like my API underneath and something like the organisational stuff of 9p on top would be good. I''d have to see how the 9p organisational stuff stacks up against the publish and subscribe mechanisms for resource discovery that I''m used to. Also, the infrastructure requirements for building fault-tolerant systems are quite demanding and I''d have to be confident that 9p was up to the task before I''d be happy with it personally.> > There are plenty of details I know I''m glossing over, and I''m sure > I''ll need lots of help getting things right. I''d have preferred > staying quiet until I had my act together a little more, but Orran and > Ron convinced me that it was important to let people know the > direction I''m planning on exploring.Yes, definitely worthwhile. I''d like to see more discussion like this on the xen-devel list. On the one hand, it''s kind of embarrassing to discuss vaporware and half finished ideas but on the other, the opportunity for public comment at an early stage in the process is probably going to save a lot of effort in the long run. -- Harry Butterworth <harry@hebutterworth.freeserve.co.uk> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/8/05, Andrew Warfield <andrew.warfield@gmail.com> wrote:> Hi Eric, > > Your thoughts on 9P are all really interesting -- I''d come across > the protocol years ago in looking into approaches to remote device/fs > access but had a hard time finding details. It''s quite interesting to > hear a bit more about the approach taken. >There''s a bigger picture that Ron and I have discussed, but there are lots of details that need to be resolved/proven before its worth talking about. The overall goal we are working towards is a simple interface with equivalent or better performance. The generality of the approach we have discussed is appealing, but we are well aware it won''t be adopted unless we can show competitive performance and reliability.> > For the more specific cases of FE/BE comms, I think the devil may > be in the details more than the current discussion is alluding to. >I agree completely, in fact there are likely several devils unique to each target architecture ;) But that''s just the sort of thing that keeps system programming interesting. These sorts of things will only be worked out once we have a prototype which to drive into the various brick walls.> > There are also a pile of complicating cases with regards cache > eviction from a BE domain, migration, and so on that make the > accounting really tricky. I think it would be quite good to have a > discussion of generalized interdomain comms address the current > drivers, as well as a hypothetical buffer cache as potential cases. > Does 9P already have hooks that would allow you to handle this sort of > per-application special case? >Page table and cache manipulation would likely sit below the 9P layer to keep things portable and abstract (as I''ve said earlier, perhaps Harry''s IDC proposal is the right layer to handle such things). 9P as a protocol is quite flexible, but existing implementations are somewhat simplistic and limited in regard to more advanced buffer/page sharing capabilities. We are exploring some of these issues in our exploration of using Plan 9 and 9P in HPC cluster environments (using cluster interconnects) -- in fact, the idea of using 9P between VMM partitions fell out of that exploration.> > Additionally, I think we get away with a lot in the current drivers > from a falure model that excludes transport. The FE or BE can crash, > and the two drivers can be written defensively to handle that. How > does 9P handle the strangenesses of real distribution? >With existing 9P implementations, defensive drivers are the way to go. There have been three previous solutions proposed for failure detection and recovery with 9P. One handled recovery of 9P state between client and server, the other handled reliability of the underlying RPC transport, and the third was a layered file server providing defensive semantics. As I said earlier, we''ll be exploring these more fully (along with looking at fail over) this summer -- specifically in the context of VMMs. 9P isn''t a magic bullet here, and there will be lots of issues that will need to be dealt with either in the layer above it, or below it. -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/8/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk> wrote:> > > > In our world, this would result in you holding a Fid pointing to the > > open object. The Fid is a pointer to meta-data and is considered > > state on both the FE and the BE. (this has downsides in terms of > > reliability and the ability to recover sessions or fail over to > > different BE''s -- one of our summer students will be addressing the > > reliability problem this summer). > > OK, so this is an area of concern for me. I used the last version of > the sketchy API I outlined to create an HA cluster infrastructure. So I > had to solve these kind of protocol issues and, whilst it was actually > pretty easy starting from scratch, retrofitting a solution to an > existing protocol might be challenging, even for a summer student. >There are three previous attempts at providing these sort of facilities in 9P that the student is going to be basing his work off of. All three worked to varying degrees of effectiveness - but there''s no magic bullet here and clients and file servers need to be written defensively to be able to cope with such disruptions in a graceful manner. Its quite likely there will be different semantics for failure recovery depending on the resource.> > > > The FE performs a read operation passing it the necessary bits: > > ret = read( fd, *buf, count ); > > Here the API is coupling the client to the memory management > implementation by assuming that the buffer is mapped into the client''s > virtual address space. > > This is probably likely to be true most of the time so an API at this > level will be useful but I''d also like to be able to write I/O > applications that manage the data in buffers that are never mapped into > the application address space. >Well, this was the context of the example (the FE was registering a buffer from its own address space). The existing Plan 9 API doesn''t have a good example of how to handle the more abstract buffer handles you describe, but I don''t think there''s anything in the protocol which would prevent such a utilization. I need to think about this scenario a bit more, could you give an example how how you would use this feature?> Also, I''d like to be able to write applications that have clients which > use different types of buffers without having to code for each case in > my application. >The attempt at portability is admirable, but it just seems to add complexity -- if I want to use the reference, I''ll have to make another functional call to resolve the buffer. I guess I''m being too narrow minded, but I just don''t have a clear idea of the utility of hidden buffers. I never know who I am supposed to be hiding information from. ;)> > So, my application can deal with buffers described like that without > having to worry about the flavour of memory management backing them. >This is important. In my example I was working on the pretext that the client initiating the read was consuming the data in some way. When that''s not the case, the interface is quite different, more like that of our file servers. In those cases, I can easily see passing the data by some more opaque reference (before I had figured scatter/gather buffers would be sufficient -- but perhaps your more abstract representation buys extra flexibility). I still hate the idea of having to resolve the abstract_buffer to get at the data, but perhaps that''s the cost of efficiency -- I''ll have to think about it some more.> Also, I can change the memory management without changing all the calls > to the API, I only have to change where I get buffers from.Again - I agree that this is an important aspect. Perhaps this sort of functionality is best called out separately with its own interfaces to provide and resolve buffer handles. It seems like perhaps this might be worth breaking out into its own. It seems like there would be three types of operations on your proposed struct: abstract_ref = get_ref( *real_data, flags ); /* constructor */ real_data = resolve_ref( *abstract_ref, flags); forget_ref( abstract_ref ); /* destructor */ Lots of details under the hood there (as it should be). flags could help specify things like read-only, cow, etc. Is such an interface sufficient? If I''m being naive here just tell me to shut up and I''ll won''t talk about it until I''ve had the time to look a little deeper into things.> > BTW, this specific abstraction I learnt about from an embedded OS > architected by Nik Shalor. He might have got it from somewhere else. >Any specific paper references we should be looking at? Or is obvious from a google?> > The above looks complicated, but to a FE writer would be as simple as: > > channel = dial("net!BE"); /* establish connection */ > > /* in my current code, channel is passed as an argument to the FE as a > > boot arg */ > > root = fsmount(channel, NULL); /* this does the t_version, auth, & attach */ > > fd = open(root, "/some/path/file", OREAD); > > ret = read(fd, *buf, sizeof(buf)); > > close(fd); > > close(root); > > close(channel); > > So, this is obviously a blocking API. My API was non-blocking because > the network latency means that you need a lot of concurrency for high > throughput and you don''t necessarily want so many threads. Like AIO. > Having a blocking API as well is convenient though. >Yeah, I am betrayed by the simplicity of the existing API. However, just wanted to point out that there is nothing specifically synchronous in the protocol. I tend to like the simplicity of using threads to deal with asynchronous behaviors, but efficient threads are hard to come by. Async APIs just seem to complicate driver writers lives, but if this is the preferred methodology such an API could be used with the 9P protocol.> > One of the thoughts that did occur to me was that a reliance on in-order > message delivery (which 9p has) turns out to be quite painful to satisfy >There are certainly issues to be resolved here, but in environments (such as using VMM transports) preserving frame boundaries on messages, the in-order 9P requirements can be relaxed a great deal.> > Yes, definitely worthwhile. I''d like to see more discussion like this > on the xen-devel list. On the one hand, it''s kind of embarrassing to > discuss vaporware and half finished ideas but on the other, the > opportunity for public comment at an early stage in the process is > probably going to save a lot of effort in the long run. >I''ll try (perhaps with Ron''s help) to put together some sort of white paper on our vision. It''d be quite easy to pull together an organizational demonstration of what we are talking about, but working out the performance/reliability/security details will likely take some time. I do like the general idea of building on top of many of the underlying bits you describe. I''m not quite sure we''d use all the features (your endpoint definition seems a bit over-engineered for our paradigm), but there are certainly lots of good things to take advantage of. -eric _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, 2005-05-08 at 11:18 -0500, Eric Van Hensbergen wrote:> > This is probably likely to be true most of the time so an API at this > > level will be useful but I''d also like to be able to write I/O > > applications that manage the data in buffers that are never mapped into > > the application address space. > > > > Well, this was the context of the example (the FE was registering a > buffer from its own address space). The existing Plan 9 API doesn''t > have a good example of how to handle the more abstract buffer handles > you describe, but I don''t think there''s anything in the protocol which > would prevent such a utilization. I need to think about this scenario > a bit more, could you give an example how how you would use this > feature?Read from disk, cache in buffer cache, sendfile to remote client.> > > Also, I''d like to be able to write applications that have clients which > > use different types of buffers without having to code for each case in > > my application. > > > > The attempt at portability is admirable, but it just seems to add > complexity -- if I want to use the reference, I''ll have to make > another functional call to resolve the buffer. I guess I''m being too > narrow minded, but I just don''t have a clear idea of the utility of > hidden buffers. I never know who I am supposed to be hiding > information from. ;)Yourself: it''s harder for a bug to scribble on the data if it''s not mapped into the address space. Also, it eliminates the overhead of page table manipulations which aren''t needed if you don''t want to look at the data.> > Also, I can change the memory management without changing all the calls > > to the API, I only have to change where I get buffers from. > > Again - I agree that this is an important aspect. Perhaps this sort > of functionality is best called out separately with its own interfaces > to provide and resolve buffer handles. It seems like perhaps this > might be worth breaking out into its own. It seems like there would > be three types of operations on your proposed struct: > abstract_ref = get_ref( *real_data, flags ); /* constructor */ > real_data = resolve_ref( *abstract_ref, flags); > forget_ref( abstract_ref ); /* destructor */ > Lots of details under the hood there (as it should be). flags could > help specify things like read-only, cow, etc. Is such an interface > sufficient? If I''m being naive here just tell me to shut up and I''ll > won''t talk about it until I''ve had the time to look a little deeper > into things.You could have something like that or you could make a local_buffer_reference for some memory in the virtual address space and then use the local_buffer_reference_copy function which would get the data into your address space as efficiently as it could. It works out as pretty much the same thing. Both cases require allocation of the virtual address space somehow, if this is an operation that could fail (more physical memory than available address space for example) then that must be expressed in the API somehow; having to allocate the memory up-front is a relatively clean way of doing this. This method also allows you to copy between a local_buffer_reference and the stack and warms the CPU cache up just before you look at the data. Which approach you choose probably depends on the context. I guess you might want both.> > > > > BTW, this specific abstraction I learnt about from an embedded OS > > architected by Nik Shalor. He might have got it from somewhere else. > > > > Any specific paper references we should be looking at? Or is obvious > from a google?Here''s a reference for historical interest but be aware that I''m proposing something different. In particular, these DDRs confuse the function of local and remote buffer references. Page numbered 16 of the following doc or page 32 according to xpdf. http://www-900.ibm.com/cn/support/library/storage/download/Advanced% 20SerialRAID%20Plus%20Adapter%20Technical%20Reference.pdf Harry. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield wrote:> Hi Eric, > > Your thoughts on 9P are all really interesting -- I''d come across > the protocol years ago in looking into approaches to remote device/fs > access but had a hard time finding details. It''s quite interesting to > hear a bit more about the approach taken. > > Having a more accessible inter-domain comms API is clearly a good > thing, and extending device channels (in our terminology -- shared > memory + event notification) to work across a cluster is something > that we''ve talked about on several occasions at the lab. > > I do think though, that as mentioned above there are some concerns > with the VMM environment that make this a little trickier. For the > general case of inefficient comms between VMs, using the regular IP > stack may be okay for many people. The net drivers are being fixed up > to special-case local communications. > > For the more specific cases of FE/BE comms, I think the devil may > be in the details more than the current discussion is alluding to. > Specifically: > > >>c) As long as the buffers in question (both *buf and the buffer cache >>entry) were page-aligned, etc. -- we could play clever VM games >>marking the page as shared RO between the two partitions and alias the >>virtual memory pointed to by *buf to the shared page. This is very >>sketchy and high level and I need to delve into all sorts of details >>-- but the idea would be to use virtual memory as your friend for >>these sort of shared read-only buffer caches. It would also require >>careful allocation of buffers of the right size on the right alignment >>-- but driver writers are used to that sort of thing. > > > Most of the good performance that Xen gets off of block and net > split devices are specifically because of these clever VM games. > Block FEs pass page references down to be mapped directly for DMA. > Net devices pass pages into a free pool, and actually exchange > physical pages under the feet of the VM as inbound packets are > demultiplexed. The grant tables that have recently been added provide > separate mechanisms for the mapping and ownership transfer of pages > across domains. In addition to these tricks, we make careful use of > timing event notification in order to batch messages.It should be possible to still use the page mapping in the i/o transport. The issue right now is that the i/o interface is very low-level and intimately tangled up with the structs being transported. And with the domain control channel there''s an implicit assumption that ''there can be only one''. This means for example, that domain A using a device with backend in domain B can''t connect directly to domain B, but has to be ''introduced'' by xend. It''d be better if it could connect directly. Something like what Harry proposes should still be able to use page mapping for efficient local comms, but without _requiring_ it. This opens the way for alternative transports, such as network. Rather than going straight for something very high-level, I''d prefer to build up gradually, starting with a more general message transport api that includes analogues to listen/connect/recv/send.> > In the case of the buffer cache that has come up several times in > the thread, a cache across domains would potentially neet to pass read > only page mappings as CoW in many situations, and a fault handler > somewhere would need to bring in a new page to the guest on a write. > There are also a pile of complicating cases with regards cache > eviction from a BE domain, migration, and so on that make the > accounting really tricky. I think it would be quite good to have a > discussion of generalized interdomain comms address the current > drivers, as well as a hypothetical buffer cache as potential cases. > Does 9P already have hooks that would allow you to handle this sort of > per-application special case? > > Additionally, I think we get away with a lot in the current drivers > from a falure model that excludes transport. The FE or BE can crash, > and the two drivers can be written defensively to handle that. How > does 9P handle the strangenesses of real distribution? > > Anyhow, very interesting discussion... looking forward to your thoughts. > > a. >Mike _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> It should be possible to still use the page mapping in the i/o transport. > The issue right now is that the i/o interface is very low-level and > intimately tangled up with the structs being transported.I don''t doubt that it is possible. The point I was making is that the current i/o interfaces are low level for a reason, and that generalizing this to a higher-level communications primitive is a non-trivial thing. Just considering the disk and net interfaces, the current device channels each make particular decisions regarding (a) what to copy and what to map, (b) when to send notification to get efficient batching through the scheduler, and most recently (c) which grant mechanism to use to pass pages securely across domains. Having a higher-level API to make all this easier, and especially to reduce the code/complexity required to build new drivers etc is something that will be fantastic to have. I think though that at least some of these underlying issues will need to be exposed for it to be useful. I''m not convinced that reimplementing the sockets API for interdomain communication is a very good solution... The buffer_reference struct that Harry mentioned looks quite interesting as a start though in terms of describing a variety of underlying transports. Do you have a paper reference on that work, Harry? With regards forwarding device channels across a network, I think we can expect application-level involvement for shifting device messages across a cluster. If this is down the road, and it''s certainly something that has been discussed, a device channel is potentially two local shared memory device channels between VMs on local hosts, and a network connection between the physical hosts. Beyond the more complicated error cases that this obviously involves, we can then make this as arbitrarily more complex by discussing HA or security concerns... for the moment though, I think it would be interesting to see how well the existing local host cases can be generalized. ;)> And with the domain control channel there''s an implicit assumption > that ''there can be only one''. This means for example, that domain A > using a device with backend in domain B can''t connect directly to domain B, > but has to be ''introduced'' by xend. It''d be better if it could connect > directly.This is not a flaw with the current implementation -- it''s completely intentional. By forcing control through xend we ensure that there is a single point for control logic, and for managing state. Why do you feel it would be better to provide arbitrary point-to-point comms in a VMM environment that is specifically trying to provide isolation between guests?> Something like what Harry proposes should still be able to use > page mapping for efficient local comms, but without _requiring_ > it. This opens the way for alternative transports, such as network. > > Rather than going straight for something very high-level, I''d prefer > to build up gradually, starting with a more general message transport > api that includes analogues to listen/connect/recv/send.As I said, I''m unconvinced that trying to mimic the sockets API is the right way to go -- I think the communicating parties often want to see and work with batches of messages without having to do extra copies or have event notification made implicit. I think you are completely right about a gradual approach though -- having a generalized host-local device channel would be very interesting to see... especially if it could be shown to apply to the existing block, net, usb, and control channels in a simplifying fashion. a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Warfield wrote:>>It should be possible to still use the page mapping in the i/o transport. >>The issue right now is that the i/o interface is very low-level and >>intimately tangled up with the structs being transported. > > I don''t doubt that it is possible. The point I was making is that the > current i/o interfaces are low level for a reason, and that > generalizing this to a higher-level communications primitive is a > non-trivial thing. Just considering the disk and net interfaces, the > current device channels each make particular decisions regarding (a) > what to copy and what to map, (b) when to send notification to get > efficient batching through the scheduler, and most recently (c) which > grant mechanism to use to pass pages securely across domains.It should be relatively easy to provide these kinds of facilities in a higher-level api.> Having a higher-level API to make all this easier, and especially to > reduce the code/complexity required to build new drivers etc is > something that will be fantastic to have. I think though that at > least some of these underlying issues will need to be exposed for it > to be useful. I''m not convinced that reimplementing the sockets API > for interdomain communication is a very good solution...I wasn''t suggesting exactly the sockets api, but something more like the connect/send and listen/recv logic. Harry''s API is quite like that, with additional higher-level facilities.> The > buffer_reference struct that Harry mentioned looks quite interesting > as a start though in terms of describing a variety of underlying > transports. Do you have a paper reference on that work, Harry? > > With regards forwarding device channels across a network, I think we > can expect application-level involvement for shifting device messages > across a cluster. If this is down the road, and it''s certainly > something that has been discussed, a device channel is potentially two > local shared memory device channels between VMs on local hosts, and a > network connection between the physical hosts. Beyond the more > complicated error cases that this obviously involves, we can then make > this as arbitrarily more complex by discussing HA or security > concerns... for the moment though, I think it would be interesting to > see how well the existing local host cases can be generalized. ;) > > >>And with the domain control channel there''s an implicit assumption >>that ''there can be only one''. This means for example, that domain A >>using a device with backend in domain B can''t connect directly to domain B, >>but has to be ''introduced'' by xend. It''d be better if it could connect >>directly. > > This is not a flaw with the current implementation -- it''s completely > intentional. By forcing control through xend we ensure that there is > a single point for control logic, and for managing state. Why do you > feel it would be better to provide arbitrary point-to-point comms in a > VMM environment that is specifically trying to provide isolation > between guests?OK, so it''s an intentional flaw ;-). One reason is that front-end drivers have to connect to their backends. If they can find out who to connect to and then do it, it simplifies things. Especially when that info is available from a store or registry service as proposed for 3.0. At the moment xend has to exchange messages with the domain to get the device front-end handle and shared page address, and then exchange messages with the back-end so it can create the device and map the page. Telling the font-end which back-end to connect to would be much simpler.>>Something like what Harry proposes should still be able to use >>page mapping for efficient local comms, but without _requiring_ >>it. This opens the way for alternative transports, such as network. >> >>Rather than going straight for something very high-level, I''d prefer >>to build up gradually, starting with a more general message transport >>api that includes analogues to listen/connect/recv/send. > > > As I said, I''m unconvinced that trying to mimic the sockets API is the > right way to go -- I think the communicating parties often want to see > and work with batches of messages without having to do extra copies or > have event notification made implicit.Like I said, I wasn''t suggesting _exactly_ the sockets api, more the spirit of it. There is an analogue of batching for sockets though: flush.> I think you are completely > right about a gradual approach though -- having a generalized > host-local device channel would be very interesting to see... > especially if it could be shown to apply to the existing block, net, > usb, and control channels in a simplifying fashion. >Just a small matter of programming then :-). Mike _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 2005-05-10 at 11:09 +0100, Andrew Warfield wrote:> Just considering the disk and net interfaces, the > current device channels each make particular decisions regarding (a) > what to copy and what to map, (b) when to send notification to get > efficient batching through the scheduler, and most recently (c) which > grant mechanism to use to pass pages securely across domains.(a) looks like a property of the local_buffer_reference type. You might be able to think of a really good way of doing (b) automatically independent of any specific client and (c) is either like (a) or it''s a strategy passed when registering a buffer to get a remote buffer reference. I''m guessing here though---Is there a list of all the considerations that went into the design of the current behaviour? That would help in mapping it to any new proposal.> The > buffer_reference struct that Harry mentioned looks quite interesting > as a start though in terms of describing a variety of underlying > transports. Do you have a paper reference on that work, Harry?I can''t give you a suitable reference, but there''s not much more to the concepts than what I''ve already written up on this list and I can chip in with anything I''ve forgotten. Here are a few more random details: Buffer providers can register to get a local_buffer_reference type for the buffers they will provide. As a minimum, they also have to register methods to copy data between a buffer of that type and the stack. This allows a core generic mechanism to implement any copy operation. Buffer providers can register specialised methods to copy directly between two specific local buffer types. These methods override the generic mechanism. As well as a copy operation which leaves the buffers identical, you might want a more efficient move operation which leaves the source undefined (or perhaps all zeroes or unmapped or something). Some local_buffer_reference types might have wider APIs. For example, an application might want to mess with and be frugal with individual pages in which case it might know that it allocates all its buffers from a buffer provider that provides buffers of a specific type. Knowing the type, it can down-cast the local buffer references it owns to get access to the wider API to do more detailed work. Reads can often complete when the data is received without waiting for separate status. I worked this optimisation into my implementation by getting notification from a registered buffer when it had been filled. My disconnects were cluster-wide and were coupled to the lifetime of remote_buffer_references in the cluster. I didn''t think this was appropriate for Xen so my sketchy API had per-connection disconnects and didn''t mention any coupling with the remote_buffer_reference lifetime. This is relevant when it comes to defining what kind of transport errors might happen and requires further thought.> for the moment though, I think it would be interesting to > see how well the existing local host cases can be generalized. ;)Yep, sounds like a good way to start :-)> > > And with the domain control channel there''s an implicit assumption > > that ''there can be only one''. This means for example, that domain A > > using a device with backend in domain B can''t connect directly to domain B, > > but has to be ''introduced'' by xend. It''d be better if it could connect > > directly. > > This is not a flaw with the current implementation -- it''s completely > intentional. By forcing control through xend we ensure that there is > a single point for control logic, and for managing state. Why do you > feel it would be better to provide arbitrary point-to-point comms in a > VMM environment that is specifically trying to provide isolation > between guests?There are two different issues here: 1) Having one Xend might be correct at the moment. The _assumption in the guests_ that there is only one Xend is technically a minor flaw. If, for example, the guest got an idc_address for Xend, the guest would be decoupled from the design choice of one or many Xends. 2) The Xend introductions. The weird aspect about these is that they are a triangular protocol with phases initiated by xend in two directions, potentially at the same time. I''m more used to control flow following resource dependencies, for example: server tells third-party about resource availability; third-party assigns resources to clients; clients connect directly to server. This is also triangular but if you put the server at the top and bottom it becomes a stack ordered by dependencies. I think stacks are much less confusing than triangles and have more easily defined interlocks and less risk of introducing timing windows. I think it ought to be possible to get the security right for a stack solution as well as the triangle solution. Maybe a secuity expert could provide some help. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Harry, These two comments seem more about the immediate control tool plans than about a generic comms mechanism.> There are two different issues here:These seem pretty concise as regards the new control tool plans:> 1) Having one Xend might be correct at the moment. The _assumption in > the guests_ that there is only one Xend is technically a minor flaw. If, > for example, the guest got an idc_address for Xend, the guest would be > decoupled from the design choice of one or many Xends.I think we are all keen to move in the direction of having no xend but rather a small set of single purpose daemons that handle specific tasks and map well on to the emerging security primitives.> 2) The Xend introductions. The weird aspect about these is that they > are a triangular protocol with phases initiated by xend in two > directions, potentially at the same time. I''m more used to control flow > following resource dependencies, for example: server tells third-party > about resource availability; third-party assigns resources to clients; > clients connect directly to server. This is also triangular but if you > put the server at the top and bottom it becomes a stack ordered by > dependencies. I think stacks are much less confusing than triangles and > have more easily defined interlocks and less risk of introducing timing > windows. I think it ought to be possible to get the security right for > a stack solution as well as the triangle solution. Maybe a secuity > expert could provide some help.The triangular connection setup in the current code is a little grim and something that will change very soon. The two relevant bits of this are the event channel and the shared page address on setting up a new device channel. Keir added support for unbound event channels quite a while ago, and the control tools/drivers just haven''t evolved to take advantage of them yet. The store will be used for additional connection set-up state, such as shared page addresses. Connection setup this way does map on to the connect/listen semantics that Mike has been advocating. For example, on a request to add a new frontend, the backend driver will create a simple state machine for the new device channel and assign an unbound event channel to it. It will then move out of this (unbound) "listening" state when the front end connects to the event channel and sends the first notification. So we still have something in the middle, but it''s a much more simple and generic thing. a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 2005-05-10 at 16:26 +0100, Andrew Warfield wrote:> I think we are all keen to move in the direction of having no xend but > rather a small set of single purpose daemons that handle specific > tasks and map well on to the emerging security primitives.Yep.> Connection setup this way does map on to the connect/listen semantics > that Mike has been advocating. For example, on a request to add a new > frontend, the backend driver will create a simple state machine for > the new device channel and assign an unbound event channel to it. It > will then move out of this (unbound) "listening" state when the front > end connects to the event channel and sends the first notification.The above description happens to fit inside my endpoint_create call too. I think the significant constraints in this area are that the choice of connect/listen semantics must be compatible with the introduction mechanism and the security requirements. With my API I assumed a symmetrical connection process where both ends created an endpoint for the same address and the IDC implementation did the work of binding them together. I didn''t give any thought to the implications for the introduction mechanism or the security requirements or consider any other options so I''m not particularly attached to this aspect of my proposal. I was primarily trying to communicate the buffer abstractions. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Aggarwal, Vikas \(OFT\)
2005-May-12 17:47 UTC
RE: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c
Hi, How can I get my virtual-trace-device into the SXP? Inside the SrvDomainDir.py:SrvDomainDir::op_create the "configstring" has devices for vbd/vif. I did a print on configstring. Since I did not create a file based in blkctl.py for my virtual-trace-device so (blkctl.py)script = xroot.get_block_script() never happens for my trace-device. Hence (XendRoot.py) get_config_value->sxp.child_value will never happen. Please clarify my confusions in area of SXP initializations. I hacked around the above piece of code after I saw (channel.py)(" requestReceived > No device ") in xend-debug.log for my device type. I am still trying to figure out time-out in my FE driver, by looking it messages simultaneously I think "No device" is the culprit for eventual timeout in FE driver. -vikas -------------------------------------------------------- This e-mail, including any attachments, may be confidential, privileged or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system. -----Original Message----- From: Harry Butterworth [mailto:harry@hebutterworth.freeserve.co.uk] Sent: Thursday, May 05, 2005 4:37 PM To: Aggarwal, Vikas (OFT) Cc: xen-devel@lists.xensource.com Subject: Re: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c To answer your questions: I think (75% confidence) create_devices is used to create devices when a frontend domain is started whereas device_create is used to create an extra single device on the fly when the frontend domain is already running. Your frontend is probably timing out waiting to hear about interfaces from xend. To establish interface connections between the frontend and backend with the current xend code you need to implement the following basic flow: When xend starts, it creates a controller factory for your device class. When a frontend domain is started, the controller factory is used to create a controller for the frontend domain. Configured devices for your device class are attached to the controller for the frontend domain. Attaching the device creates the device object which in turn does a lookup of the "backend interface" object. The backend interface object represents the communication channel between the frontend domain and the backend domain. If the backend interface object does not exist it is created at this point. When a backend interface is created, it finds the backend controller object for its backend domain. If the backend controller object does not exist it is created at this point. Some of the drivers assume that the backend driver is already up when the backend controller object is created and so do not listen for the driver status up message from the backend. I''m implementing a driver as loadable modules so I need to use this message. In my case, the backend driver sends a driver status up message to xend which is handled by the backend controller object. The backend controller object notifies all its backend interface objects that the backend driver is up. When the frontend driver is initialised, it sends a driver status up message to xend. This is handled by the controller object. The controller object notifies all its backend interface objects that the frontend driver is up. So, a backend interface object gets a driver up/down notification at each end. When a backend interface finds that the drivers at both its ends are up, it can squence the establishment of the connection: The backend interface object sends a "be create" message to the backend to create the object representing the interface in the backend. The backend interface object sends a "fe interface status disconnected" message to the frontend to create the object representing the interface in the frontend. The frontend allocates a page for the ring interface and sends a "fe interface connect" message to xend which is forwarded by the controller object to the backend interface object. xend opens an event channel between the two domains and sends a "be connect" message to the backend passing the event channel and the address of the page for the ring interface. The backend responds to the be connect and the backend interface object sends an fe interface status connected message to the frontend, passing the event channel number. Unloading and reloading the frontend requires that the backend stops writing to the shared page of memory before the frontend frees it for reuse. In my case, the frontend sends a driver status down message, which is handled by the controller object for the frontend. This object notifies all the backend interface objects for that frontend which then send be disconnect messages and wait for a response before acknowledging back to the controller object. The controller object waits for all backend interfaces to complete before acknowledging the frontend driver status down. Unloading and reloading the backend requires that the connection is broken and reestablished: the backend stops using the shared pages before sending a driver status down message. The driver status down is handled by the backend controller which notifies all the connected backend interfaces which in turn send fe interface status disconnected messages to their respective frontend domains. The frontend domains respond with fe interface connect messages, trying to reestablish a connection. When the back-end driver is reloaded it sends a driver status up which is propagated to the backend interface objects which send be create messages again and follow up with be connect messages containing the new shared pages and fe interface status connected messages to complete the connection. This is just the establishment of the connection. There are also messages sent by the devices to create the devices in the backend after the backend interfaces send the be create messages to create the interfaces. Also, this description covers the internals of xend only. The Frontend and backend drivers must handle the messages, map the shared page, initialise the ring interface, allocate IRQs and bind to the event channel as appropriate. There are a number of people working to make this simpler, in particular Mike Wray is apparently rewriting xend and Anthony Liguori is working on an alternative set of tools. Also, the grant tables implementation probably means that the above description is out of date. I''ve not investigated grant tables much yet. I''ve not been following the checkins to unstable since the xen summit so I''ll have missed anything not mentioned on the mailing list or IRC. The current overhead in terms of client code to establish an entity on the xen inter-domain communication "bus" is currently of the order of 1000 statements (counting FE, BE and slice of xend). A better inter-domain communication API could reduce this to fewer than 10 statements. If it''s not done by the time I finish the USB work, I will hopefully be allowed to help with this. Harry. On Thu, 2005-05-05 at 11:18 -0400, Aggarwal, Vikas (OFT) wrote:> Hi, > > I have added a debug-Front-end.c & debug-Back-end.c. Also created my > debug.py based on blkif.py > > But my front-end still times out even after sending > ctrl_if_register_receiver() to domain controller during module_init. > > I think I still did not initialize my debug.py properly. I saw VBD > gets initialized in device_create or create_devices. Whats the > difference in the two? > > Where should I look to initialize the event channel in domain > controller? > > My virtual device in back-end collects the debug data and writes tofile> system so I don''t want it to be initialized during boot time viasome> config file. > > -vikas > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Harry Butterworth
2005-May-13 09:08 UTC
RE: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c
You need to modify tools/python/xm/create.py to get your config parameters into the config SXP. You''ll need to have a new controllerfactory for your device class as described in my previous note attached below. You need to create your controllerfactory in tools/python/xen/xend/server/SrvDaemon.py. You also need to modify python/xen/xend/XendDomainInfo.py to add a device handler for your device. The device handler is used to parse the device config in the config SXP and create the devices using your new controller factory. You need to list your messages in tools/python/xen/xend/server/messages.py. You need to do the C->python and python->c translation of your messages in tools/python/xen/lowlevel/xu/xu.c Harry. On Thu, 2005-05-12 at 13:47 -0400, Aggarwal, Vikas (OFT) wrote:> Hi, > How can I get my virtual-trace-device into the SXP? > Inside the SrvDomainDir.py:SrvDomainDir::op_create the "configstring" > has devices for vbd/vif. I did a print on configstring. > > Since I did not create a file based in blkctl.py for my > virtual-trace-device so (blkctl.py)script = xroot.get_block_script() > never happens for my trace-device. Hence (XendRoot.py) > get_config_value->sxp.child_value will never happen. > > Please clarify my confusions in area of SXP initializations. I hacked > around the above piece of code after I saw (channel.py)(" > requestReceived > No device ") in xend-debug.log for my device type. > > I am still trying to figure out time-out in my FE driver, by looking it > messages simultaneously I think "No device" is the culprit for eventual > timeout in FE driver. > > -vikas > > > > > -------------------------------------------------------- > This e-mail, including any attachments, may be confidential, privileged or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system. > > > -----Original Message----- > > From: Harry Butterworth [mailto:harry@hebutterworth.freeserve.co.uk] > Sent: Thursday, May 05, 2005 4:37 PM > To: Aggarwal, Vikas (OFT) > Cc: xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] please help: initialize XEND for my > debug-FE/BE.c > > To answer your questions: > > I think (75% confidence) create_devices is used to create devices when a > frontend domain is started whereas device_create is used to create an > extra single device on the fly when the frontend domain is already > running. > > Your frontend is probably timing out waiting to hear about interfaces > from xend. To establish interface connections between the frontend and > backend with the current xend code you need to implement the following > basic flow: > > When xend starts, it creates a controller factory for your device class. > > When a frontend domain is started, the controller factory is used to > create a controller for the frontend domain. > > Configured devices for your device class are attached to the controller > for the frontend domain. > > Attaching the device creates the device object which in turn does a > lookup of the "backend interface" object. The backend interface object > represents the communication channel between the frontend domain and the > backend domain. If the backend interface object does not exist it is > created at this point. When a backend interface is created, it finds > the backend controller object for its backend domain. If the backend > controller object does not exist it is created at this point. > > Some of the drivers assume that the backend driver is already up when > the backend controller object is created and so do not listen for the > driver status up message from the backend. I''m implementing a driver as > loadable modules so I need to use this message. > > In my case, the backend driver sends a driver status up message to xend > which is handled by the backend controller object. The backend > controller object notifies all its backend interface objects that the > backend driver is up. > > When the frontend driver is initialised, it sends a driver status up > message to xend. This is handled by the controller object. The > controller object notifies all its backend interface objects that the > frontend driver is up. > > So, a backend interface object gets a driver up/down notification at > each end. When a backend interface finds that the drivers at both its > ends are up, it can squence the establishment of the connection: > > The backend interface object sends a "be create" message to the backend > to create the object representing the interface in the backend. > > The backend interface object sends a "fe interface status disconnected" > message to the frontend to create the object representing the interface > in the frontend. > > The frontend allocates a page for the ring interface and sends a "fe > interface connect" message to xend which is forwarded by the controller > object to the backend interface object. > > xend opens an event channel between the two domains and sends a "be > connect" message to the backend passing the event channel and the > address of the page for the ring interface. > > The backend responds to the be connect and the backend interface object > sends an fe interface status connected message to the frontend, passing > the event channel number. > > Unloading and reloading the frontend requires that the backend stops > writing to the shared page of memory before the frontend frees it for > reuse. In my case, the frontend sends a driver status down message, > which is handled by the controller object for the frontend. This object > notifies all the backend interface objects for that frontend which then > send be disconnect messages and wait for a response before acknowledging > back to the controller object. The controller object waits for all > backend interfaces to complete before acknowledging the frontend driver > status down. > > Unloading and reloading the backend requires that the connection is > broken and reestablished: the backend stops using the shared pages > before sending a driver status down message. The driver status down is > handled by the backend controller which notifies all the connected > backend interfaces which in turn send fe interface status disconnected > messages to their respective frontend domains. The frontend domains > respond with fe interface connect messages, trying to reestablish a > connection. When the back-end driver is reloaded it sends a driver > status up which is propagated to the backend interface objects which > send be create messages again and follow up with be connect messages > containing the new shared pages and fe interface status connected > messages to complete the connection. > > This is just the establishment of the connection. There are also > messages sent by the devices to create the devices in the backend after > the backend interfaces send the be create messages to create the > interfaces. > > Also, this description covers the internals of xend only. The Frontend > and backend drivers must handle the messages, map the shared page, > initialise the ring interface, allocate IRQs and bind to the event > channel as appropriate. > > There are a number of people working to make this simpler, in particular > Mike Wray is apparently rewriting xend and Anthony Liguori is working on > an alternative set of tools. > > Also, the grant tables implementation probably means that the above > description is out of date. I''ve not investigated grant tables much > yet. > > I''ve not been following the checkins to unstable since the xen summit so > I''ll have missed anything not mentioned on the mailing list or IRC. > > The current overhead in terms of client code to establish an entity on > the xen inter-domain communication "bus" is currently of the order of > 1000 statements (counting FE, BE and slice of xend). A better > inter-domain communication API could reduce this to fewer than 10 > statements. If it''s not done by the time I finish the USB work, I will > hopefully be allowed to help with this. > > Harry. > > On Thu, 2005-05-05 at 11:18 -0400, Aggarwal, Vikas (OFT) wrote: > > Hi, > > > > I have added a debug-Front-end.c & debug-Back-end.c. Also created my > > debug.py based on blkif.py > > > > But my front-end still times out even after sending > > ctrl_if_register_receiver() to domain controller during module_init. > > > > I think I still did not initialize my debug.py properly. I saw VBD > > gets initialized in device_create or create_devices. Whats the > > difference in the two? > > > > Where should I look to initialize the event channel in domain > > controller? > > > > My virtual device in back-end collects the debug data and writes to > file > > system so I don''t want it to be initialized during boot time via > some > > config file. > > > > -vikas > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel