thr3ads.net - Xen devel - [Xen-devel] please help: initialize XEND for my debug-FE/BE.c [May 2005]

If this information is useful, please help other people find it:
Share via:

Aggarwal, Vikas \(OFT\)

2005-May-05 15:18 UTC

[Xen-devel] please help: initialize XEND for my debug-FE/BE.c

Hi,

I have added a debug-Front-end.c &  debug-Back-end.c.  Also created my
debug.py based on  blkif.py

But my front-end still times out even after sending
ctrl_if_register_receiver()  to domain controller during module_init.

I think I  still did not initialize my debug.py properly.  I saw VBD
gets initialized in device_create or create_devices. Whats the
difference in the two?

Where should I look to initialize the event channel in domain
controller?

My virtual device in back-end collects the debug data and writes to file
system  so I don''t  want it to be initialized during boot time via some
config file.

-vikas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-05 20:37 UTC

head link

Re: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c

To answer your questions:

I think (75% confidence) create_devices is used to create devices when a
frontend domain is started whereas device_create is used to create an
extra single device on the fly when the frontend domain is already
running.

Your frontend is probably timing out waiting to hear about interfaces
from xend.  To establish interface connections between the frontend and
backend with the current xend code you need to implement the following
basic flow:

When xend starts, it creates a controller factory for your device class.

When a frontend domain is started, the controller factory is used to
create a controller for the frontend domain.

Configured devices for your device class are attached to the controller
for the frontend domain.

Attaching the device creates the device object which in turn does a
lookup of the "backend interface" object. The backend interface object
represents the communication channel between the frontend domain and the
backend domain.  If the backend interface object does not exist it is
created at this point.  When a backend interface is created, it finds
the backend controller object for its backend domain.  If the backend
controller object does not exist it is created at this point.

Some of the drivers assume that the backend driver is already up when
the backend controller object is created and so do not listen for the
driver status up message from the backend.  I''m implementing a driver
as
loadable modules so I need to use this message.

In my case, the backend driver sends a driver status up message to xend
which is handled by the backend controller object.  The backend
controller object notifies all its backend interface objects that the
backend driver is up.

When the frontend driver is initialised, it sends a driver status up
message to xend.  This is handled by the controller object. The
controller object notifies all its backend interface objects that the
frontend driver is up.

So, a backend interface object gets a driver up/down notification at
each end.  When a backend interface finds that the drivers at both its
ends are up, it can squence the establishment of the connection:

The backend interface object sends a "be create" message to the
backend
to create the object representing the interface in the backend.

The backend interface object sends a "fe interface status
disconnected"
message to the frontend to create the object representing the interface
in the frontend.

The frontend allocates a page for the ring interface and sends a "fe
interface connect" message to xend which is forwarded by the controller
object to the backend interface object.

xend opens an event channel between the two domains and sends a "be
connect" message to the backend passing the event channel and the
address of the page for the ring interface.

The backend responds to the be connect and the backend interface object
sends an fe interface status connected message to the frontend, passing
the event channel number.

Unloading and reloading the frontend requires that the backend stops
writing to the shared page of memory before the frontend frees it for
reuse.  In my case, the frontend sends a driver status down message,
which is handled by the controller object for the frontend. This object
notifies all the backend interface objects for that frontend which then
send be disconnect messages and wait for a response before acknowledging
back to the controller object.  The controller object waits for all
backend interfaces to complete before acknowledging the frontend driver
status down.

Unloading and reloading the backend requires that the connection is
broken and reestablished: the backend stops using the shared pages
before sending a driver status down message. The driver status down is
handled by the backend controller which notifies all the connected
backend interfaces which in turn send fe interface status disconnected
messages to their respective frontend domains.  The frontend domains
respond with fe interface connect messages, trying to reestablish a
connection.  When the back-end driver is reloaded it sends a driver
status up which is propagated to the backend interface objects which
send be create messages again and follow up with be connect messages
containing the new shared pages and fe interface status connected
messages to complete the connection.

This is just the establishment of the connection.  There are also
messages sent by the devices to create the devices in the backend after
the backend interfaces send the be create messages to create the
interfaces.

Also, this description covers the internals of xend only. The Frontend
and backend drivers must handle the messages, map the shared page,
initialise the ring interface, allocate IRQs and bind to the event
channel as appropriate.

There are a number of people working to make this simpler, in particular
Mike Wray is apparently rewriting xend and Anthony Liguori is working on
an alternative set of tools.

Also, the grant tables implementation probably means that the above
description is out of date.  I''ve not investigated grant tables much
yet.

I''ve not been following the checkins to unstable since the xen summit
so
I''ll have missed anything not mentioned on the mailing list or IRC.

The current overhead in terms of client code to establish an entity on
the xen inter-domain communication "bus" is currently of the order of
1000 statements (counting FE, BE and slice of xend).  A better
inter-domain communication API could reduce this to fewer than 10
statements.  If it''s not done by the time I finish the USB work, I will
hopefully be allowed to help with this.

Harry.

On Thu, 2005-05-05 at 11:18 -0400, Aggarwal, Vikas (OFT)
wrote:> Hi,
> 
> I have added a debug-Front-end.c &  debug-Back-end.c.  Also created my
> debug.py based on  blkif.py
> 
> But my front-end still times out even after sending
> ctrl_if_register_receiver()  to domain controller during module_init.
> 
> I think I  still did not initialize my debug.py properly.  I saw VBD
> gets initialized in device_create or create_devices. Whats the
> difference in the two?
> 
> Where should I look to initialize the event channel in domain
> controller?
> 
> My virtual device in back-end collects the debug data and writes to file
> system  so I don''t  want it to be initialized during boot time via
some
> config file.
> 
> -vikas
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-06 12:14 UTC

head link

[Xen-devel] Re: Interdomain comms

On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:> Harry Butterworth wrote:
> 
> > The current overhead in terms of client code to establish an entity on
> > the xen inter-domain communication "bus" is currently of the
order of
> > 1000 statements (counting FE, BE and slice of xend).  A better
> > inter-domain communication API could reduce this to fewer than 10
> > statements.  If it''s not done by the time I finish the USB
work, I will
> > hopefully be allowed to help with this.
> > 
> 
> This reminded me you had suggested a different model for inter-domain
comms.
> I recently suggested a more socket-like API but it didn''t go down
well.
What exactly were the issues with the socket-like proposal?
> 
> I agree with you that the event channel model could be improved -
> what kind of comms model do you suggest?
The event-channel and shared memory page are fine as low-level
primitives to implement a comms channel between domains on the same
physical machine. The problem is that the primitives are unnecessarily
low-level from the client''s perspective and result in too much
per-client code.

The inter-domain communication API should preserve the efficiency of
these primitives but provide a higher level API which is more convenient
to use.

Another issue with the current API is that, in the future, it is likely
(for a number of virtual-iron/fault-tolerant-virtual-machine-like
reasons) that it will be useful for the inter-domain communication API
to span physical nodes in a cluster. The problem with the current API is
that it directly couples the clients to a shared memory implementation
with a direct connection between the front and back end domains and the
clients would all need to be rewritten if the implementation was to span
physical machines or require indirection. Eventually I would expect the
effort invested in the clients of the inter-domain API to equal or
exceed the effort invested in the hypervisor in the same way that the
linux device drivers make up the bulk of the linux kernel code. There is
a risk therefore that this might become a significant architectural
limitation.

So, I think we''re looking for a higher-level API which can preserve the
current efficient implementation for domains resident on the same
physical machine but allows for domains to be separated by a network
interface without having to rewrite all the drivers.

The API needs to address the following issues:

Resource discovery --- Discovering the targets of IDC is an inherent
requirement.

Dynamic behaviour --- Domains are going to come and go all the time.

Stale communications --- When domains come and go, client protocols must
have a way to recover from communications in flight or potentially in
flight from before the last transition.

Deadlock --- IDC is a shared resource and must not introduce resource
deadlock issues, for example when FE and BEs are arranged symetrically
in reverse across the same interface or when BEs are stacked and so
introduce chains of dependencies.

Security --- There are varying degrees of trust beween the domains.

Ease of use --- This is important for developer productivity and also to
help ensure the other goals (security/robustness) are actually met.

Efficiency/Performance --- obviously.

I''d need a few days (which I don''t have right now) to put
together a
coherent proposal tailored specifically to xen.  However, it would
probably be along the lines of the following:

A buffer abstraction to decouple the IDC API from the memory management
implementation:

struct local_buffer_reference;

An endpoint abstraction to represent one end of an IDC connection. 
It''s
important that this is done on a per connection basis rather than having
one per domain for all IDC activity because it avoids deadlock issues
arising from chained, dependent communication.

struct idc_endpoint;

A message abstraction because some protocols are more efficiently
implemented using one-way messages than request-response pairs,
particularly when the protocol involves more than two parties.

struct idc_message
{
    ...
    struct local_buffer_reference message_body;
};

/* When a received message is finished with */

void idc_message_complete( struct idc_message * message );

A request-response transaction abstraction because most protocols are
more easily implemented with these.

struct idc_transaction
{
    ...
    struct local_buffer_reference transaction_parameters;
    struct local_buffer_reference transaction_status;
};

/* Useful to have an error code in addition to status.  */

/* When a received transaction is finished with. */

void idc_transaction_complete
  ( struct idc_transaction * transaction, error_code error );

/* When an initiated transaction completes. Error code also reports
transport errors when endpoint disconnects whilst transaction is
outstanding. */

error_code idc_transaction_query_error_code
  ( struct idc_transaction * transaction );

An IDC address abstraction:

struct idc_address;

A mechanism to initiate connection establishment, can''t fail because
endpoint resource is pre-allocated and create doesn''t actually need to
establish the connection.

The endpoint calls the registered notification functions as follows:

''appear'' when the remote endpoint is discovered then
''disappear'' if it
goes away again or ''connect'' if a connection is actually
established.

After ''connect'', the client can submit messages and
transactions.

''disconnect'' when the connection is failing, the client must
wait for
outstanding messages and transactions to complete (sucessfully or with a
transport error) before completing the disconnect callback and must
flush received messages and transactions whilst disconnected.

Then ''connect'' if the connection is reestablished or
''disappear'' if the
remote endpoint has gone away.

A disconnect, connect cycle guarantees that the remote endpoint also
goes through a disconnect, connect cycle.

This API allows multi-pathing clients to make intelligent decisions and
provides sufficient guarantees about stale messages and transactions to
make a useful foundation.

void idc_endpoint_create
(
    struct idc_endpoint * endpoint,
    struct idc_address address,
    void ( * appear     )( struct idc_endpoint * endpoint ),
    void ( * connect    )( struct idc_endpoint * endpoint ),
    void ( * disconnect )
      ( struct idc_endpoint * endpoint, struct callback * callback ),
    void ( * disappear )( struct idc_endpoint * endpoint ),
    void ( * handle_message )
      ( struct idc_endpoint * endpoint, struct idc_message * message ),
    void ( * handle_transaction )
    (
        struct idc_endpoint * endpoint,
        struct idc_transaction * transaction
    )
);

void idc_endpoint_submit_message
  ( struct idc_endpoint * endpoint, struct idc_message * message );

void idc_endpoint_submit_transaction
  ( struct idc_endpoint * endpoint, struct idc_transaction *
transaction );

idc_endpoint_destroy completes the callback once the endpoint has
''disconnected'' and ''disappeared'' and the
endpoint resource is free for
reuse for a different connection.

void idc_endpoint_destroy
(
    struct idc_endpoint * endpoint,
    struct callback * callback
);

The messages and transaction parameters and status must be of finite
length (these quota properties might be parameters of the endpoint
resource allocation). Need a mechanism for efficient, arbitrary length
bulk transfer too.

An abstraction for buffers owned by remote domains:

struct remote_buffer_reference;

Can register a local buffer with the IDC to get a remote buffer
reference:

struct remote_buffer_reference idc_register_buffer
  ( struct local_buffer_reference buffer, some kind of resource probably
required here );

remote buffer references may be passed between domains in idc messages
or transaction parameters or transaction status.

remote buffer references may be forwarded between domains and are usable
from any domain.

Once in posession of a remote buffer reference, a domain can transfer
data between the remote buffer and a local buffer:

void idc_send_to_remote_buffer
(
    struct remote_buffer_reference remote_buffer,
    struct local_buffer_reference local_buffer,
    struct callback * callback, /* transfer completes asynchronously */
    some kind of resource required here
);

void idc_receive_from_remote_buffer
(
    struct remote_buffer_reference remote_buffer,
    struct local_buffer_reference local_buffer,
    struct callback * callback, /* Again, completes asynchronously */
    some kind of resource required here
);

Can unregister to free a local buffer independent of remote buffer
references still knocking around in remote domains (subsequent
sends/receives fail):

void idc_unregister_buffer
  ( probably a pointer to the resource passed on registration );

So, the 1000 statements of establishment code in the current drivers
becomes:

Receive an idc address from somewhere (resource discovery is outside the
scope of this sketch).

Allocate an IDC endpoint from somewhere (resource management is again
outside the scope of this sketch).

Call idc_endpoint_create.

Wait for ''connect'' before attempting to use connection for
device
specific protocol implemented using messages/transactions/remote buffer
references.

Call idc_endpoint_destroy and quiesce before unloading module.

The implementation of the local buffer references and memory management
can hide the use of pages which are shared between domains and reference
counted to provide a zero copy implementation of bulk data transfer and
shared page-caches.

I implemented something very similar to this before for a cluster
interconnect and it worked very nicely.  There are some subtleties to
get right about the remote buffer reference implementation and the
implications for out-of-order and idempotent bulk data transfers.

As I said, it would require a few more days work to nail down a good
API.

Harry.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Mark Williamson

2005-May-06 13:39 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

I like it.  To start with, local communication only would be fine.  Eventually 
it would scale neatly to things like remote device access.

I particularly like the abstraction for remote memory - this would be an 
excellent fit to take advantage of RDMA where available (e.g. a cluster 
running on an IB fabric).

Cheers,
Mark

On Friday 06 May 2005 13:14, Harry Butterworth wrote:> On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:
> > Harry Butterworth wrote:
> > > The current overhead in terms of client code to establish an
entity on
> > > the xen inter-domain communication "bus" is currently
of the order of
> > > 1000 statements (counting FE, BE and slice of xend).  A better
> > > inter-domain communication API could reduce this to fewer than 10
> > > statements.  If it''s not done by the time I finish the
USB work, I will
> > > hopefully be allowed to help with this.
> >
> > This reminded me you had suggested a different model for inter-domain
> > comms. I recently suggested a more socket-like API but it
didn''t go down
> > well.
>
> What exactly were the issues with the socket-like proposal?
>
> > I agree with you that the event channel model could be improved -
> > what kind of comms model do you suggest?
>
> The event-channel and shared memory page are fine as low-level
> primitives to implement a comms channel between domains on the same
> physical machine. The problem is that the primitives are unnecessarily
> low-level from the client''s perspective and result in too much
> per-client code.
>
> The inter-domain communication API should preserve the efficiency of
> these primitives but provide a higher level API which is more convenient
> to use.
>
> Another issue with the current API is that, in the future, it is likely
> (for a number of virtual-iron/fault-tolerant-virtual-machine-like
> reasons) that it will be useful for the inter-domain communication API
> to span physical nodes in a cluster. The problem with the current API is
> that it directly couples the clients to a shared memory implementation
> with a direct connection between the front and back end domains and the
> clients would all need to be rewritten if the implementation was to span
> physical machines or require indirection. Eventually I would expect the
> effort invested in the clients of the inter-domain API to equal or
> exceed the effort invested in the hypervisor in the same way that the
> linux device drivers make up the bulk of the linux kernel code. There is
> a risk therefore that this might become a significant architectural
> limitation.
>
> So, I think we''re looking for a higher-level API which can
preserve the
> current efficient implementation for domains resident on the same
> physical machine but allows for domains to be separated by a network
> interface without having to rewrite all the drivers.
>
> The API needs to address the following issues:
>
> Resource discovery --- Discovering the targets of IDC is an inherent
> requirement.
>
> Dynamic behaviour --- Domains are going to come and go all the time.
>
> Stale communications --- When domains come and go, client protocols must
> have a way to recover from communications in flight or potentially in
> flight from before the last transition.
>
> Deadlock --- IDC is a shared resource and must not introduce resource
> deadlock issues, for example when FE and BEs are arranged symetrically
> in reverse across the same interface or when BEs are stacked and so
> introduce chains of dependencies.
>
> Security --- There are varying degrees of trust beween the domains.
>
> Ease of use --- This is important for developer productivity and also to
> help ensure the other goals (security/robustness) are actually met.
>
> Efficiency/Performance --- obviously.
>
> I''d need a few days (which I don''t have right now) to put
together a
> coherent proposal tailored specifically to xen.  However, it would
> probably be along the lines of the following:
>
> A buffer abstraction to decouple the IDC API from the memory management
> implementation:
>
> struct local_buffer_reference;
>
> An endpoint abstraction to represent one end of an IDC connection. 
It''s
> important that this is done on a per connection basis rather than having
> one per domain for all IDC activity because it avoids deadlock issues
> arising from chained, dependent communication.
>
> struct idc_endpoint;
>
> A message abstraction because some protocols are more efficiently
> implemented using one-way messages than request-response pairs,
> particularly when the protocol involves more than two parties.
>
> struct idc_message
> {
>     ...
>     struct local_buffer_reference message_body;
> };
>
> /* When a received message is finished with */
>
> void idc_message_complete( struct idc_message * message );
>
> A request-response transaction abstraction because most protocols are
> more easily implemented with these.
>
> struct idc_transaction
> {
>     ...
>     struct local_buffer_reference transaction_parameters;
>     struct local_buffer_reference transaction_status;
> };
>
> /* Useful to have an error code in addition to status.  */
>
> /* When a received transaction is finished with. */
>
> void idc_transaction_complete
>   ( struct idc_transaction * transaction, error_code error );
>
> /* When an initiated transaction completes. Error code also reports
> transport errors when endpoint disconnects whilst transaction is
> outstanding. */
>
> error_code idc_transaction_query_error_code
>   ( struct idc_transaction * transaction );
>
> An IDC address abstraction:
>
> struct idc_address;
>
> A mechanism to initiate connection establishment, can''t fail
because
> endpoint resource is pre-allocated and create doesn''t actually
need to
> establish the connection.
>
> The endpoint calls the registered notification functions as follows:
>
> ''appear'' when the remote endpoint is discovered then
''disappear'' if it
> goes away again or ''connect'' if a connection is actually
established.
>
> After ''connect'', the client can submit messages and
transactions.
>
> ''disconnect'' when the connection is failing, the client
must wait for
> outstanding messages and transactions to complete (sucessfully or with a
> transport error) before completing the disconnect callback and must
> flush received messages and transactions whilst disconnected.
>
> Then ''connect'' if the connection is reestablished or
''disappear'' if the
> remote endpoint has gone away.
>
> A disconnect, connect cycle guarantees that the remote endpoint also
> goes through a disconnect, connect cycle.
>
> This API allows multi-pathing clients to make intelligent decisions and
> provides sufficient guarantees about stale messages and transactions to
> make a useful foundation.
>
> void idc_endpoint_create
> (
>     struct idc_endpoint * endpoint,
>     struct idc_address address,
>     void ( * appear     )( struct idc_endpoint * endpoint ),
>     void ( * connect    )( struct idc_endpoint * endpoint ),
>     void ( * disconnect )
>       ( struct idc_endpoint * endpoint, struct callback * callback ),
>     void ( * disappear )( struct idc_endpoint * endpoint ),
>     void ( * handle_message )
>       ( struct idc_endpoint * endpoint, struct idc_message * message ),
>     void ( * handle_transaction )
>     (
>         struct idc_endpoint * endpoint,
>         struct idc_transaction * transaction
>     )
> );
>
> void idc_endpoint_submit_message
>   ( struct idc_endpoint * endpoint, struct idc_message * message );
>
> void idc_endpoint_submit_transaction
>   ( struct idc_endpoint * endpoint, struct idc_transaction *
> transaction );
>
> idc_endpoint_destroy completes the callback once the endpoint has
> ''disconnected'' and ''disappeared'' and
the endpoint resource is free for
> reuse for a different connection.
>
> void idc_endpoint_destroy
> (
>     struct idc_endpoint * endpoint,
>     struct callback * callback
> );
>
> The messages and transaction parameters and status must be of finite
> length (these quota properties might be parameters of the endpoint
> resource allocation). Need a mechanism for efficient, arbitrary length
> bulk transfer too.
>
> An abstraction for buffers owned by remote domains:
>
> struct remote_buffer_reference;
>
> Can register a local buffer with the IDC to get a remote buffer
> reference:
>
> struct remote_buffer_reference idc_register_buffer
>   ( struct local_buffer_reference buffer, some kind of resource probably
> required here );
>
> remote buffer references may be passed between domains in idc messages
> or transaction parameters or transaction status.
>
> remote buffer references may be forwarded between domains and are usable
> from any domain.
>
> Once in posession of a remote buffer reference, a domain can transfer
> data between the remote buffer and a local buffer:
>
> void idc_send_to_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* transfer completes asynchronously */
>     some kind of resource required here
> );
>
> void idc_receive_from_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* Again, completes asynchronously */
>     some kind of resource required here
> );
>
> Can unregister to free a local buffer independent of remote buffer
> references still knocking around in remote domains (subsequent
> sends/receives fail):
>
> void idc_unregister_buffer
>   ( probably a pointer to the resource passed on registration );
>
> So, the 1000 statements of establishment code in the current drivers
> becomes:
>
> Receive an idc address from somewhere (resource discovery is outside the
> scope of this sketch).
>
> Allocate an IDC endpoint from somewhere (resource management is again
> outside the scope of this sketch).
>
> Call idc_endpoint_create.
>
> Wait for ''connect'' before attempting to use connection
for device
> specific protocol implemented using messages/transactions/remote buffer
> references.
>
> Call idc_endpoint_destroy and quiesce before unloading module.
>
> The implementation of the local buffer references and memory management
> can hide the use of pages which are shared between domains and reference
> counted to provide a zero copy implementation of bulk data transfer and
> shared page-caches.
>
> I implemented something very similar to this before for a cluster
> interconnect and it worked very nicely.  There are some subtleties to
> get right about the remote buffer reference implementation and the
> implications for out-of-order and idempotent bulk data transfers.
>
> As I said, it would require a few more days work to nail down a good
> API.
>
> Harry.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ronald G. Minnich

2005-May-06 16:04 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

I would be happiest (is Eric Van Hensbergen out there?) if we could move 
to a model where the basic inter-domain communications is 9p, with all 
that implies. I can try to say more if there is interest. 

Right now, there are multiple types of comms for inter-domain, which all
work very differently (consider vbd and network). I am continually dealing
with chasing changes in how things work, which can get tough in a non-gcc
world. From my point of view, the differences between (e.g.) bsd and linux 
are very minor compares to linux and plan 9. The biggest effort for me in 
forward-ports is dealing with gcc-isms in the code, and multiplying that 
by the number of different devices, adding in the binary structures, turns 
this into a lot of work.

It would sure be nice if the option were there to have a set of devices 
which all had the ring-buffer structure of (e.g.) the vbd, with no page 
loaning etc., and which allowed fairly arbitrary data transfer (i.e. not 
the current limits of vbd, with assumed 512-byte blocks and assumed 
upper-limits of 4096 bytes etc.). 

Plan 9 has shown that in practice a simple open/read/write/close model 
works fine for any device, and that devices can present a file-system-like 
interface to the kernel (and hence programs) and it all works. A lot of 
the binary structure of the current interdomain comms and devices could be 
eliminated and replaced with a common, simple abstraction usable for all 
devices. You could get rid of all the different bits of code (block, net, 
etc.) and replace with one simple bit of code. 

ron

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-06 16:49 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/6/05, Ronald G. Minnich <rminnich@lanl.gov>
wrote:> I would be happiest (is Eric Van Hensbergen out there?) if we could move
> to a model where the basic inter-domain communications is 9p, with all
> that implies. I can try to say more if there is interest.
> 
I''ve been developing such a model using IBM''s research
hypervisor -
but aside from some minor differences to the underlying transport
mechanisms, it should be easy enough to apply to a Xen framework (it
was my plan to look into that sometime this summer).

I selected 9P as the protocol because of its limited set of underlying
transport requirements, (it requires a reliable, in-order pipe and
otherwise doesn''t make any assumptions), its flexibility (it can be
used to share file systems, devices, protocol stacks, communication
channels and application interfaces both across partitions and across
the network), and the simplicity of its design (and resulting
simplicity of implementation).

The initial target of my implementation was special-purpose
application kernels (built with a library-OS) running alongside a
Linux kernel acting as an I/O host and controller.  Our prototype
library-OS version of the client library is approximately 2000 lines
of code and the Linux file system client (http://v9fs.sf.net) is 4000
lines of code (extra weight added by VFS mapping routines).

This summer we will be looking at improving the efficiency of our
transport infrastructure, adding failure detection, recovery, and/or
fail over (including network fail over), and adding intelligent
network transports (we work over Ethernet and shared memory right now,
but we want to fully prototype Infiniband and related transports).

Some of the bits (like the v9fs Linux client and the u9fs user-space
server) are already available open-source.  We''ll are in the process
of jumping through the paperwork hoops to open-source the library-OS
infrastructure and shared memory transport pieces.  There was a paper
presented on v9fs at Freenix 2005.

I''d be happy to answer any questions on this work.  I was planning on
presenting the approach to the Xen community once I had a more solid
prototype and some performance numbers, but Ron has let the cat out of
the bag. ;)

       -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Nivedita Singhvi

2005-May-06 16:57 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

Harry Butterworth wrote:
> On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:

Harry, thanks for bringing this to xen-devel discussion..

> The inter-domain communication API should preserve the efficiency of
> these primitives but provide a higher level API which is more convenient
> to use.
We certainly need to simplify the API for the frontends -
making it easier to add frontends for new devices and OSs.

We also need to build in support for frontend - frontend
communication in an efficient way.

> So, I think we''re looking for a higher-level API which can
preserve the
> current efficient implementation for domains resident on the same
> physical machine but allows for domains to be separated by a network
> interface without having to rewrite all the drivers.
> 
> The API needs to address the following issues:
> 
> Resource discovery --- Discovering the targets of IDC is an inherent
> requirement.
> 
> Dynamic behaviour --- Domains are going to come and go all the time.
> 
> Stale communications --- When domains come and go, client protocols must
> have a way to recover from communications in flight or potentially in
> flight from before the last transition.
> 
> Deadlock --- IDC is a shared resource and must not introduce resource
> deadlock issues, for example when FE and BEs are arranged symetrically
> in reverse across the same interface or when BEs are stacked and so
> introduce chains of dependencies.
> 
> Security --- There are varying degrees of trust beween the domains.
> 
> Ease of use --- This is important for developer productivity and also to
> help ensure the other goals (security/robustness) are actually met.
> 
> Efficiency/Performance --- obviously.
> 
> I''d need a few days (which I don''t have right now) to put
together a
> coherent proposal tailored specifically to xen.  However, it would
> probably be along the lines of the following:
> 
> A buffer abstraction to decouple the IDC API from the memory management
> implementation:
> 
> struct local_buffer_reference;
> 
> An endpoint abstraction to represent one end of an IDC connection. 
It''s
> important that this is done on a per connection basis rather than having
> one per domain for all IDC activity because it avoids deadlock issues
> arising from chained, dependent communication.
> 
> struct idc_endpoint;
> 
> A message abstraction because some protocols are more efficiently
> implemented using one-way messages than request-response pairs,
> particularly when the protocol involves more than two parties.
> 
> struct idc_message
> {
>     ...
>     struct local_buffer_reference message_body;
> };
> 
> /* When a received message is finished with */
> 
> void idc_message_complete( struct idc_message * message );
> 
> A request-response transaction abstraction because most protocols are
> more easily implemented with these.
> 
> struct idc_transaction
> {
>     ...
>     struct local_buffer_reference transaction_parameters;
>     struct local_buffer_reference transaction_status;
> };
> 
> /* Useful to have an error code in addition to status.  */
> 
> /* When a received transaction is finished with. */
> 
> void idc_transaction_complete
>   ( struct idc_transaction * transaction, error_code error );
> 
> /* When an initiated transaction completes. Error code also reports
> transport errors when endpoint disconnects whilst transaction is
> outstanding. */
> 
> error_code idc_transaction_query_error_code
>   ( struct idc_transaction * transaction );
> 
> An IDC address abstraction:
> 
> struct idc_address;
> 
> A mechanism to initiate connection establishment, can''t fail
because
> endpoint resource is pre-allocated and create doesn''t actually
need to
> establish the connection.
> 
> The endpoint calls the registered notification functions as follows:
> 
> ''appear'' when the remote endpoint is discovered then
''disappear'' if it
> goes away again or ''connect'' if a connection is actually
established.
> 
> After ''connect'', the client can submit messages and
transactions.
> 
> ''disconnect'' when the connection is failing, the client
must wait for
> outstanding messages and transactions to complete (sucessfully or with a
> transport error) before completing the disconnect callback and must
> flush received messages and transactions whilst disconnected.
> 
> Then ''connect'' if the connection is reestablished or
''disappear'' if the
> remote endpoint has gone away.
> 
> A disconnect, connect cycle guarantees that the remote endpoint also
> goes through a disconnect, connect cycle.
> 
> This API allows multi-pathing clients to make intelligent decisions and
> provides sufficient guarantees about stale messages and transactions to
> make a useful foundation.
> 
> void idc_endpoint_create
> (
>     struct idc_endpoint * endpoint,
>     struct idc_address address,
>     void ( * appear     )( struct idc_endpoint * endpoint ),
>     void ( * connect    )( struct idc_endpoint * endpoint ),
>     void ( * disconnect )
>       ( struct idc_endpoint * endpoint, struct callback * callback ),
>     void ( * disappear )( struct idc_endpoint * endpoint ),
>     void ( * handle_message )
>       ( struct idc_endpoint * endpoint, struct idc_message * message ),
>     void ( * handle_transaction )
>     (
>         struct idc_endpoint * endpoint,
>         struct idc_transaction * transaction
>     )
> );
> 
> void idc_endpoint_submit_message
>   ( struct idc_endpoint * endpoint, struct idc_message * message );
> 
> void idc_endpoint_submit_transaction
>   ( struct idc_endpoint * endpoint, struct idc_transaction *
> transaction );
> 
> idc_endpoint_destroy completes the callback once the endpoint has
> ''disconnected'' and ''disappeared'' and
the endpoint resource is free for
> reuse for a different connection.
> 
> void idc_endpoint_destroy
> (
>     struct idc_endpoint * endpoint,
>     struct callback * callback
> );
> 
> The messages and transaction parameters and status must be of finite
> length (these quota properties might be parameters of the endpoint
> resource allocation). Need a mechanism for efficient, arbitrary length
> bulk transfer too.
> 
> An abstraction for buffers owned by remote domains:
> 
> struct remote_buffer_reference;
> 
> Can register a local buffer with the IDC to get a remote buffer
> reference:
> 
> struct remote_buffer_reference idc_register_buffer
>   ( struct local_buffer_reference buffer, some kind of resource probably
> required here );
> 
> remote buffer references may be passed between domains in idc messages
> or transaction parameters or transaction status.
> 
> remote buffer references may be forwarded between domains and are usable
> from any domain.
> 
> Once in posession of a remote buffer reference, a domain can transfer
> data between the remote buffer and a local buffer:
> 
> void idc_send_to_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* transfer completes asynchronously */
>     some kind of resource required here
> );
> 
> void idc_receive_from_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* Again, completes asynchronously */
>     some kind of resource required here
> );
> 
> Can unregister to free a local buffer independent of remote buffer
> references still knocking around in remote domains (subsequent
> sends/receives fail):
> 
> void idc_unregister_buffer
>   ( probably a pointer to the resource passed on registration );
> 
> So, the 1000 statements of establishment code in the current drivers
> becomes:
> 
> Receive an idc address from somewhere (resource discovery is outside the
> scope of this sketch).
> 
> Allocate an IDC endpoint from somewhere (resource management is again
> outside the scope of this sketch).
> 
> Call idc_endpoint_create.
> 
> Wait for ''connect'' before attempting to use connection
for device
> specific protocol implemented using messages/transactions/remote buffer
> references.
> 
> Call idc_endpoint_destroy and quiesce before unloading module.
quiesce across remote nodes as well?
> The implementation of the local buffer references and memory management
> can hide the use of pages which are shared between domains and reference
> counted to provide a zero copy implementation of bulk data transfer and
> shared page-caches.
> 
> I implemented something very similar to this before for a cluster
> interconnect and it worked very nicely.  There are some subtleties to
> get right about the remote buffer reference implementation and the
> implications for out-of-order and idempotent bulk data transfers.
All the above looked very sane. How does stuff get out of order,
though? We have effectively per-device queues.
> As I said, it would require a few more days work to nail down a good
> API.
thanks,
Nivedita


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-06 23:13 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

I skimmed what I could find immediately on 9P:
http://www.cs.bell-labs.com/sys/man/5/INDEX.html and came to the
conclusion that what I proposed is lower level and more general (it''s
also simpler but, given that it''s not doing the same thing, that
isn''t
really significant).

You could implement 9P on top of my IDC API proposal---and this might
very well be a valid thing to do given that some sort of higher level
protocol is required---but, with what I proposed, you could also do
things like implement a stack of software I/O virtualization functions
and send the bulk data between the top of the stack and the bottom
without going via all the components in the middle.  My 10 mins of 9P
experience indicates that 9P doesn''t have anything like that.

Or does it---any pointers to good 9P documentation?

Harry



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-07 00:19 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/6/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk>
wrote:> I skimmed what I could find immediately on 9P:
> http://www.cs.bell-labs.com/sys/man/5/INDEX.html and came to the
> conclusion that what I proposed is lower level and more general
(it''s
> also simpler but, given that it''s not doing the same thing, that
isn''t
> really significant).
There isn''t a great deal of detail on the specifics of the protocol
beyond the man pages.  I think you''ll find the following Plan 9
foundational papers more illustrative of the type of things that it
can be used to do:
        http://plan9.bell-labs.com/sys/doc/9.pdf
        http://plan9.bell-labs.com/sys/doc/names.pdf
and in particular:
        http://plan9.bell-labs.com/sys/doc/net/net.pdf
For the security minded:
        http://plan9.bell-labs.com/sys/doc/auth.pdf

I believe Ron''s original statements were motivated by your earlier
statement:>>
>>The event-channel and shared memory page are fine as low-level
>>primitives to implement a comms channel between domains on the same
>>physical machine. The problem is that the primitives are unnecessarily
>>low-level from the client''s perspective and result in too much
>>per-client code.
>>
This is exactly the sort of thing that the Plan 9 networking model was
designed to do.
The idea being that the Plan 9 model provides a nice abstract layer
which to communicate with AND to organize (the organization is an
important feature) the resulting communication channels and endpoints
(no matter what the underlying transport is or where the particular
endpoints may be).  The Plan 9 networking model can be run over named
pipes, shared memory, PCI buses, Ethernet, Infiniband, or whatever
other flavor of network with relatively minor requirements on the
underlying transport layer.
> 
> You could implement 9P on top of my IDC API proposal
>
You could implement 9P on top of such an IDC API, however, it seems
like it would add unnecessary overhead.  9P can be easily implemented
directly on top of the existing event/shared-page mechanisms and then
trivially bridged to the network at the I/O partition(s).  As far as
multi-level protocol stacks and circumventing data paths, I''d hope we
could come up with a simpler, more elegant solution.

Looking over your earlier proposal it seems like an awful lot of
complexity to accomplish a relatively simple task.  Perhaps the
refined API will demonstrate that simplicity better? I''d be really
interested in the security details of your model as well when you
finish the more detailed proposal.
>>
>>The inter-domain communication API should preserve the efficiency of
>>these primitives but provide a higher level API which is more convenient
>>to use.
>>
I think this is a great mission statement -- convenience and
simplicity are the key to selling such an API.

I''d suggest we step through a complete example with actual
applications and actual devices that would demonstrate the problem
that we are trying to solve.  Perhaps Ron and I can pull together an
alternate proposal based on such a concrete example.

           -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-07 13:26 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On Fri, 2005-05-06 at 19:19 -0500, Eric Van Hensbergen wrote:
> This is exactly the sort of thing that the Plan 9 networking model was
> designed to do.
> The idea being that the Plan 9 model provides a nice abstract layer
> which to communicate with AND to organize (the organization is an
> important feature)
Yes, the organisation is important, I expect we can learn a lot from 9P
here.  I''m not trying to address the organisation aspect yet though (my
API proposal is lower level) so please let''s postpone that aspect of
the
discussion.
> Looking over your earlier proposal it seems like an awful lot of
> complexity to accomplish a relatively simple task.  Perhaps the
> refined API will demonstrate that simplicity better? I''d be really
> interested in the security details of your model as well when you
> finish the more detailed proposal.
I''d need help from security experts but, as an initial stab, if the
idc_address and remote_buffer_references are capabilities then I think
the security falls out in the wash since it''s impossible to access
something unless you have been granted permission.
> I''d suggest we step through a complete example with actual
> applications and actual devices that would demonstrate the problem
> that we are trying to solve.  Perhaps Ron and I can pull together an
> alternate proposal based on such a concrete example.
OK, here goes:

A significant difference between Plan 9 and Xen which is relevant to
this discussion is that Plan 9 is designed to construct a single shared
environment from multiple physical machines whereas Xen is designed to
partition a single physical machine into multiple isolated environments.
Arguably, Xen clusters might also partition multiple physical machines
into multiple isolated environments with some weird and wonderful
cross-machine sharing and replication going on.

The significance of this difference is that in the Xen environment,
there are many interesting opportunities for optimisations across the
virtual machines running on the same physical machine. These
optimisations are not relevant to a native Plan 9 system and so (AFAICT
with 20 mins experience :-) ) there is no provision for them in 9P.

Take, for example, the case where 1000 virtual machines are booting on
the same physical machine from copy-on-write file-systems which share a
common ancestry.

With my proposal, when the virtual machines read a file into memory, the
operation might proceed as follows:

The FE registers a local buffer to receive the file contents and is
given a remote buffer reference.

The FE issues a transaction to the BE to read the file, passing the
remote buffer reference.

The BE looks up the file using its meta data and happens to find the
file contents in its buffer cache.

The BE makes an idc_send_to_remote_buffer call passing a local buffer
reference for the file data in its local buffer cache and the remote
buffer reference provided by the FE.

The IDC implementation resolves the remote buffer reference to a local
buffer reference (since the FE buffer happens to reside on the same
physical machine in this example) and makes a call to
local_buffer_reference_copy to copy the file data from its buffer cache
to the FE buffer.

The local_buffer_reference_copy implementation determines the type of
copy based on the types of the source and destination local buffer
references which, for the sake of argument, both happen to be of a type
backed by reference counted pages from compatible page pools.

The implementation of local_buffer_reference_copy for that specific
combination of buffer types maps the BE pages into the FE address space
incrementing their reference counts and also unmaps the old FE pages and
decrements their reference counts, returning them to the free pool if
necessary.

The other 999 virtual machines boot and do the same thing (since the
file in question happens to be one which has not diverged) leaving all
1000 virtual machines sharing the same physical memory.

Had the virtual machines been booting on different physical machines
then the path through the FE and BE client code would have been
identical (so we have met the network transparency goal) but the IDC
implementation would have taken an alternative path upon discovering
that the remote_buffer_reference was genuinely remote.

Hopefully this explains better where I''m coming from.

Harry.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-07 14:57 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/7/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk>
wrote:> 
> I''d need help from security experts but, as an initial stab, if
the
> idc_address and remote_buffer_references are capabilities then I think
> the security falls out in the wash since it''s impossible to access
> something unless you have been granted permission.
> 
That seems straightforward and clear to me on the local host, but it
seems like there might be additional concerns when bridging to the
cluster.  Of course, those security issues may be embedded in the
underlying network transport layer, so maybe its not as much of a
concern.
> 
> OK, here goes:
> 
> A significant difference between Plan 9 and Xen which is relevant to
> this discussion is that Plan 9 is designed to construct a single shared
> environment from multiple physical machines whereas Xen is designed to
> partition a single physical machine into multiple isolated environments.
> Arguably, Xen clusters might also partition multiple physical machines
> into multiple isolated environments with some weird and wonderful
> cross-machine sharing and replication going on.
> 
Yes and no, Plan 9 does provide a coherent mechanism to unify access
to the resources of an entire cluster of physical machines -- but it
also provides a lot of facilities for partitioning and organizing
those resources into a private name spaces.  Its both of these aspects
that Ron and I would look to see leveraged in any sort of future Xen
I/O architecture.  But this gets more into organizational features,
which may be a separate topic.
>
> The significance of this difference is that in the Xen environment,
> there are many interesting opportunities for optimisations across the
> virtual machines running on the same physical machine. These
> optimisations are not relevant to a native Plan 9 system and so (AFAICT
> with 20 mins experience :-) ) there is no provision for them in 9P.
> 
This is true.  The example you step through sounds like an
implementation of DSM targeted at a buffer cache.  In the past, 9P has
not been used to provide such a level of transparent sharing of
underlying memory.  However, an area that Orran and I have been
talking about is exploring the addition of scatter/gather type
semantics to the 9P protocol implementations (there''s really not that
much that has to change in the specification, just some differences in
the way the protocol looks on the underlying transport).  In other
words, there is nothing to prevent read/write calls from having
pointers to the page containing the data versus having a copy of the
data.  This page containing the data could be a copy, or it could be
shared as in your example.  The cool thing is, since 9P is already
setup to be a network protocol, if it did end up having to leave
shared memory and go over an external wire, there''s already a fairly
rich organizational infrastructure in place (with built-in support for
security/encryption/digesting).
> 
> Had the virtual machines been booting on different physical machines
> then the path through the FE and BE client code would have been
> identical (so we have met the network transparency goal) but the IDC
> implementation would have taken an alternative path upon discovering
> that the remote_buffer_reference was genuinely remote.
>
The scenario you walk through sounds really great, and providing
mechanisms to manage and recognize shared buffers on the same machine
sounds like absolutely the right thing to do.  I''m just arguing for a
simpler client interface (and I think something akin to the 9P API
together with a nice Channel abstraction to management endpoints would
be the right way to go about it).  Okay Ron, you got me into this,
what are your thoughts?
> Hopefully this explains better where I''m coming from.
This paints a clearer picture of what you were going after, and I
think we are on similar tracks.  I need to get more engaged in looking
at the Xen stuff so I have a better context for some of the problems
particular to its environment.

          -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ronald G. Minnich

2005-May-07 16:15 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On Sat, 7 May 2005, Harry Butterworth wrote:> A significant difference between Plan 9 and Xen which is relevant to
> this discussion is that Plan 9 is designed to construct a single shared
> environment from multiple physical machines whereas Xen is designed to
> partition a single physical machine into multiple isolated environments.
this is not in my view germane. We''re proposing using 9p as the low
level
interdomain protocol for xen. 9p stands alone from plan 9. Both Eric and I 
have felt for some time that this would work well, and I''m pretty
strongly
convinced that it would work better than what we have now for the various 
devices. 

Your 1000 machine example is very interesting, however. I do know that 
some folks who worked on the IBM hypervisor had a lot of concerns about 
the page flipping done in the xen network FE/BE drivers, and I wonder if 
their concerns would apply to your proposed implementation -- I wish we 
could find a way to hear from them. 

thanks

ron


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2005-May-07 17:10 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 7 May 2005, at 17:15, Ronald G. Minnich wrote:
> this is not in my view germane. We''re proposing using 9p as the
low
> level
> interdomain protocol for xen. 9p stands alone from plan 9. Both Eric 
> and I
> have felt for some time that this would work well, and I''m pretty 
> strongly
> convinced that it would work better than what we have now for the 
> various
> devices.
Isn''t 9p a file transfer protocol, or is that just one aspect of it? At
the very lowest level it doesn''t appear so different from what we have 
now (multiple oustanding request messages, each with a different tag 
that is echoed in the asynchronous response message); but the next 
level up seems to be defined in terms of walking a hierarchical file 
system, obtaining handles on files, and so on. To me that doesn''t sound
obviously applicable to our inter-driver communications.

  -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-07 17:17 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On Sat, 2005-05-07 at 10:15 -0600, Ronald G. Minnich
wrote:> 
> On Sat, 7 May 2005, Harry Butterworth wrote:
> > A significant difference between Plan 9 and Xen which is relevant to
> > this discussion is that Plan 9 is designed to construct a single
shared
> > environment from multiple physical machines whereas Xen is designed to
> > partition a single physical machine into multiple isolated
environments.
> 
> this is not in my view germane. We''re proposing using 9p as the
low level
> interdomain protocol for xen. 9p stands alone from plan 9. Both Eric and I 
> have felt for some time that this would work well, and I''m pretty
strongly
> convinced that it would work better than what we have now for the various 
> devices. 
I agree it would work well, definitely better than the current code, but
I don''t think it''s right for Xen because it precludes the kind
of
optimisations that are relevant to Xen but not Plan 9.

Eric mentioned adding scatter/gather type semantics to the 9P protocol
which would get you half-way to what I was proposing and start to look
similar to the remote buffer reference concept in that you pass around
references to the bulk data instead of the data itself but you still
wouldn''t get the ability to manipulate the meta-data of a request in a
stack of I/O applications and then do a direct transfer of the bulk data
from the source to the target.

Of course, if you modify the 9P protocol enough then you can make it a
superset of what I was proposing but I think my API proposal captured
the essence of what was required at the low level without complicating
the description with the higher level organisation aspects of the
protocol.
> 
> Your 1000 machine example is very interesting, however. I do know that 
> some folks who worked on the IBM hypervisor had a lot of concerns about 
> the page flipping done in the xen network FE/BE drivers, and I wonder if 
> their concerns would apply to your proposed implementation -- I wish we 
> could find a way to hear from them. 
Rusty and I exchanged some mail about the page flipping and I think
Rusty demonstrated that their concerns were unfounded.  Perhaps he can
comment.

In any case, my proposal decouples the IDC API clients from the actual
memory management implementation so different page-flipping behaviours
could be evaluated without requiring code changes in the clients.

Harry.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-07 21:22 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/7/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk>
wrote:> 
> Isn''t 9p a file transfer protocol, or is that just one aspect of
it?
>
Its an easy misconception, 9P is many things - at its core it is the
API to all resources (system, device, application, files) in the Plan
9 paradigm - on the local system the system calls translate directly
to function calls within the kernel - never being marshaled into a
message form.  It is also a protocol for remote resource sharing
(since 9P is used to more than just files, the network encapsulation
of 9P can be used to share all the different resources of the system).
>
> At the very lowest level it doesn''t appear so different from what
we have
> now (multiple oustanding request messages, each with a different tag
> that is echoed in the asynchronous response message); but the next
> level up seems to be defined in terms of walking a hierarchical file
> system, obtaining handles on files, and so on. To me that doesn''t
sound
> obviously applicable to our inter-driver communications.
> 
The dynamic, stackable, hierarchical organization of 9P resources
within Plan 9 may be one of the most important thing for
virtualization environments to leverage.  It creates a easily
understood and manipulated metaphor for organizing system, partition,
or cluster resources.  Within device implementation, it can create a
hierarchy of interfaces allow core resources as well as extensions
that not all clients can take advantage of.  One thing which might be
useful would be to look at the Plan 9 man page sections on devices:
      (http://plan9.bell-labs.com/sys/man/3/INDEX.html) 
and file servers:
      (http://plan9.bell-labs.com/sys/man/4/INDEX.html)

I hate talking about this stuff before we have some sort of prototype
within the Xen environment to demonstrate, a lot of the power of this
methodology is best experienced rather than pontificated about. 
Regardless, we will be working towards this over the next few months
and it should tie in nicely with our v9fs driver in Linux (which will
also be likely ported to K42 and naturally already exists within Plan
9).  There has been quite a bit of interest from certain sects within
the BSD community in implementing something similar to v9fs, so
perhaps we''ll have something there by summer''s end as well.

        -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-07 21:29 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/7/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk>
wrote:> 
> I agree it would work well, definitely better than the current code, but
> I don''t think it''s right for Xen because it precludes the
kind of
> optimisations that are relevant to Xen but not Plan 9.
> 
I really don''t see how they preclude any optimizations in Xen,
particularly with some sort of scatter/gather mechanism in place - the
9P API would be unchanged, just the link-layer would be a bit
different (so its not even a protocol change).
>
> but you still wouldn''t get the ability to manipulate the meta-data
of a
> request in a stack of I/O applications and then do a direct transfer of the
> bulk data from the source to the target.
> 
Not clear why this is the case.  It seems like you''d be able to do
this sort of thing in any sort of protocol where the data is distinct
from the rest of the operation encapsulation.  In fact, there are many
different forms of synthetic "filter file systems" within Plan 9 that
take advantage of just this property (the fact that the bulk data is
handled separately from the control data).
> 
> In any case, my proposal decouples the IDC API clients from the actual
> memory management implementation so different page-flipping behaviours
> could be evaluated without requiring code changes in the clients.
> 
Again, this sounds like a good thing.  A nice abstract interface which
is valid on all platforms for communicating on byte streams between
domains (not just to the controller, but between peer domains as well)
is exactly the sort of transport we would want to build 9P on, but it
really seemed (from your initial proposal) that you were constructing
interfaces for much more than that.  Its also not clear to me whether
having extensions built into the IDC protocol for doing network
transactions would be the right thing to do, it sounds like it would
add a lot of complexity to an otherwise simple core mechanism -- and
as I stated previously, I think simplicity is the ultimate goal of
such an interface.

         -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-07 22:11 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

If you could go through with 9p the same concrete example I did I think
I''d find that useful.  Also, I should probably spend another 10 mins on
the docs ;-)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-08 00:57 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/7/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk>
wrote:> If you could go through with 9p the same concrete example I did I think
> I''d find that useful.  Also, I should probably spend another 10
mins on
> the docs ;-)
> 
It would be way better if we could step you through a demo in the
environment.  The documentation for Plan 9 has always been spartan at
best -- so getting a good idea of how things work without looking over
someones shoulder who is experienced has always been somewhat
difficult.  I''ve been trying to correct this while pushing some of the
bits into Linux - my Freenix v9fs paper spent quite a bit of time
talking about how 9P actually works and showing traces for typical
file system operations.  The OLS paper which I''m supposed to be
writing right now covers the same sort of thing for the dynamic
private name spaces.

The reason why I didn''t post a counter-example was because I
didn''t
see much difference between our two ideas for the scenario you lay out
(except you obviously understand a lot more about some of the details
that I haven''t gotten into yet):

(from the looks of things, you had already established an
authenticated connection between the FE and BE and the top-level had
already been established and whatever file system you are using to
read file data had already traversed to the file and opened it.

A quick summary of how we do the above with 9P:
a) establish the connection to the BE (there are various semantics
possible here, in my current stuff I pass a reference to the Channel
around.  Alternatively you could use socket like semantics to connect
to the other partition)
b) Issue a protocol negotiation packet to establish buffer sizes,
protocol version, etc. (t_version)
c) Attach to the remote resource, providing authentication information
(if necessary) t_attach -- this will also create an initial piece of
shared meta-data referencing the root resource of the BE (for devices
there may only be a single resource, or one resource such as a block
device may have multiple nodes such as partitions, in Plan 9 devices
also present different aspects of their interface (ctl, data, stat,
etc.) as different nodes in a hierarchical fashion.  The reference to
different nodes in the tree is called a FID (think of it as a file
descriptor) - it contains information about who has attached to the
resource and where in the hierarchy they are.  A key thing to remember
is that in Plan 9, every device is a file server.
d) You would then traverse the object tree to the resource you wanted
to use (in your case it sounded like a file in a file system, so the
metaphor is straightforward).  The client would issue a t_walk message
to perform this traversal.
e) The object would then need to be opened (t_open) with information
about what type of operation will be executed (read, write, both) and
can include additional information about the type of transactions
(append only, exclusive access, etc.) that may be beneficial to
managing the underlying resource.  The BE could use information cached
in the Fid from the attach to check the FE''s permission to be
accessing this resource with that mode.

In our world, this would result in you holding a Fid pointing to the
open object.  The Fid is a pointer to meta-data and is considered
state on both the FE and the BE. (this has downsides in terms of
reliability and the ability to recover sessions or fail over to
different BE''s -- one of our summer students will be addressing the
reliability problem this summer).

The FE performs a read operation passing it the necessary bits:
  ret = read( fd, *buf, count );

This actually would be translated (in collaboration with local
meta-data into a t_read mesage)
  t_read tag fid offset count (where offset is determined by local fid metadata)

The BE receives the read request, and based on state information kept
in the Fid (basically your metadata), it finds the file contents in
the buffer cache.  It sends a response packet with a pointer to its
local buffer cache entry:

 r_read tag count *data

There are a couple ways we could go when the FE receives the response:
a) it could memcopy the data to the user buffer *buf .  This is the
way things   currently work, and isn''t very efficient -- but may be
the way to go for the ultra-paranoid who don''t like sharing memory
references between partitions.

b) We could have registered the memory pointed to by *buf and passed
that reference along the path -- but then it probably would just
amount to the BE doing the copy rather than the front end.  Perhaps
this approximates what you were talking about doing?

c) As long as the buffers in question (both *buf and the buffer cache
entry) were page-aligned, etc. -- we could play clever VM games
marking the page as shared RO between the two partitions and alias the
virtual memory pointed to by *buf to the shared page.  This is very
sketchy and high level and I need to delve into all sorts of details
-- but the idea would be to use virtual memory as your friend for
these sort of shared read-only buffer caches.  It would also require
careful allocation of buffers of the right size on the right alignment
-- but driver writers are used to that sort of thing.

To do this sort of thing, we''d need to do the exact same sort of
accounting you describe:>The implementation of local_buffer_reference_copy for that specific
>combination of buffer types maps the BE pages into the FE address space
>incrementing their reference counts and also unmaps the old FE pages and
>decrements their reference counts, returning them to the free pool if
>necessary.
When the FE was done with the BE, it would close the resources
(issuing t_clunk on any fids associated with the BE).

The above looks complicated, but to a FE writer would be as simple as:
 channel = dial("net!BE"); /* establish connection */ 
/* in my current code, channel is passed as an argument to the FE as a
boot arg */
  root = fsmount(channel, NULL); /* this does the t_version, auth, & attach
*/
  fd = open(root, "/some/path/file", OREAD);
  ret = read(fd, *buf, sizeof(buf));
  close(fd);
 close(root);
 close(channel);

If you want to get fancy, you could get rid of the root arg to open
and use a private name space (after fsmount):
  bind(root, "/mnt/be", MREPL); /* bind the back end to a well known
place */
then it would be:
  fd=open("/mnt/be/some/path/file", OREAD);

There''s also all sorts of cool stuff you can do on the domain
controller to provision child partitions using dynamic name space and
then just exporting the custom fashioned environment using 9P -- but
that''s higher level organization stuff again.  There''s all
sorts of
cool tricks you can play with 9P (similar to the stuff that the FUSE
and FiST user-space file system packages provide) like copy-on-write
file systems, COW block devices, muxed ttys, etc. etc.

The reality is that I''m not sure I''d actually want to use a BE
to
implement a file system, but its quite reasonable to implement a
buffer cache that way.  In all likelihood this would result in the FE
opening up a connection (and a single object) on the BE buffer cache,
then using different offsets to grab specific blocks from the BE
buffer cache using the t_read operation.

I''ve described it in terms of a file system, using your example as a
basis, but the same sort of thing would be true for a block device or
a network connection (with some slightly different semantic rules on
the network connection).  The main point is to keep things simple for
the FE and BE writers, and deal with all the accounting and magic you
describe within the infrastructure (no small task).

Another difference would involve what would happen if you did have to
bridge a cluster network - the 9P network encapsulation is well
defined, all you would need to do (at the I/O partition bridging the
network) is marshall the data according to the existing protocol spec.
 For more intelligent networks using RDMA and such things, you could
keep the scatter/gather style semantics and send pointers into the
RDMA space for buffer references.

As I said before, there''s lots of similarities in what we are talking
about, I''m just gluing a slightly more abstract interface on top,
which has some benefits in some additional organizational and security
mechanisms (and a well-established (but not widely used yet) network
protocol encapsulation).

There are plenty of details I know I''m glossing over, and I''m
sure
I''ll need lots of help getting things right.  I''d have
preferred
staying quiet until I had my act together a little more, but Orran and
Ron convinced me that it was important to let people know the
direction I''m planning on exploring.

     -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andrew Warfield

2005-May-08 08:19 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

Hi Eric,

   Your thoughts on 9P are all really interesting -- I''d come across
the protocol years ago in looking into approaches to remote device/fs
access but had a hard time finding details.  It''s quite interesting to
hear a bit more about the approach taken.

   Having a more accessible inter-domain comms API is clearly a good
thing, and extending device channels (in our terminology -- shared
memory + event notification) to work across a cluster is something
that we''ve talked about on several occasions at the lab.

   I do think though, that as mentioned above there are some concerns
with the VMM environment that make this a little trickier.  For the
general case of inefficient comms between VMs, using the regular IP
stack may be okay for many people.  The net drivers are being fixed up
to special-case local communications.

   For the more specific cases of FE/BE comms, I think the devil may
be in the details more than the current discussion is alluding to. 
Specifically:
> c) As long as the buffers in question (both *buf and the buffer cache
> entry) were page-aligned, etc. -- we could play clever VM games
> marking the page as shared RO between the two partitions and alias the
> virtual memory pointed to by *buf to the shared page.  This is very
> sketchy and high level and I need to delve into all sorts of details
> -- but the idea would be to use virtual memory as your friend for
> these sort of shared read-only buffer caches.  It would also require
> careful allocation of buffers of the right size on the right alignment
> -- but driver writers are used to that sort of thing.
   Most of the good performance that Xen gets off of block and net
split devices are specifically because of these clever VM games. 
Block FEs pass page references down to be mapped directly for DMA. 
Net devices pass pages into a free pool, and actually exchange
physical pages under the feet of the VM as inbound packets are
demultiplexed.  The grant tables that have recently been added provide
separate mechanisms for the mapping and ownership transfer of pages
across domains.  In addition to these tricks, we make careful use of
timing event notification in order to batch messages.

   In the case of the buffer cache that has come up several times in
the thread, a cache across domains would potentially neet to pass read
only page mappings as CoW in many situations, and a fault handler
somewhere would need to bring in a new page to the guest on a write. 
There are also a pile of complicating cases with regards cache
eviction from a BE domain, migration, and so on that make the
accounting really tricky.  I think it would be quite good to have a
discussion of generalized interdomain comms address the current
drivers, as well as a hypothetical buffer cache as potential cases. 
Does 9P already have hooks that would allow you to handle this sort of
per-application special case?

   Additionally, I think we get away with a lot in the current drivers
from a falure model that excludes transport.  The FE or BE can crash,
and the two drivers can be written defensively to handle that.  How
does 9P handle the strangenesses of real distribution?

Anyhow, very interesting discussion... looking forward to your thoughts.

a.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-08 08:36 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

I quite like some of this.  A few more comments below.

On Sat, 2005-05-07 at 19:57 -0500, Eric Van Hensbergen
wrote:> 
> In our world, this would result in you holding a Fid pointing to the
> open object.  The Fid is a pointer to meta-data and is considered
> state on both the FE and the BE. (this has downsides in terms of
> reliability and the ability to recover sessions or fail over to
> different BE''s -- one of our summer students will be addressing
the
> reliability problem this summer).
OK, so this is an area of concern for me.  I used the last version of
the sketchy API I outlined to create an HA cluster infrastructure. So I
had to solve these kind of protocol issues and, whilst it was actually
pretty easy starting from scratch, retrofitting a solution to an
existing protocol might be challenging, even for a summer student.
> 
> The FE performs a read operation passing it the necessary bits:
>   ret = read( fd, *buf, count );
Here the API is coupling the client to the memory management
implementation by assuming that the buffer is mapped into the client''s
virtual address space.

This is probably likely to be true most of the time so an API at this
level will be useful but I''d also like to be able to write I/O
applications that manage the data in buffers that are never mapped into
the application address space.

Also, I''d like to be able to write applications that have clients which
use different types of buffers without having to code for each case in
my application.

This is why my API deals in terms of local_buffer_references which, for
the sake of argument, might look like this:

struct local_buffer_reference
{
    local_buffer_reference_type type;
    local_buffer_reference_base base;
    buffer_reference_offset     offset;
    buffer_reference_length     length;
};

A local buffer reference of type virtual_address would have a base value
equal to the buf pointer above, an offset of zero and a length of count.

A local buffer reference for the hidden buffer pages would have a
different type, say hidden_buffer_page, the base would be a pointer to a
vector of page indices, the offset would be the offset of the start of
the buffer into the first page and the length would be the length of the
buffer.

So, my application can deal with buffers described like that without
having to worry about the flavour of memory management backing them.

Also, I can change the memory management without changing all the calls
to the API, I only have to change where I get buffers from.

BTW, this specific abstraction I learnt about from an embedded OS
architected by Nik Shalor. He might have got it from somewhere else.
> 
> This actually would be translated (in collaboration with local
> meta-data into a t_read mesage)
>   t_read tag fid offset count (where offset is determined by local fid
metadata)
> 
> The BE receives the read request, and based on state information kept
> in the Fid (basically your metadata), it finds the file contents in
> the buffer cache.  It sends a response packet with a pointer to its
> local buffer cache entry:
> 
>  r_read tag count *data
> 
> There are a couple ways we could go when the FE receives the response:
> a) it could memcopy the data to the user buffer *buf .  This is the
> way things   currently work, and isn''t very efficient -- but may
be
> the way to go for the ultra-paranoid who don''t like sharing memory
> references between partitions.
> 
> b) We could have registered the memory pointed to by *buf and passed
> that reference along the path -- but then it probably would just
> amount to the BE doing the copy rather than the front end.  Perhaps
> this approximates what you were talking about doing?
No, ''c'' is closer to what I was sketching out, except that I
was
proposing a general mechanism that had one code path in the clients even
though the underlying implementation could be a, b or c or any other
memory management strategy determined at run-time according to the type
of the local buffer references involved.
> 
> c) As long as the buffers in question (both *buf and the buffer cache
> entry) were page-aligned, etc. -- we could play clever VM games
> marking the page as shared RO between the two partitions and alias the
> virtual memory pointed to by *buf to the shared page.  This is very
> sketchy and high level and I need to delve into all sorts of details
> -- but the idea would be to use virtual memory as your friend for
> these sort of shared read-only buffer caches.  It would also require
> careful allocation of buffers of the right size on the right alignment
> -- but driver writers are used to that sort of thing.
Yes, it also requires buffers of the right size and alignment to be used
at the receiving end of any network transfers and for the alignment to
be preserved across the network even if the transfer starts at a
non-zero page offset. You might think that once the data goes over the
network you don''t care but it might be received by an application that
wants to share pages with another application so, in fact, it does
matter. This is just something you have to get right if you want any
kind of page referencing technique to work although you can fall back to
memcopy for misaligned data if necessary.
> The above looks complicated, but to a FE writer would be as simple as:
>  channel = dial("net!BE"); /* establish connection */ 
> /* in my current code, channel is passed as an argument to the FE as a
> boot arg */
>   root = fsmount(channel, NULL); /* this does the t_version, auth, &
attach */
>   fd = open(root, "/some/path/file", OREAD);
>   ret = read(fd, *buf, sizeof(buf));
>   close(fd);
>  close(root);
>  close(channel);
So, this is obviously a blocking API.  My API was non-blocking because
the network latency means that you need a lot of concurrency for high
throughput and you don''t necessarily want so many threads. Like AIO.
Having a blocking API as well is convenient though.
> 
> If you want to get fancy, you could get rid of the root arg to open
> and use a private name space (after fsmount):
>   bind(root, "/mnt/be", MREPL); /* bind the back end to a well
known place */
> then it would be:
>   fd=open("/mnt/be/some/path/file", OREAD);
> 
> There''s also all sorts of cool stuff you can do on the domain
> controller to provision child partitions using dynamic name space and
> then just exporting the custom fashioned environment using 9P -- but
> that''s higher level organization stuff again.  There''s
all sorts of
> cool tricks you can play with 9P (similar to the stuff that the FUSE
> and FiST user-space file system packages provide) like copy-on-write
> file systems, COW block devices, muxed ttys, etc. etc.
I''m definitely going to study the organisational aspects of 9P.
> I''ve described it in terms of a file system, using your example as
a
> basis, but the same sort of thing would be true for a block device or
> a network connection (with some slightly different semantic rules on
> the network connection).  The main point is to keep things simple for
> the FE and BE writers, and deal with all the accounting and magic you
> describe within the infrastructure (no small task).
If you use the buffer abstraction I described then you can start with a
very simple mm implementation and improve it without having to change
the clients.
> 
> Another difference would involve what would happen if you did have to
> bridge a cluster network - the 9P network encapsulation is well
> defined, all you would need to do (at the I/O partition bridging the
> network) is marshall the data according to the existing protocol spec.
>  For more intelligent networks using RDMA and such things, you could
> keep the scatter/gather style semantics and send pointers into the
> RDMA space for buffer references.
I don''t see this as a difference. My API was explicitly compatible with
a networked implementation.

One of the thoughts that did occur to me was that a reliance on in-order
message delivery (which 9p has) turns out to be quite painful to satisfy
when combined with multi-pathing and failover because a link failure
results in the remaining links being full of messages that can''t be
delivered until the lost messages (which happened to include the message
that must be delivered first) have been resent.  This is more
problematical than you might think because all communications stall for
the time taken to _detect_ the lost link and all the remaining links are
full so you can''t do the recovery without cleaning up another link.

This is not insurmountable but it''s not obviously the best solution
either, particularly since, with SMP machines, you aren''t going to want
to serialise the concurrent activity so there needn''t be any
fundamental
in-order requirements at the protocol level.
> 
> As I said before, there''s lots of similarities in what we are
talking
> about, I''m just gluing a slightly more abstract interface on top,
> which has some benefits in some additional organizational and security
> mechanisms (and a well-established (but not widely used yet) network
> protocol encapsulation).
Maybe something like my API underneath and something like the
organisational stuff of 9p on top would be good.  I''d have to see how
the 9p organisational stuff stacks up against the publish and subscribe
mechanisms for resource discovery that I''m used to. Also, the
infrastructure requirements for building fault-tolerant systems are
quite demanding and I''d have to be confident that 9p was up to the task
before I''d be happy with it personally.
> 
> There are plenty of details I know I''m glossing over, and
I''m sure
> I''ll need lots of help getting things right.  I''d have
preferred
> staying quiet until I had my act together a little more, but Orran and
> Ron convinced me that it was important to let people know the
> direction I''m planning on exploring.
Yes, definitely worthwhile.  I''d like to see more discussion like this
on the xen-devel list.  On the one hand, it''s kind of embarrassing to
discuss vaporware and half finished ideas but on the other, the
opportunity for public comment at an early stage in the process is
probably going to save a lot of effort in the long run.

-- 
Harry Butterworth <harry@hebutterworth.freeserve.co.uk>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-08 15:27 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/8/05, Andrew Warfield <andrew.warfield@gmail.com>
wrote:> Hi Eric,
> 
>    Your thoughts on 9P are all really interesting -- I''d come
across
> the protocol years ago in looking into approaches to remote device/fs
> access but had a hard time finding details.  It''s quite
interesting to
> hear a bit more about the approach taken.
>
There''s a bigger picture that Ron and I have discussed, but there are
lots of details that need to be resolved/proven before its worth
talking about.  The overall goal we are working towards is a simple
interface with equivalent or better performance.  The generality of
the approach we have discussed is appealing, but we are well aware it
won''t be adopted unless we can show competitive performance and
reliability.
> 
>    For the more specific cases of FE/BE comms, I think the devil may
> be in the details more than the current discussion is alluding to.
> 
I agree completely, in fact there are likely several devils unique to
each target architecture ;)  But that''s just the sort of thing that
keeps system programming interesting.  These sorts of things will only
be worked out once we have a prototype which to drive into the various
brick walls.
> 
> There are also a pile of complicating cases with regards cache
> eviction from a BE domain, migration, and so on that make the
> accounting really tricky.  I think it would be quite good to have a
> discussion of generalized interdomain comms address the current
> drivers, as well as a hypothetical buffer cache as potential cases.
> Does 9P already have hooks that would allow you to handle this sort of
> per-application special case?
>
Page table and cache manipulation would likely sit below the 9P layer
to keep things portable and abstract (as I''ve said earlier, perhaps
Harry''s IDC proposal is the right layer to handle such things).  9P as
a protocol is quite flexible, but existing implementations are
somewhat simplistic and limited in regard to more advanced buffer/page
sharing capabilities.  We are exploring some of these issues in our
exploration of using Plan 9 and 9P in HPC cluster environments (using
cluster interconnects) -- in fact, the idea of using 9P between VMM
partitions fell out of that exploration.
> 
>    Additionally, I think we get away with a lot in the current drivers
> from a falure model that excludes transport.  The FE or BE can crash,
> and the two drivers can be written defensively to handle that.  How
> does 9P handle the strangenesses of real distribution?
> 
With existing 9P implementations, defensive drivers are the way to go.
 There have been three previous solutions proposed for failure
detection and recovery with 9P.  One handled recovery of 9P state
between client and server, the other handled reliability of the
underlying RPC transport, and the third was a layered file server
providing defensive semantics.  As I said earlier, we''ll be exploring
these more fully (along with looking at fail over) this summer --
specifically in the context of VMMs.

9P isn''t a magic bullet here, and there will be lots of issues that
will need to be dealt with either in the layer above it, or below it.

        -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Eric Van Hensbergen

2005-May-08 16:18 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On 5/8/05, Harry Butterworth <harry@hebutterworth.freeserve.co.uk>
wrote:> >
> > In our world, this would result in you holding a Fid pointing to the
> > open object.  The Fid is a pointer to meta-data and is considered
> > state on both the FE and the BE. (this has downsides in terms of
> > reliability and the ability to recover sessions or fail over to
> > different BE''s -- one of our summer students will be
addressing the
> > reliability problem this summer).
> 
> OK, so this is an area of concern for me.  I used the last version of
> the sketchy API I outlined to create an HA cluster infrastructure. So I
> had to solve these kind of protocol issues and, whilst it was actually
> pretty easy starting from scratch, retrofitting a solution to an
> existing protocol might be challenging, even for a summer student.
>
There are three previous attempts at providing these sort of
facilities in 9P that the student is going to be basing his work off
of.  All three worked to varying degrees of effectiveness - but
there''s no magic bullet here and clients and file servers need to be
written defensively to be able to cope with such disruptions in a
graceful manner.  Its quite likely there will be different semantics
for failure recovery depending on the resource.
 > >
> > The FE performs a read operation passing it the necessary bits:
> >   ret = read( fd, *buf, count );
> 
> Here the API is coupling the client to the memory management
> implementation by assuming that the buffer is mapped into the
client''s
> virtual address space.
> 
> This is probably likely to be true most of the time so an API at this
> level will be useful but I''d also like to be able to write I/O
> applications that manage the data in buffers that are never mapped into
> the application address space.
>
Well, this was the context of the example (the FE was registering a
buffer from its own address space).  The existing Plan 9 API doesn''t
have a good example of how to handle the more abstract buffer handles
you describe, but I don''t think there''s anything in the
protocol which
would prevent such a utilization.  I need to think about this scenario
a bit more, could you give an example how how you would use this
feature?
 > Also, I''d like to be able to write applications that have clients
which
> use different types of buffers without having to code for each case in
> my application.
> 
The attempt at portability is admirable, but it just seems to add
complexity -- if I want to use the reference, I''ll have to make
another functional call to resolve the buffer.  I guess I''m being too
narrow minded, but I just don''t have a clear idea of the utility of
hidden buffers.  I never know who I am supposed to be hiding
information from. ;)
 > 
> So, my application can deal with buffers described like that without
> having to worry about the flavour of memory management backing them.
> 
This is important.  In my example I was working on the pretext that
the client initiating the read was consuming the data in some way.  
When that''s not the case, the interface is quite different, more like
that of our file servers.  In those cases, I can easily see passing
the data by some more opaque reference (before I had figured
scatter/gather buffers would be sufficient -- but perhaps your more
abstract representation buys extra flexibility).  I still hate the
idea of having to resolve the abstract_buffer to get at the data, but
perhaps that''s the cost of efficiency -- I''ll have to think
about it
some more.
> Also, I can change the memory management without changing all the calls
> to the API, I only have to change where I get buffers from.
Again - I agree that this is an important aspect.  Perhaps this sort
of functionality is best called out separately with its own interfaces
to provide and resolve buffer handles.   It seems like perhaps this
might be worth breaking out into its own.  It seems like there would
be three types of operations on your proposed struct:
   abstract_ref = get_ref( *real_data, flags ); /* constructor */
   real_data = resolve_ref( *abstract_ref, flags);
   forget_ref( abstract_ref ); /* destructor */
Lots of details under the hood there (as it should be).  flags could
help specify things like read-only, cow, etc.   Is such an interface
sufficient?  If I''m being naive here just tell me to shut up and
I''ll
won''t talk about it until I''ve had the time to look a little
deeper
into things.
> 
> BTW, this specific abstraction I learnt about from an embedded OS
> architected by Nik Shalor. He might have got it from somewhere else.
> 
Any specific paper references we should be looking at?  Or is obvious
from a google?
> > The above looks complicated, but to a FE writer would be as simple as:
> >  channel = dial("net!BE"); /* establish connection */
> > /* in my current code, channel is passed as an argument to the FE as a
> > boot arg */
> >   root = fsmount(channel, NULL); /* this does the t_version, auth,
& attach */
> >   fd = open(root, "/some/path/file", OREAD);
> >   ret = read(fd, *buf, sizeof(buf));
> >   close(fd);
> >  close(root);
> >  close(channel);
> 
> So, this is obviously a blocking API.  My API was non-blocking because
> the network latency means that you need a lot of concurrency for high
> throughput and you don''t necessarily want so many threads. Like
AIO.
> Having a blocking API as well is convenient though.
> 
Yeah, I am betrayed by the simplicity of the existing API.  However,
just wanted to point out that there is nothing specifically
synchronous in the protocol.  I tend to like the simplicity of using
threads to deal with asynchronous behaviors, but efficient threads are
hard to come by.  Async APIs just seem to complicate driver writers
lives, but if this is the preferred methodology such an API could be
used with the 9P protocol.
> 
> One of the thoughts that did occur to me was that a reliance on in-order
> message delivery (which 9p has) turns out to be quite painful to satisfy
> 
There are certainly issues to be resolved here, but in environments
(such as using VMM transports) preserving frame boundaries on
messages, the in-order 9P requirements can be relaxed a great deal.
> 
> Yes, definitely worthwhile.  I''d like to see more discussion like
this
> on the xen-devel list.  On the one hand, it''s kind of embarrassing
to
> discuss vaporware and half finished ideas but on the other, the
> opportunity for public comment at an early stage in the process is
> probably going to save a lot of effort in the long run.
> 
I''ll try (perhaps with Ron''s help) to put together some sort
of white
paper on our vision.  It''d be quite easy to pull together an
organizational demonstration of what we are talking about, but working
out the performance/reliability/security details will likely take some
time.

I do like the general idea of building on top of many of the
underlying bits you describe.   I''m not quite sure we''d use
all the
features (your endpoint definition seems a bit over-engineered for our
paradigm), but there are certainly lots of good things to take
advantage of.

          -eric

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-08 17:48 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On Sun, 2005-05-08 at 11:18 -0500, Eric Van Hensbergen wrote:
> > This is probably likely to be true most of the time so an API at this
> > level will be useful but I''d also like to be able to write
I/O
> > applications that manage the data in buffers that are never mapped
into
> > the application address space.
> >
> 
> Well, this was the context of the example (the FE was registering a
> buffer from its own address space).  The existing Plan 9 API
doesn''t
> have a good example of how to handle the more abstract buffer handles
> you describe, but I don''t think there''s anything in the
protocol which
> would prevent such a utilization.  I need to think about this scenario
> a bit more, could you give an example how how you would use this
> feature?
Read from disk, cache in buffer cache, sendfile to remote client. 
>  
> > Also, I''d like to be able to write applications that have
clients which
> > use different types of buffers without having to code for each case in
> > my application.
> > 
> 
> The attempt at portability is admirable, but it just seems to add
> complexity -- if I want to use the reference, I''ll have to make
> another functional call to resolve the buffer.  I guess I''m being
too
> narrow minded, but I just don''t have a clear idea of the utility
of
> hidden buffers.  I never know who I am supposed to be hiding
> information from. ;)
Yourself: it''s harder for a bug to scribble on the data if
it''s not
mapped into the address space. Also, it eliminates the overhead of page
table manipulations which aren''t needed if you don''t want to
look at the
data.
> > Also, I can change the memory management without changing all the
calls
> > to the API, I only have to change where I get buffers from.
> 
> Again - I agree that this is an important aspect.  Perhaps this sort
> of functionality is best called out separately with its own interfaces
> to provide and resolve buffer handles.   It seems like perhaps this
> might be worth breaking out into its own.  It seems like there would
> be three types of operations on your proposed struct:
>    abstract_ref = get_ref( *real_data, flags ); /* constructor */
>    real_data = resolve_ref( *abstract_ref, flags);
>    forget_ref( abstract_ref ); /* destructor */
> Lots of details under the hood there (as it should be).  flags could
> help specify things like read-only, cow, etc.   Is such an interface
> sufficient?  If I''m being naive here just tell me to shut up and
I''ll
> won''t talk about it until I''ve had the time to look a
little deeper
> into things.
You could have something like that or you could make a
local_buffer_reference for some memory in the virtual address space and
then use the local_buffer_reference_copy function which would get the
data into your address space as efficiently as it could. It works out as
pretty much the same thing.  Both cases require allocation of the
virtual address space somehow, if this is an operation that could fail
(more physical memory than available address space for example) then
that must be expressed in the API somehow; having to allocate the memory
up-front is a relatively clean way of doing this.  This method also
allows you to copy between a local_buffer_reference and the stack and
warms the CPU cache up just before you look at the data.  Which approach
you choose probably depends on the context.  I guess you might want
both.
> 
> > 
> > BTW, this specific abstraction I learnt about from an embedded OS
> > architected by Nik Shalor. He might have got it from somewhere else.
> > 
> 
> Any specific paper references we should be looking at?  Or is obvious
> from a google?
Here''s a reference for historical interest but be aware that
I''m
proposing something different.  In particular, these DDRs confuse the
function of local and remote buffer references.

Page numbered 16 of the following doc or page 32 according to xpdf.

http://www-900.ibm.com/cn/support/library/storage/download/Advanced%
20SerialRAID%20Plus%20Adapter%20Technical%20Reference.pdf

Harry.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Mike Wray

2005-May-10 08:31 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

Andrew Warfield wrote:> Hi Eric,
> 
>    Your thoughts on 9P are all really interesting -- I''d come
across
> the protocol years ago in looking into approaches to remote device/fs
> access but had a hard time finding details.  It''s quite
interesting to
> hear a bit more about the approach taken.
> 
>    Having a more accessible inter-domain comms API is clearly a good
> thing, and extending device channels (in our terminology -- shared
> memory + event notification) to work across a cluster is something
> that we''ve talked about on several occasions at the lab.
> 
>    I do think though, that as mentioned above there are some concerns
> with the VMM environment that make this a little trickier.  For the
> general case of inefficient comms between VMs, using the regular IP
> stack may be okay for many people.  The net drivers are being fixed up
> to special-case local communications.
> 
>    For the more specific cases of FE/BE comms, I think the devil may
> be in the details more than the current discussion is alluding to. 
> Specifically:
> 
> 
>>c) As long as the buffers in question (both *buf and the buffer cache
>>entry) were page-aligned, etc. -- we could play clever VM games
>>marking the page as shared RO between the two partitions and alias the
>>virtual memory pointed to by *buf to the shared page.  This is very
>>sketchy and high level and I need to delve into all sorts of details
>>-- but the idea would be to use virtual memory as your friend for
>>these sort of shared read-only buffer caches.  It would also require
>>careful allocation of buffers of the right size on the right alignment
>>-- but driver writers are used to that sort of thing.
> 
> 
>    Most of the good performance that Xen gets off of block and net
> split devices are specifically because of these clever VM games. 
> Block FEs pass page references down to be mapped directly for DMA. 
> Net devices pass pages into a free pool, and actually exchange
> physical pages under the feet of the VM as inbound packets are
> demultiplexed.  The grant tables that have recently been added provide
> separate mechanisms for the mapping and ownership transfer of pages
> across domains.  In addition to these tricks, we make careful use of
> timing event notification in order to batch messages.
It should be possible to still use the page mapping in the i/o transport.
The issue right now is that the i/o interface is very low-level and
intimately tangled up with the structs being transported.

And with the domain control channel there''s an implicit assumption
that ''there can be only one''. This means for example, that
domain A
using a device with backend in domain B can''t connect directly to
domain B,
but has to be ''introduced'' by xend. It''d be better if
it could connect
directly.

Something like what Harry proposes should still be able to use
page mapping for efficient local comms, but without _requiring_
it. This opens the way for alternative transports, such as network.

Rather than going straight for something very high-level, I''d prefer
to build up gradually, starting with a more general message transport
api that includes analogues to listen/connect/recv/send.
> 
>    In the case of the buffer cache that has come up several times in
> the thread, a cache across domains would potentially neet to pass read
> only page mappings as CoW in many situations, and a fault handler
> somewhere would need to bring in a new page to the guest on a write. 
> There are also a pile of complicating cases with regards cache
> eviction from a BE domain, migration, and so on that make the
> accounting really tricky.  I think it would be quite good to have a
> discussion of generalized interdomain comms address the current
> drivers, as well as a hypothetical buffer cache as potential cases. 
> Does 9P already have hooks that would allow you to handle this sort of
> per-application special case?
> 
>    Additionally, I think we get away with a lot in the current drivers
> from a falure model that excludes transport.  The FE or BE can crash,
> and the two drivers can be written defensively to handle that.  How
> does 9P handle the strangenesses of real distribution?
> 
> Anyhow, very interesting discussion... looking forward to your thoughts.
> 
> a.
> 
Mike

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andrew Warfield

2005-May-10 10:09 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

> It should be possible to still use the page mapping in the i/o transport.
> The issue right now is that the i/o interface is very low-level and
> intimately tangled up with the structs being transported.
I don''t doubt that it is possible.  The point I was making is that the
current i/o interfaces are low level for a reason, and that
generalizing this to a higher-level communications primitive is a
non-trivial thing.  Just considering the disk and net interfaces, the
current device channels each make particular decisions regarding (a)
what to copy and what to map, (b) when to send notification to get
efficient batching through the scheduler, and most recently (c) which
grant mechanism to use to pass pages securely across domains.

Having a higher-level API to make all this easier, and especially to
reduce the code/complexity required to build new drivers etc is
something that will be fantastic to have.  I think though that at
least some of these underlying issues will need to be exposed for it
to be useful.  I''m not convinced that reimplementing the sockets API
for interdomain communication is a very good solution... The
buffer_reference struct that Harry mentioned looks quite interesting
as a start though in terms of describing a variety of underlying
transports.  Do you have a paper reference on that work, Harry?

With regards forwarding device channels across a network, I think we
can expect application-level involvement for shifting device messages
across a cluster.  If this is down the road, and it''s certainly
something that has been discussed, a device channel is potentially two
local shared memory device channels between VMs on local hosts, and a
network connection between the physical hosts.  Beyond the more
complicated error cases that this obviously involves, we can then make
this as arbitrarily more complex by discussing HA or security
concerns... for the moment though, I think it would be interesting to
see how well the existing local host cases can be generalized.  ;)
 > And with the domain control channel there''s an implicit assumption
> that ''there can be only one''. This means for example,
that domain A
> using a device with backend in domain B can''t connect directly to
domain B,
> but has to be ''introduced'' by xend. It''d be
better if it could connect
> directly.
This is not a flaw with the current implementation -- it''s completely
intentional.  By forcing control through xend we ensure that there is
a single point for control logic, and for managing state.  Why do you
feel it would be better to provide arbitrary point-to-point comms in a
VMM environment that is specifically trying to provide isolation
between guests?
> Something like what Harry proposes should still be able to use
> page mapping for efficient local comms, but without _requiring_
> it. This opens the way for alternative transports, such as network.
> 
> Rather than going straight for something very high-level, I''d
prefer
> to build up gradually, starting with a more general message transport
> api that includes analogues to listen/connect/recv/send.
As I said, I''m unconvinced that trying to mimic the sockets API is the
right way to go -- I think the communicating parties often want to see
and work with batches of messages without having to do extra copies or
have event notification made implicit.  I think you are completely
right about a gradual approach though -- having a generalized
host-local device channel would be very interesting to see...
especially if it could be shown to apply to the existing block, net,
usb, and control channels in a simplifying fashion.

a.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Mike Wray

2005-May-10 14:30 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

Andrew Warfield wrote:>>It should be possible to still use the page mapping in the i/o
transport.
>>The issue right now is that the i/o interface is very low-level and
>>intimately tangled up with the structs being transported.
> 
> I don''t doubt that it is possible.  The point I was making is that
the
> current i/o interfaces are low level for a reason, and that
> generalizing this to a higher-level communications primitive is a
> non-trivial thing.  Just considering the disk and net interfaces, the
> current device channels each make particular decisions regarding (a)
> what to copy and what to map, (b) when to send notification to get
> efficient batching through the scheduler, and most recently (c) which
> grant mechanism to use to pass pages securely across domains.
It should be relatively easy to provide these kinds of facilities in a
higher-level api.
> Having a higher-level API to make all this easier, and especially to
> reduce the code/complexity required to build new drivers etc is
> something that will be fantastic to have.  I think though that at
> least some of these underlying issues will need to be exposed for it
> to be useful.  I''m not convinced that reimplementing the sockets
API
> for interdomain communication is a very good solution...
I wasn''t suggesting exactly the sockets api, but something more like
the connect/send and listen/recv logic. Harry''s API is quite like that,
with additional higher-level facilities.
> The
> buffer_reference struct that Harry mentioned looks quite interesting
> as a start though in terms of describing a variety of underlying
> transports.  Do you have a paper reference on that work, Harry?
> 
> With regards forwarding device channels across a network, I think we
> can expect application-level involvement for shifting device messages
> across a cluster.  If this is down the road, and it''s certainly
> something that has been discussed, a device channel is potentially two
> local shared memory device channels between VMs on local hosts, and a
> network connection between the physical hosts.  Beyond the more
> complicated error cases that this obviously involves, we can then make
> this as arbitrarily more complex by discussing HA or security
> concerns... for the moment though, I think it would be interesting to
> see how well the existing local host cases can be generalized.  ;)
>  
> 
>>And with the domain control channel there''s an implicit
assumption
>>that ''there can be only one''. This means for example,
that domain A
>>using a device with backend in domain B can''t connect directly
to domain B,
>>but has to be ''introduced'' by xend. It''d be
better if it could connect
>>directly.
> 
> This is not a flaw with the current implementation -- it''s
completely
> intentional.  By forcing control through xend we ensure that there is
> a single point for control logic, and for managing state.  Why do you
> feel it would be better to provide arbitrary point-to-point comms in a
> VMM environment that is specifically trying to provide isolation
> between guests?
OK, so it''s an intentional flaw ;-).

One reason is that front-end drivers have to connect to their backends.
If they can find out who to connect to and then do it, it simplifies things.
Especially when that info is available from a store or registry service
as proposed for 3.0.

At the moment xend has to exchange messages with the domain to get the
device front-end handle and shared page address, and then exchange messages
with the back-end so it can create the device and map the page.
Telling the font-end which back-end to connect to would be much simpler.
>>Something like what Harry proposes should still be able to use
>>page mapping for efficient local comms, but without _requiring_
>>it. This opens the way for alternative transports, such as network.
>>
>>Rather than going straight for something very high-level, I''d
prefer
>>to build up gradually, starting with a more general message transport
>>api that includes analogues to listen/connect/recv/send.
> 
> 
> As I said, I''m unconvinced that trying to mimic the sockets API is
the
> right way to go -- I think the communicating parties often want to see
> and work with batches of messages without having to do extra copies or
> have event notification made implicit.  
Like I said, I wasn''t suggesting _exactly_ the sockets api, more the
spirit of it. There is an analogue of batching for sockets though: flush.
> I think you are completely
> right about a gradual approach though -- having a generalized
> host-local device channel would be very interesting to see...
> especially if it could be shown to apply to the existing block, net,
> usb, and control channels in a simplifying fashion.
> 
Just a small matter of programming then :-).

Mike

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-10 14:51 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On Tue, 2005-05-10 at 11:09 +0100, Andrew Warfield
wrote:> Just considering the disk and net interfaces, the
> current device channels each make particular decisions regarding (a)
> what to copy and what to map, (b) when to send notification to get
> efficient batching through the scheduler, and most recently (c) which
> grant mechanism to use to pass pages securely across domains.
(a) looks like a property of the local_buffer_reference type. You might
be able to think of a really good way of doing (b) automatically
independent of any specific client and (c) is either like (a) or it''s a
strategy passed when registering a buffer to get a remote buffer
reference.  I''m guessing here though---Is there a list of all the
considerations that went into the design of the current behaviour? That
would help in mapping it to any new proposal.
> The
> buffer_reference struct that Harry mentioned looks quite interesting
> as a start though in terms of describing a variety of underlying
> transports.  Do you have a paper reference on that work, Harry?
I can''t give you a suitable reference, but there''s not much
more to the
concepts than what I''ve already written up on this list and I can chip
in with anything I''ve forgotten.

Here are a few more random details:

Buffer providers can register to get a local_buffer_reference type for
the buffers they will provide. As a minimum, they also have to register
methods to copy data between a buffer of that type and the stack. This
allows a core generic mechanism to implement any copy operation.

Buffer providers can register specialised methods to copy directly
between two specific local buffer types. These methods override the
generic mechanism.

As well as a copy operation which leaves the buffers identical, you
might want a more efficient move operation which leaves the source
undefined (or perhaps all zeroes or unmapped or something).

Some local_buffer_reference types might have wider APIs. For example, an
application might want to mess with and be frugal with individual pages
in which case it might know that it allocates all its buffers from a
buffer provider that provides buffers of a specific type.  Knowing the
type, it can down-cast the local buffer references it owns to get access
to the wider API to do more detailed work.

Reads can often complete when the data is received without waiting for
separate status.  I worked this optimisation into my implementation by
getting notification from a registered buffer when it had been filled.

My disconnects were cluster-wide and were coupled to the lifetime of
remote_buffer_references in the cluster.  I didn''t think this was
appropriate for Xen so my sketchy API had per-connection disconnects and
didn''t mention any coupling with the remote_buffer_reference lifetime.
This is relevant when it comes to defining what kind of transport errors
might happen and requires further thought.
> for the moment though, I think it would be interesting to
> see how well the existing local host cases can be generalized.  ;)
Yep, sounds like a good way to start :-)
>  
> > And with the domain control channel there''s an implicit
assumption
> > that ''there can be only one''. This means for
example, that domain A
> > using a device with backend in domain B can''t connect
directly to domain B,
> > but has to be ''introduced'' by xend. It''d be
better if it could connect
> > directly.
> 
> This is not a flaw with the current implementation -- it''s
completely
> intentional.  By forcing control through xend we ensure that there is
> a single point for control logic, and for managing state.  Why do you
> feel it would be better to provide arbitrary point-to-point comms in a
> VMM environment that is specifically trying to provide isolation
> between guests?
There are two different issues here:

1) Having one Xend might be correct at the moment. The _assumption in
the guests_ that there is only one Xend is technically a minor flaw. If,
for example, the guest got an idc_address for Xend, the guest would be
decoupled from the design choice of one or many Xends.

2) The Xend introductions.  The weird aspect about these is that they
are a triangular protocol with phases initiated by xend in two
directions, potentially at the same time. I''m more used to control flow
following resource dependencies, for example: server tells third-party
about resource availability; third-party assigns resources to clients;
clients connect directly to server. This is also triangular but if you
put the server at the top and bottom it becomes a stack ordered by
dependencies.  I think stacks are much less confusing than triangles and
have more easily defined interlocks and less risk of introducing timing
windows.  I think it ought to be possible to get the security right for
a stack solution as well as the triangle solution. Maybe a secuity
expert could provide some help.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andrew Warfield

2005-May-10 15:26 UTC

head link

[Xen-devel] Re: Interdomain comms

Hi Harry,

  These two comments seem more about the immediate control tool plans
than about a generic comms mechanism.
> There are two different issues here:
These seem pretty concise as regards the new control tool plans:
> 1) Having one Xend might be correct at the moment. The _assumption in
> the guests_ that there is only one Xend is technically a minor flaw. If,
> for example, the guest got an idc_address for Xend, the guest would be
> decoupled from the design choice of one or many Xends.
I think we are all keen to move in the direction of having no xend but
rather a small set of single purpose daemons that handle specific
tasks and map well on to the emerging security primitives.
> 2) The Xend introductions.  The weird aspect about these is that they
> are a triangular protocol with phases initiated by xend in two
> directions, potentially at the same time. I''m more used to control
flow
> following resource dependencies, for example: server tells third-party
> about resource availability; third-party assigns resources to clients;
> clients connect directly to server. This is also triangular but if you
> put the server at the top and bottom it becomes a stack ordered by
> dependencies.  I think stacks are much less confusing than triangles and
> have more easily defined interlocks and less risk of introducing timing
> windows.  I think it ought to be possible to get the security right for
> a stack solution as well as the triangle solution. Maybe a secuity
> expert could provide some help.
The triangular connection setup in the current code is a little grim
and something that will change very soon.  The two relevant bits of
this are the event channel and the shared page address on setting up a
new device channel.  Keir added support for unbound event channels
quite a while ago, and the control tools/drivers just haven''t evolved
to take advantage of them yet.  The store will be used for additional
connection set-up state, such as shared page addresses.

Connection setup this way does map on to the connect/listen semantics
that Mike has been advocating.  For example, on a request to add a new
frontend, the backend driver will create a simple state machine for
the new device channel and assign an unbound event channel to it.  It
will then move out of this (unbound) "listening" state when the front
end connects to the event channel and sends the first notification.

So we still have something in the middle, but it''s a much more simple
and generic thing.

a.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-10 16:42 UTC

head link

Re: [Xen-devel] Re: Interdomain comms

On Tue, 2005-05-10 at 16:26 +0100, Andrew Warfield wrote:
> I think we are all keen to move in the direction of having no xend but
> rather a small set of single purpose daemons that handle specific
> tasks and map well on to the emerging security primitives.
Yep.
> Connection setup this way does map on to the connect/listen semantics
> that Mike has been advocating.  For example, on a request to add a new
> frontend, the backend driver will create a simple state machine for
> the new device channel and assign an unbound event channel to it.  It
> will then move out of this (unbound) "listening" state when the
front
> end connects to the event channel and sends the first notification.
The above description happens to fit inside my endpoint_create call too.

I think the significant constraints in this area are that the choice of
connect/listen semantics must be compatible with the introduction
mechanism and the security requirements.

With my API I assumed a symmetrical connection process where both ends
created an endpoint for the same address and the IDC implementation did
the work of binding them together.  I didn''t give any thought to the
implications for the introduction mechanism or the security requirements
or consider any other options so I''m not particularly attached to this
aspect of my proposal.  I was primarily trying to communicate the buffer
abstractions.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Aggarwal, Vikas \(OFT\)

2005-May-12 17:47 UTC

head link

RE: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c

Hi,
How can I  get my virtual-trace-device  into the SXP?
Inside the SrvDomainDir.py:SrvDomainDir::op_create   the
"configstring"
has devices for vbd/vif. I did a print on configstring.

Since I did not create a file based in blkctl.py for my
virtual-trace-device so (blkctl.py)script = xroot.get_block_script()
never happens for my trace-device. Hence (XendRoot.py)
get_config_value->sxp.child_value  will never happen.

Please clarify my confusions in area of SXP initializations.  I hacked
around the above piece of code after I saw (channel.py)("
requestReceived > No device ")   in  xend-debug.log for my device type.

I am still trying to figure out time-out in my FE driver, by looking it
messages simultaneously I think "No device" is the culprit for
eventual
timeout in FE driver.

-vikas

--------------------------------------------------------
This e-mail, including any attachments, may be confidential, privileged or
otherwise legally protected. It is intended only for the addressee. If you
received this e-mail in error or from someone who was not authorized to send it
to you, do not disseminate, copy or otherwise use this e-mail or its
attachments.  Please notify the sender immediately by reply e-mail and delete
the e-mail from your system.

-----Original Message-----

From: Harry Butterworth [mailto:harry@hebutterworth.freeserve.co.uk] 
Sent: Thursday, May 05, 2005 4:37 PM
To: Aggarwal, Vikas (OFT)
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] please help: initialize XEND for my
debug-FE/BE.c

To answer your questions:

I think (75% confidence) create_devices is used to create devices when a
frontend domain is started whereas device_create is used to create an
extra single device on the fly when the frontend domain is already
running.

Your frontend is probably timing out waiting to hear about interfaces
from xend.  To establish interface connections between the frontend and
backend with the current xend code you need to implement the following
basic flow:

When xend starts, it creates a controller factory for your device class.

When a frontend domain is started, the controller factory is used to
create a controller for the frontend domain.

Configured devices for your device class are attached to the controller
for the frontend domain.

Attaching the device creates the device object which in turn does a
lookup of the "backend interface" object. The backend interface object
represents the communication channel between the frontend domain and the
backend domain.  If the backend interface object does not exist it is
created at this point.  When a backend interface is created, it finds
the backend controller object for its backend domain.  If the backend
controller object does not exist it is created at this point.

Some of the drivers assume that the backend driver is already up when
the backend controller object is created and so do not listen for the
driver status up message from the backend.  I''m implementing a driver
as
loadable modules so I need to use this message.

In my case, the backend driver sends a driver status up message to xend
which is handled by the backend controller object.  The backend
controller object notifies all its backend interface objects that the
backend driver is up.

When the frontend driver is initialised, it sends a driver status up
message to xend.  This is handled by the controller object. The
controller object notifies all its backend interface objects that the
frontend driver is up.

So, a backend interface object gets a driver up/down notification at
each end.  When a backend interface finds that the drivers at both its
ends are up, it can squence the establishment of the connection:

The backend interface object sends a "be create" message to the
backend
to create the object representing the interface in the backend.

The backend interface object sends a "fe interface status
disconnected"
message to the frontend to create the object representing the interface
in the frontend.

The frontend allocates a page for the ring interface and sends a "fe
interface connect" message to xend which is forwarded by the controller
object to the backend interface object.

xend opens an event channel between the two domains and sends a "be
connect" message to the backend passing the event channel and the
address of the page for the ring interface.

The backend responds to the be connect and the backend interface object
sends an fe interface status connected message to the frontend, passing
the event channel number.

Unloading and reloading the frontend requires that the backend stops
writing to the shared page of memory before the frontend frees it for
reuse.  In my case, the frontend sends a driver status down message,
which is handled by the controller object for the frontend. This object
notifies all the backend interface objects for that frontend which then
send be disconnect messages and wait for a response before acknowledging
back to the controller object.  The controller object waits for all
backend interfaces to complete before acknowledging the frontend driver
status down.

Unloading and reloading the backend requires that the connection is
broken and reestablished: the backend stops using the shared pages
before sending a driver status down message. The driver status down is
handled by the backend controller which notifies all the connected
backend interfaces which in turn send fe interface status disconnected
messages to their respective frontend domains.  The frontend domains
respond with fe interface connect messages, trying to reestablish a
connection.  When the back-end driver is reloaded it sends a driver
status up which is propagated to the backend interface objects which
send be create messages again and follow up with be connect messages
containing the new shared pages and fe interface status connected
messages to complete the connection.

This is just the establishment of the connection.  There are also
messages sent by the devices to create the devices in the backend after
the backend interfaces send the be create messages to create the
interfaces.

Also, this description covers the internals of xend only. The Frontend
and backend drivers must handle the messages, map the shared page,
initialise the ring interface, allocate IRQs and bind to the event
channel as appropriate.

There are a number of people working to make this simpler, in particular
Mike Wray is apparently rewriting xend and Anthony Liguori is working on
an alternative set of tools.

Also, the grant tables implementation probably means that the above
description is out of date.  I''ve not investigated grant tables much
yet.

I''ve not been following the checkins to unstable since the xen summit
so
I''ll have missed anything not mentioned on the mailing list or IRC.

The current overhead in terms of client code to establish an entity on
the xen inter-domain communication "bus" is currently of the order of
1000 statements (counting FE, BE and slice of xend).  A better
inter-domain communication API could reduce this to fewer than 10
statements.  If it''s not done by the time I finish the USB work, I will
hopefully be allowed to help with this.

Harry.

On Thu, 2005-05-05 at 11:18 -0400, Aggarwal, Vikas (OFT)
wrote:> Hi,
> 
> I have added a debug-Front-end.c &  debug-Back-end.c.  Also created my
> debug.py based on  blkif.py
> 
> But my front-end still times out even after sending
> ctrl_if_register_receiver()  to domain controller during module_init.
> 
> I think I  still did not initialize my debug.py properly.  I saw VBD
> gets initialized in device_create or create_devices. Whats the
> difference in the two?
> 
> Where should I look to initialize the event channel in domain
> controller?
> 
> My virtual device in back-end collects the debug data and writes to
file> system  so I don''t  want it to be initialized during boot time via
some> config file.
> 
> -vikas
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Harry Butterworth

2005-May-13 09:08 UTC

head link

RE: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c

You need to modify tools/python/xm/create.py to get your config
parameters into the config SXP.

You''ll need to have a new controllerfactory for your device class as
described in my previous note attached below.

You need to create your controllerfactory in
tools/python/xen/xend/server/SrvDaemon.py.

You also need to modify python/xen/xend/XendDomainInfo.py to add a
device handler for your device. The device handler is used to parse the
device config in the config SXP and create the devices using your new
controller factory.

You need to list your messages in 
tools/python/xen/xend/server/messages.py.

You need to do the C->python and python->c translation of your messages
in tools/python/xen/lowlevel/xu/xu.c

Harry.

On Thu, 2005-05-12 at 13:47 -0400, Aggarwal, Vikas (OFT)
wrote:> Hi,
> How can I  get my virtual-trace-device  into the SXP?
> Inside the SrvDomainDir.py:SrvDomainDir::op_create   the
"configstring"
> has devices for vbd/vif. I did a print on configstring.
> 
> Since I did not create a file based in blkctl.py for my
> virtual-trace-device so (blkctl.py)script = xroot.get_block_script()
> never happens for my trace-device. Hence (XendRoot.py)
> get_config_value->sxp.child_value  will never happen.
> 
> Please clarify my confusions in area of SXP initializations.  I hacked
> around the above piece of code after I saw (channel.py)("
> requestReceived > No device ")   in  xend-debug.log for my device
type.
> 
> I am still trying to figure out time-out in my FE driver, by looking it
> messages simultaneously I think "No device" is the culprit for
eventual
> timeout in FE driver.
> 
> -vikas
> 
> 
> 
> 
> --------------------------------------------------------
> This e-mail, including any attachments, may be confidential, privileged or
otherwise legally protected. It is intended only for the addressee. If you
received this e-mail in error or from someone who was not authorized to send it
to you, do not disseminate, copy or otherwise use this e-mail or its
attachments.  Please notify the sender immediately by reply e-mail and delete
the e-mail from your system.
> 
> 
> -----Original Message-----
> 
> From: Harry Butterworth [mailto:harry@hebutterworth.freeserve.co.uk] 
> Sent: Thursday, May 05, 2005 4:37 PM
> To: Aggarwal, Vikas (OFT)
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] please help: initialize XEND for my
> debug-FE/BE.c
> 
> To answer your questions:
> 
> I think (75% confidence) create_devices is used to create devices when a
> frontend domain is started whereas device_create is used to create an
> extra single device on the fly when the frontend domain is already
> running.
> 
> Your frontend is probably timing out waiting to hear about interfaces
> from xend.  To establish interface connections between the frontend and
> backend with the current xend code you need to implement the following
> basic flow:
> 
> When xend starts, it creates a controller factory for your device class.
> 
> When a frontend domain is started, the controller factory is used to
> create a controller for the frontend domain.
> 
> Configured devices for your device class are attached to the controller
> for the frontend domain.
> 
> Attaching the device creates the device object which in turn does a
> lookup of the "backend interface" object. The backend interface
object
> represents the communication channel between the frontend domain and the
> backend domain.  If the backend interface object does not exist it is
> created at this point.  When a backend interface is created, it finds
> the backend controller object for its backend domain.  If the backend
> controller object does not exist it is created at this point.
> 
> Some of the drivers assume that the backend driver is already up when
> the backend controller object is created and so do not listen for the
> driver status up message from the backend.  I''m implementing a
driver as
> loadable modules so I need to use this message.
> 
> In my case, the backend driver sends a driver status up message to xend
> which is handled by the backend controller object.  The backend
> controller object notifies all its backend interface objects that the
> backend driver is up.
> 
> When the frontend driver is initialised, it sends a driver status up
> message to xend.  This is handled by the controller object. The
> controller object notifies all its backend interface objects that the
> frontend driver is up.
> 
> So, a backend interface object gets a driver up/down notification at
> each end.  When a backend interface finds that the drivers at both its
> ends are up, it can squence the establishment of the connection:
> 
> The backend interface object sends a "be create" message to the
backend
> to create the object representing the interface in the backend.
> 
> The backend interface object sends a "fe interface status
disconnected"
> message to the frontend to create the object representing the interface
> in the frontend.
> 
> The frontend allocates a page for the ring interface and sends a "fe
> interface connect" message to xend which is forwarded by the
controller
> object to the backend interface object.
> 
> xend opens an event channel between the two domains and sends a "be
> connect" message to the backend passing the event channel and the
> address of the page for the ring interface.
> 
> The backend responds to the be connect and the backend interface object
> sends an fe interface status connected message to the frontend, passing
> the event channel number.
> 
> Unloading and reloading the frontend requires that the backend stops
> writing to the shared page of memory before the frontend frees it for
> reuse.  In my case, the frontend sends a driver status down message,
> which is handled by the controller object for the frontend. This object
> notifies all the backend interface objects for that frontend which then
> send be disconnect messages and wait for a response before acknowledging
> back to the controller object.  The controller object waits for all
> backend interfaces to complete before acknowledging the frontend driver
> status down.
> 
> Unloading and reloading the backend requires that the connection is
> broken and reestablished: the backend stops using the shared pages
> before sending a driver status down message. The driver status down is
> handled by the backend controller which notifies all the connected
> backend interfaces which in turn send fe interface status disconnected
> messages to their respective frontend domains.  The frontend domains
> respond with fe interface connect messages, trying to reestablish a
> connection.  When the back-end driver is reloaded it sends a driver
> status up which is propagated to the backend interface objects which
> send be create messages again and follow up with be connect messages
> containing the new shared pages and fe interface status connected
> messages to complete the connection.
> 
> This is just the establishment of the connection.  There are also
> messages sent by the devices to create the devices in the backend after
> the backend interfaces send the be create messages to create the
> interfaces.
> 
> Also, this description covers the internals of xend only. The Frontend
> and backend drivers must handle the messages, map the shared page,
> initialise the ring interface, allocate IRQs and bind to the event
> channel as appropriate.
> 
> There are a number of people working to make this simpler, in particular
> Mike Wray is apparently rewriting xend and Anthony Liguori is working on
> an alternative set of tools.
> 
> Also, the grant tables implementation probably means that the above
> description is out of date.  I''ve not investigated grant tables
much
> yet.
> 
> I''ve not been following the checkins to unstable since the xen
summit so
> I''ll have missed anything not mentioned on the mailing list or
IRC.
> 
> The current overhead in terms of client code to establish an entity on
> the xen inter-domain communication "bus" is currently of the
order of
> 1000 statements (counting FE, BE and slice of xend).  A better
> inter-domain communication API could reduce this to fewer than 10
> statements.  If it''s not done by the time I finish the USB work, I
will
> hopefully be allowed to help with this.
> 
> Harry.
> 
> On Thu, 2005-05-05 at 11:18 -0400, Aggarwal, Vikas (OFT) wrote:
> > Hi,
> > 
> > I have added a debug-Front-end.c &  debug-Back-end.c.  Also
created my
> > debug.py based on  blkif.py
> > 
> > But my front-end still times out even after sending
> > ctrl_if_register_receiver()  to domain controller during module_init.
> > 
> > I think I  still did not initialize my debug.py properly.  I saw VBD
> > gets initialized in device_create or create_devices. Whats the
> > difference in the two?
> > 
> > Where should I look to initialize the event channel in domain
> > controller?
> > 
> > My virtual device in back-end collects the debug data and writes to
> file
> > system  so I don''t  want it to be initialized during boot
time via
> some
> > config file.
> > 
> > -vikas
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> >
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - May 2005 - please help: initialize XEND for my debug-FE/BE.c

[Xen-devel] please help: initialize XEND for my debug-FE/BE.c

Re: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c

[Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

[Xen-devel] Re: Interdomain comms

Re: [Xen-devel] Re: Interdomain comms

RE: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c

RE: [Xen-devel] please help: initialize XEND for my debug-FE/BE.c