RFC: This is draft proposal API for the user-level interface for feeds. (This does not describe changelogs in general). Feeds would generally be used for two things: creating audit logs, and driving a database watching for filesystem changes. -------------- next part -------------- A non-text attachment was scrubbed... Name: feed_api.pdf Type: application/pdf Size: 85313 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080125/a5b52ad2/attachment-0004.pdf
On Jan 25, 2008 12:37 -0800, Nathaniel Rutman wrote:> This is draft proposal API for the user-level interface for feeds. (This > does not describe changelogs in general). > > Feeds would generally be used for two things: creating audit logs, and > driving a database watching for filesystem changes.2.1.1 The type-specific data struct looks awfully like an MDS_REINT record... It would be highly convenient if it were exactly the same. That would make it possible, for example, to implement a mechanism like the ZFS "send" and "receive" functionality (at the Lustre level) to clone one filesystem onto another by "simply" taking the feed from the parent filesystem and driving it directly into the batch reintegration mechanism being planned for client-side metadata cache. I''m not familiar with all of the details of the ZFS "send" structures, but my understanding is that these are generated as changelogs from a particular snapshot, and the record contains enough information to make the target filesystem an exact clone of the current one, including file offset+length for "write" commands so that a subset of a large file could be sent instead of the whole thing. By doing this against a snapshot, this allows the feed to "reduce" operations that may have been done as multiple discrete steps originally (e.g. small writes that change a large part of a file, or creation and subsequent removal of files after the reference snapshot). Is there a benefit to having the clientname as an ASCII string, instead of the more compact NID value? This could be expanded in userspace via a library call if needed, but avoids server overhead if it isn''t needed. 2.1.2 One aspect of the design that is troubling is the guarantee that a feed will be persistent once created. It seems entirely probable that some feed would be set up for a particular task, the task completed, and then the userspace consumer being stopped without being destroyed, and never restarted again. This would result in a boundless growth of the feed "backlog" as there is no longer a consumer. 2.1.3 I''m assuming that the actual kernel implementation of the feed stream will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to notify the consumer, instead of having it e.g. busy wait on the feed size? There are a wide variety of services that already function in a similar way (e.g. ftp and http servers), and having them efficiently process their requests is important. Also, the requirement that a process be privileged to start a feed is a bit unfortunate. I can imagine that it isn''t possible to start a _persistent_ feed (i.e. one that lives after the death of the application) but it should be possible to have a transient one. A simple use case would be integration into the Linux inotify/dnotify mechanism (and equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X, Google Desktop search, etc. It would of course only be possible to receive a feed for files that a particular user already had access to. For applications like backup/sync it is also undesirable that the operator not need full system privileges in order to start the backup. I suppose unprivileged access might be possible by having the privileged feed be sent to a secondary userspace process like the dbus-daemon on Linux... This also implies that the feed needs to be filterable for a given user. For consumer feed restart, how does the consumer know where the first uncancelled entry begins? Assuming this is a linear stream of records the file offsets can become very large quite quickly. A mechanism like SEEK_DATA would be useful, as would adding some parameters to the llapi_audit_getinfo() data structure to return the first and available record offset. Also, there is the risk of 2^64-byte offset overflow if this is presented as a regular file to userspace. It would make more sense to present this as a FIFO or socket. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Jan 25, 2008 12:37 -0800, Nathaniel Rutman wrote: > >> This is draft proposal API for the user-level interface for feeds. (This >> does not describe changelogs in general). >> >> Feeds would generally be used for two things: creating audit logs, and >> driving a database watching for filesystem changes. >> > > 2.1.1 > The type-specific data struct looks awfully like an MDS_REINT record... > It would be highly convenient if it were exactly the same. That would > make it possible, for example, to implement a mechanism like the ZFS > "send" and "receive" functionality (at the Lustre level) to clone one > filesystem onto another by "simply" taking the feed from the parent > filesystem and driving it directly into the batch reintegration mechanism > being planned for client-side metadata cache. >That''s where I took it from. You''re right, I should include all the MDS_REINT fields.> Is there a benefit to having the clientname as an ASCII string, instead > of the more compact NID value? This could be expanded in userspace via > a library call if needed, but avoids server overhead if it isn''t needed. >Good point. We need a translator to human-readable form anyhow; may as well have it decode the nid as well.> 2.1.2 > One aspect of the design that is troubling is the guarantee that a > feed will be persistent once created. It seems entirely probable that > some feed would be set up for a particular task, the task completed, and > then the userspace consumer being stopped without being destroyed, and > never restarted again. This would result in a boundless growth of the > feed "backlog" as there is no longer a consumer. >Here is where the abort_timeout would come in handy. Maybe I should default that to some large size, or instead have a default abort_size that assumes the consumer is dead when the log grows beyond some number of unconsumed entries.> 2.1.3 > I''m assuming that the actual kernel implementation of the feed stream > will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to notify > the consumer, instead of having it e.g. busy wait on the feed size? > There are a wide variety of services that already function in a similar > way (e.g. ftp and http servers), and having them efficiently process > their requests is important. >Consumers would generally blocking wait (not busy wait) on the filedescriptor. Or use select(2) or poll(2).> Also, the requirement that a process be privileged to start a feed > is a bit unfortunate. I can imagine that it isn''t possible to start a > _persistent_ feed (i.e. one that lives after the death of the application) > but it should be possible to have a transient one. A simple use case > would be integration into the Linux inotify/dnotify mechanism (and > equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X, > Google Desktop search, etc. It would of course only be possible to > receive a feed for files that a particular user already had access to. >the point is security - you don''t want joe user to be able to be able to log what every other user is doing to the filesystem. One might argue, however, that since you''re doing this on the server anyhow (not a client), that the server itself should be secured and we don''t bother here...> For applications like backup/sync it is also undesirable that the operator > not need full system privileges in order to start the backup. I suppose > unprivileged access might be possible by having the privileged feed be > sent to a secondary userspace process like the dbus-daemon on Linux... > This also implies that the feed needs to be filterable for a given user. > > > For consumer feed restart, how does the consumer know where the first > uncancelled entry begins? Assuming this is a linear stream of records > the file offsets can become very large quite quickly. A mechanism like > SEEK_DATA would be useful, as would adding some parameters to the > llapi_audit_getinfo() data structure to return the first and available > record offset. Also, there is the risk of 2^64-byte offset overflow > if this is presented as a regular file to userspace. It would make more > sense to present this as a FIFO or socket. >The consumer doesn''t know, the feed does. It has retained all uncanceled entries persistently, so it just starts playing back from the first uncanceled one. The consumers were given sequence numbers in each log entry; it is up to them to ignore repeated records that they already processed (but did not cancel from the feed). Ah yes, I get what you are saying now; it''s not really a file that you can see the beginning of at any point - the beginning disappears as entries are consumed. So yes, a FIFO. That implies a single consumer per FIFO, but I think that''s fine. We''ll restrict ourselves to the AC_ONESHOT case, and drop AC_BATCH, which I was unsure was useful anyhow. And yes, getinfo returning max number of available records would be useful too. I''ll still use the next read() as an indicator that the previous batch of records read can now be canceled.
Nathan,> 2.1.1 > The type-specific data struct looks awfully like an MDS_REINT record... > It would be highly convenient if it were exactly the same. That would > make it possible, for example, to implement a mechanism like the ZFS > "send" and "receive" functionality (at the Lustre level) to clone one > filesystem onto another by "simply" taking the feed from the parent > filesystem and driving it directly into the batch reintegration mechanism > being planned for client-side metadata cache.Didn''t we rule this out in Moscow?> Is there a benefit to having the clientname as an ASCII string, instead > of the more compact NID value? This could be expanded in userspace via > a library call if needed, but avoids server overhead if it isn''t needed.Yes (compact wire representation - lower layers already have it) No (interop MUCH easier with strings)> One aspect of the design that is troubling is the guarantee that a > feed will be persistent once created. It seems entirely probable that > some feed would be set up for a particular task, the task completed, and > then the userspace consumer being stopped without being destroyed, and > never restarted again. This would result in a boundless growth of the > feed "backlog" as there is no longer a consumer.Needs a good answer> I''m assuming that the actual kernel implementation of the feed stream > will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to notify > the consumer, instead of having it e.g. busy wait on the feed size? > There are a wide variety of services that already function in a similar > way (e.g. ftp and http servers), and having them efficiently process > their requests is important.Good point> Also, the requirement that a process be privileged to start a feed > is a bit unfortunate. I can imagine that it isn''t possible to start a > _persistent_ feed (i.e. one that lives after the death of the application) > but it should be possible to have a transient one.I wouldn''t be tempted to relax the privilege required to do _anything_at_all_ with a feed until the security issues are _completely_ understood.> A simple use case > would be integration into the Linux inotify/dnotify mechanism (and > equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X, > Google Desktop search, etc. It would of course only be possible to > receive a feed for files that a particular user already had access to.Until you''ve really thought through the security implications, a statement as seemingly obvious as this can''t be trusted. Security issues are profoundly devious.> For applications like backup/sync it is also undesirable that the operator > not need full system privileges in order to start the backup. I suppose > unprivileged access might be possible by having the privileged feed be > sent to a secondary userspace process like the dbus-daemon on Linux... > This also implies that the feed needs to be filterable for a > given user.Again - must be thought through _completely_ before relaxing constraints.> For consumer feed restart, how does the consumer know where the first > uncancelled entry begins? Assuming this is a linear stream of records > the file offsets can become very large quite quickly. A mechanism like > SEEK_DATA would be useful, as would adding some parameters to the > llapi_audit_getinfo() data structure to return the first and available > record offset. Also, there is the risk of 2^64-byte offset overflow > if this is presented as a regular file to userspace. It would make more > sense to present this as a FIFO or socket.(BTW, please check my figures in the following - it''s too easy to be out by an order of magnitude...) 2^64 is about 16384 petabytes, so not than many orders of magnitude bigger than the whole filesystems envisaged for the near future. Can a feed include the actual data? If so, then this could be a real limitation (say in the next decade). However it will take 54 years to push 2^64 bytes as a single stream through a 10GByte/sec network and even with a future 1TByte/sec network (wow - imagine that) it would still be 6 months. So it''s not a limitation for a single stream FTTB. But must a feed necessarily be a single stream? Will the bandwidth at which a feed can be created never exceed the capacity of a single pipe? Can we envisage the use cases of a clustered feed receiver? Could that ever include another lustre filesystem? Cheers, Eric
>> 2.1.2 >> One aspect of the design that is troubling is the guarantee that a >> feed will be persistent once created. It seems entirely probable that >> some feed would be set up for a particular task, the task completed, and >> then the userspace consumer being stopped without being destroyed, and >> never restarted again. This would result in a boundless growth of the >> feed "backlog" as there is no longer a consumer. >> > Here is where the abort_timeout would come in handy. Maybe I should > default that to > some large size, or instead have a default abort_size that assumes the > consumer is > dead when the log grows beyond some number of unconsumed entries.There are many feeds for which incurring ENOSPACE is the right answer. For example, searches have to be exact, and perhaps re-scanning the file system is not an option. The only reason I know that you may want to truncate changelogs forcefully is for non-returning disconnected clients or proxies. So there might be two refcounts (one for essential and one for less essential users) on a feed to accomplish this, but having refcounts may make it hard to track which consumers have consumed.>> 2.1.3 >> I''m assuming that the actual kernel implementation of the feed stream >> will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to notify >> the consumer, instead of having it e.g. busy wait on the feed size? >> There are a wide variety of services that already function in a similar >> way (e.g. ftp and http servers), and having them efficiently process >> their requests is important. >> > Consumers would generally blocking wait (not busy wait) on the > filedescriptor. Or use select(2) or poll(2). >> Also, the requirement that a process be privileged to start a feed >> is a bit unfortunate. I can imagine that it isn''t possible to start a >> _persistent_ feed (i.e. one that lives after the death of the >> application) >> but it should be possible to have a transient one. A simple use case >> would be integration into the Linux inotify/dnotify mechanism (and >> equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X, >> Google Desktop search, etc. It would of course only be possible to >> receive a feed for files that a particular user already had access to. >> > the point is security - you don''t want joe user to be able to be able > to log what > every other user is doing to the filesystem. One might argue, > however, that > since you''re doing this on the server anyhow (not a client), that the > server > itself should be secured and we don''t bother here...>> For applications like backup/sync it is also undesirable that the >> operator >> not need full system privileges in order to start the backup. I suppose >> unprivileged access might be possible by having the privileged feed be >> sent to a secondary userspace process like the dbus-daemon on Linux... >> This also implies that the feed needs to be filterable for a given user. >>The kerberos user should have FID access priviliges to use a feed. This is unrelated to the uid.>> >> For consumer feed restart, how does the consumer know where the first >> uncancelled entry begins?Usually the replicator reports this (e.g. the search engine says "last digested feed entry was....", similar for replicators) - Peter -