thr3ads.net - Lustre devel - [Lustre-devel] Feed API draft for comment [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Nathaniel Rutman

2008-Jan-25 20:37 UTC

[Lustre-devel] Feed API draft for comment

RFC:
This is draft proposal API for the user-level interface for feeds.  
(This does not describe changelogs in general).

Feeds would generally be used for two things: creating audit logs, and 
driving a database watching for filesystem changes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: feed_api.pdf
Type: application/pdf
Size: 85313 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20080125/a5b52ad2/attachment-0004.pdf

Andreas Dilger

2008-Jan-26 00:20 UTC

head link

[Lustre-devel] Feed API draft for comment

On Jan 25, 2008 12:37 -0800, Nathaniel Rutman wrote:> This is draft proposal API for the user-level interface for feeds. (This
> does not describe changelogs in general).
>
> Feeds would generally be used for two things: creating audit logs, and
> driving a database watching for filesystem changes.
2.1.1
The type-specific data struct looks awfully like an MDS_REINT record...
It would be highly convenient if it were exactly the same. That would
make it possible, for example, to implement a mechanism like the ZFS
"send" and "receive" functionality (at the Lustre level) to
clone one
filesystem onto another by "simply" taking the feed from the parent
filesystem and driving it directly into the batch reintegration mechanism
being planned for client-side metadata cache.

I''m not familiar with all of the details of the ZFS "send"
structures,
but my understanding is that these are generated as changelogs from a
particular snapshot, and the record contains enough information to make
the target filesystem an exact clone of the current one, including file
offset+length for "write" commands so that a subset of a large file
could
be sent instead of the whole thing. By doing this against a snapshot,
this allows the feed to "reduce" operations that may have been done
as multiple discrete steps originally (e.g. small writes that change a
large part of a file, or creation and subsequent removal of files after
the reference snapshot).

Is there a benefit to having the clientname as an ASCII string, instead
of the more compact NID value? This could be expanded in userspace via
a library call if needed, but avoids server overhead if it isn''t
needed.

2.1.2
One aspect of the design that is troubling is the guarantee that a
feed will be persistent once created. It seems entirely probable that
some feed would be set up for a particular task, the task completed, and
then the userspace consumer being stopped without being destroyed, and
never restarted again. This would result in a boundless growth of the
feed "backlog" as there is no longer a consumer.

2.1.3
I''m assuming that the actual kernel implementation of the feed stream
will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to notify
the consumer, instead of having it e.g. busy wait on the feed size?
There are a wide variety of services that already function in a similar
way (e.g. ftp and http servers), and having them efficiently process
their requests is important.

Also, the requirement that a process be privileged to start a feed
is a bit unfortunate. I can imagine that it isn''t possible to start a
_persistent_ feed (i.e. one that lives after the death of the application)
but it should be possible to have a transient one. A simple use case
would be integration into the Linux inotify/dnotify mechanism (and
equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X,
Google Desktop search, etc. It would of course only be possible to
receive a feed for files that a particular user already had access to.

For applications like backup/sync it is also undesirable that the operator
not need full system privileges in order to start the backup. I suppose
unprivileged access might be possible by having the privileged feed be
sent to a secondary userspace process like the dbus-daemon on Linux...
This also implies that the feed needs to be filterable for a given user.

For consumer feed restart, how does the consumer know where the first
uncancelled entry begins? Assuming this is a linear stream of records
the file offsets can become very large quite quickly. A mechanism like
SEEK_DATA would be useful, as would adding some parameters to the
llapi_audit_getinfo() data structure to return the first and available
record offset. Also, there is the risk of 2^64-byte offset overflow
if this is presented as a regular file to userspace. It would make more
sense to present this as a FIFO or socket.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nathaniel Rutman

2008-Jan-28 18:02 UTC

head link

[Lustre-devel] Feed API draft for comment

Andreas Dilger wrote:> On Jan 25, 2008  12:37 -0800, Nathaniel Rutman wrote:
>   
>> This is draft proposal API for the user-level interface for feeds. 
(This
>> does not describe changelogs in general).
>>
>> Feeds would generally be used for two things: creating audit logs, and 
>> driving a database watching for filesystem changes.
>>     
>
> 2.1.1
> The type-specific data struct looks awfully like an MDS_REINT record...
> It would be highly convenient if it were exactly the same.  That would
> make it possible, for example, to implement a mechanism like the ZFS
> "send" and "receive" functionality (at the Lustre
level) to clone one
> filesystem onto another by "simply" taking the feed from the
parent
> filesystem and driving it directly into the batch reintegration mechanism
> being planned for client-side metadata cache.
>   That''s where I took it from.  You''re right, I should include
all the
MDS_REINT fields.> Is there a benefit to having the clientname as an ASCII string, instead
> of the more compact NID value?  This could be expanded in userspace via
> a library call if needed, but avoids server overhead if it isn''t
needed.
>   Good point.  We need a translator to human-readable form anyhow; may as 
well have it
decode the nid as well.> 2.1.2
> One aspect of the design that is troubling is the guarantee that a
> feed will be persistent once created.  It seems entirely probable that
> some feed would be set up for a particular task, the task completed, and
> then the userspace consumer being stopped without being destroyed, and
> never restarted again.  This would result in a boundless growth of the
> feed "backlog" as there is no longer a consumer.
>   Here is where the abort_timeout would come in handy.  Maybe I should 
default that to
some large size, or instead have a default abort_size that assumes the 
consumer is
dead when the log grows beyond some number of unconsumed
entries.> 2.1.3
> I''m assuming that the actual kernel implementation of the feed
stream
> will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to
notify
> the consumer, instead of having it e.g. busy wait on the feed size?
> There are a wide variety of services that already function in a similar
> way (e.g. ftp and http servers), and having them efficiently process
> their requests is important.
>   Consumers would generally blocking wait (not busy wait) on the 
filedescriptor.  
Or use select(2) or poll(2).> Also, the requirement that a process be privileged to start a feed
> is a bit unfortunate.  I can imagine that it isn''t possible to
start a
> _persistent_ feed (i.e. one that lives after the death of the application)
> but it should be possible to have a transient one.  A simple use case
> would be integration into the Linux inotify/dnotify mechanism (and
> equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X,
> Google Desktop search, etc.  It would of course only be possible to
> receive a feed for files that a particular user already had access to.
>   the point is security - you don''t want joe user to be able to be able
to
log what
every other user is doing to the filesystem.  One might argue, however, 
that
since you''re doing this on the server anyhow (not a client), that the
server
itself should be secured and we don''t bother
here...> For applications like backup/sync it is also undesirable that the operator
> not need full system privileges in order to start the backup.  I suppose
> unprivileged access might be possible by having the privileged feed be
> sent to a secondary userspace process like the dbus-daemon on Linux...
> This also implies that the feed needs to be filterable for a given user.
>
>
> For consumer feed restart, how does the consumer know where the first
> uncancelled entry begins?  Assuming this is a linear stream of records
> the file offsets can become very large quite quickly.  A mechanism like
> SEEK_DATA would be useful, as would adding some parameters to the
> llapi_audit_getinfo() data structure to return the first and available
> record offset.  Also, there is the risk of 2^64-byte offset overflow
> if this is presented as a regular file to userspace.  It would make more
> sense to present this as a FIFO or socket.
>   The consumer doesn''t know, the feed does.  It has retained all 
uncanceled entries
persistently, so it just starts playing back from the first uncanceled 
one.  The consumers
were given sequence numbers in each log entry; it is up to them to 
ignore repeated
records that they already processed (but did not cancel from the feed). 
Ah yes, I get what you are saying now; it''s not really a file that you 
can see the beginning of
at any point - the beginning disappears as entries are consumed.  So 
yes, a FIFO. 
That implies a single consumer per FIFO, but I think that''s fine. 
We''ll
restrict ourselves
to the AC_ONESHOT case, and drop AC_BATCH, which I was unsure was useful 
anyhow.
And yes, getinfo returning max number of available records would be 
useful too.  I''ll
still use the next read() as an indicator that the previous batch of 
records read can now
be canceled.

Eric Barton

2008-Jan-28 21:32 UTC

head link

[Lustre-devel] Feed API draft for comment

Nathan,
> 2.1.1
> The type-specific data struct looks awfully like an MDS_REINT record...
> It would be highly convenient if it were exactly the same.  That would
> make it possible, for example, to implement a mechanism like the ZFS
> "send" and "receive" functionality (at the Lustre
level) to clone one
> filesystem onto another by "simply" taking the feed from the
parent
> filesystem and driving it directly into the batch reintegration mechanism
> being planned for client-side metadata cache.
Didn''t we rule this out in Moscow?
> Is there a benefit to having the clientname as an ASCII string, instead
> of the more compact NID value?  This could be expanded in userspace via
> a library call if needed, but avoids server overhead if it isn''t
needed.
Yes (compact wire representation - lower layers already have it)
No (interop MUCH easier with strings)
> One aspect of the design that is troubling is the guarantee that a
> feed will be persistent once created.  It seems entirely probable that
> some feed would be set up for a particular task, the task completed, and
> then the userspace consumer being stopped without being destroyed, and
> never restarted again.  This would result in a boundless growth of the
> feed "backlog" as there is no longer a consumer.
Needs a good answer
> I''m assuming that the actual kernel implementation of the feed
stream
> will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to
notify
> the consumer, instead of having it e.g. busy wait on the feed size?
> There are a wide variety of services that already function in a similar
> way (e.g. ftp and http servers), and having them efficiently process
> their requests is important.
Good point
> Also, the requirement that a process be privileged to start a feed
> is a bit unfortunate.  I can imagine that it isn''t possible to
start a
> _persistent_ feed (i.e. one that lives after the death of the application)
> but it should be possible to have a transient one.  
I wouldn''t be tempted to relax the privilege required to do
_anything_at_all_
with a feed until the security issues are _completely_ understood.
> A simple use case
> would be integration into the Linux inotify/dnotify mechanism (and
> equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X,
> Google Desktop search, etc.  It would of course only be possible to
> receive a feed for files that a particular user already had access to.
Until you''ve really thought through the security implications, a
statement
as seemingly obvious as this can''t be trusted.  Security issues are
profoundly devious.
> For applications like backup/sync it is also undesirable that the operator
> not need full system privileges in order to start the backup. I suppose
> unprivileged access might be possible by having the privileged feed be
> sent to a secondary userspace process like the dbus-daemon on Linux...
> This also implies that the feed needs to be filterable for a 
> given user.
Again - must be thought through _completely_ before relaxing constraints.
> For consumer feed restart, how does the consumer know where the first
> uncancelled entry begins?  Assuming this is a linear stream of records
> the file offsets can become very large quite quickly.  A mechanism like
> SEEK_DATA would be useful, as would adding some parameters to the
> llapi_audit_getinfo() data structure to return the first and available
> record offset.  Also, there is the risk of 2^64-byte offset overflow
> if this is presented as a regular file to userspace.  It would make more
> sense to present this as a FIFO or socket.
(BTW, please check my figures in the following - it''s too easy to be
out
 by an order of magnitude...)

2^64 is about 16384 petabytes, so not than many orders of magnitude bigger
than the whole filesystems envisaged for the near future.  Can a feed
include the actual data?  If so, then this could be a real limitation
(say in the next decade).

However it will take 54 years to push 2^64 bytes as a single stream through
a 10GByte/sec network and even with a future 1TByte/sec network (wow - 
imagine that) it would still be 6 months.  So it''s not a limitation for
a
single stream FTTB.

But must a feed necessarily be a single stream?  Will the bandwidth at
which a feed can be created never exceed the capacity of a single pipe?
Can we envisage the use cases of a clustered feed receiver?  Could that
ever include another lustre filesystem?

    Cheers,
              Eric

Peter J Braam

2008-Jan-28 22:30 UTC

head link

[Lustre-devel] Feed API draft for comment

>> 2.1.2
>> One aspect of the design that is troubling is the guarantee that a
>> feed will be persistent once created.  It seems entirely probable that
>> some feed would be set up for a particular task, the task completed,
and
>> then the userspace consumer being stopped without being destroyed, and
>> never restarted again.  This would result in a boundless growth of the
>> feed "backlog" as there is no longer a consumer.
>>   
> Here is where the abort_timeout would come in handy.  Maybe I should 
> default that to
> some large size, or instead have a default abort_size that assumes the 
> consumer is
> dead when the log grows beyond some number of unconsumed entries.
There are many feeds for which incurring ENOSPACE is the right answer.  
For example, searches have to be exact, and perhaps re-scanning the file 
system is not an option.  The only reason I know that you may want to 
truncate changelogs forcefully is for non-returning disconnected clients 
or proxies.

So there might be two refcounts (one for essential and one for less 
essential users) on a feed to accomplish this, but having refcounts may 
make it hard to track which consumers have consumed.

>> 2.1.3
>> I''m assuming that the actual kernel implementation of the feed
stream
>> will allow a "poll" mechanisms (sys_poll, sys_epoll, etc.) to
notify
>> the consumer, instead of having it e.g. busy wait on the feed size?
>> There are a wide variety of services that already function in a similar
>> way (e.g. ftp and http servers), and having them efficiently process
>> their requests is important.
>>   
> Consumers would generally blocking wait (not busy wait) on the 
> filedescriptor.  Or use select(2) or poll(2).
>> Also, the requirement that a process be privileged to start a feed
>> is a bit unfortunate.  I can imagine that it isn''t possible to
start a
>> _persistent_ feed (i.e. one that lives after the death of the 
>> application)
>> but it should be possible to have a transient one.  A simple use case
>> would be integration into the Linux inotify/dnotify mechanism (and
>> equivalent for OS/X, Solaris) for desktop updates, Spotlight on OS/X,
>> Google Desktop search, etc.  It would of course only be possible to
>> receive a feed for files that a particular user already had access to.
>>   
> the point is security - you don''t want joe user to be able to be
able
> to log what
> every other user is doing to the filesystem.  One might argue, 
> however, that
> since you''re doing this on the server anyhow (not a client), that
the
> server
> itself should be secured and we don''t bother here...
>> For applications like backup/sync it is also undesirable that the 
>> operator
>> not need full system privileges in order to start the backup.  I
suppose
>> unprivileged access might be possible by having the privileged feed be
>> sent to a secondary userspace process like the dbus-daemon on Linux...
>> This also implies that the feed needs to be filterable for a given
user.
>>

The kerberos user should have FID access priviliges to use a feed. This 
is unrelated to the uid.
>>
>> For consumer feed restart, how does the consumer know where the first
>> uncancelled entry begins?
Usually the replicator reports this (e.g. the search engine says "last 
digested feed entry was....", similar for replicators)

- Peter -

Apparently Analagous Threads

Search for more maybe matching threads

Lustre devel - Jan 2008 - Feed API draft for comment

[Lustre-devel] Feed API draft for comment

[Lustre-devel] Feed API draft for comment

[Lustre-devel] Feed API draft for comment

[Lustre-devel] Feed API draft for comment

[Lustre-devel] Feed API draft for comment

Apparently Analagous Threads