adilger@clusterfs.com
2006-Dec-15 17:30 UTC
[Lustre-devel] [Bug 11287] server i/o call path cleanup
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11287
adilger@clusterfs.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #9054|review?(adilger@clusterfs.co|
Flag|m) |
(From update of attachment 9054)
0.2
Was there consideration given to the fact that the current interface can submit
a large discontiguous IO request with a single wait? The generic_file_write()
interface will only accept a single buffer+offset. In newer kernels the
generic_file_{read,write}() interfaces are simple wrappers around other
exported methods like generic_file_aio_write_nolock()) that would allow us to
submit multiple discontiguous DIRECT_IO requests (one for each call) and then
wait on them in parallel, and also avoid serialization from the inode lock
(depending either on DLM extent locking and/or ldiskfs range locking for
coherency). This also avoids the need to have a journal callback, because we
can have control over i_sem and start a proper nested transaction inside i_sem.
0.3.1 there is already a /proc tunable for the size at which a file would be
kept in the cache for reads (if that functionality handn''t been removed
because
it had a race condition, though I think HP still has it). That is
/proc/fs/lustre/obdfilter/*/readcache_max_filesize.
This is the file size (in bytes) below which reads will be kept in cache (on
the assumption that small files are re-read more often than large ones, and
trying to cache large files is futile). I think there should be a similar
tunable for the writes (file size in bytes, possibly at 1MB) below which file
writes will go into the cache and try to be merged or "group
allocated" later.
This would be a 64-bit value.
0.4.4 I think there IS already known changes to the locking. Not the DLM
locking, but rather the VFS inode locking. In the current write path we lock
the inode only during allocation and then do IO separately, ensuring that
concurrent writers on that file are blocked only for the minimal amount of
time. This is mentioned in the introduction, but is also appropriate to list
here. Using generic_file_aio_write_nolock() would allow us to get closer to
this same model (maybe even exactly the same if we do the aio wait after
dropping i_sem.
0.5 does your design for rniobufs_pages_split include the cleanup that we
previously discussed in beijing - namely that the OST code would no longer
split the rniobuf into separate page chunks, and instead pass it down in
possibly-larger contiguous niobufs so that the obdfilter code doesn''t
have to
reassemble them? This is useful for e.g. knowing how large an IO request can
be for each call to generic_file{_aio}_write{_nolock} and when calculating the
transaction size.
0.5 (should have sub-section numbers to more easily identify) We only need to
do the MM mapping one time per thread at startup/buffer allocation time. In
the O_DIRECT case, the IO will be done before the call returns so the buffers
can be re-used, and in the buffered IO case the VFS/VM will copy the pages and
again we can re-use them when done.
For rw flags, why wouldn''t we use O_WRONLY for the write case?
The whole "pages = pages + 1" checking can go away if obdfilter
doesn''t
fragment the niobufs, as described above.
For post-process() it would is necessary to change the UID/GID of the file
_before_ the write, in order to have the quota code work properly. I
don''t
think we need the setattr afterward if we don''t need to change the size
anymore.
0.6.2 In particular, fsync() from the client may result in an OBD_SYNC rpc on
the server, if the writes were handled without sync (should be knowable on the
client if the returned transno > last_committed).
0.7 If we create our own transaction before doing the write (as we do now) then
we should be certain that the transaction is large enough to hold all the
metadata changes. The only issue is the ext3_direct_IO() possibly dropping the
transnaction handle, but we can patch this internally.