adilger@clusterfs.com
2006-Dec-15 17:30 UTC
[Lustre-devel] [Bug 11287] server i/o call path cleanup
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11287 adilger@clusterfs.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #9054|review?(adilger@clusterfs.co| Flag|m) | (From update of attachment 9054) 0.2 Was there consideration given to the fact that the current interface can submit a large discontiguous IO request with a single wait? The generic_file_write() interface will only accept a single buffer+offset. In newer kernels the generic_file_{read,write}() interfaces are simple wrappers around other exported methods like generic_file_aio_write_nolock()) that would allow us to submit multiple discontiguous DIRECT_IO requests (one for each call) and then wait on them in parallel, and also avoid serialization from the inode lock (depending either on DLM extent locking and/or ldiskfs range locking for coherency). This also avoids the need to have a journal callback, because we can have control over i_sem and start a proper nested transaction inside i_sem. 0.3.1 there is already a /proc tunable for the size at which a file would be kept in the cache for reads (if that functionality handn''t been removed because it had a race condition, though I think HP still has it). That is /proc/fs/lustre/obdfilter/*/readcache_max_filesize. This is the file size (in bytes) below which reads will be kept in cache (on the assumption that small files are re-read more often than large ones, and trying to cache large files is futile). I think there should be a similar tunable for the writes (file size in bytes, possibly at 1MB) below which file writes will go into the cache and try to be merged or "group allocated" later. This would be a 64-bit value. 0.4.4 I think there IS already known changes to the locking. Not the DLM locking, but rather the VFS inode locking. In the current write path we lock the inode only during allocation and then do IO separately, ensuring that concurrent writers on that file are blocked only for the minimal amount of time. This is mentioned in the introduction, but is also appropriate to list here. Using generic_file_aio_write_nolock() would allow us to get closer to this same model (maybe even exactly the same if we do the aio wait after dropping i_sem. 0.5 does your design for rniobufs_pages_split include the cleanup that we previously discussed in beijing - namely that the OST code would no longer split the rniobuf into separate page chunks, and instead pass it down in possibly-larger contiguous niobufs so that the obdfilter code doesn''t have to reassemble them? This is useful for e.g. knowing how large an IO request can be for each call to generic_file{_aio}_write{_nolock} and when calculating the transaction size. 0.5 (should have sub-section numbers to more easily identify) We only need to do the MM mapping one time per thread at startup/buffer allocation time. In the O_DIRECT case, the IO will be done before the call returns so the buffers can be re-used, and in the buffered IO case the VFS/VM will copy the pages and again we can re-use them when done. For rw flags, why wouldn''t we use O_WRONLY for the write case? The whole "pages = pages + 1" checking can go away if obdfilter doesn''t fragment the niobufs, as described above. For post-process() it would is necessary to change the UID/GID of the file _before_ the write, in order to have the quota code work properly. I don''t think we need the setattr afterward if we don''t need to change the size anymore. 0.6.2 In particular, fsync() from the client may result in an OBD_SYNC rpc on the server, if the writes were handled without sync (should be knowable on the client if the returned transno > last_committed). 0.7 If we create our own transaction before doing the write (as we do now) then we should be certain that the transaction is large enough to hold all the metadata changes. The only issue is the ext3_direct_IO() possibly dropping the transnaction handle, but we can patch this internally.