Hi, I am using b_eff_io to measure performance of ROMIO over Lustre version 1.6.7.1. I am using the new ADIO Lustre Driver and saw that performances are very low. The reason of that is because the write bandwidth is calculated after a call to fsync(). After investigations, I saw that even when the file is empty, the fsync takes 10 ms. If there are more than one process, the fsync calls seems to be serialized. The time is 80 ms for 8 processes : salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE filename=/mnt/romio/FILE First sync (proc 0): 0.005534 03: sync : 0.019168 (err=0) 07: sync : 0.028794 (err=0) 01: sync : 0.038586 (err=0) 05: sync : 0.048467 (err=0) 02: sync : 0.058380 (err=0) 00: sync : 0.068205 (err=0) 04: sync : 0.078027 (err=0) 06: sync : 0.087960 (err=0) The same programm on an NFS file gives less than 5 microseconds for the same fsync() calls on 8 processes: salloc -n 8 -N 1 mpirun time-fsync -f FILE filename=FILE First sync (proc 0): 0.000004 06: sync : 0.000004 (err=0) 04: sync : 0.000004 (err=0) 00: sync : 0.000002 (err=0) 03: sync : 0.000003 (err=0) 02: sync : 0.000004 (err=0) 01: sync : 0.000004 (err=0) 05: sync : 0.000004 (err=0) 07: sync : 0.000004 (err=0) 1) Is this behaviour normal for Lustre ? 2) Is is possible to configure something to make this fsync() run better ? ================ source of time_fsync.c ====================================#include "mpi.h" #include <string.h> #include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> double t1; char *opt_filename; int mynod, fd, err; int main(int argc, char **argv){ char ch; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &mynod); while ((ch = getopt( argc, argv, "f:" )) != EOF) { switch(ch) { case ''f'': opt_filename = strdup(optarg); if (!mynod) printf("filename=%s \n", opt_filename); break; } } // Proc 0 opens/create the file if (!mynod) { fd = open(opt_filename, O_RDWR | O_CREAT, 0666); t1 = MPI_Wtime(); fsync(fd); printf("First sync (proc 0): %.6f\n", MPI_Wtime()-t1); close(fd); } MPI_Barrier(MPI_COMM_WORLD); fd = open(opt_filename, O_RDWR); MPI_Barrier(MPI_COMM_WORLD); t1 = MPI_Wtime(); err=fsync(fd); printf("%.2d: sync : %.6f (err=%d)\n", mynod, MPI_Wtime()-t1, err); close(fd); MPI_Finalize(); return 0; }
On 2009-12-04, at 03:24, pascal.deveze at bull.net wrote:> I am using b_eff_io to measure performance of ROMIO over Lustre > version > 1.6.7.1. I am using the new ADIO Lustre Driver and saw that > performances are very low. The reason of that is because the write > bandwidth is calculated after a call to fsync(). > > After investigations, I saw that even when the file is empty, the > fsync takes 10 ms. If there are more than one process, the fsync calls > seems to be serialized. The time is 80 ms for 8 processes : > > salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE > filename=/mnt/romio/FILE > First sync (proc 0): 0.005534 > 03: sync : 0.019168 (err=0) > 07: sync : 0.028794 (err=0) > 01: sync : 0.038586 (err=0) > 05: sync : 0.048467 (err=0) > 02: sync : 0.058380 (err=0) > 00: sync : 0.068205 (err=0) > 04: sync : 0.078027 (err=0) > 06: sync : 0.087960 (err=0)Very strange.> 1) Is this behaviour normal for Lustre ?Not AFAIK. For proper data conistency, Lustre is not only flushing the cache for the file descriptor (which is empty in this case), but is also sending an SYNC RPC to the MDS to ensure that the metadata for this file is persistent on disk. From reading the code, a regular sys_fsync() _should_ only cause an MDS SYNC RPC, while sys_fdatasync() will also cause an OSS SYNC RPC for each stripe. That said, I''m not 100% sure the kernel has this right until the very latest kernels (i.e. 2.6.32). I''m not sure of the exact semantics of fsync() in NFS, whether it is essentially a no-op when there is no dirty data in cache, because the writes themselves are always synchronous and there is no need to do anything on the server. The Lustre RPCs _should_ all be happening in parallel, from looking at your program below, but it is possible that they are not arriving _quite_ at the same time on the server, and this is forcing an extra transaction commit for each RPC. The times are about right - 10ms to do a seek on a disk, so this looks like about a single seek for each RPC.> 2) Is is possible to configure something to make this fsync() run > better ?Some filesystems (e.g. Reiser4) have the dubious optimization of disabling fsync() all together, because it slows down applications too much, but if applications are calling fsync() it is generally for a good reason (though, I admit, not always). As for legitimately optimizing this, there are a few options. The RPCs, and the corresponding file operations on the servers should happen in parallel, and I''m not sure why at least most of them are not being aggregated into the same transaction. Getting debug logs from the servers and looking into why they are not grouped into a single transaction should identify what is causing the serialization. Secondly, in Lustre 1.8 with Version Based Recovery, it would be possible for the MDS and OSS to determine if the file being fsync''d has any uncommitted changes, and if not then not do anything at all. With an fsync() (as opposed to a filesystem-wide "sync", the client should send the FID or object ID to the server to identify the file being fsync''d. With VBR there is a version stored on each inode that contains the transaction number in which it was last modified. If the inode version is older than the filesystem''s last_committed transaction number, then it is already on stable storage and nothing needs to be done. However, that is just my 5-minute investigation and there may be some hole in that logic.> ================ source of time_fsync.c =============================> #include "mpi.h" > #include <string.h> > #include <stdio.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, char **argv) > { > double t1; > char *opt_filename; > int mynod, fd, err; > char ch; > > MPI_Init(&argc,&argv); > MPI_Comm_rank(MPI_COMM_WORLD, &mynod); > > while ((ch = getopt( argc, argv, "f:" )) != EOF) { > switch(ch) { > case ''f'': > opt_filename = strdup(optarg); > if (mynod == 0) > printf("filename=%s \n", opt_filename); > break; > } > } > > // Proc 0 opens/create the file > if (mynod == 0) { > fd = open(opt_filename, O_RDWR | O_CREAT, 0666); > > t1 = MPI_Wtime(); > fsync(fd); > printf("First sync (proc 0): %.6f\n", MPI_Wtime()-t1); > > close(fd); > } > > MPI_Barrier(MPI_COMM_WORLD); > fd = open(opt_filename, O_RDWR); > MPI_Barrier(MPI_COMM_WORLD); > > t1 = MPI_Wtime(); > err=fsync(fd); > printf("%.2d: sync : %.6f (err=%d)\n", mynod, MPI_Wtime()-t1, err); > > close(fd); > MPI_Finalize(); > > return 0; > }Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
[ ... ]>> 1.6.7.1. I am using the new ADIO Lustre Driver and saw that >> performances are very low. The reason of that is because the >> write bandwidth is calculated after a call to fsync().This is storage systems FAQ #1: committed IOP performance is not the same as streaming buffered performance.>> After investigations, I saw that even when the file is empty, >> the fsync takes 10 ms.That''s pretty obvious: [ ... ]> [ ... ] 10ms to do a seek on a disk, so this looks like about > a single seek for each RPC.>> If there are more than one process, the fsync calls seems to >> be serialized. The time is 80 ms for 8 processes : [ ... ]>> salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE >> filename=/mnt/romio/FILE >> First sync (proc 0): 0.005534 >> 03: sync : 0.019168 (err=0) >> 07: sync : 0.028794 (err=0) >> 01: sync : 0.038586 (err=0) >> 05: sync : 0.048467 (err=0) >> 02: sync : 0.058380 (err=0) >> 00: sync : 0.068205 (err=0) >> 04: sync : 0.078027 (err=0) >> 06: sync : 0.087960 (err=0)> Very strange.Note necessarily -- IIRC Lustre metdata are not "striped". The file is empty, and nothing is written to ti, so there should be no traffic to the OSTs, only to the currently active MDT.>> The same programm on an NFS file gives less than 5 >> microseconds for the same fsync() calls on 8 processes: [ >> ... ]The NFS mount options are likely "wrong". Note that you are using ''fsync'' and not ''fdatasync'', and perhaps ''noatime'' is set differently between Lustre and NFS.>> 2) Is is possible to configure something to make this fsync() >> run better ?Well, a beginner text on storage systems and file systems and transactions would be a start, so at least there would be a basic understanding of why 10ms per ''fsync'' is probably right and 5us per 8 ''fsync''s is probably wrong.> Some filesystems (e.g. Reiser4) have the dubious optimization > of disabling fsync() all together, because it slows down > applications too much, but if applications are calling fsync() > it is generally for a good reason (though, I admit, not > always).More broadly the problem usually is that applications don''t even issue ''fsync'', and very few people seem to have read any beginner text on storage systems and file systems and transactions and understand why it matters and when, and this is one aspect of the "userspace sucks" issue. Some links that AndrewD probably knows well: http://sandeen.net/wordpress/?p=34 http://sandeen.net/wordpress/?p=42 http://mjg59.livejournal.com/108257.html http://tribulaciones.org/2009/03/is-ext4-unsafe/ http://lwn.net/SubscriberLink/322823/e6979f02e5a73feb/ http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ http://loupgaroublond.blogspot.com/2009/03/anecdote-about-why-doing-wrong-thing-is.html https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45> As for legitimately optimizing this, there are a few options.Perhaps the best option is to use low latency storage media. That''s the only really good way to get IOP/s up in sustained way. IIRC AndreasD has repeatedly recommended to use good SSDs if one wants fast MDTs (but writing an SSD page can be slow). Battery backed RAM seems also important. [ ... ]> Secondly, in Lustre 1.8 with Version Based Recovery, it would be > possible for the MDS and OSS to determine if the file being > fsync''d has any uncommitted changes, and if not then not do > anything at all.I suspect that there is an important difference here between ''fsync'' and ''fdatasync''. [ ... ]