thr3ads.net - Lustre discuss - [Lustre-discuss] Performances and fsync() [Dec 2009]

If this information is useful, please help other people find it:
Share via:

pascal.deveze at bull.net

2009-Dec-04 10:24 UTC

[Lustre-discuss] Performances and fsync()

Hi,

I am using b_eff_io to measure performance of ROMIO over Lustre version
1.6.7.1.
I am using the new ADIO Lustre Driver and saw that performances are very
low.
The reason of that is because the write bandwidth is calculated after a
call to
fsync().

After investigations, I saw that even when the file is empty, the
fsync takes 10 ms. If there are more than one process, the fsync calls
seems to be serialized.
The time is 80 ms for 8 processes :

salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE
filename=/mnt/romio/FILE
First sync (proc 0): 0.005534
03: sync : 0.019168 (err=0)
07: sync : 0.028794 (err=0)
01: sync : 0.038586 (err=0)
05: sync : 0.048467 (err=0)
02: sync : 0.058380 (err=0)
00: sync : 0.068205 (err=0)
04: sync : 0.078027 (err=0)
06: sync : 0.087960 (err=0)

The same programm on an NFS file gives less than 5 microseconds for
the same fsync() calls on 8 processes:

salloc -n 8 -N 1 mpirun time-fsync -f FILE
filename=FILE
First sync (proc 0): 0.000004
06: sync : 0.000004 (err=0)
04: sync : 0.000004 (err=0)
00: sync : 0.000002 (err=0)
03: sync : 0.000003 (err=0)
02: sync : 0.000004 (err=0)
01: sync : 0.000004 (err=0)
05: sync : 0.000004 (err=0)
07: sync : 0.000004 (err=0)

1) Is this behaviour normal for Lustre ?

2) Is is possible to configure something to make this fsync() run better ?


================ source of time_fsync.c
====================================#include "mpi.h"
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

double t1;
char *opt_filename;
int mynod, fd, err;

int main(int argc, char **argv){
 char ch;

 MPI_Init(&argc,&argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &mynod);

 while ((ch = getopt( argc, argv, "f:" )) != EOF) {
    switch(ch) {
       case ''f'':
          opt_filename = strdup(optarg);
          if (!mynod) printf("filename=%s \n", opt_filename);
          break;
    }
 }

 // Proc 0 opens/create the file
 if (!mynod) {
   fd = open(opt_filename, O_RDWR | O_CREAT, 0666);
   t1 = MPI_Wtime();
   fsync(fd);
   printf("First sync (proc 0): %.6f\n", MPI_Wtime()-t1);
   close(fd);
 }
 MPI_Barrier(MPI_COMM_WORLD);
 fd = open(opt_filename, O_RDWR);
 MPI_Barrier(MPI_COMM_WORLD);
 t1 = MPI_Wtime();
 err=fsync(fd);
 printf("%.2d: sync : %.6f (err=%d)\n", mynod, MPI_Wtime()-t1, err);

 close(fd);
 MPI_Finalize();
 return 0;
}

Andreas Dilger

2009-Dec-06 01:14 UTC

head link

[Lustre-discuss] Performances and fsync()

On 2009-12-04, at 03:24, pascal.deveze at bull.net
wrote:> I am using b_eff_io to measure performance of ROMIO over Lustre  
> version
> 1.6.7.1.  I am using the new ADIO Lustre Driver and saw that  
> performances are very low.  The reason of that is because the write  
> bandwidth is calculated after a call to fsync().
>
> After investigations, I saw that even when the file is empty, the
> fsync takes 10 ms. If there are more than one process, the fsync calls
> seems to be serialized.  The time is 80 ms for 8 processes :
>
> salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE
> filename=/mnt/romio/FILE
> First sync (proc 0): 0.005534
> 03: sync : 0.019168 (err=0)
> 07: sync : 0.028794 (err=0)
> 01: sync : 0.038586 (err=0)
> 05: sync : 0.048467 (err=0)
> 02: sync : 0.058380 (err=0)
> 00: sync : 0.068205 (err=0)
> 04: sync : 0.078027 (err=0)
> 06: sync : 0.087960 (err=0)
Very strange.
> 1) Is this behaviour normal for Lustre ?
Not AFAIK.  For proper data conistency, Lustre is not only flushing  
the cache for the file descriptor (which is empty in this case), but  
is also sending an SYNC RPC to the MDS to ensure that the metadata for  
this file is persistent on disk.  From reading the code, a regular  
sys_fsync() _should_ only cause an MDS SYNC RPC, while sys_fdatasync()  
will also cause an OSS SYNC RPC for each stripe.  That said, I''m not  
100% sure the kernel has this right until the very latest kernels  
(i.e. 2.6.32).

I''m not sure of the exact semantics of fsync() in NFS, whether it is  
essentially a no-op when there is no dirty data in cache, because the  
writes themselves are always synchronous and there is no need to do  
anything on the server.

The Lustre RPCs _should_ all be happening in parallel, from looking at  
your program below, but it is possible that they are not arriving  
_quite_ at the same time on the server, and this is forcing an extra  
transaction commit for each RPC.  The times are about right - 10ms to  
do a seek on a disk, so this looks like about a single seek for each  
RPC.
> 2) Is is possible to configure something to make this fsync() run  
> better ?
Some filesystems (e.g. Reiser4) have the dubious optimization of  
disabling fsync() all together, because it slows down applications too  
much, but if applications are calling fsync() it is generally for a  
good reason (though, I admit, not always).

As for legitimately optimizing this, there are a few options.  The  
RPCs, and the corresponding file operations on the servers should  
happen in parallel, and I''m not sure why at least most of them are not
being aggregated into the same transaction.  Getting debug logs from  
the servers and looking into why they are not grouped into a single  
transaction should identify what is causing the serialization.

Secondly, in Lustre 1.8 with Version Based Recovery, it would be  
possible for the MDS and OSS to determine if the file being fsync''d  
has any uncommitted changes, and if not then not do anything at all.   
With an fsync() (as opposed to a filesystem-wide "sync", the client  
should send the FID or object ID to the server to identify the file  
being fsync''d.   With VBR there is a version stored on each inode that
contains the transaction number in which it was last modified.  If the  
inode version is older than the filesystem''s last_committed   
transaction number, then it is already on stable storage and nothing  
needs to be done.  However, that is just my 5-minute investigation and  
there may be some hole in that logic.
> ================ source of time_fsync.c =============================>
#include "mpi.h"
> #include <string.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> int main(int argc, char **argv)
> {
> 	double t1;
> 	char *opt_filename;
> 	int mynod, fd, err;
> 	char ch;
>
> 	MPI_Init(&argc,&argv);
> 	MPI_Comm_rank(MPI_COMM_WORLD, &mynod);
>
> 	while ((ch = getopt( argc, argv, "f:" )) != EOF) {
> 		switch(ch) {
> 		case ''f'':
> 			opt_filename = strdup(optarg);
> 			if (mynod == 0)
> 				printf("filename=%s \n", opt_filename);
> 			break;
> 		}
> 	}
>
> 	// Proc 0 opens/create the file
> 	if (mynod == 0) {
> 		fd = open(opt_filename, O_RDWR | O_CREAT, 0666);
>
> 		t1 = MPI_Wtime();
> 		fsync(fd);
> 		printf("First sync (proc 0): %.6f\n", MPI_Wtime()-t1);
>
> 		close(fd);
> 	}
>
> 	MPI_Barrier(MPI_COMM_WORLD);
> 	fd = open(opt_filename, O_RDWR);
> 	MPI_Barrier(MPI_COMM_WORLD);
>
> 	t1 = MPI_Wtime();
> 	err=fsync(fd);
> 	printf("%.2d: sync : %.6f (err=%d)\n", mynod, MPI_Wtime()-t1,
err);
>
> 	close(fd);
> 	MPI_Finalize();
>
> 	return 0;
> }

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Peter Grandi

2009-Dec-06 19:32 UTC

head link

[Lustre-discuss] Performances and fsync()

[ ... ]
>> 1.6.7.1.  I am using the new ADIO Lustre Driver and saw that
>> performances are very low. The reason of that is because the
>> write bandwidth is calculated after a call to fsync().
This is storage systems FAQ #1: committed IOP performance is not
the same as streaming buffered performance.
>> After investigations, I saw that even when the file is empty,
>> the fsync takes 10 ms.
That''s pretty obvious:

[ ... ]
> [ ... ] 10ms to do a seek on a disk, so this looks like about
> a single seek for each RPC.
>> If there are more than one process, the fsync calls seems to
>> be serialized.  The time is 80 ms for 8 processes : [ ... ]
>> salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE
>> filename=/mnt/romio/FILE
>> First sync (proc 0): 0.005534
>> 03: sync : 0.019168 (err=0)
>> 07: sync : 0.028794 (err=0)
>> 01: sync : 0.038586 (err=0)
>> 05: sync : 0.048467 (err=0)
>> 02: sync : 0.058380 (err=0)
>> 00: sync : 0.068205 (err=0)
>> 04: sync : 0.078027 (err=0)
>> 06: sync : 0.087960 (err=0)
> Very strange.
Note necessarily -- IIRC Lustre metdata are not "striped". The
file is empty, and nothing is written to ti, so there should be no
traffic to the OSTs, only to the currently active MDT.
>> The same programm on an NFS file gives less than 5
>> microseconds for the same fsync() calls on 8 processes: [
>> ... ]
The NFS mount options are likely "wrong". Note that you are
using ''fsync'' and not ''fdatasync'', and
perhaps ''noatime'' is
set differently between Lustre and NFS.
>> 2) Is is possible to configure something to make this fsync()
>> run better ?
Well, a beginner text on storage systems and file systems and
transactions would be a start, so at least there would be a
basic understanding of why 10ms per ''fsync'' is probably right
and 5us per 8 ''fsync''s is probably wrong.
> Some filesystems (e.g. Reiser4) have the dubious optimization
> of disabling fsync() all together, because it slows down
> applications too much, but if applications are calling fsync()
> it is generally for a good reason (though, I admit, not
> always).
More broadly the problem usually is that applications don''t even
issue ''fsync'', and very few people seem to have read any
beginner
text on storage systems and file systems and transactions and
understand why it matters and when, and this is one aspect of the
"userspace sucks" issue. Some links that AndrewD probably knows
well:

    http://sandeen.net/wordpress/?p=34
    http://sandeen.net/wordpress/?p=42
    http://mjg59.livejournal.com/108257.html
    http://tribulaciones.org/2009/03/is-ext4-unsafe/
    http://lwn.net/SubscriberLink/322823/e6979f02e5a73feb/
   
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
   
http://loupgaroublond.blogspot.com/2009/03/anecdote-about-why-doing-wrong-thing-is.html
    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
> As for legitimately optimizing this, there are a few options.
Perhaps the best option is to use low latency storage media.
That''s the only really good way to get IOP/s up in sustained
way. IIRC AndreasD has repeatedly recommended to use good SSDs
if one wants fast MDTs (but writing an SSD page can be slow).
Battery backed RAM seems also important.

[ ... ]
> Secondly, in Lustre 1.8 with Version Based Recovery, it would be
> possible for the MDS and OSS to determine if the file being
> fsync''d has any uncommitted changes, and if not then not do
> anything at all.
I suspect that there is an important difference here between
''fsync'' and ''fdatasync''.

[ ... ]

Lustre discuss - Dec 2009 - Performances and fsync()

[Lustre-discuss] Performances and fsync()

[Lustre-discuss] Performances and fsync()

[Lustre-discuss] Performances and fsync()