thr3ads.net - Lustre discuss - [Lustre-discuss] more info on O

If this information is useful, please help other people find it:
Share via:

Barnaby, Marty L

2006-May-19 07:36 UTC

[Lustre-discuss] more info on O_DIRECT

>From past experience, chiefly with the SGI, I had thought there would bemore involved than simply using the O_DIRECT to the flags in the opening
of my file. Is this not true?

Marty

-----Original Message-----
From: Andreas Dilger [mailto:adilger@clusterfs.com] 
Sent: Thursday, March 16, 2006 4:39 AM
To: Barnaby, Marty L
Cc: lustre-discuss@clusterfs.com; Naegle, John H; Klundt, Ruth
Subject: Re: [Lustre-discuss] more info on O_DIRECT


On Mar 15, 2006  10:43 -0700, Barnaby, Marty L wrote:> However, I get extremely poor scaling when I distribute the reading of
> the 100 GB file, in 20 GB junks, over for separate nodes. I have a 
> script to do this, which, while in the future I plan on making 
> available for users to perform parallel-cp, currently checks the 
> operations by performing them first in a series, then in the 
> background, to achieve parallelism. As you can see by what I am 
> verbosely echoing, I use ssh -k -x to distribute the jobs from one 
> login node:
> 
> rslogin06:~> rs2rs.chk /scratch5/mlbarna/datum/mpp.6OST /dev/null
> -rw-r--r--    1 mlbarna  mlbarna  104857600000 Jan 12 09:53
> /scratch5/mlbarna/datum/mpp.6OST
> /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp 
> -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null 
> 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s) 
> /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp 
> -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST 
> /dev/null 104857600000 bytes copied in 2.6e+02 seconds (3.9e+05 
> Kbytes/s) /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp 
> -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST 
> /dev/null 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 
> Kbytes/s) /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp 
> -Xo=78643200000 /scratch5/mlbarna/datum/mpp.6OST /dev/null 
> 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
These are the operations in series?  I assume (since -X isn''t mentioned
in the mpscp man page) that -Xs={size},o={offset} and you want each
client to copy a subset of the file?  It is confusing that the app would
report 100GB copied in that case, instead of {size}.  Also, given that
the time for each copy is the same as the single-client 100GB copy time
it seems you are copying the full 100GB in each process.
> /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp 
> -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null 
> /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp 
> -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST 
> /dev/null /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp 
> -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST 
> /dev/null /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp 
> -Xo=78643200000 /scratch5/mlbarna/datum/mpp.6OST /dev/null 
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) 
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) 
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) 
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
If the above is true, then we are in fact copying 400GB in 400s, and the
reported rates can be added, so 1.04GB/s, which is ~2.5x the single
client rate of 390MB/s.  So, not 4x the single client rate, but there
could be other factors involved such as limited disk bandwidth on the
server side for a 6-stripe file (160MB/s/OST is pretty good).
> I have been told that the readahead size configured for
> these nodes is 80 GB, which would explain where I am losing time doing
> extra loading.
I''d hope the readahead is more like 40MB (the default) or 80MB, and not
80GB, since I doubt the client nodes have 80GB of RAM available just for
readahead?  Please check /proc/fs/lustre/llite/*/max_readahead_mb.
> With this mpscp utility, I have had success in the past with the 
> implementation of SGI''s O_DIRECT for XFS, speeding up this type of
> well-behaved, large-blocksize, parallel reading by about 20%. At some 
> point in the future, I expect the owners of these rsnetX nodes to make
> adjustments to the readahead, maybe by turning it off altogether. 
> However, if some mode of O_DIRECT is available for this operation, I 
> could take control of my own needs.
Lustre O_DIRECT is available for linux 2.4 clients.  It has not yet been
imlpemented for 2.6 clients (not much demand so far), but I believe
there is a patch available.  In cases like this, where you are doing
large, well-aligned IOs and not necessarily interested in using the file
contents then O_DIRECT is definitely appropriate.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Hancock, Tom

2006-May-19 07:36 UTC

head link

[Lustre-discuss] more info on O_DIRECT

>From: lustre-discuss-bounces@clusterfs.com[mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Andreas
Dilger>Sent: 16 March 2006 11:39
>To: Barnaby, Marty L
>Cc: lustre-discuss@clusterfs.com; Naegle, John H; Klundt, Ruth
>Subject: Re: [Lustre-discuss] more info on O_DIRECT 
>Lustre O_DIRECT is available for linux 2.4 clients.  It has not yet
>been implemented for 2.6 clients (not much demand so far), but I
>believe there is a patch available.  In cases like this, where you are
>doing large, well-aligned IOs and not necessarily interested in using
>the file contents then O_DIRECT is definitely appropriate.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Principal Software Engineer
>Cluster File Systems, Inc.
We supplied at least one draft version of lustre directio client code
for
2.6 kernels to CFS, and which was largely sympathetic to the 2.4
implementation.
It appeared to do all the right things.

- Tom

---
Tomas Hancock, Hewlett Packard, Galway. Ireland +353-91-754765 
The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] more info on O_DIRECT

On Mar 16, 2006  07:20 -0700, Barnaby, Marty L wrote:> >From past experience, chiefly with the SGI, I had thought there would
be
> more involved than simply using the O_DIRECT to the flags in the opening
> of my file. Is this not true?
Beyond opening (or fcntl?) with O_DIRECT, the read/write buffers must be
aligned on PAGE_SIZE boundaries (this should happen automatically for any
allocations >= 4kB I think) and the file offset + count must also be on
PAGE_SIZE boundaries.  If you use at least multiples of 64kB you are safe
on all platforms, though of course multiples of 1MB are better for Lustre.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] more info on O_DIRECT

On Mar 15, 2006  10:43 -0700, Barnaby, Marty L wrote:> However, I get extremely poor scaling when I distribute the reading of
> the 100 GB file, in 20 GB junks, over for separate nodes. I have a
> script to do this, which, while in the future I plan on making available
> for users to perform parallel-cp, currently checks the operations by
> performing them first in a series, then in the background, to achieve
> parallelism. As you can see by what I am verbosely echoing, I use ssh -k
> -x to distribute the jobs from one login node:
> 
> rslogin06:~> rs2rs.chk /scratch5/mlbarna/datum/mpp.6OST /dev/null
> -rw-r--r--    1 mlbarna  mlbarna  104857600000 Jan 12 09:53
> /scratch5/mlbarna/datum/mpp.6OST
> /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp
> -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null
> 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
> /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp
> -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
> 104857600000 bytes copied in 2.6e+02 seconds (3.9e+05 Kbytes/s)
> /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp
> -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
> 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
> /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000
> /scratch5/mlbarna/datum/mpp.6OST /dev/null
> 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
These are the operations in series?  I assume (since -X isn''t mentioned
in the mpscp man page) that -Xs={size},o={offset} and you want each client
to copy a subset of the file?  It is confusing that the app would report
100GB copied in that case, instead of {size}.  Also, given that the time
for each copy is the same as the single-client 100GB copy time it seems
you are copying the full 100GB in each process.
> /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp
> -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null
> /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp
> -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
> /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp
> -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
> /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000
> /scratch5/mlbarna/datum/mpp.6OST /dev/null
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
> 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
If the above is true, then we are in fact copying 400GB in 400s, and
the reported rates can be added, so 1.04GB/s, which is ~2.5x the single
client rate of 390MB/s.  So, not 4x the single client rate, but there
could be other factors involved such as limited disk bandwidth on the
server side for a 6-stripe file (160MB/s/OST is pretty good).
> I have been told that the readahead size configured for
> these nodes is 80 GB, which would explain where I am losing time doing
> extra loading.
I''d hope the readahead is more like 40MB (the default) or 80MB, and not
80GB, since I doubt the client nodes have 80GB of RAM available just for
readahead?  Please check /proc/fs/lustre/llite/*/max_readahead_mb.
> With this mpscp utility, I have had success in the past with the
> implementation of SGI''s O_DIRECT for XFS, speeding up this type of
> well-behaved, large-blocksize, parallel reading by about 20%. At some
> point in the future, I expect the owners of these rsnetX nodes to make
> adjustments to the readahead, maybe by turning it off altogether.
> However, if some mode of O_DIRECT is available for this operation, I
> could take control of my own needs.
Lustre O_DIRECT is available for linux 2.4 clients.  It has not yet
been imlpemented for 2.6 clients (not much demand so far), but I
believe there is a patch available.  In cases like this, where you are
doing large, well-aligned IOs and not necessarily interested in using
the file contents then O_DIRECT is definitely appropriate.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Barnaby, Marty L

2006-May-19 07:36 UTC

head link

[Lustre-discuss] more info on O_DIRECT

My interest in O_DIRECT involves, primarily, sourcing of large,
contiguous, disjoint chunks of a huge file, distributed across multiple,
high-performance network nodes to achieve a parallel-ftp type operation
that might perform at many GB/s.

My parallel ftp application is called MPSCP:

http://www.sandia.gov/MPSCP/mpscp_design.htm

I''ve augmented it to do a cp--actually, even leveraging sendfile for
the
sourcing side--which can take as optional arguments a data-access
blocksize, and also the size and offset of a chunk in a large file. Here
is an example of cp to /dev/null performance for a 20 GB file, a 100 GB,
and the first 20 GB of the 100 GB file, all, respectively, stored on a
six-wide, LFS stripe:

mlbarna@rsnet06:~> /projects/sio/exe/mpscp -b8388608
/scratch4/mlbarna/datum/wt.6OST /dev/null
20000538624 bytes copied in 48 seconds (4.1e+05 Kbytes/s)
mlbarna@rsnet07:~> /projects/sio/exe/mpscp
/scratch5/mlbarna/datum/mpp.6OST /dev/null
/projects/sio/exe/mpscp -b8388608 /scratch5/mlbarna/datum/mpp.6OST
/dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.9e+05 Kbytes/s)

These I consider reasonable byte-rates for the single node case.
However, I get extremely poor scaling when I distribute the reading of
the 100 GB file, in 20 GB junks, over for separate nodes. I have a
script to do this, which, while in the future I plan on making available
for users to perform parallel-cp, currently checks the operations by
performing them first in a series, then in the background, to achieve
parallelism. As you can see by what I am verbosely echoing, I use ssh -k
-x to distribute the jobs from one login node:

rslogin06:~> rs2rs.chk /scratch5/mlbarna/datum/mpp.6OST /dev/null
-rw-r--r--    1 mlbarna  mlbarna  104857600000 Jan 12 09:53
/scratch5/mlbarna/datum/mpp.6OST
/usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp
-Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp
-Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.6e+02 seconds (3.9e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp
-Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000
/scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp
-Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null
/usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp
-Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
/usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp
-Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
/usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000
/scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
Wed Mar 15 09:41:19 MST 2006

This shows that, though I get some parallel scaling, it''s somewhere
around 25%. I have been told that the readahead size configured for
these nodes is 80 GB, which would explain where I am losing time doing
extra loading, since I am only reading 26 GB chunks. I''ve experimented
with having chunks as big as 80 GB, but haven''t had success.

With this mpscp utility, I have had success in the past with the
implementation of SGI''s O_DIRECT for XFS, speeding up this type of
well-behaved, large-blocksize, parallel reading by about 20%. At some
point in the future, I expect the owners of these rsnetX nodes to make
adjustments to the readahead, maybe by turning it off altogether.
However, if some mode of O_DIRECT is available for this operation, I
could take control of my own needs.

Marty Barnaby
Sandia National Laboratories


 

-----Original Message-----
From: lustre-discuss-bounces@clusterfs.com
[mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Wei-keng Liao
Sent: Tuesday, March 14, 2006 2:22 PM
To: lustre-discuss@clusterfs.com
Subject: [Lustre-discuss] more info on O_DIRECT



I am looking for more information about using O_DIRECT. According to 
Lustre manual 1.4.6, page 50, it says

   "For more information about the pros and cons of using Direct I/O
with
    Lustre, see Performance Concepts."

Where can I find this "Performance Concepts" document?

Wei-keng
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Lustre discuss - May 2006 - more info on O_DIRECT

[Lustre-discuss] more info on O_DIRECT

[Lustre-discuss] more info on O_DIRECT

[Lustre-discuss] more info on O_DIRECT

[Lustre-discuss] more info on O_DIRECT

[Lustre-discuss] more info on O_DIRECT