>From past experience, chiefly with the SGI, I had thought there would bemore involved than simply using the O_DIRECT to the flags in the opening of my file. Is this not true? Marty -----Original Message----- From: Andreas Dilger [mailto:adilger@clusterfs.com] Sent: Thursday, March 16, 2006 4:39 AM To: Barnaby, Marty L Cc: lustre-discuss@clusterfs.com; Naegle, John H; Klundt, Ruth Subject: Re: [Lustre-discuss] more info on O_DIRECT On Mar 15, 2006 10:43 -0700, Barnaby, Marty L wrote:> However, I get extremely poor scaling when I distribute the reading of> the 100 GB file, in 20 GB junks, over for separate nodes. I have a > script to do this, which, while in the future I plan on making > available for users to perform parallel-cp, currently checks the > operations by performing them first in a series, then in the > background, to achieve parallelism. As you can see by what I am > verbosely echoing, I use ssh -k -x to distribute the jobs from one > login node: > > rslogin06:~> rs2rs.chk /scratch5/mlbarna/datum/mpp.6OST /dev/null > -rw-r--r-- 1 mlbarna mlbarna 104857600000 Jan 12 09:53 > /scratch5/mlbarna/datum/mpp.6OST > /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp > -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s) > /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp > -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST > /dev/null 104857600000 bytes copied in 2.6e+02 seconds (3.9e+05 > Kbytes/s) /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp > -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST > /dev/null 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 > Kbytes/s) /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp > -Xo=78643200000 /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)These are the operations in series? I assume (since -X isn''t mentioned in the mpscp man page) that -Xs={size},o={offset} and you want each client to copy a subset of the file? It is confusing that the app would report 100GB copied in that case, instead of {size}. Also, given that the time for each copy is the same as the single-client 100GB copy time it seems you are copying the full 100GB in each process.> /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp > -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null > /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp > -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST > /dev/null /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp > -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST > /dev/null /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp > -Xo=78643200000 /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)If the above is true, then we are in fact copying 400GB in 400s, and the reported rates can be added, so 1.04GB/s, which is ~2.5x the single client rate of 390MB/s. So, not 4x the single client rate, but there could be other factors involved such as limited disk bandwidth on the server side for a 6-stripe file (160MB/s/OST is pretty good).> I have been told that the readahead size configured for > these nodes is 80 GB, which would explain where I am losing time doing> extra loading.I''d hope the readahead is more like 40MB (the default) or 80MB, and not 80GB, since I doubt the client nodes have 80GB of RAM available just for readahead? Please check /proc/fs/lustre/llite/*/max_readahead_mb.> With this mpscp utility, I have had success in the past with the > implementation of SGI''s O_DIRECT for XFS, speeding up this type of > well-behaved, large-blocksize, parallel reading by about 20%. At some > point in the future, I expect the owners of these rsnetX nodes to make> adjustments to the readahead, maybe by turning it off altogether. > However, if some mode of O_DIRECT is available for this operation, I > could take control of my own needs.Lustre O_DIRECT is available for linux 2.4 clients. It has not yet been imlpemented for 2.6 clients (not much demand so far), but I believe there is a patch available. In cases like this, where you are doing large, well-aligned IOs and not necessarily interested in using the file contents then O_DIRECT is definitely appropriate. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
>From: lustre-discuss-bounces@clusterfs.com[mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Andreas Dilger>Sent: 16 March 2006 11:39 >To: Barnaby, Marty L >Cc: lustre-discuss@clusterfs.com; Naegle, John H; Klundt, Ruth >Subject: Re: [Lustre-discuss] more info on O_DIRECT>Lustre O_DIRECT is available for linux 2.4 clients. It has not yet >been implemented for 2.6 clients (not much demand so far), but I >believe there is a patch available. In cases like this, where you are >doing large, well-aligned IOs and not necessarily interested in using >the file contents then O_DIRECT is definitely appropriate. > >Cheers, Andreas >-- >Andreas Dilger >Principal Software Engineer >Cluster File Systems, Inc.We supplied at least one draft version of lustre directio client code for 2.6 kernels to CFS, and which was largely sympathetic to the 2.4 implementation. It appeared to do all the right things. - Tom --- Tomas Hancock, Hewlett Packard, Galway. Ireland +353-91-754765 The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated, you should consider this message and attachments as "HP CONFIDENTIAL".
On Mar 16, 2006 07:20 -0700, Barnaby, Marty L wrote:> >From past experience, chiefly with the SGI, I had thought there would be > more involved than simply using the O_DIRECT to the flags in the opening > of my file. Is this not true?Beyond opening (or fcntl?) with O_DIRECT, the read/write buffers must be aligned on PAGE_SIZE boundaries (this should happen automatically for any allocations >= 4kB I think) and the file offset + count must also be on PAGE_SIZE boundaries. If you use at least multiples of 64kB you are safe on all platforms, though of course multiples of 1MB are better for Lustre. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Mar 15, 2006 10:43 -0700, Barnaby, Marty L wrote:> However, I get extremely poor scaling when I distribute the reading of > the 100 GB file, in 20 GB junks, over for separate nodes. I have a > script to do this, which, while in the future I plan on making available > for users to perform parallel-cp, currently checks the operations by > performing them first in a series, then in the background, to achieve > parallelism. As you can see by what I am verbosely echoing, I use ssh -k > -x to distribute the jobs from one login node: > > rslogin06:~> rs2rs.chk /scratch5/mlbarna/datum/mpp.6OST /dev/null > -rw-r--r-- 1 mlbarna mlbarna 104857600000 Jan 12 09:53 > /scratch5/mlbarna/datum/mpp.6OST > /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp > -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s) > /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp > -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 2.6e+02 seconds (3.9e+05 Kbytes/s) > /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp > -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s) > /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000 > /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)These are the operations in series? I assume (since -X isn''t mentioned in the mpscp man page) that -Xs={size},o={offset} and you want each client to copy a subset of the file? It is confusing that the app would report 100GB copied in that case, instead of {size}. Also, given that the time for each copy is the same as the single-client 100GB copy time it seems you are copying the full 100GB in each process.> /usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp > -Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null > /usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp > -Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null > /usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp > -Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null > /usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000 > /scratch5/mlbarna/datum/mpp.6OST /dev/null > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s) > 104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)If the above is true, then we are in fact copying 400GB in 400s, and the reported rates can be added, so 1.04GB/s, which is ~2.5x the single client rate of 390MB/s. So, not 4x the single client rate, but there could be other factors involved such as limited disk bandwidth on the server side for a 6-stripe file (160MB/s/OST is pretty good).> I have been told that the readahead size configured for > these nodes is 80 GB, which would explain where I am losing time doing > extra loading.I''d hope the readahead is more like 40MB (the default) or 80MB, and not 80GB, since I doubt the client nodes have 80GB of RAM available just for readahead? Please check /proc/fs/lustre/llite/*/max_readahead_mb.> With this mpscp utility, I have had success in the past with the > implementation of SGI''s O_DIRECT for XFS, speeding up this type of > well-behaved, large-blocksize, parallel reading by about 20%. At some > point in the future, I expect the owners of these rsnetX nodes to make > adjustments to the readahead, maybe by turning it off altogether. > However, if some mode of O_DIRECT is available for this operation, I > could take control of my own needs.Lustre O_DIRECT is available for linux 2.4 clients. It has not yet been imlpemented for 2.6 clients (not much demand so far), but I believe there is a patch available. In cases like this, where you are doing large, well-aligned IOs and not necessarily interested in using the file contents then O_DIRECT is definitely appropriate. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
My interest in O_DIRECT involves, primarily, sourcing of large,
contiguous, disjoint chunks of a huge file, distributed across multiple,
high-performance network nodes to achieve a parallel-ftp type operation
that might perform at many GB/s.
My parallel ftp application is called MPSCP:
http://www.sandia.gov/MPSCP/mpscp_design.htm
I''ve augmented it to do a cp--actually, even leveraging sendfile for
the
sourcing side--which can take as optional arguments a data-access
blocksize, and also the size and offset of a chunk in a large file. Here
is an example of cp to /dev/null performance for a 20 GB file, a 100 GB,
and the first 20 GB of the 100 GB file, all, respectively, stored on a
six-wide, LFS stripe:
mlbarna@rsnet06:~> /projects/sio/exe/mpscp -b8388608
/scratch4/mlbarna/datum/wt.6OST /dev/null
20000538624 bytes copied in 48 seconds (4.1e+05 Kbytes/s)
mlbarna@rsnet07:~> /projects/sio/exe/mpscp
/scratch5/mlbarna/datum/mpp.6OST /dev/null
/projects/sio/exe/mpscp -b8388608 /scratch5/mlbarna/datum/mpp.6OST
/dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.9e+05 Kbytes/s)
These I consider reasonable byte-rates for the single node case.
However, I get extremely poor scaling when I distribute the reading of
the 100 GB file, in 20 GB junks, over for separate nodes. I have a
script to do this, which, while in the future I plan on making available
for users to perform parallel-cp, currently checks the operations by
performing them first in a series, then in the background, to achieve
parallelism. As you can see by what I am verbosely echoing, I use ssh -k
-x to distribute the jobs from one login node:
rslogin06:~> rs2rs.chk /scratch5/mlbarna/datum/mpp.6OST /dev/null
-rw-r--r-- 1 mlbarna mlbarna 104857600000 Jan 12 09:53
/scratch5/mlbarna/datum/mpp.6OST
/usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp
-Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp
-Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.6e+02 seconds (3.9e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp
-Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000
/scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 2.7e+02 seconds (3.8e+05 Kbytes/s)
/usr/local/bin/ssh -k -x rsnet05 /projects/sio/exe/mpscp
-Xs=26214400000,o=0 /scratch5/mlbarna/datum/mpp.6OST /dev/null
/usr/local/bin/ssh -k -x rsnet06 /projects/sio/exe/mpscp
-Xs=26214400000,o=26214400000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
/usr/local/bin/ssh -k -x rsnet07 /projects/sio/exe/mpscp
-Xs=26214400000,o=52428800000 /scratch5/mlbarna/datum/mpp.6OST /dev/null
/usr/local/bin/ssh -k -x rsnet08 /projects/sio/exe/mpscp -Xo=78643200000
/scratch5/mlbarna/datum/mpp.6OST /dev/null
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
104857600000 bytes copied in 4e+02 seconds (2.6e+05 Kbytes/s)
Wed Mar 15 09:41:19 MST 2006
This shows that, though I get some parallel scaling, it''s somewhere
around 25%. I have been told that the readahead size configured for
these nodes is 80 GB, which would explain where I am losing time doing
extra loading, since I am only reading 26 GB chunks. I''ve experimented
with having chunks as big as 80 GB, but haven''t had success.
With this mpscp utility, I have had success in the past with the
implementation of SGI''s O_DIRECT for XFS, speeding up this type of
well-behaved, large-blocksize, parallel reading by about 20%. At some
point in the future, I expect the owners of these rsnetX nodes to make
adjustments to the readahead, maybe by turning it off altogether.
However, if some mode of O_DIRECT is available for this operation, I
could take control of my own needs.
Marty Barnaby
Sandia National Laboratories
-----Original Message-----
From: lustre-discuss-bounces@clusterfs.com
[mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Wei-keng Liao
Sent: Tuesday, March 14, 2006 2:22 PM
To: lustre-discuss@clusterfs.com
Subject: [Lustre-discuss] more info on O_DIRECT
I am looking for more information about using O_DIRECT. According to
Lustre manual 1.4.6, page 50, it says
"For more information about the pros and cons of using Direct I/O
with
Lustre, see Performance Concepts."
Where can I find this "Performance Concepts" document?
Wei-keng
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss