jungdoo yoon
2008-Apr-29 03:21 UTC
[Lustre-discuss] Client is not accesible when OSS/OST server is down
When OSS/OST server is down, lustre client is not accesible by any means ( ssh, telnet ). I guess it is hung over lustre partition. OSS/OST HA ( heartbeat ) is not available yet. Is there any mount option or any way to prevent this problem? I''m using ''flock'' lustre mount option. -- <http://www.java.com> * Jungdoo Yoon * *Sun Microsystems, Inc.* KR Phone x85014/+82221935 014 Mobile 011-9889-2159 Email Jungdoo.Yoon at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/202c9809/attachment.html
Brian J. Murrell
2008-Apr-29 14:53 UTC
[Lustre-discuss] Client is not accesible when OSS/OST server is down
On Tue, 2008-04-29 at 12:21 +0900, jungdoo yoon wrote:> > When OSS/OST server is down, lustre client is > not accesible by any means ( ssh, telnet ). > I guess it is hung over lustre partition.Unless you are using Lustre for your root and/or usr filesystem and/or for swap, Lustre should not hang a machine completely. When a machine is hung as you describe, we generally want to see a stack trace from all of the processes. Given that you cannot access the machine once it''s in this state, you will need to get stack traces from the console. You can do this by sending a BREAK-t. You will want to be able to capture the output from the serial console as there will be much. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/0b998ca8/attachment-0001.bin
Kilian CAVALOTTI
2008-Apr-29 16:05 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
Hi Brian, On Tuesday 29 April 2008 07:53:01 am Brian J. Murrell wrote:> Unless you are using Lustre for your root and/or usr filesystem > and/or for swap, Lustre should not hang a machine completely.I was precisely wondering if it was possible to use a file residing on a Lustre filesystem as a swap file. I tried the basic steps without any success. On a regular ext3 fs, no problem: /tmp # dd if=/dev/zero of=./swapfile bs=1024 count=1024 10240+0 records in 10240+0 records out /tmp # mkswap ./swapfile Setting up swapspace version 1, size = 104853 kB /tmp # swapon -a ./swapfile /tmp # swapon -s Filename Type Size Used Priority /dev/sda3 partition 4096564 204 -1 /tmp/swapfile file 102392 0 -2 But on a Lustre mount: # cd /scratch /scratch # grep /scratch /proc/mounts 10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0 /scratch # dd if=/dev/zero of=./swapfile bs=1024 count=1024 10240+0 records in 10240+0 records out /scratch # mkswap ./swapfile Setting up swapspace version 1, size = 104853 kB /scratch # swapon -a ./swapfile swapon: ./swapfile: Invalid argument Is that expected? Thanks, -- Kilian
David_Kewley at Dell.com
2008-Apr-29 19:53 UTC
[Lustre-discuss] Client is not accesible when OSS/OST server is down
Perhaps the machine is not hung completely, but is only unable to support logins. There can be an issue if the login process attempts to access Lustre, e.g. because the home directory is on Lustre, or perhaps when a directory on Lustre is early in your $PATH. I''ve seen logins and existing shell sessions routinely hang in my setting when Lustre servers encountered severe problems, while home directories were on Lustre and Lustre was mounted on the login node. I''m sure there are details there that a Lustre expert could fill in; maybe there are some fail-soft mechanisms that should are designed to prevent hangs by returning appropriate error codes. So this may be more an issue of the login mechanisms being unable to recover when attempts to access an expected file or directory give some particular I/O error. David> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > bounces at lists.lustre.org] On Behalf Of Brian J. Murrell > Sent: Tuesday, April 29, 2008 4:53 AM > To: lustre > Subject: Re: [Lustre-discuss] Client is not accesible when OSS/OSTserver> is down > > On Tue, 2008-04-29 at 12:21 +0900, jungdoo yoon wrote: > > > > When OSS/OST server is down, lustre client is > > not accesible by any means ( ssh, telnet ). > > I guess it is hung over lustre partition. > > Unless you are using Lustre for your root and/or usr filesystem and/or > for swap, Lustre should not hang a machine completely. > > When a machine is hung as you describe, we generally want to see astack> trace from all of the processes. Given that you cannot access the > machine once it''s in this state, you will need to get stack tracesfrom> the console. You can do this by sending a BREAK-t. You will want tobe> able to capture the output from the serial console as there will be > much. > > b.
Brian J. Murrell
2008-Apr-29 20:02 UTC
[Lustre-discuss] Client is not accesible when OSS/OST server is down
On Tue, 2008-04-29 at 14:53 -0500, David_Kewley at Dell.com wrote:> Perhaps the machine is not hung completely, but is only unable to > support logins.I had assumed OP was including "root" logins in their attempts and already established shell sessions.> There can be an issue if the login process attempts to access Lustre, > e.g. because the home directory is on Lustre, or perhaps when a > directory on Lustre is early in your $PATH.Indeed, you are right David, for non-root users.> I''m sure there are details there that a Lustre expert could fill in; > maybe there are some fail-soft mechanisms that should are designed to > prevent hangs by returning appropriate error codes.Well, that is failout vs. failover. But you have to choose one. There is no way Lustre could try to determine which read()/write()s should "fail soft" vs. block waiting for a recovery.> So this may be more > an issue of the login mechanisms being unable to recover when attempts > to access an expected file or directory give some particular I/O error.Indeed. I should not have ruled out that the OP had determined this was or was not the case. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/41ada890/attachment.bin
Brian J. Murrell
2008-Apr-29 20:27 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
On Tue, 2008-04-29 at 09:05 -0700, Kilian CAVALOTTI wrote:> Hi Brian,Hi Kilian,> I was precisely wondering if it was possible to use a file residing on a > Lustre filesystem as a swap file.There certainly has been some historic use and testing of swap on Lustre. I''m not sure what it''s current state is TBH. The last bug in bugzilla I can find about swap and Lustre is bug 5794/5749/5188 filed over two years ago (however updated just recently) in reference to some of our regression testing we do with swap on Lustre. It seems the final outcome is that it has been flagged (just today in fact) as a 1.6.6 blocker. The timing is absolutely uncanny!> # cd /scratch > /scratch # grep /scratch /proc/mounts > 10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0 > /scratch # dd if=/dev/zero of=./swapfile bs=1024 count=1024 > 10240+0 records in > 10240+0 records out > /scratch # mkswap ./swapfile > Setting up swapspace version 1, size = 104853 kB > /scratch # swapon -a ./swapfile > swapon: ./swapfile: Invalid argument > > Is that expected?I would think not, no. There is some reference in bug 5188 that we only test it for "real/remote OSTs" and "non-TCP" interconnects. I wonder if either of those match your situation. You might file a bug about it, including the usual data -- syslog from the client you did the swapon on for example. A (-1) debug log from the client right at the time of the swapon would probably be useful. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/f5fdd9e2/attachment-0001.bin
Robin Humble
2008-Apr-30 04:11 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
On Tue, Apr 29, 2008 at 04:27:42PM -0400, Brian J. Murrell wrote:>On Tue, 2008-04-29 at 09:05 -0700, Kilian CAVALOTTI wrote: >> I was precisely wondering if it was possible to use a file residing on a >> Lustre filesystem as a swap file. >There certainly has been some historic use and testing of swap on >Lustre. I''m not sure what it''s current state is TBH. The last bug in >bugzilla I can find about swap and Lustre is bug 5794/5749/5188 filed >over two years ago (however updated just recently) in reference to some >of our regression testing we do with swap on Lustre. > >It seems the final outcome is that it has been flagged (just today in >fact) as a 1.6.6 blocker. The timing is absolutely uncanny! > >> # cd /scratch >> /scratch # grep /scratch /proc/mounts >> 10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0 >> /scratch # dd if=/dev/zero of=./swapfile bs=1024 count=1024 >> 10240+0 records in >> 10240+0 records out >> /scratch # mkswap ./swapfile >> Setting up swapspace version 1, size = 104853 kB >> /scratch # swapon -a ./swapfile >> swapon: ./swapfile: Invalid argument >> >> Is that expected? > >I would think not, no. There is some reference in bug 5188 that we only >test it for "real/remote OSTs" and "non-TCP" interconnects. I wonder if >either of those match your situation. You might file a bug about it, >including the usual data -- syslog from the client you did the swapon on >for example. A (-1) debug log from the client right at the time of the >swapon would probably be useful.when doing the ''swapon'' if you look in dmesg you should see: kernel: swapon: swapfile has holes even though the file on Lustre isn''t sparse. I was looking into this last week, and I think this message is because Lustre (at least up to 1.6.4.3) doesn''t appear to implement bmap() -> lustre/llite/rw26.c:289: .bmap = NULL and the swapfile code in RedHat kernels and 2.6.23 mm/swapfile.c setup_swap_extents() expects bmap() to be present, and assumes the file is sparse otherwise :-/ we too are very interested in swap on Lustre using o2ib, so I''m happy to hear it''ll be in 1.6.6. cheers, robin
Serge Ryabchun
2008-Apr-30 09:17 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
2008/4/30, Robin Humble <rjh+lustre at cita.utoronto.ca>:> > we too are very interested in swap on Lustre using o2ib, so I''m happy > to hear it''ll be in 1.6.6.I used swap with loop on lustre over tcp and o2ib more then 2 years. cat /etc/rc.d/rc.swap #!/bin/bash SWAP=/mnt/adm/swap/${HOSTNAME}.swap if [ ! -f $SWAP ]; then dd if=/dev/zero of=$SWAP bs=1M count=4096 else dd if=$SWAP of=/dev/null count=1 bs=1M fi losetup /dev/loop0 $SWAP mkswap /dev/loop0 && swapon /dev/loop0 -- Serge Ryabchun <serge.ryabchun at gmail.com>
Robin Humble
2008-May-01 04:22 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
Hi Serge, On Wed, Apr 30, 2008 at 12:17:03PM +0300, Serge Ryabchun wrote:>2008/4/30, Robin Humble <rjh+lustre at cita.utoronto.ca>: >> we too are very interested in swap on Lustre using o2ib, so I''m happy >> to hear it''ll be in 1.6.6. >I used swap with loop on lustre over tcp and o2ib more then 2 years.thanks for that. I tried that with o2ib last week and as pretty much as soon as I use any swap the node deadlocks hard in the low memory situation and has to be reset. possibly it''s more to do with how loop.c uses aops and alloc''s mem than it is to do with Lustre, but either way, sadly, it''s not a solution. swap to iSER (backed by a file on Lustre on another server) has worked and survived harsh testing for me in the past, as has swap over iSCSI (TCP) when the clients have Peter Zijlstra''s VM deadlock prevention patches applied. presumably swap to SRP would work too. there are now software targets available for all of these protocols now which is nice. swap directly or via loop to Lustre has never been successful for me. possibly the loop approach might work if a variant of the Zijlstra patches was applied to the loop code. a working swap to native Lustre would be much better and faster though. on a similar subject, loop on Lustre does work fine for making ''local'' filesystems on nodes, which can sustain local disk metadata rates without any load on the Lustre MDS. we use that here and it works well... +/- some VM problems in RedHat kernels under low memory situations that forced us to move to 2.6.23 patchless Lustre clients. cheers, robin>cat /etc/rc.d/rc.swap >#!/bin/bash > >SWAP=/mnt/adm/swap/${HOSTNAME}.swap > >if [ ! -f $SWAP ]; then > dd if=/dev/zero of=$SWAP bs=1M count=4096 >else > dd if=$SWAP of=/dev/null count=1 bs=1M >fi > >losetup /dev/loop0 $SWAP >mkswap /dev/loop0 && swapon /dev/loop0 > >-- >Serge Ryabchun <serge.ryabchun at gmail.com>
Serge Ryabchun
2008-May-01 06:22 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
Hi Robin, 2008/5/1, Robin Humble <rjh+lustre at cita.utoronto.ca>:> > I tried that with o2ib last week and as pretty much as soon as I use > any swap the node deadlocks hard in the low memory situation and has to > be reset. > possibly it''s more to do with how loop.c uses aops and alloc''s mem than > it is to do with Lustre, but either way, sadly, it''s not a solution. > > swap to iSER (backed by a file on Lustre on another server) has worked > and survived harsh testing for me in the past, as has swap over iSCSI > (TCP) when the clients have Peter Zijlstra''s VM deadlock prevention > patches applied. presumably swap to SRP would work too. there are now > software targets available for all of these protocols now which is > nice.Khm, in one of our project X-terminals use swap on NFS with Peter''s patch. But in tests loop on file on NFS without VM patch was very stable. X-terminal has 16MB memory. And yes, I use swap on Lustre as tmpfs. mount -t tmpfs -o size=64G tmpfs /tmp -- Serge Ryabchun <serge.ryabchun at gmail.com>
Andreas Dilger
2008-May-02 18:25 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
On Apr 29, 2008 09:05 -0700, Kilian CAVALOTTI wrote:> On Tuesday 29 April 2008 07:53:01 am Brian J. Murrell wrote: > > Unless you are using Lustre for your root and/or usr filesystem > > and/or for swap, Lustre should not hang a machine completely. > > I was precisely wondering if it was possible to use a file residing on a > Lustre filesystem as a swap file. I tried the basic steps without any > success. > > On a regular ext3 fs, no problem: > > /tmp # dd if=/dev/zero of=./swapfile bs=1024 count=1024 > 10240+0 records in > 10240+0 records out > /tmp # mkswap ./swapfile > Setting up swapspace version 1, size = 104853 kB > /tmp # swapon -a ./swapfile > /tmp # swapon -s > Filename Type Size Used Priority > /dev/sda3 partition 4096564 204 -1 > /tmp/swapfile file 102392 0 -2 > > But on a Lustre mount: > > # cd /scratch > /scratch # grep /scratch /proc/mounts > 10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0 > /scratch # dd if=/dev/zero of=./swapfile bs=1024 count=1024 > 10240+0 records in > 10240+0 records out > /scratch # mkswap ./swapfile > Setting up swapspace version 1, size = 104853 kB > /scratch # swapon -a ./swapfile > swapon: ./swapfile: Invalid argumentNote that in 1.6.4+ there is an interface to export a block device more directly from Lustre instead of using the loopback driver on top of the client filesystem. This is the "llite_loop" module and is configured like: lctl blockdev_attach {loopback_filename} {blockdev_filename} where {loopback_filename} is the file that should be turned into a block device (it can be sparse if desired) and {blockdev_filename} is the full filename that the new block device should be created at. To clean up the device use: lctl blockdev_detach {blockdev_filename} Note that using this block device for swap hasn''t been very successful in our testing so far, but we also haven''t done a great deal of real world testing only "allocate a ton of RAM and dirty it all as fast as possible", which isn''t a very realistic usage. Feedback would be welcome. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Robin Humble
2008-May-09 16:57 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
On Fri, May 02, 2008 at 11:25:03AM -0700, Andreas Dilger wrote:>On Apr 29, 2008 09:05 -0700, Kilian CAVALOTTI wrote: >> /scratch # swapon -a ./swapfile >> swapon: ./swapfile: Invalid argument > >Note that in 1.6.4+ there is an interface to export a block device >more directly from Lustre instead of using the loopback driver on top >of the client filesystem. This is the "llite_loop" module and is >configured like: > > lctl blockdev_attach {loopback_filename} {blockdev_filename} > >where {loopback_filename} is the file that should be turned into a >block device (it can be sparse if desired) and {blockdev_filename} >is the full filename that the new block device should be created at.cool! I was wondering that that module was for. I''m trying to use it like: lctl blockdev_attach /dev/lloop0 /some/file/on/lustre but then dd to /dev/lloop0 seems to go at ~kB/s wheras dd to the file goes at >= 100MB/s. am I doing something wrong? cheers, robin>To clean up the device use: > > lctl blockdev_detach {blockdev_filename} > >Note that using this block device for swap hasn''t been very successful >in our testing so far, but we also haven''t done a great deal of real >world testing only "allocate a ton of RAM and dirty it all as fast >as possible", which isn''t a very realistic usage. Feedback would >be welcome. > >Cheers, Andreas >-- >Andreas Dilger >Sr. Staff Engineer, Lustre Group >Sun Microsystems of Canada, Inc. > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
Andreas Dilger
2008-May-09 19:08 UTC
[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)
On May 09, 2008 12:57 -0400, Robin Humble wrote:> On Fri, May 02, 2008 at 11:25:03AM -0700, Andreas Dilger wrote: > >Note that in 1.6.4+ there is an interface to export a block device > >more directly from Lustre instead of using the loopback driver on top > >of the client filesystem. This is the "llite_loop" module and is > >configured like: > > > > lctl blockdev_attach {loopback_filename} {blockdev_filename} > > > >where {loopback_filename} is the file that should be turned into a > >block device (it can be sparse if desired) and {blockdev_filename} > >is the full filename that the new block device should be created at. > > cool! I was wondering that that module was for. > I''m trying to use it like: > lctl blockdev_attach /dev/lloop0 /some/file/on/lustre > but then dd to /dev/lloop0 seems to go at ~kB/s wheras dd to the file > goes at >= 100MB/s. am I doing something wrong?Hmm, the code is virtually identical to the loop driver, so this is a bit strange. Could you grab the IO stats at the OST in /proc/fs/lustre/obdfilter/{ostname}/brw_stats (use "lfs getstripe" to determine which OST the file lives on) to see what kind of RPCs are being sent to the server. It might be that the lloop driver isn''t setting up an elevator properly, or is just sending the RPCs 4kB at a time... Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.