thr3ads.net - Lustre discuss - [Lustre-discuss] Client is not accesible when OSS/OST server is down [Apr 2008]

If this information is useful, please help other people find it:
Share via:

jungdoo yoon

2008-Apr-29 03:21 UTC

[Lustre-discuss] Client is not accesible when OSS/OST server is down

When OSS/OST server is down, lustre client is
not accesible by any means ( ssh, telnet ).
I guess it is hung over lustre partition.

OSS/OST HA ( heartbeat ) is not available yet.

Is there any mount option or any way to prevent this
problem? I''m using ''flock'' lustre mount option.


-- 
<http://www.java.com> 	* Jungdoo Yoon *

*Sun Microsystems, Inc.*
KR
Phone x85014/+82221935 014
Mobile 011-9889-2159
Email Jungdoo.Yoon at Sun.COM

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/202c9809/attachment.html

Brian J. Murrell

2008-Apr-29 14:53 UTC

head link

[Lustre-discuss] Client is not accesible when OSS/OST server is down

On Tue, 2008-04-29 at 12:21 +0900, jungdoo yoon wrote:> 
>  When  OSS/OST server is down, lustre client is 
> not accesible by any means ( ssh, telnet ).
> I guess it is hung over lustre partition.
Unless you are using Lustre for your root and/or usr filesystem and/or
for swap, Lustre should not hang a machine completely.

When a machine is hung as you describe, we generally want to see a stack
trace from all of the processes.  Given that you cannot access the
machine once it''s in this state, you will need to get stack traces from
the console.  You can do this by sending a BREAK-t.  You will want to be
able to capture the output from the serial console as there will be
much.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/0b998ca8/attachment-0001.bin

Kilian CAVALOTTI

2008-Apr-29 16:05 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

Hi Brian,

On Tuesday 29 April 2008 07:53:01 am Brian J. Murrell
wrote:> Unless you are using Lustre for your root and/or usr filesystem
> and/or for swap, Lustre should not hang a machine completely.
I was precisely wondering if it was possible to use a file residing on a 
Lustre filesystem as a swap file. I tried the basic steps without any 
success.

On a regular ext3 fs, no problem:

/tmp # dd if=/dev/zero of=./swapfile  bs=1024 count=1024
10240+0 records in
10240+0 records out
/tmp # mkswap ./swapfile
Setting up swapspace version 1, size = 104853 kB
/tmp # swapon -a ./swapfile
/tmp # swapon -s
Filename                       Type            Size    Used    Priority
/dev/sda3                      partition       4096564 204     -1
/tmp/swapfile                  file            102392  0       -2

But on a Lustre mount:

# cd /scratch
/scratch # grep /scratch /proc/mounts
10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0
/scratch # dd if=/dev/zero of=./swapfile  bs=1024 count=1024
10240+0 records in
10240+0 records out
/scratch # mkswap ./swapfile
Setting up swapspace version 1, size = 104853 kB
/scratch # swapon -a ./swapfile
swapon: ./swapfile: Invalid argument

Is that expected?

Thanks,
-- 
Kilian

David_Kewley at Dell.com

2008-Apr-29 19:53 UTC

head link

[Lustre-discuss] Client is not accesible when OSS/OST server is down

Perhaps the machine is not hung completely, but is only unable to
support logins.

There can be an issue if the login process attempts to access Lustre,
e.g. because the home directory is on Lustre, or perhaps when a
directory on Lustre is early in your $PATH.  I''ve seen logins and
existing shell sessions routinely hang in my setting when Lustre servers
encountered severe problems, while home directories were on Lustre and
Lustre was mounted on the login node.

I''m sure there are details there that a Lustre expert could fill in;
maybe there are some fail-soft mechanisms that should are designed to
prevent hangs by returning appropriate error codes.  So this may be more
an issue of the login mechanisms being unable to recover when attempts
to access an expected file or directory give some particular I/O error.

David
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-
> bounces at lists.lustre.org] On Behalf Of Brian J. Murrell
> Sent: Tuesday, April 29, 2008 4:53 AM
> To: lustre
> Subject: Re: [Lustre-discuss] Client is not accesible when OSS/OST
server> is down
> 
> On Tue, 2008-04-29 at 12:21 +0900, jungdoo yoon wrote:
> >
> >  When  OSS/OST server is down, lustre client is
> > not accesible by any means ( ssh, telnet ).
> > I guess it is hung over lustre partition.
> 
> Unless you are using Lustre for your root and/or usr filesystem and/or
> for swap, Lustre should not hang a machine completely.
> 
> When a machine is hung as you describe, we generally want to see a
stack> trace from all of the processes.  Given that you cannot access the
> machine once it''s in this state, you will need to get stack traces
from> the console.  You can do this by sending a BREAK-t.  You will want to
be> able to capture the output from the serial console as there will be
> much.
> 
> b.

Brian J. Murrell

2008-Apr-29 20:02 UTC

head link

[Lustre-discuss] Client is not accesible when OSS/OST server is down

On Tue, 2008-04-29 at 14:53 -0500, David_Kewley at Dell.com
wrote:> Perhaps the machine is not hung completely, but is only unable to
> support logins.
I had assumed OP was including "root" logins in their attempts and
already established shell sessions.
> There can be an issue if the login process attempts to access Lustre,
> e.g. because the home directory is on Lustre, or perhaps when a
> directory on Lustre is early in your $PATH.
Indeed, you are right David, for non-root users.
> I''m sure there are details there that a Lustre expert could fill
in;
> maybe there are some fail-soft mechanisms that should are designed to
> prevent hangs by returning appropriate error codes.
Well, that is failout vs. failover.  But you have to choose one.  There
is no way Lustre could try to determine which read()/write()s should
"fail soft" vs. block waiting for a recovery.
> So this may be more
> an issue of the login mechanisms being unable to recover when attempts
> to access an expected file or directory give some particular I/O error.
Indeed.  I should not have ruled out that the OP had determined this was
or was not the case.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/41ada890/attachment.bin

Brian J. Murrell

2008-Apr-29 20:27 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

On Tue, 2008-04-29 at 09:05 -0700, Kilian CAVALOTTI
wrote:> Hi Brian,
Hi Kilian,
> I was precisely wondering if it was possible to use a file residing on a 
> Lustre filesystem as a swap file.
There certainly has been some historic use and testing of swap on
Lustre.  I''m not sure what it''s current state is TBH.  The
last bug in
bugzilla I can find about swap and Lustre is bug 5794/5749/5188 filed
over two years ago (however updated just recently) in reference to some
of our regression testing we do with swap on Lustre.

It seems the final outcome is that it has been flagged (just today in
fact) as a 1.6.6 blocker.  The timing is absolutely uncanny!
> # cd /scratch
> /scratch # grep /scratch /proc/mounts
> 10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0
> /scratch # dd if=/dev/zero of=./swapfile  bs=1024 count=1024
> 10240+0 records in
> 10240+0 records out
> /scratch # mkswap ./swapfile
> Setting up swapspace version 1, size = 104853 kB
> /scratch # swapon -a ./swapfile
> swapon: ./swapfile: Invalid argument
> 
> Is that expected?
I would think not, no.  There is some reference in bug 5188 that we only
test it for "real/remote OSTs" and "non-TCP" interconnects. 
I wonder if
either of those match your situation.  You might file a bug about it,
including the usual data -- syslog from the client you did the swapon on
for example.  A (-1) debug log from the client right at the time of the
swapon would probably be useful.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080429/f5fdd9e2/attachment-0001.bin

Robin Humble

2008-Apr-30 04:11 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

On Tue, Apr 29, 2008 at 04:27:42PM -0400, Brian J. Murrell
wrote:>On Tue, 2008-04-29 at 09:05 -0700, Kilian CAVALOTTI wrote:
>> I was precisely wondering if it was possible to use a file residing on
a
>> Lustre filesystem as a swap file.
>There certainly has been some historic use and testing of swap on
>Lustre.  I''m not sure what it''s current state is TBH.  The
last bug in
>bugzilla I can find about swap and Lustre is bug 5794/5749/5188 filed
>over two years ago (however updated just recently) in reference to some
>of our regression testing we do with swap on Lustre.
>
>It seems the final outcome is that it has been flagged (just today in
>fact) as a 1.6.6 blocker.  The timing is absolutely uncanny!
>
>> # cd /scratch
>> /scratch # grep /scratch /proc/mounts
>> 10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0
>> /scratch # dd if=/dev/zero of=./swapfile  bs=1024 count=1024
>> 10240+0 records in
>> 10240+0 records out
>> /scratch # mkswap ./swapfile
>> Setting up swapspace version 1, size = 104853 kB
>> /scratch # swapon -a ./swapfile
>> swapon: ./swapfile: Invalid argument
>> 
>> Is that expected?
>
>I would think not, no.  There is some reference in bug 5188 that we only
>test it for "real/remote OSTs" and "non-TCP"
interconnects.  I wonder if
>either of those match your situation.  You might file a bug about it,
>including the usual data -- syslog from the client you did the swapon on
>for example.  A (-1) debug log from the client right at the time of the
>swapon would probably be useful.
when doing the ''swapon'' if you look in dmesg you should see:
  kernel: swapon: swapfile has holes
even though the file on Lustre isn''t sparse.

I was looking into this last week, and I think this message is because
Lustre (at least up to 1.6.4.3) doesn''t appear to implement bmap()
->
  lustre/llite/rw26.c:289:        .bmap           = NULL
and the swapfile code in RedHat kernels and 2.6.23 mm/swapfile.c
setup_swap_extents() expects bmap() to be present, and assumes the file
is sparse otherwise :-/

we too are very interested in swap on Lustre using o2ib, so I''m happy
to hear it''ll be in 1.6.6.

cheers,
robin

Serge Ryabchun

2008-Apr-30 09:17 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

2008/4/30, Robin Humble <rjh+lustre at
cita.utoronto.ca>:>
>  we too are very interested in swap on Lustre using o2ib, so I''m
happy
>  to hear it''ll be in 1.6.6.
I used swap with loop on lustre over tcp and o2ib more then 2 years.

cat /etc/rc.d/rc.swap
#!/bin/bash

SWAP=/mnt/adm/swap/${HOSTNAME}.swap

if [ ! -f $SWAP ]; then
   dd if=/dev/zero of=$SWAP bs=1M count=4096
else
   dd if=$SWAP of=/dev/null count=1 bs=1M
fi

losetup /dev/loop0 $SWAP
mkswap /dev/loop0 && swapon /dev/loop0

-- 
Serge Ryabchun <serge.ryabchun at gmail.com>

Robin Humble

2008-May-01 04:22 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

Hi Serge,

On Wed, Apr 30, 2008 at 12:17:03PM +0300, Serge Ryabchun
wrote:>2008/4/30, Robin Humble <rjh+lustre at cita.utoronto.ca>:
>>  we too are very interested in swap on Lustre using o2ib, so
I''m happy
>>  to hear it''ll be in 1.6.6.
>I used swap with loop on lustre over tcp and o2ib more then 2 years.
thanks for that.

I tried that with o2ib last week and as pretty much as soon as I use
any swap the node deadlocks hard in the low memory situation and has to
be reset.
possibly it''s more to do with how loop.c uses aops and alloc''s
mem than
it is to do with Lustre, but either way, sadly, it''s not a solution.

swap to iSER (backed by a file on Lustre on another server) has worked
and survived harsh testing for me in the past, as has swap over iSCSI
(TCP) when the clients have Peter Zijlstra''s VM deadlock prevention
patches applied. presumably swap to SRP would work too. there are now
software targets available for all of these protocols now which is
nice.

swap directly or via loop to Lustre has never been successful for me.
possibly the loop approach might work if a variant of the Zijlstra
patches was applied to the loop code.
a working swap to native Lustre would be much better and faster though.

on a similar subject, loop on Lustre does work fine for making
''local''
filesystems on nodes, which can sustain local disk metadata rates
without any load on the Lustre MDS.
we use that here and it works well... +/- some VM problems in RedHat
kernels under low memory situations that forced us to move to 2.6.23
patchless Lustre clients.

cheers,
robin

>cat /etc/rc.d/rc.swap
>#!/bin/bash
>
>SWAP=/mnt/adm/swap/${HOSTNAME}.swap
>
>if [ ! -f $SWAP ]; then
>   dd if=/dev/zero of=$SWAP bs=1M count=4096
>else
>   dd if=$SWAP of=/dev/null count=1 bs=1M
>fi
>
>losetup /dev/loop0 $SWAP
>mkswap /dev/loop0 && swapon /dev/loop0
>
>-- 
>Serge Ryabchun <serge.ryabchun at gmail.com>

Serge Ryabchun

2008-May-01 06:22 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

Hi Robin,

2008/5/1, Robin Humble <rjh+lustre at
cita.utoronto.ca>:>
>  I tried that with o2ib last week and as pretty much as soon as I use
>  any swap the node deadlocks hard in the low memory situation and has to
>  be reset.
>  possibly it''s more to do with how loop.c uses aops and
alloc''s mem than
>  it is to do with Lustre, but either way, sadly, it''s not a
solution.
>
>  swap to iSER (backed by a file on Lustre on another server) has worked
>  and survived harsh testing for me in the past, as has swap over iSCSI
>  (TCP) when the clients have Peter Zijlstra''s VM deadlock
prevention
>  patches applied. presumably swap to SRP would work too. there are now
>  software targets available for all of these protocols now which is
>  nice.
Khm, in one of our project X-terminals use swap on NFS with Peter''s
patch. But
in tests loop on file on NFS without VM patch was very stable. X-terminal has
16MB memory.

And yes, I use swap on Lustre as tmpfs.

mount -t tmpfs -o size=64G tmpfs /tmp

-- 
Serge Ryabchun <serge.ryabchun at gmail.com>

Andreas Dilger

2008-May-02 18:25 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

On Apr 29, 2008  09:05 -0700, Kilian CAVALOTTI wrote:> On Tuesday 29 April 2008 07:53:01 am Brian J. Murrell wrote:
> > Unless you are using Lustre for your root and/or usr filesystem
> > and/or for swap, Lustre should not hang a machine completely.
> 
> I was precisely wondering if it was possible to use a file residing on a 
> Lustre filesystem as a swap file. I tried the basic steps without any 
> success.
> 
> On a regular ext3 fs, no problem:
> 
> /tmp # dd if=/dev/zero of=./swapfile  bs=1024 count=1024
> 10240+0 records in
> 10240+0 records out
> /tmp # mkswap ./swapfile
> Setting up swapspace version 1, size = 104853 kB
> /tmp # swapon -a ./swapfile
> /tmp # swapon -s
> Filename                       Type            Size    Used    Priority
> /dev/sda3                      partition       4096564 204     -1
> /tmp/swapfile                  file            102392  0       -2
> 
> But on a Lustre mount:
> 
> # cd /scratch
> /scratch # grep /scratch /proc/mounts
> 10.10.50.2 at o2ib:/scratch /scratch lustre rw 0 0
> /scratch # dd if=/dev/zero of=./swapfile  bs=1024 count=1024
> 10240+0 records in
> 10240+0 records out
> /scratch # mkswap ./swapfile
> Setting up swapspace version 1, size = 104853 kB
> /scratch # swapon -a ./swapfile
> swapon: ./swapfile: Invalid argument
Note that in 1.6.4+ there is an interface to export a block device
more directly from Lustre instead of using the loopback driver on top
of the client filesystem.  This is the "llite_loop" module and is
configured like:

	lctl blockdev_attach {loopback_filename} {blockdev_filename}

where {loopback_filename} is the file that should be turned into a
block device (it can be sparse if desired) and {blockdev_filename}
is the full filename that the new block device should be created at.
To clean up the device use:

	lctl blockdev_detach {blockdev_filename}

Note that using this block device for swap hasn''t been very successful
in our testing so far, but we also haven''t done a great deal of real
world testing only "allocate a ton of RAM and dirty it all as fast
as possible", which isn''t a very realistic usage.  Feedback would
be welcome.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Robin Humble

2008-May-09 16:57 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

On Fri, May 02, 2008 at 11:25:03AM -0700, Andreas Dilger
wrote:>On Apr 29, 2008  09:05 -0700, Kilian CAVALOTTI wrote:
>> /scratch # swapon -a ./swapfile
>> swapon: ./swapfile: Invalid argument
>
>Note that in 1.6.4+ there is an interface to export a block device
>more directly from Lustre instead of using the loopback driver on top
>of the client filesystem.  This is the "llite_loop" module and is
>configured like:
>
>	lctl blockdev_attach {loopback_filename} {blockdev_filename}
>
>where {loopback_filename} is the file that should be turned into a
>block device (it can be sparse if desired) and {blockdev_filename}
>is the full filename that the new block device should be created at.
cool! I was wondering that that module was for.
I''m trying to use it like:
  lctl blockdev_attach /dev/lloop0 /some/file/on/lustre
but then dd to /dev/lloop0 seems to go at ~kB/s wheras dd to the file
goes at >= 100MB/s. am I doing something wrong?

cheers,
robin
>To clean up the device use:
>
>	lctl blockdev_detach {blockdev_filename}
>
>Note that using this block device for swap hasn''t been very
successful
>in our testing so far, but we also haven''t done a great deal of
real
>world testing only "allocate a ton of RAM and dirty it all as fast
>as possible", which isn''t a very realistic usage.  Feedback
would
>be welcome.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Sr. Staff Engineer, Lustre Group
>Sun Microsystems of Canada, Inc.
>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

Andreas Dilger

2008-May-09 19:08 UTC

head link

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

On May 09, 2008  12:57 -0400, Robin Humble wrote:> On Fri, May 02, 2008 at 11:25:03AM -0700, Andreas Dilger wrote:
> >Note that in 1.6.4+ there is an interface to export a block device
> >more directly from Lustre instead of using the loopback driver on top
> >of the client filesystem.  This is the "llite_loop" module
and is
> >configured like:
> >
> >	lctl blockdev_attach {loopback_filename} {blockdev_filename}
> >
> >where {loopback_filename} is the file that should be turned into a
> >block device (it can be sparse if desired) and {blockdev_filename}
> >is the full filename that the new block device should be created at.
> 
> cool! I was wondering that that module was for.
> I''m trying to use it like:
>   lctl blockdev_attach /dev/lloop0 /some/file/on/lustre
> but then dd to /dev/lloop0 seems to go at ~kB/s wheras dd to the file
> goes at >= 100MB/s. am I doing something wrong?
Hmm, the code is virtually identical to the loop driver, so this is
a bit strange.  Could you grab the IO stats at the OST in
/proc/fs/lustre/obdfilter/{ostname}/brw_stats (use "lfs getstripe"
to determine which OST the file lives on) to see what kind of RPCs
are being sent to the server.

It might be that the lloop driver isn''t setting up an elevator
properly,
or is just sending the RPCs 4kB at a time...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Apr 2008 - Client is not accesible when OSS/OST server is down

[Lustre-discuss] Client is not accesible when OSS/OST server is down

[Lustre-discuss] Client is not accesible when OSS/OST server is down

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Client is not accesible when OSS/OST server is down

[Lustre-discuss] Client is not accesible when OSS/OST server is down

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)

[Lustre-discuss] Swap on Lustre (was: Client is not accesible when OSS/OST server is down)