thr3ads.net - Lustre discuss - [Lustre-discuss] root on lustre and timeouts [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Robin Humble

2009-Apr-29 14:39 UTC

[Lustre-discuss] root on lustre and timeouts

we are (happily) using read-only root-on-Lustre in production with
oneSIS, but have noticed something odd...

if a root-on-Lustre client node has been up for more than 10 or 12hours
then it survives an MDS failure/failover/reboot event(*), but if the
client is newly rebooted and has been up for less than this time, then
it doesn''t successfully reconnect after an MDS event and the node is
~dead.

by trial and error I''ve also found that if I rsync /lib64, /bin, and
/sbin from Lustre to a root ramdisk, ''echo 3 >
/proc/sys/vm/drop_caches'',
and symlink the rest of dirs to Lustre then the node sails through MDS
events. leaving out any one of the dirs/steps leads to a dead node. so
it looks like the Lustre kernel''s recovery process is somehow tied to
userspace via apps in /bin and /sbin?

I can reproduce the weird 10-12hr behaviour at will by changing the
clock on nodes in a toy Lustre test setup. ie.
 - servers and client all have the correct time
 - reboot client node
 - stop ntpd everywhere
 - use ''date --set ...'' to set all clocks to be X hours in the
future
 - cause a MDS event(*)
 - wait for recovery to complete
 - if X <= ~10 to 12 then the client will be dead

it''s no big deal to put those 3 dirs into ramdisk as they''re
really
small (and the part-on-ramdisk model is nice and flexible too), so
we''ll probably move to running in this way anyway, but I''m
still
curious as to why a kernel-only system like Lustre a) cares about
userspace at all during recovery b) why it has a 10-12hr timescale :-)

changing the contents of /proc/sys/lnet/upcall into some path stat''able
without Lustre being up doesn''t change anything.

BTW, OSS reboot/failover is handled fine with root on Lustre, as are
regular (non-root on Lustre clients) - this behaviour seems to be
limited to the MDS/MGS failure when all/almost-all of the OS is on Lustre.

our setup is patchless 1.6.4.3 clients, 1.6.6 servers, rhel5.2/5.3
x86_64, but the behaviour seems the same with much newer Lustre too
eg. patched b_release_1_8_0.  

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

(*) umount mdt and mgs, lustre_rmmod, wait 10 mins, mount them again

Troy Benjegerdes

2009-Apr-29 15:42 UTC

head link

[Lustre-discuss] root on lustre and timeouts

On Wed, Apr 29, 2009 at 10:39:20AM -0400, Robin Humble
wrote:> we are (happily) using read-only root-on-Lustre in production with
> oneSIS, but have noticed something odd...
> 
> if a root-on-Lustre client node has been up for more than 10 or 12hours
> then it survives an MDS failure/failover/reboot event(*), but if the
> client is newly rebooted and has been up for less than this time, then
> it doesn''t successfully reconnect after an MDS event and the node
is
> ~dead.
> 
> by trial and error I''ve also found that if I rsync /lib64, /bin,
and
> /sbin from Lustre to a root ramdisk, ''echo 3 >
/proc/sys/vm/drop_caches'',
> and symlink the rest of dirs to Lustre then the node sails through MDS
> events. leaving out any one of the dirs/steps leads to a dead node. so
> it looks like the Lustre kernel''s recovery process is somehow tied
to
> userspace via apps in /bin and /sbin?
Now that''s interesting.. What distro are you using? I have been toying
with the idea of modifiying the Debian initramfs-tools boot ramdisk to
include bushbox and dropbear-ssh in order to debug these kind of
root-network-filesystem bugs. In my case, I''m running AFS as the root
filesystem, and I have the ''afsd'' in the ramdisk that gets
started at
boot. I''m wondering if the Lustre binaries that are necessary could be
placed in the initrd as well.

It would be nice if various distros could work ''out of the
box'' with
readonly network filesystems.

Andreas Dilger

2009-Apr-29 20:48 UTC

head link

[Lustre-discuss] root on lustre and timeouts

On Apr 29, 2009  10:39 -0400, Robin Humble wrote:> we are (happily) using read-only root-on-Lustre in production with
> oneSIS, but have noticed something odd...
> 
> if a root-on-Lustre client node has been up for more than 10 or 12hours
> then it survives an MDS failure/failover/reboot event(*), but if the
> client is newly rebooted and has been up for less than this time, then
> it doesn''t successfully reconnect after an MDS event and the node
is
> ~dead.
> 
> by trial and error I''ve also found that if I rsync /lib64, /bin,
and
> /sbin from Lustre to a root ramdisk, ''echo 3 >
/proc/sys/vm/drop_caches'',
> and symlink the rest of dirs to Lustre then the node sails through MDS
> events. leaving out any one of the dirs/steps leads to a dead node. so
> it looks like the Lustre kernel''s recovery process is somehow tied
to
> userspace via apps in /bin and /sbin?
> 
> I can reproduce the weird 10-12hr behaviour at will by changing the
> clock on nodes in a toy Lustre test setup. ie.
>  - servers and client all have the correct time
>  - reboot client node
>  - stop ntpd everywhere
>  - use ''date --set ...'' to set all clocks to be X hours
in the future
>  - cause a MDS event(*)
>  - wait for recovery to complete
>  - if X <= ~10 to 12 then the client will be dead
This shouldn''t really happen.  We of course test failover with client
uptimes a lot less than 10-12h without problems, though not with root
filesystems on Lustre.  Providing any MDS console messages that are
unique to a failing short-lived client vs a long-lived client might
point us in the right direction.

One of the few things that is time dependent on the client is the
DLM lock LRU list.  Idle locks will expire from the client cache
over time.  You can force a flush of the client''s MDS lock cache with:

	# check how many metadata locks client currently has
	client# cat /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size

	client# echo clear > /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size

The MGC shouldn''t be the culprit since it only holds a single lock
that never expires.
> it''s no big deal to put those 3 dirs into ramdisk as
they''re really
> small (and the part-on-ramdisk model is nice and flexible too), so
> we''ll probably move to running in this way anyway, but
I''m still
> curious as to why a kernel-only system like Lustre a) cares about
> userspace at all during recovery b) why it has a 10-12hr timescale :-)
It would be good to know the root cause of this problem, as it may
expose a defect in another part of the code.
> changing the contents of /proc/sys/lnet/upcall into some path
stat''able
> without Lustre being up doesn''t change anything.
There are no longer upcalls needed on the client for recovery, and having
the upcall inside Lustre when Lustre itself is not accessible is always a
bad idea.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Robin Humble

2009-Apr-30 15:20 UTC

head link

[Lustre-discuss] root on lustre and timeouts

On Wed, Apr 29, 2009 at 02:48:34PM -0600, Andreas Dilger
wrote:>On Apr 29, 2009  10:39 -0400, Robin Humble wrote:
>> we are (happily) using read-only root-on-Lustre in production with
>> oneSIS, but have noticed something odd...
>> 
>> if a root-on-Lustre client node has been up for more than 10 or 12hours
>> then it survives an MDS failure/failover/reboot event(*), but if the
>> client is newly rebooted and has been up for less than this time, then
>> it doesn''t successfully reconnect after an MDS event and the
node is
>> ~dead.
>> 
>> by trial and error I''ve also found that if I rsync /lib64,
/bin, and
>> /sbin from Lustre to a root ramdisk, ''echo 3 >
/proc/sys/vm/drop_caches'',
>> and symlink the rest of dirs to Lustre then the node sails through MDS
>> events. leaving out any one of the dirs/steps leads to a dead node. so
>> it looks like the Lustre kernel''s recovery process is somehow
tied to
>> userspace via apps in /bin and /sbin?
>> 
>> I can reproduce the weird 10-12hr behaviour at will by changing the
>> clock on nodes in a toy Lustre test setup. ie.
>>  - servers and client all have the correct time
>>  - reboot client node
>>  - stop ntpd everywhere
>>  - use ''date --set ...'' to set all clocks to be X
hours in the future
>>  - cause a MDS event(*)
>>  - wait for recovery to complete
>>  - if X <= ~10 to 12 then the client will be dead
>
>This shouldn''t really happen.  We of course test failover with
client
>uptimes a lot less than 10-12h without problems, though not with root
>filesystems on Lustre.  Providing any MDS console messages that are
>unique to a failing short-lived client vs a long-lived client might
>point us in the right direction.
sadly, there''s not much there... our consoles lose character over IPMI,
so the below are from syslog.
fox3 (.213) is mds/mgs. fox4,6 (.214,.216) are oss''s and fox1 (.211) is
client.
mds/oss''s are 1.6.6, client is 1.6.4.3.

success (setting clocks forward):

Apr 30 22:58:31 fox3 kernel: Lustre: MGS MGS started
Apr 30 22:58:31 fox3 kernel: Lustre: Server MGS on device /dev/md0 has started
...
Apr 30 22:58:31 fox3 kernel: Lustre: Enabling user_xattr
Apr 30 22:58:31 fox3 kernel: Lustre:
8758:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service test-MDT0000, 1
recoverable clients, last_transno 180390470883
Apr 30 22:58:31 fox3 kernel: Lustre: MDT test-MDT0000 now serving
test-MDT0000_UUID (test-MDT0000/50ae1644-01d5-bc1a-4e65-954d53f99b50), but will
be in recovery for at least 5:00, or until 1 client reconnect. During this time
new clients will not be allowed to connect. Recovery progress can be monitored
by watching /proc/fs/lustre/mds/test-MDT0000/recovery_status.
Apr 30 22:58:31 fox3 kernel: Lustre:
8758:0:(lproc_mds.c:273:lprocfs_wr_group_upcall()) test-MDT0000: group upcall
set to /usr/sbin/l_getgroups
Apr 30 22:58:31 fox3 kernel: Lustre: test-MDT0000.mdt: set parameter
group_upcall=/usr/sbin/l_getgroups
Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_lov.c:1070:mds_notify()) MDS
test-MDT0000: in recovery, not resetting orphans on test-OST0000_UUID
Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_lov.c:1070:mds_notify()) MDS
test-MDT0000: in recovery, not resetting orphans on test-OST0001_UUID
Apr 30 22:58:31 fox3 kernel: Lustre: Server test-MDT0000 on device /dev/md1 has
started
Apr 30 22:58:32 fox6 kernel: Lustre:
11544:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0
changed server handle from 0x18a8172888292c35 to 0xfdb564f8289dd54
Apr 30 22:58:32 fox6 kernel: but is still in recovery
Apr 30 22:58:32 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 22:58:32 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored
to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:38 fox4 kernel: Lustre:
11583:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0
changed server handle from 0x18a8172888292c0b to 0xfdb564f8289dd69
Apr 30 22:58:38 fox4 kernel: but is still in recovery
Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored
to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored
to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:47 fox1 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 22:58:47 fox1 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored
to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:47 fox3 kernel: Lustre:
8690:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) test-MDT0000: starting
recovery timer
Apr 30 22:58:47 fox1 kernel: LustreError:
724:0:(client.c:1750:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req
at ffff81025ec35800 x1960/t180390441407 o101->test-MDT0000_UUID at
10.8.30.213@o2ib:12 lens 440/768 ref 2 fl Complete:RP/4/0 rc 301/301
Apr 30 22:58:47 fox1 kernel: LustreError:
724:0:(client.c:1750:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req
at ffff81025dd4c400 x5132/t180390443405 o101->test-MDT0000_UUID at
10.8.30.213@o2ib:12 lens 440/768 ref 2 fl Complete:RP/4/0 rc 301/301
Apr 30 22:58:47 fox3 kernel: Lustre: test-MDT0000: sending delayed replies to
recovered clients
Apr 30 22:58:47 fox3 kernel: Lustre: test-MDT0000: recovery complete: rc 0
Apr 30 22:58:47 fox1 kernel: Lustre: test-MDT0000-mdc-ffff81025a2b7800:
Connection restored to service test-MDT0000 using nid 10.8.30.213 at o2ib.
Apr 30 22:58:47 fox6 kernel: Lustre: test-OST0000: received MDS connection from
10.8.30.213 at o2ib
Apr 30 22:58:47 fox3 kernel: Lustre: MDS test-MDT0000: test-OST0000_UUID now
active, resetting orphans
Apr 30 22:58:47 fox4 kernel: Lustre: test-OST0001: received MDS connection from
10.8.30.213 at o2ib
Apr 30 22:58:47 fox3 kernel: Lustre: MDS test-MDT0000: test-OST0001_UUID now
active, resetting orphans
Apr 30 23:03:07 fox3 kernel: Lustre: MGS: haven''t heard from client
83d64a6f-bf4f-6d1f-231f-a3e0b08b9ab7 (at 10.8.30.211 at o2ib) in 243 seconds. I
think it''s dead, and I am evicting it.

where the last line above confuses me - that looks like something
failed, but the node is completely happy :-/

 # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1241136707
recovery_duration: 0
completed_clients: 1/1
replayed_requests: 0
last_transno: 180390470883

fail (node recently booted):

Apr 30 23:40:43 fox3 kernel: Lustre: MGS MGS started
Apr 30 23:40:43 fox3 kernel: Lustre: Server MGS on device /dev/md0 has started
...
Apr 30 23:40:44 fox3 kernel: Lustre: Enabling user_xattr
Apr 30 23:40:44 fox3 kernel: Lustre:
10739:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service test-MDT0000, 1
recoverable clients, last_transno 180390483549
Apr 30 23:40:44 fox3 kernel: Lustre: MDT test-MDT0000 now serving
test-MDT0000_UUID (test-MDT0000/50ae1644-01d5-bc1a-4e65-954d53f99b50), but will
be in recovery for at least 5:00, or until 1 client reconnect. During this time
new clients will not be allowed to connect. Recovery progress can be monitored
by watching /proc/fs/lustre/mds/test-MDT0000/recovery_status.
Apr 30 23:40:44 fox3 kernel: Lustre:
10739:0:(lproc_mds.c:273:lprocfs_wr_group_upcall()) test-MDT0000: group upcall
set to /usr/sbin/l_getgroups
Apr 30 23:40:44 fox3 kernel: Lustre: test-MDT0000.mdt: set parameter
group_upcall=/usr/sbin/l_getgroups
Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_lov.c:1070:mds_notify()) MDS
test-MDT0000: in recovery, not resetting orphans on test-OST0000_UUID
Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_lov.c:1070:mds_notify()) MDS
test-MDT0000: in recovery, not resetting orphans on test-OST0001_UUID
Apr 30 23:40:44 fox3 kernel: Lustre: Server test-MDT0000 on device /dev/md1 has
started
Apr 30 23:40:52 fox3 kernel: Lustre:
10671:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) test-MDT0000:
starting recovery timer
Apr 30 23:41:02 fox6 kernel: Lustre:
11544:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0
changed server handle from 0xfdb564f8289dd54 to 0xc63759a4fedd9a8
Apr 30 23:41:02 fox6 kernel: but is still in recovery
Apr 30 23:41:02 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 23:41:02 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored
to service MGS using nid 10.8.30.213 at o2ib.
...

 # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status
status: RECOVERING
recovery_start: 1241098852
time_remaining: 278
connected_clients: 1/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 180390483550

...
Apr 30 23:42:23 fox4 kernel: Lustre:
11583:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0
changed server handle from 0xfdb564f8289dd69 to 0xc63759a4fedd9bd
Apr 30 23:42:23Apr 30 23:42:23 fox4 kernel: but is still in recovery
Apr 30 23:42:23 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 23:42:23 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored
to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 23:45:52 fox3 kernel: LustreError:
0:0:(ldlm_lib.c:1161:target_recovery_expired()) test-MDT0000: recovery timed
out, aborting

 # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status
status: RECOVERING
recovery_start: 1241098852
time_remaining: 0
connected_clients: 1/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 180390483550

and ''RECOVERING'' doesn''t stop anytime soon in this
case. only when I
reset the node (or maybe wait 10+hours for something to timeout and
evict) does it go to COMPLETE.

I''m not sure, but I think with newer Lustre versions that I tried,
recovery does complete in a timely manner and go COMPLETE, but the
client is still dead.
>One of the few things that is time dependent on the client is the
>DLM lock LRU list.  Idle locks will expire from the client cache
>over time.  You can force a flush of the client''s MDS lock cache
with:
cool - thanks - makes sense.
>	# check how many metadata locks client currently has
>	client# cat /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size
>
>	client# echo clear > /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size
>
>The MGC shouldn''t be the culprit since it only holds a single lock
>that never expires.
darn, doesn''t seem to help :-/
I also did a clear into all
  /proc/fs/lustre/ldlm/namespaces/*/lru_size
and that didn''t seem to change anything.

cheers,
robin
>> it''s no big deal to put those 3 dirs into ramdisk as
they''re really
>> small (and the part-on-ramdisk model is nice and flexible too), so
>> we''ll probably move to running in this way anyway, but
I''m still
>> curious as to why a kernel-only system like Lustre a) cares about
>> userspace at all during recovery b) why it has a 10-12hr timescale :-)
>
>It would be good to know the root cause of this problem, as it may
>expose a defect in another part of the code.
>
>> changing the contents of /proc/sys/lnet/upcall into some path
stat''able
>> without Lustre being up doesn''t change anything.
>
>There are no longer upcalls needed on the client for recovery, and having
>the upcall inside Lustre when Lustre itself is not accessible is always a
>bad idea.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Sr. Staff Engineer, Lustre Group
>Sun Microsystems of Canada, Inc.
>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

Robin Humble

2009-Apr-30 15:48 UTC

head link

[Lustre-discuss] root on lustre and timeouts

On Wed, Apr 29, 2009 at 10:42:44AM -0500, Troy Benjegerdes
wrote:>On Wed, Apr 29, 2009 at 10:39:20AM -0400, Robin Humble wrote:
>> we are (happily) using read-only root-on-Lustre in production with
>> oneSIS, but have noticed something odd...
>> 
>> if a root-on-Lustre client node has been up for more than 10 or 12hours
>> then it survives an MDS failure/failover/reboot event(*), but if the
>> client is newly rebooted and has been up for less than this time, then
>> it doesn''t successfully reconnect after an MDS event and the
node is
>> ~dead.
>> 
>> by trial and error I''ve also found that if I rsync /lib64,
/bin, and
>> /sbin from Lustre to a root ramdisk, ''echo 3 >
/proc/sys/vm/drop_caches'',
>> and symlink the rest of dirs to Lustre then the node sails through MDS
>> events. leaving out any one of the dirs/steps leads to a dead node. so
>> it looks like the Lustre kernel''s recovery process is somehow
tied to
>> userspace via apps in /bin and /sbin?
>
>Now that''s interesting.. What distro are you using? I have been
toying
>with the idea of modifiying the Debian initramfs-tools boot ramdisk to
>include bushbox and dropbear-ssh in order to debug these kind of
>root-network-filesystem bugs.
yeah, putting an ssh server into the initramfs is certainly possible.
I''ve mostly used IPMI Serial-over-LAN and lots of echo''s and
occasional
dropping into /bin/ash to debug problems.
>In my case, I''m running AFS as the root
>filesystem, and I have the ''afsd'' in the ramdisk that gets
started at
>boot. I''m wondering if the Lustre binaries that are necessary could
be
>placed in the initrd as well.
cool.

I''m mostly working with CentOS 5.2 and 5.3, with a oneSIS initramfs as
a starting point.  http://onesis.sourceforge.net/

for pure root-on-Lustre, and with a recent kernel that accepts a huge
initramfs (RHEL/CentOS kernels are too old), the minimum initramfs
requirements would probably just be a 64bit busybox build with
/sbin/mount.lustre and piles of IB and Lustre kernel modules.
the /init script can then be altered slightly to mount a Lustre fs and
then bind mount the OS image sub-directory to the right place before
you switch_root to it, and after that it''s just like the normal oneSIS
NFS read-only root...
nothing particularly tricky.

for those older kernels I found the IB modules would fit but the
Lustre modules were too large for the initramfs to handle (I thought the
bad old days of initrd size limitations were over?!). so I needed to
rsync over the correct /lib/modules/`uname -r`/ tree, or just specific
modules, into the ramfs before I could fire up Lustre. hence rsync
needed to be in the initramfs.

once rsync is there, then things get pretty flexible and hybrid
approaches with some/all of the OS in ramdisk (or on local disk), and
some on Lustre becomes pretty easy to play with :-)

I followed the oneSIS approach and pass a bunch of possible boot
variants to the /init script via /proc/cmdline, so a single initramfs
can be pointed at different OS root images on different Lustre fs''s, do
different bind mounts, or be told to install various parts of the OS
onto different media.

in production for OSS''s and MDS''s we use an all-on-ramdisk
Lustre-free
(for obvious reasons) variant, and we will probably migrate our current
pure Lustre root compute nodes to the hybrid model soon.

hopefully I''ll tidy/generalise the code and push some of this back to
oneSIS at some stage.

the key changes I made from the basic oneSIS initramfs are probably:
 - compile up a 64bit busybox (big filesystems didn''t seem to work with
   32bit busybox IIRC) against glibc (not ulibc), as glibc is needed
   for rsync anyway. I just used the oneSIS config "busybux bbconfig"
   ''cos I don''t know much about busybox.
 - get ssh and rsync running in the initramfs. I put the cluster''s
   usual ssh in there rather than dropbear as I needed it working without
   a passwd. a rsync server to boot from would also be possible and then
   maybe ssh wouldn''t be needed in the initramfs. quite a few shared
   libs are needed to get ssh and rsync working.
 - put mount.lustre into the initramfs
 - include IB and (if they fit) Lustre modules in the initramfs
 - start editing /init to mount, rsync, bind mount, ... things to where
   you want them to be.
>It would be nice if various distros could work ''out of the
box'' with
>readonly network filesystems. 
definitely.

sadly the hybrid approach (which will probably always need quite a bit
of tweaking) ultimately might be the best way forward as it''s good to
have the option of unloading some commonly used libs and dirs from
Lustre and have them in local ram or local SSD/disk/USB, etc - a bit
more scalable.

having said that, we haven''t noticed any scalability problems with 150+
clients yet, except a little load on the MDS when all nodes execute the
same command at once (cexec, pdsh etc.).

BTW, as was pointed out in one talk of this years LUG, Lustre 1.8''s
OSS read cache should help things like root-on-Lustre because small
commonly used files will likely be cached in the OSS''s and
won''t result
in disk accesses.

cheers,
robin

Brian J. Murrell

2009-Apr-30 16:51 UTC

head link

[Lustre-discuss] root on lustre and timeouts

On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote:> BTW, as was pointed out in one talk of this years LUG, Lustre
1.8''s
> OSS read cache should help things like root-on-Lustre because small
> commonly used files will likely be cached in the OSS''s and
won''t result
> in disk accesses.
Yes, imagine what the ROSS cache can do for 150 clients all booting (and
executing the same scripts/binaries) at the same time.  Imagine what the
OSS disk did/does before the cache.  :-)

Certainly, I am not without bias, but the feature set of 1.8 looks
compelling enough to make me want to upgrade my own little "dogfood"
cluster here to 1.8.  :-)

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090430/536c49f4/attachment-0001.bin

Robin Humble

2009-May-01 14:50 UTC

head link

[Lustre-discuss] root on lustre and timeouts

On Thu, Apr 30, 2009 at 12:51:00PM -0400, Brian J. Murrell
wrote:>On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote:
>> BTW, as was pointed out in one talk of this years LUG, Lustre
1.8''s
>> OSS read cache should help things like root-on-Lustre because small
>> commonly used files will likely be cached in the OSS''s and
won''t result
>> in disk accesses.
>Yes, imagine what the ROSS cache can do for 150 clients all booting (and
>executing the same scripts/binaries) at the same time.  Imagine what the
>OSS disk did/does before the cache.  :-)
hopefully most of the frequently used parts of the OS are in page cache
on clients after the first read or two, but if there are new parts
accessed (or if everything boots at once) then yes, the OSS read cache
should definitely help lots.  

currently the only load we notice from root-on-Lustre is on the MDS,
but I can''t say we''ve been actively monitoring and
categorising all the
traffic - we really haven''t felt the need because there
haven''t been
slowdowns to speak of - that''s a good thing :)

actually, just thinking about it, it''d be good if you could tell Lustre
(llite) to be lazy about re-stat''ing files in what is mostly an
un-changing read-only image. is it possible to do this?
>Certainly, I am not without bias, but the feature set of 1.8 looks
>compelling enough to make me want to upgrade my own little
"dogfood"
>cluster here to 1.8.  :-)
yes, the features are shiny :)

cheers,
robin

Daire Byrne

2009-May-01 15:33 UTC

head link

[Lustre-discuss] root on lustre and timeouts

FYI we have been testing FS-Cache from Redhat (landed in 2.6.30) -
http://people.redhat.com/~dhowells/fscache. A common read-only NFS root is
cached by the clients to the local disk. Not quite a "diskless" system
but a good hybrid between booting entirely from network and booting from the
local drive. You can make the NFS lookups a little more lazy with the mount
options "actimeo=7200,nocto" (say). It has been working well so far
but more testing is required before we go into production. Combined with UnionFS
for writes we can network boot desktop machines without users noticing the
difference between a local HD install.

I''ve never tried oneSIS but we use the "buildroot" initrd
which uses uclibc and busybox - quite easy to port apps to it
(http://buildroot.uclibc.org).

Daire

----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:
> On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote:
> > BTW, as was pointed out in one talk of this years LUG, Lustre
1.8''s
> > OSS read cache should help things like root-on-Lustre because small
> > commonly used files will likely be cached in the OSS''s and
won''t
> result
> > in disk accesses.
> 
> Yes, imagine what the ROSS cache can do for 150 clients all booting
> (and
> executing the same scripts/binaries) at the same time.  Imagine what
> the
> OSS disk did/does before the cache.  :-)
> 
> Certainly, I am not without bias, but the feature set of 1.8 looks
> compelling enough to make me want to upgrade my own little
"dogfood"
> cluster here to 1.8.  :-)
> 
> b.
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Andreas Dilger

2009-May-01 16:36 UTC

head link

[Lustre-discuss] root on lustre and timeouts

On May 01, 2009  16:33 +0100, Daire Byrne wrote:> FYI we have been testing FS-Cache from Redhat (landed in 2.6.30) -
> http://people.redhat.com/~dhowells/fscache. A common read-only NFS root
> is cached by the clients to the local disk. Not quite a
"diskless" system
> but a good hybrid between booting entirely from network and booting from
> the local drive. You can make the NFS lookups a little more lazy with
> the mount options "actimeo=7200,nocto" (say). It has been working
well so
> far but more testing is required before we go into production. Combined
> with UnionFS for writes we can network boot desktop machines without
> users noticing the difference between a local HD install.
There was an external project to add fscache support to Lustre (was for
use over a WAN, but would be equally valuable for many diskless clients).
It would be nice to hear if that project was still underway and what
the status is.  Is anyone involved reading this?
> ----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:
> 
> > On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote:
> > > BTW, as was pointed out in one talk of this years LUG, Lustre
1.8''s
> > > OSS read cache should help things like root-on-Lustre because
small
> > > commonly used files will likely be cached in the OSS''s
and won''t
> > > result in disk accesses.
> > 
> > Yes, imagine what the ROSS cache can do for 150 clients all booting
> > (and executing the same scripts/binaries) at the same time.  Imagine
> > what the OSS disk did/does before the cache.  :-)
> > 
> > Certainly, I am not without bias, but the feature set of 1.8 looks
> > compelling enough to make me want to upgrade my own little
"dogfood"
> > cluster here to 1.8.  :-)
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Apr 2009 - root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts

[Lustre-discuss] root on lustre and timeouts