we are (happily) using read-only root-on-Lustre in production with oneSIS, but have noticed something odd... if a root-on-Lustre client node has been up for more than 10 or 12hours then it survives an MDS failure/failover/reboot event(*), but if the client is newly rebooted and has been up for less than this time, then it doesn''t successfully reconnect after an MDS event and the node is ~dead. by trial and error I''ve also found that if I rsync /lib64, /bin, and /sbin from Lustre to a root ramdisk, ''echo 3 > /proc/sys/vm/drop_caches'', and symlink the rest of dirs to Lustre then the node sails through MDS events. leaving out any one of the dirs/steps leads to a dead node. so it looks like the Lustre kernel''s recovery process is somehow tied to userspace via apps in /bin and /sbin? I can reproduce the weird 10-12hr behaviour at will by changing the clock on nodes in a toy Lustre test setup. ie. - servers and client all have the correct time - reboot client node - stop ntpd everywhere - use ''date --set ...'' to set all clocks to be X hours in the future - cause a MDS event(*) - wait for recovery to complete - if X <= ~10 to 12 then the client will be dead it''s no big deal to put those 3 dirs into ramdisk as they''re really small (and the part-on-ramdisk model is nice and flexible too), so we''ll probably move to running in this way anyway, but I''m still curious as to why a kernel-only system like Lustre a) cares about userspace at all during recovery b) why it has a 10-12hr timescale :-) changing the contents of /proc/sys/lnet/upcall into some path stat''able without Lustre being up doesn''t change anything. BTW, OSS reboot/failover is handled fine with root on Lustre, as are regular (non-root on Lustre clients) - this behaviour seems to be limited to the MDS/MGS failure when all/almost-all of the OS is on Lustre. our setup is patchless 1.6.4.3 clients, 1.6.6 servers, rhel5.2/5.3 x86_64, but the behaviour seems the same with much newer Lustre too eg. patched b_release_1_8_0. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility (*) umount mdt and mgs, lustre_rmmod, wait 10 mins, mount them again
On Wed, Apr 29, 2009 at 10:39:20AM -0400, Robin Humble wrote:> we are (happily) using read-only root-on-Lustre in production with > oneSIS, but have noticed something odd... > > if a root-on-Lustre client node has been up for more than 10 or 12hours > then it survives an MDS failure/failover/reboot event(*), but if the > client is newly rebooted and has been up for less than this time, then > it doesn''t successfully reconnect after an MDS event and the node is > ~dead. > > by trial and error I''ve also found that if I rsync /lib64, /bin, and > /sbin from Lustre to a root ramdisk, ''echo 3 > /proc/sys/vm/drop_caches'', > and symlink the rest of dirs to Lustre then the node sails through MDS > events. leaving out any one of the dirs/steps leads to a dead node. so > it looks like the Lustre kernel''s recovery process is somehow tied to > userspace via apps in /bin and /sbin?Now that''s interesting.. What distro are you using? I have been toying with the idea of modifiying the Debian initramfs-tools boot ramdisk to include bushbox and dropbear-ssh in order to debug these kind of root-network-filesystem bugs. In my case, I''m running AFS as the root filesystem, and I have the ''afsd'' in the ramdisk that gets started at boot. I''m wondering if the Lustre binaries that are necessary could be placed in the initrd as well. It would be nice if various distros could work ''out of the box'' with readonly network filesystems.
On Apr 29, 2009 10:39 -0400, Robin Humble wrote:> we are (happily) using read-only root-on-Lustre in production with > oneSIS, but have noticed something odd... > > if a root-on-Lustre client node has been up for more than 10 or 12hours > then it survives an MDS failure/failover/reboot event(*), but if the > client is newly rebooted and has been up for less than this time, then > it doesn''t successfully reconnect after an MDS event and the node is > ~dead. > > by trial and error I''ve also found that if I rsync /lib64, /bin, and > /sbin from Lustre to a root ramdisk, ''echo 3 > /proc/sys/vm/drop_caches'', > and symlink the rest of dirs to Lustre then the node sails through MDS > events. leaving out any one of the dirs/steps leads to a dead node. so > it looks like the Lustre kernel''s recovery process is somehow tied to > userspace via apps in /bin and /sbin? > > I can reproduce the weird 10-12hr behaviour at will by changing the > clock on nodes in a toy Lustre test setup. ie. > - servers and client all have the correct time > - reboot client node > - stop ntpd everywhere > - use ''date --set ...'' to set all clocks to be X hours in the future > - cause a MDS event(*) > - wait for recovery to complete > - if X <= ~10 to 12 then the client will be deadThis shouldn''t really happen. We of course test failover with client uptimes a lot less than 10-12h without problems, though not with root filesystems on Lustre. Providing any MDS console messages that are unique to a failing short-lived client vs a long-lived client might point us in the right direction. One of the few things that is time dependent on the client is the DLM lock LRU list. Idle locks will expire from the client cache over time. You can force a flush of the client''s MDS lock cache with: # check how many metadata locks client currently has client# cat /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size client# echo clear > /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size The MGC shouldn''t be the culprit since it only holds a single lock that never expires.> it''s no big deal to put those 3 dirs into ramdisk as they''re really > small (and the part-on-ramdisk model is nice and flexible too), so > we''ll probably move to running in this way anyway, but I''m still > curious as to why a kernel-only system like Lustre a) cares about > userspace at all during recovery b) why it has a 10-12hr timescale :-)It would be good to know the root cause of this problem, as it may expose a defect in another part of the code.> changing the contents of /proc/sys/lnet/upcall into some path stat''able > without Lustre being up doesn''t change anything.There are no longer upcalls needed on the client for recovery, and having the upcall inside Lustre when Lustre itself is not accessible is always a bad idea. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Wed, Apr 29, 2009 at 02:48:34PM -0600, Andreas Dilger wrote:>On Apr 29, 2009 10:39 -0400, Robin Humble wrote: >> we are (happily) using read-only root-on-Lustre in production with >> oneSIS, but have noticed something odd... >> >> if a root-on-Lustre client node has been up for more than 10 or 12hours >> then it survives an MDS failure/failover/reboot event(*), but if the >> client is newly rebooted and has been up for less than this time, then >> it doesn''t successfully reconnect after an MDS event and the node is >> ~dead. >> >> by trial and error I''ve also found that if I rsync /lib64, /bin, and >> /sbin from Lustre to a root ramdisk, ''echo 3 > /proc/sys/vm/drop_caches'', >> and symlink the rest of dirs to Lustre then the node sails through MDS >> events. leaving out any one of the dirs/steps leads to a dead node. so >> it looks like the Lustre kernel''s recovery process is somehow tied to >> userspace via apps in /bin and /sbin? >> >> I can reproduce the weird 10-12hr behaviour at will by changing the >> clock on nodes in a toy Lustre test setup. ie. >> - servers and client all have the correct time >> - reboot client node >> - stop ntpd everywhere >> - use ''date --set ...'' to set all clocks to be X hours in the future >> - cause a MDS event(*) >> - wait for recovery to complete >> - if X <= ~10 to 12 then the client will be dead > >This shouldn''t really happen. We of course test failover with client >uptimes a lot less than 10-12h without problems, though not with root >filesystems on Lustre. Providing any MDS console messages that are >unique to a failing short-lived client vs a long-lived client might >point us in the right direction.sadly, there''s not much there... our consoles lose character over IPMI, so the below are from syslog. fox3 (.213) is mds/mgs. fox4,6 (.214,.216) are oss''s and fox1 (.211) is client. mds/oss''s are 1.6.6, client is 1.6.4.3. success (setting clocks forward): Apr 30 22:58:31 fox3 kernel: Lustre: MGS MGS started Apr 30 22:58:31 fox3 kernel: Lustre: Server MGS on device /dev/md0 has started ... Apr 30 22:58:31 fox3 kernel: Lustre: Enabling user_xattr Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service test-MDT0000, 1 recoverable clients, last_transno 180390470883 Apr 30 22:58:31 fox3 kernel: Lustre: MDT test-MDT0000 now serving test-MDT0000_UUID (test-MDT0000/50ae1644-01d5-bc1a-4e65-954d53f99b50), but will be in recovery for at least 5:00, or until 1 client reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/test-MDT0000/recovery_status. Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(lproc_mds.c:273:lprocfs_wr_group_upcall()) test-MDT0000: group upcall set to /usr/sbin/l_getgroups Apr 30 22:58:31 fox3 kernel: Lustre: test-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0000_UUID Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0001_UUID Apr 30 22:58:31 fox3 kernel: Lustre: Server test-MDT0000 on device /dev/md1 has started Apr 30 22:58:32 fox6 kernel: Lustre: 11544:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0x18a8172888292c35 to 0xfdb564f8289dd54 Apr 30 22:58:32 fox6 kernel: but is still in recovery Apr 30 22:58:32 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import Apr 30 22:58:32 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib. Apr 30 22:58:38 fox4 kernel: Lustre: 11583:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0x18a8172888292c0b to 0xfdb564f8289dd69 Apr 30 22:58:38 fox4 kernel: but is still in recovery Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib. Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib. Apr 30 22:58:47 fox1 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import Apr 30 22:58:47 fox1 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib. Apr 30 22:58:47 fox3 kernel: Lustre: 8690:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) test-MDT0000: starting recovery timer Apr 30 22:58:47 fox1 kernel: LustreError: 724:0:(client.c:1750:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req at ffff81025ec35800 x1960/t180390441407 o101->test-MDT0000_UUID at 10.8.30.213@o2ib:12 lens 440/768 ref 2 fl Complete:RP/4/0 rc 301/301 Apr 30 22:58:47 fox1 kernel: LustreError: 724:0:(client.c:1750:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req at ffff81025dd4c400 x5132/t180390443405 o101->test-MDT0000_UUID at 10.8.30.213@o2ib:12 lens 440/768 ref 2 fl Complete:RP/4/0 rc 301/301 Apr 30 22:58:47 fox3 kernel: Lustre: test-MDT0000: sending delayed replies to recovered clients Apr 30 22:58:47 fox3 kernel: Lustre: test-MDT0000: recovery complete: rc 0 Apr 30 22:58:47 fox1 kernel: Lustre: test-MDT0000-mdc-ffff81025a2b7800: Connection restored to service test-MDT0000 using nid 10.8.30.213 at o2ib. Apr 30 22:58:47 fox6 kernel: Lustre: test-OST0000: received MDS connection from 10.8.30.213 at o2ib Apr 30 22:58:47 fox3 kernel: Lustre: MDS test-MDT0000: test-OST0000_UUID now active, resetting orphans Apr 30 22:58:47 fox4 kernel: Lustre: test-OST0001: received MDS connection from 10.8.30.213 at o2ib Apr 30 22:58:47 fox3 kernel: Lustre: MDS test-MDT0000: test-OST0001_UUID now active, resetting orphans Apr 30 23:03:07 fox3 kernel: Lustre: MGS: haven''t heard from client 83d64a6f-bf4f-6d1f-231f-a3e0b08b9ab7 (at 10.8.30.211 at o2ib) in 243 seconds. I think it''s dead, and I am evicting it. where the last line above confuses me - that looks like something failed, but the node is completely happy :-/ # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status status: COMPLETE recovery_start: 1241136707 recovery_duration: 0 completed_clients: 1/1 replayed_requests: 0 last_transno: 180390470883 fail (node recently booted): Apr 30 23:40:43 fox3 kernel: Lustre: MGS MGS started Apr 30 23:40:43 fox3 kernel: Lustre: Server MGS on device /dev/md0 has started ... Apr 30 23:40:44 fox3 kernel: Lustre: Enabling user_xattr Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service test-MDT0000, 1 recoverable clients, last_transno 180390483549 Apr 30 23:40:44 fox3 kernel: Lustre: MDT test-MDT0000 now serving test-MDT0000_UUID (test-MDT0000/50ae1644-01d5-bc1a-4e65-954d53f99b50), but will be in recovery for at least 5:00, or until 1 client reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/test-MDT0000/recovery_status. Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(lproc_mds.c:273:lprocfs_wr_group_upcall()) test-MDT0000: group upcall set to /usr/sbin/l_getgroups Apr 30 23:40:44 fox3 kernel: Lustre: test-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0000_UUID Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0001_UUID Apr 30 23:40:44 fox3 kernel: Lustre: Server test-MDT0000 on device /dev/md1 has started Apr 30 23:40:52 fox3 kernel: Lustre: 10671:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) test-MDT0000: starting recovery timer Apr 30 23:41:02 fox6 kernel: Lustre: 11544:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0xfdb564f8289dd54 to 0xc63759a4fedd9a8 Apr 30 23:41:02 fox6 kernel: but is still in recovery Apr 30 23:41:02 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import Apr 30 23:41:02 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib. ... # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status status: RECOVERING recovery_start: 1241098852 time_remaining: 278 connected_clients: 1/1 completed_clients: 0/1 replayed_requests: 0/?? queued_requests: 0 next_transno: 180390483550 ... Apr 30 23:42:23 fox4 kernel: Lustre: 11583:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0xfdb564f8289dd69 to 0xc63759a4fedd9bd Apr 30 23:42:23Apr 30 23:42:23 fox4 kernel: but is still in recovery Apr 30 23:42:23 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import Apr 30 23:42:23 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib. Apr 30 23:45:52 fox3 kernel: LustreError: 0:0:(ldlm_lib.c:1161:target_recovery_expired()) test-MDT0000: recovery timed out, aborting # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status status: RECOVERING recovery_start: 1241098852 time_remaining: 0 connected_clients: 1/1 completed_clients: 0/1 replayed_requests: 0/?? queued_requests: 0 next_transno: 180390483550 and ''RECOVERING'' doesn''t stop anytime soon in this case. only when I reset the node (or maybe wait 10+hours for something to timeout and evict) does it go to COMPLETE. I''m not sure, but I think with newer Lustre versions that I tried, recovery does complete in a timely manner and go COMPLETE, but the client is still dead.>One of the few things that is time dependent on the client is the >DLM lock LRU list. Idle locks will expire from the client cache >over time. You can force a flush of the client''s MDS lock cache with:cool - thanks - makes sense.> # check how many metadata locks client currently has > client# cat /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size > > client# echo clear > /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size > >The MGC shouldn''t be the culprit since it only holds a single lock >that never expires.darn, doesn''t seem to help :-/ I also did a clear into all /proc/fs/lustre/ldlm/namespaces/*/lru_size and that didn''t seem to change anything. cheers, robin>> it''s no big deal to put those 3 dirs into ramdisk as they''re really >> small (and the part-on-ramdisk model is nice and flexible too), so >> we''ll probably move to running in this way anyway, but I''m still >> curious as to why a kernel-only system like Lustre a) cares about >> userspace at all during recovery b) why it has a 10-12hr timescale :-) > >It would be good to know the root cause of this problem, as it may >expose a defect in another part of the code. > >> changing the contents of /proc/sys/lnet/upcall into some path stat''able >> without Lustre being up doesn''t change anything. > >There are no longer upcalls needed on the client for recovery, and having >the upcall inside Lustre when Lustre itself is not accessible is always a >bad idea. > >Cheers, Andreas >-- >Andreas Dilger >Sr. Staff Engineer, Lustre Group >Sun Microsystems of Canada, Inc. > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Wed, Apr 29, 2009 at 10:42:44AM -0500, Troy Benjegerdes wrote:>On Wed, Apr 29, 2009 at 10:39:20AM -0400, Robin Humble wrote: >> we are (happily) using read-only root-on-Lustre in production with >> oneSIS, but have noticed something odd... >> >> if a root-on-Lustre client node has been up for more than 10 or 12hours >> then it survives an MDS failure/failover/reboot event(*), but if the >> client is newly rebooted and has been up for less than this time, then >> it doesn''t successfully reconnect after an MDS event and the node is >> ~dead. >> >> by trial and error I''ve also found that if I rsync /lib64, /bin, and >> /sbin from Lustre to a root ramdisk, ''echo 3 > /proc/sys/vm/drop_caches'', >> and symlink the rest of dirs to Lustre then the node sails through MDS >> events. leaving out any one of the dirs/steps leads to a dead node. so >> it looks like the Lustre kernel''s recovery process is somehow tied to >> userspace via apps in /bin and /sbin? > >Now that''s interesting.. What distro are you using? I have been toying >with the idea of modifiying the Debian initramfs-tools boot ramdisk to >include bushbox and dropbear-ssh in order to debug these kind of >root-network-filesystem bugs.yeah, putting an ssh server into the initramfs is certainly possible. I''ve mostly used IPMI Serial-over-LAN and lots of echo''s and occasional dropping into /bin/ash to debug problems.>In my case, I''m running AFS as the root >filesystem, and I have the ''afsd'' in the ramdisk that gets started at >boot. I''m wondering if the Lustre binaries that are necessary could be >placed in the initrd as well.cool. I''m mostly working with CentOS 5.2 and 5.3, with a oneSIS initramfs as a starting point. http://onesis.sourceforge.net/ for pure root-on-Lustre, and with a recent kernel that accepts a huge initramfs (RHEL/CentOS kernels are too old), the minimum initramfs requirements would probably just be a 64bit busybox build with /sbin/mount.lustre and piles of IB and Lustre kernel modules. the /init script can then be altered slightly to mount a Lustre fs and then bind mount the OS image sub-directory to the right place before you switch_root to it, and after that it''s just like the normal oneSIS NFS read-only root... nothing particularly tricky. for those older kernels I found the IB modules would fit but the Lustre modules were too large for the initramfs to handle (I thought the bad old days of initrd size limitations were over?!). so I needed to rsync over the correct /lib/modules/`uname -r`/ tree, or just specific modules, into the ramfs before I could fire up Lustre. hence rsync needed to be in the initramfs. once rsync is there, then things get pretty flexible and hybrid approaches with some/all of the OS in ramdisk (or on local disk), and some on Lustre becomes pretty easy to play with :-) I followed the oneSIS approach and pass a bunch of possible boot variants to the /init script via /proc/cmdline, so a single initramfs can be pointed at different OS root images on different Lustre fs''s, do different bind mounts, or be told to install various parts of the OS onto different media. in production for OSS''s and MDS''s we use an all-on-ramdisk Lustre-free (for obvious reasons) variant, and we will probably migrate our current pure Lustre root compute nodes to the hybrid model soon. hopefully I''ll tidy/generalise the code and push some of this back to oneSIS at some stage. the key changes I made from the basic oneSIS initramfs are probably: - compile up a 64bit busybox (big filesystems didn''t seem to work with 32bit busybox IIRC) against glibc (not ulibc), as glibc is needed for rsync anyway. I just used the oneSIS config "busybux bbconfig" ''cos I don''t know much about busybox. - get ssh and rsync running in the initramfs. I put the cluster''s usual ssh in there rather than dropbear as I needed it working without a passwd. a rsync server to boot from would also be possible and then maybe ssh wouldn''t be needed in the initramfs. quite a few shared libs are needed to get ssh and rsync working. - put mount.lustre into the initramfs - include IB and (if they fit) Lustre modules in the initramfs - start editing /init to mount, rsync, bind mount, ... things to where you want them to be.>It would be nice if various distros could work ''out of the box'' with >readonly network filesystems.definitely. sadly the hybrid approach (which will probably always need quite a bit of tweaking) ultimately might be the best way forward as it''s good to have the option of unloading some commonly used libs and dirs from Lustre and have them in local ram or local SSD/disk/USB, etc - a bit more scalable. having said that, we haven''t noticed any scalability problems with 150+ clients yet, except a little load on the MDS when all nodes execute the same command at once (cexec, pdsh etc.). BTW, as was pointed out in one talk of this years LUG, Lustre 1.8''s OSS read cache should help things like root-on-Lustre because small commonly used files will likely be cached in the OSS''s and won''t result in disk accesses. cheers, robin
On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote:> BTW, as was pointed out in one talk of this years LUG, Lustre 1.8''s > OSS read cache should help things like root-on-Lustre because small > commonly used files will likely be cached in the OSS''s and won''t result > in disk accesses.Yes, imagine what the ROSS cache can do for 150 clients all booting (and executing the same scripts/binaries) at the same time. Imagine what the OSS disk did/does before the cache. :-) Certainly, I am not without bias, but the feature set of 1.8 looks compelling enough to make me want to upgrade my own little "dogfood" cluster here to 1.8. :-) b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090430/536c49f4/attachment-0001.bin
On Thu, Apr 30, 2009 at 12:51:00PM -0400, Brian J. Murrell wrote:>On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote: >> BTW, as was pointed out in one talk of this years LUG, Lustre 1.8''s >> OSS read cache should help things like root-on-Lustre because small >> commonly used files will likely be cached in the OSS''s and won''t result >> in disk accesses. >Yes, imagine what the ROSS cache can do for 150 clients all booting (and >executing the same scripts/binaries) at the same time. Imagine what the >OSS disk did/does before the cache. :-)hopefully most of the frequently used parts of the OS are in page cache on clients after the first read or two, but if there are new parts accessed (or if everything boots at once) then yes, the OSS read cache should definitely help lots. currently the only load we notice from root-on-Lustre is on the MDS, but I can''t say we''ve been actively monitoring and categorising all the traffic - we really haven''t felt the need because there haven''t been slowdowns to speak of - that''s a good thing :) actually, just thinking about it, it''d be good if you could tell Lustre (llite) to be lazy about re-stat''ing files in what is mostly an un-changing read-only image. is it possible to do this?>Certainly, I am not without bias, but the feature set of 1.8 looks >compelling enough to make me want to upgrade my own little "dogfood" >cluster here to 1.8. :-)yes, the features are shiny :) cheers, robin
FYI we have been testing FS-Cache from Redhat (landed in 2.6.30) - http://people.redhat.com/~dhowells/fscache. A common read-only NFS root is cached by the clients to the local disk. Not quite a "diskless" system but a good hybrid between booting entirely from network and booting from the local drive. You can make the NFS lookups a little more lazy with the mount options "actimeo=7200,nocto" (say). It has been working well so far but more testing is required before we go into production. Combined with UnionFS for writes we can network boot desktop machines without users noticing the difference between a local HD install. I''ve never tried oneSIS but we use the "buildroot" initrd which uses uclibc and busybox - quite easy to port apps to it (http://buildroot.uclibc.org). Daire ----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:> On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote: > > BTW, as was pointed out in one talk of this years LUG, Lustre 1.8''s > > OSS read cache should help things like root-on-Lustre because small > > commonly used files will likely be cached in the OSS''s and won''t > result > > in disk accesses. > > Yes, imagine what the ROSS cache can do for 150 clients all booting > (and > executing the same scripts/binaries) at the same time. Imagine what > the > OSS disk did/does before the cache. :-) > > Certainly, I am not without bias, but the feature set of 1.8 looks > compelling enough to make me want to upgrade my own little "dogfood" > cluster here to 1.8. :-) > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On May 01, 2009 16:33 +0100, Daire Byrne wrote:> FYI we have been testing FS-Cache from Redhat (landed in 2.6.30) - > http://people.redhat.com/~dhowells/fscache. A common read-only NFS root > is cached by the clients to the local disk. Not quite a "diskless" system > but a good hybrid between booting entirely from network and booting from > the local drive. You can make the NFS lookups a little more lazy with > the mount options "actimeo=7200,nocto" (say). It has been working well so > far but more testing is required before we go into production. Combined > with UnionFS for writes we can network boot desktop machines without > users noticing the difference between a local HD install.There was an external project to add fscache support to Lustre (was for use over a WAN, but would be equally valuable for many diskless clients). It would be nice to hear if that project was still underway and what the status is. Is anyone involved reading this?> ----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote: > > > On Thu, 2009-04-30 at 11:48 -0400, Robin Humble wrote: > > > BTW, as was pointed out in one talk of this years LUG, Lustre 1.8''s > > > OSS read cache should help things like root-on-Lustre because small > > > commonly used files will likely be cached in the OSS''s and won''t > > > result in disk accesses. > > > > Yes, imagine what the ROSS cache can do for 150 clients all booting > > (and executing the same scripts/binaries) at the same time. Imagine > > what the OSS disk did/does before the cache. :-) > > > > Certainly, I am not without bias, but the feature set of 1.8 looks > > compelling enough to make me want to upgrade my own little "dogfood" > > cluster here to 1.8. :-)Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.