Hello. I''m new to Lustre. I''m writing because I have changed some IP''s and I can''t get everything mounted. 4 total servers (changed IPs of just OSS servers) head - MGT/MGS (no IP change) oss0 - OSS (changed IP) oss1 - OSS (changed IP) oss2 - OSS (changed IP) brianjm was helpful in IRC and hinted I needed to read the manual about Changing a NID. I followed this procedure : 4.3.12 Changing a Server NID On all servers, I made sure everything was unmounted: umount /mnt/mdt and umount /mnt/lustre On head, I ran : tunefs.lustre --writeconf /dev/sdb1 On oss0, I ran : tunefs.lustre --writeconf /dev/sda5 On oss1, I ran : tunefs.lustre --writeconf /dev/sda5 On oss2, I ran : tunefs.lustre --writeconf /dev/sda5 I then mounted everything. First the mdt on server head On head: mount /mnt/mdt -- this didn''t throw any errors. checked the logs and no errors. On oss0: mount /mnt/lustre mount.lustre: mount jupiter at tcp0:/mylustre at /mnt/lustre failed: No medium found This filesystem needs at least 1 OST Error log on oss0: Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError: 151-5: There are no OST''s in this filesystem. There must be at least one active OST for a client to start. Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError: 4136:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError: 4136:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError: 4136:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-123) The same error appears on the other oss servers. Any help would be greatly appreciated. -Brendon
Since the last post, I realized the fstab didn''t have an entry for the OST to mount. It''s not clear to me how this was working before, because I don''t recall seeing an OST mounted when running "mount" before. Anyway, continuing on... On oss0 I ran: mount -t lustre /dev/sda5 /mnt/ost0 Logs on oss0 say: Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError: 4458:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003: denying connection for new client 10.1.1.1 at tcp (mylustre-mdtlov_UUID): 3 clients in recovery for 300s Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError: 4458:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-16) req at ffff81021e84da00 x72/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to 0 dl 1294785313 ref 1 fl Interpret:/0/0 rc -16/0 Jan 11 14:33:58 compute-oss-0-0 kernel: LustreError: 4459:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003: denying connection for new client 10.1.1.1 at tcp (mylustre-mdtlov_UUID): 3 clients in recovery for 275s ... Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError: 4468:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003: denying connection for new client 10.1.1.1 at tcp (mylustre-mdtlov_UUID): 3 clients in recovery for 74s Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError: 4468:0:(ldlm_lib.c:806:target_handle_connect()) Skipped 1 previous similar message Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError: 4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-16) req at ffff81021e310800 x91/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to 0 dl 1294785538 ref 1 fl Interpret:/0/0 rc -16/0 Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError: 4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped1 previous similar message Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError: 0:0:(ldlm_lib.c:1105:target_recovery_expired()) mylustre-OST0003: recovery timed out, aborting Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError: 4471:0:(genops.c:1024:class_disconnect_stale_exports()) mylustre-OST0003: disconnecting 3 stale clients logs on head server : Jan 11 14:33:33 jupiter kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.2 at tcp. The ost_connect operation failed with -16 Jan 11 14:34:23 jupiter last message repeated 2 times Jan 11 14:35:38 jupiter last message repeated 3 times Jan 11 14:37:18 jupiter last message repeated 3 times Jan 11 14:37:18 jupiter kernel: LustreError: Skipped 1 previous similar message Any help is much appreciated. -Brendon
I continued mounting the OSTs on n1 and n2. I received the same errors when mounting them as on n0. I then tried mounting the lustre as a client. That worked, but touching a file on it caused it to hang and spit the errors below on server head. After about a minute the server became unresponsive to ping. I''m guessing it has oops''ed. I googled the ost_connect error -16, but haven''t found anything relevnat yet that appears useful. I''m going to take a break. I''ve been working this one all day... time for a late lunch. Any insight is much appreciated. -Brendon Jan 11 14:54:19 jupiter kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.3 at tcp. The ost_connect operation failed with -16 Jan 11 14:54:19 jupiter kernel: LustreError: Skipped 5 previous similar messages Jan 11 14:57:14 jupiter kernel: LustreError: 7694:0:(osc_create.c:348:osc_create()) mylustre-OST0002-osc: oscc recovery failed: -110 Jan 11 14:57:14 jupiter kernel: LustreError: 7693:0:(osc_create.c:348:osc_create()) mylustre-OST0001-osc: oscc recovery failed: -110 Jan 11 14:57:14 jupiter kernel: LustreError: 7694:0:(lov_obd.c:1074:lov_clear_orphans()) error in orphan recovery on OST idx 2/4: rc = -110 Jan 11 14:57:14 jupiter kernel: LustreError: 11-0: an error occurred while communicating with 10.1.1.4 at tcp. The ost_connect operation failed with -16 Jan 11 14:57:14 jupiter kernel: LustreError: Skipped 9 previous similar messages
Hi Brendon, Can you please provide following: 1) output of ifconfig run on each OSS MDS and at least one client 2) output of lctl list_nids run on each OSS MDS and at least one client 3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device> from each OSS Wojciech On 11 January 2011 23:07, Brendon <b at brendon.com> wrote:> I continued mounting the OSTs on n1 and n2. I received the same errors > when mounting them as on n0. > > I then tried mounting the lustre as a client. That worked, but > touching a file on it caused it to hang and spit the errors below on > server head. After about a minute the server became unresponsive to > ping. I''m guessing it has oops''ed. > > I googled the ost_connect error -16, but haven''t found anything > relevnat yet that appears useful. > I''m going to take a break. I''ve been working this one all day... time > for a late lunch. > > Any insight is much appreciated. > -Brendon > > Jan 11 14:54:19 jupiter kernel: LustreError: 11-0: an error occurred > while communicating with 10.1.1.3 at tcp. The ost_connect operation > failed with -16 > Jan 11 14:54:19 jupiter kernel: LustreError: Skipped 5 previous similar > messages > Jan 11 14:57:14 jupiter kernel: LustreError: > 7694:0:(osc_create.c:348:osc_create()) mylustre-OST0002-osc: oscc > recovery failed: -110 > Jan 11 14:57:14 jupiter kernel: LustreError: > 7693:0:(osc_create.c:348:osc_create()) mylustre-OST0001-osc: oscc > recovery failed: -110 > Jan 11 14:57:14 jupiter kernel: LustreError: > 7694:0:(lov_obd.c:1074:lov_clear_orphans()) error in orphan recovery > on OST idx 2/4: rc = -110 > Jan 11 14:57:14 jupiter kernel: LustreError: 11-0: an error occurred > while communicating with 10.1.1.4 at tcp. The ost_connect operation > failed with -16 > Jan 11 14:57:14 jupiter kernel: LustreError: Skipped 9 previous similar > messages > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110111/678804c8/attachment-0001.html
On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi Brendon, > > Can you please provide following: > 1) output of ifconfig run on each OSS MDS and at least one client > 2) output of lctl list_nids run on each OSS MDS and at least one client > 3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device> from > each OSS > > WojciechAfter someone looked at the emails I sent out, they grabbed me on IRC. We had a discussion and basically they interpreted the email as everything should be working, I just needed to wait for a repair to run and complete. What I then learned is that first, a client has to connect for a repair to initiate. Secondly, the code isn''t perfect. The MDS kernel oops''ed twice before it finally completed a repair successfully. I was in the process of disabling panic on oops, but it finally completed successfully. Once that was done, I got a clean bill of health. Just to complete this discussion, I have listed the requested output. I might still learn something :) ...Looks like I did learn something. OSS0 has an issue with the root FS and was remounted RO which I discovered when running ?tunefs.lustre --print --dryrun /dev/sda5. The fun never ends :) -Brendon 1) ifconfig info MDS: # ifconfig eth0 ? ? ?Link encap:Ethernet ?HWaddr 00:15:17:5E:46:64 ? ? ? ? ?inet addr:10.1.1.1 ?Bcast:10.1.1.255 ?Mask:255.255.255.0 ? ? ? ? ?inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link ? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1 ? ? ? ? ?RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0 ? ? ? ? ?TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0 ? ? ? ? ?collisions:0 txqueuelen:1000 ? ? ? ? ?RX bytes:18963170801 (17.6 GiB) ?TX bytes:65261762295 (60.7 GiB) ? ? ? ? ?Base address:0xcc00 Memory:f58e0000-f5900000 eth1 ? ? ?Link encap:Ethernet ?HWaddr 00:15:17:5E:46:65 ? ? ? ? ?inet addr:192.168.0.181 ?Bcast:192.168.0.255 ?Mask:255.255.255.0 ? ? ? ? ?inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link ? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1 ? ? ? ? ?RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0 ? ? ? ? ?TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0 ? ? ? ? ?collisions:0 txqueuelen:100 ? ? ? ? ?RX bytes:15562858193 (14.4 GiB) ?TX bytes:686167422947 (639.0 GiB) ? ? ? ? ?Base address:0xc880 Memory:f5880000-f58a0000 OSS : # ifconfig eth0 ? ? ?Link encap:Ethernet ?HWaddr 00:1D:60:E0:5B:B2 ? ? ? ? ?inet addr:10.1.1.2 ?Bcast:10.1.1.255 ?Mask:255.255.255.0 ? ? ? ? ?inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link ? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1 ? ? ? ? ?RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0 ? ? ? ? ?TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0 ? ? ? ? ?collisions:0 txqueuelen:1000 ? ? ? ? ?RX bytes:1320521551 (1.2 GiB) ?TX bytes:2670089148 (2.4 GiB) ? ? ? ? ?Interrupt:233 client: # ifconfig eth0 ? ? ?Link encap:Ethernet ?HWaddr 00:1E:8C:39:E4:69 ? ? ? ? ?inet addr:10.1.1.5 ?Bcast:10.1.1.255 ?Mask:255.255.255.0 ? ? ? ? ?inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link ? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1 ? ? ? ? ?RX packets:727922 errors:0 dropped:0 overruns:0 frame:0 ? ? ? ? ?TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0 ? ? ? ? ?collisions:0 txqueuelen:1000 ? ? ? ? ?RX bytes:433349006 (413.2 MiB) ?TX bytes:231985578 (221.2 MiB) ? ? ? ? ?Interrupt:50 2) lctl list_nids client: lctl list_nids 10.1.1.5 at tcp MDS: lctl list_nids 10.1.1.1 at tcp OSS: lctl list_nids 10.1.1.2 at tcp 3) tunefs.lustre --print --dryrun /dev/sda5 OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5 checking for existing Lustre data: found CONFIGS/mountdata tunefs.lustre: Can''t create temporary directory /tmp/dirCZXt3k: Read-only file system tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5 (30) tunefs.lustre: exiting with 30 (Read-only file system) OSS1: # tunefs.lustre --print --dryrun /dev/sda5 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata ? Read previous values: Target: ? ? mylustre-OST0001 Index: ? ? ?1 Lustre FS: ?mylustre Mount type: ldiskfs Flags: ? ? ?0x2 ? ? ? ? ? ? ?(OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.1.1.1 at tcp ? Permanent disk data: Target: ? ? mylustre-OST0001 Index: ? ? ?1 Lustre FS: ?mylustre Mount type: ldiskfs Flags: ? ? ?0x2 ? ? ? ? ? ? ?(OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.1.1.1 at tcp exiting before disk write. OSS2: # tunefs.lustre --print --dryrun /dev/sda5 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata ? Read previous values: Target: ? ? mylustre-OST0002 Index: ? ? ?2 Lustre FS: ?mylustre Mount type: ldiskfs Flags: ? ? ?0x2 ? ? ? ? ? ? ?(OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.1.1.1 at tcp ? Permanent disk data: Target: ? ? mylustre-OST0002 Index: ? ? ?2 Lustre FS: ?mylustre Mount type: ldiskfs Flags: ? ? ?0x2 ? ? ? ? ? ? ?(OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.1.1.1 at tcp exiting before disk write.
Hi Brendon, So it looks like you Lustre was just stuck in recovery processes after all. It is a bit concerning that you had kernel panics on MDS during recovery. Which Lustre version are you using? Do you have stack traces from the kernel panics? Wojciech On 13 January 2011 17:41, Brendon <b at brendon.com> wrote:> On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: > > Hi Brendon, > > > > Can you please provide following: > > 1) output of ifconfig run on each OSS MDS and at least one client > > 2) output of lctl list_nids run on each OSS MDS and at least one client > > 3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device> from > > each OSS > > > > Wojciech > > After someone looked at the emails I sent out, they grabbed me on IRC. > We had a discussion and basically they interpreted the email as > everything should be working, I just needed to wait for a repair to > run and complete. What I then learned is that first, a client has to > connect for a repair to initiate. Secondly, the code isn''t perfect. > The MDS kernel oops''ed twice before it finally completed a repair > successfully. I was in the process of disabling panic on oops, but it > finally completed successfully. Once that was done, I got a clean bill > of health. > > Just to complete this discussion, I have listed the requested output. > I might still learn something :) > > ...Looks like I did learn something. OSS0 has an issue with the root > FS and was remounted RO which I discovered when running tunefs.lustre > --print --dryrun /dev/sda5. > > The fun never ends :) > -Brendon > > 1) ifconfig info > MDS: # ifconfig > eth0 Link encap:Ethernet HWaddr 00:15:17:5E:46:64 > inet addr:10.1.1.1 Bcast:10.1.1.255 Mask:255.255.255.0 > inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0 > TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:18963170801 (17.6 GiB) TX bytes:65261762295 (60.7 GiB) > Base address:0xcc00 Memory:f58e0000-f5900000 > > eth1 Link encap:Ethernet HWaddr 00:15:17:5E:46:65 > inet addr:192.168.0.181 Bcast:192.168.0.255 Mask:255.255.255.0 > inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0 > TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:100 > RX bytes:15562858193 (14.4 GiB) TX bytes:686167422947 (639.0 GiB) > Base address:0xc880 Memory:f5880000-f58a0000 > > OSS : # ifconfig > eth0 Link encap:Ethernet HWaddr 00:1D:60:E0:5B:B2 > inet addr:10.1.1.2 Bcast:10.1.1.255 Mask:255.255.255.0 > inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0 > TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1320521551 (1.2 GiB) TX bytes:2670089148 (2.4 GiB) > Interrupt:233 > > client: # ifconfig > eth0 Link encap:Ethernet HWaddr 00:1E:8C:39:E4:69 > inet addr:10.1.1.5 Bcast:10.1.1.255 Mask:255.255.255.0 > inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:727922 errors:0 dropped:0 overruns:0 frame:0 > TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:433349006 (413.2 MiB) TX bytes:231985578 (221.2 MiB) > Interrupt:50 > > > > 2) lctl list_nids > > client: lctl list_nids > 10.1.1.5 at tcp > > MDS: lctl list_nids > 10.1.1.1 at tcp > > OSS: lctl list_nids > 10.1.1.2 at tcp > > 3) tunefs.lustre --print --dryrun /dev/sda5 > OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5 > checking for existing Lustre data: found CONFIGS/mountdata > tunefs.lustre: Can''t create temporary directory /tmp/dirCZXt3k: > Read-only file system > > tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5 > (30) > tunefs.lustre: exiting with 30 (Read-only file system) > > OSS1: # tunefs.lustre --print --dryrun /dev/sda5 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: mylustre-OST0001 > Index: 1 > Lustre FS: mylustre > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=10.1.1.1 at tcp > > > Permanent disk data: > Target: mylustre-OST0001 > Index: 1 > Lustre FS: mylustre > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=10.1.1.1 at tcp > > exiting before disk write. > > > OSS2: # tunefs.lustre --print --dryrun /dev/sda5 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: mylustre-OST0002 > Index: 2 > Lustre FS: mylustre > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=10.1.1.1 at tcp > > > Permanent disk data: > Target: mylustre-OST0002 > Index: 2 > Lustre FS: mylustre > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=10.1.1.1 at tcp > > exiting before disk write. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110113/ee915698/attachment-0001.html
Kernel panics were due to bugs in the Lustre version you are running. If you want to avoid this sort of troubles in the future and make your filesystem stable then you should upgrade to 1.8.5. On 13 January 2011 18:04, Brendon <b at brendon.com> wrote:> Wojciech- > > Before this, I did read a bit about lustre, but not much. Just some > high-level stuff. It was definitely a "crash course" > > It looks like version 1.6.5. I don''t have stack traces. The kernel > paniced each time and I don''t have a console server. > > # uname -a > Linux jupiter.nanostellar.com 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1 > SMP Wed Jun 18 19:45:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > > -Brendon > > On Thu, Jan 13, 2011 at 10:01 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: > > Hi Brendon, > > > > So it looks like you Lustre was just stuck in recovery processes after > all. > > It is a bit concerning that you had kernel panics on MDS during recovery. > > Which Lustre version are you using? Do you have stack traces from the > kernel > > panics? > > > > Wojciech > > > > On 13 January 2011 17:41, Brendon <b at brendon.com> wrote: > >> > >> On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wjt27 at cam.ac.uk> > wrote: > >> > Hi Brendon, > >> > > >> > Can you please provide following: > >> > 1) output of ifconfig run on each OSS MDS and at least one client > >> > 2) output of lctl list_nids run on each OSS MDS and at least one > client > >> > 3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device> > from > >> > each OSS > >> > > >> > Wojciech > >> > >> After someone looked at the emails I sent out, they grabbed me on IRC. > >> We had a discussion and basically they interpreted the email as > >> everything should be working, I just needed to wait for a repair to > >> run and complete. What I then learned is that first, a client has to > >> connect for a repair to initiate. Secondly, the code isn''t perfect. > >> The MDS kernel oops''ed twice before it finally completed a repair > >> successfully. I was in the process of disabling panic on oops, but it > >> finally completed successfully. Once that was done, I got a clean bill > >> of health. > >> > >> Just to complete this discussion, I have listed the requested output. > >> I might still learn something :) > >> > >> ...Looks like I did learn something. OSS0 has an issue with the root > >> FS and was remounted RO which I discovered when running tunefs.lustre > >> --print --dryrun /dev/sda5. > >> > >> The fun never ends :) > >> -Brendon > >> > >> 1) ifconfig info > >> MDS: # ifconfig > >> eth0 Link encap:Ethernet HWaddr 00:15:17:5E:46:64 > >> inet addr:10.1.1.1 Bcast:10.1.1.255 Mask:255.255.255.0 > >> inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link > >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > >> RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0 > >> TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0 > >> collisions:0 txqueuelen:1000 > >> RX bytes:18963170801 (17.6 GiB) TX bytes:65261762295 (60.7 > GiB) > >> Base address:0xcc00 Memory:f58e0000-f5900000 > >> > >> eth1 Link encap:Ethernet HWaddr 00:15:17:5E:46:65 > >> inet addr:192.168.0.181 Bcast:192.168.0.255 > Mask:255.255.255.0 > >> inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link > >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > >> RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0 > >> TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0 > >> collisions:0 txqueuelen:100 > >> RX bytes:15562858193 (14.4 GiB) TX bytes:686167422947 (639.0 > >> GiB) > >> Base address:0xc880 Memory:f5880000-f58a0000 > >> > >> OSS : # ifconfig > >> eth0 Link encap:Ethernet HWaddr 00:1D:60:E0:5B:B2 > >> inet addr:10.1.1.2 Bcast:10.1.1.255 Mask:255.255.255.0 > >> inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link > >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > >> RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0 > >> TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0 > >> collisions:0 txqueuelen:1000 > >> RX bytes:1320521551 (1.2 GiB) TX bytes:2670089148 (2.4 GiB) > >> Interrupt:233 > >> > >> client: # ifconfig > >> eth0 Link encap:Ethernet HWaddr 00:1E:8C:39:E4:69 > >> inet addr:10.1.1.5 Bcast:10.1.1.255 Mask:255.255.255.0 > >> inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link > >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > >> RX packets:727922 errors:0 dropped:0 overruns:0 frame:0 > >> TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0 > >> collisions:0 txqueuelen:1000 > >> RX bytes:433349006 (413.2 MiB) TX bytes:231985578 (221.2 MiB) > >> Interrupt:50 > >> > >> > >> > >> 2) lctl list_nids > >> > >> client: lctl list_nids > >> 10.1.1.5 at tcp > >> > >> MDS: lctl list_nids > >> 10.1.1.1 at tcp > >> > >> OSS: lctl list_nids > >> 10.1.1.2 at tcp > >> > >> 3) tunefs.lustre --print --dryrun /dev/sda5 > >> OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5 > >> checking for existing Lustre data: found CONFIGS/mountdata > >> tunefs.lustre: Can''t create temporary directory /tmp/dirCZXt3k: > >> Read-only file system > >> > >> tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5 > >> (30) > >> tunefs.lustre: exiting with 30 (Read-only file system) > >> > >> OSS1: # tunefs.lustre --print --dryrun /dev/sda5 > >> checking for existing Lustre data: found CONFIGS/mountdata > >> Reading CONFIGS/mountdata > >> > >> Read previous values: > >> Target: mylustre-OST0001 > >> Index: 1 > >> Lustre FS: mylustre > >> Mount type: ldiskfs > >> Flags: 0x2 > >> (OST ) > >> Persistent mount opts: errors=remount-ro,extents,mballoc > >> Parameters: mgsnode=10.1.1.1 at tcp > >> > >> > >> Permanent disk data: > >> Target: mylustre-OST0001 > >> Index: 1 > >> Lustre FS: mylustre > >> Mount type: ldiskfs > >> Flags: 0x2 > >> (OST ) > >> Persistent mount opts: errors=remount-ro,extents,mballoc > >> Parameters: mgsnode=10.1.1.1 at tcp > >> > >> exiting before disk write. > >> > >> > >> OSS2: # tunefs.lustre --print --dryrun /dev/sda5 > >> checking for existing Lustre data: found CONFIGS/mountdata > >> Reading CONFIGS/mountdata > >> > >> Read previous values: > >> Target: mylustre-OST0002 > >> Index: 2 > >> Lustre FS: mylustre > >> Mount type: ldiskfs > >> Flags: 0x2 > >> (OST ) > >> Persistent mount opts: errors=remount-ro,extents,mballoc > >> Parameters: mgsnode=10.1.1.1 at tcp > >> > >> > >> Permanent disk data: > >> Target: mylustre-OST0002 > >> Index: 2 > >> Lustre FS: mylustre > >> Mount type: ldiskfs > >> Flags: 0x2 > >> (OST ) > >> Persistent mount opts: errors=remount-ro,extents,mballoc > >> Parameters: mgsnode=10.1.1.1 at tcp > >> > >> exiting before disk write. > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > -- > > Wojciech Turek > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110113/925eec5f/attachment.html