thr3ads.net - Lustre discuss - [Lustre-discuss] lustre OSS IP change [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Brendon

2011-Jan-11 22:23 UTC

[Lustre-discuss] lustre OSS IP change

Hello.

I''m new to Lustre. I''m writing because I have changed some
IP''s and I
can''t get everything mounted.

4 total servers (changed IPs of just OSS servers)
 head - MGT/MGS (no IP change)
 oss0 - OSS (changed IP)
 oss1 - OSS (changed IP)
 oss2 - OSS (changed IP)

brianjm was helpful in IRC and hinted I needed to read the manual
about Changing a NID. I followed this procedure : 4.3.12 Changing a
Server NID

On all servers, I made sure everything was unmounted: umount /mnt/mdt
and umount /mnt/lustre

On head, I ran : tunefs.lustre --writeconf /dev/sdb1
On oss0, I ran : tunefs.lustre --writeconf /dev/sda5
On oss1, I ran : tunefs.lustre --writeconf /dev/sda5
On oss2, I ran : tunefs.lustre --writeconf /dev/sda5

I then mounted everything. First the mdt on server head

On head: mount /mnt/mdt  -- this didn''t throw any errors. checked the
logs and no errors.

On oss0: mount /mnt/lustre
mount.lustre: mount jupiter at tcp0:/mylustre at /mnt/lustre failed: No
medium found
This filesystem needs at least 1 OST

Error log on oss0:
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError: 151-5: There are
no OST''s in this filesystem. There must be at least one active OST for
a client to start.
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from
cancel RPC: canceling anyway
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
Jan 11 14:12:25 compute-oss-0-0 kernel: LustreError:
4136:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount  (-123)

The same error appears on the other oss servers.

Any help would be greatly appreciated.
-Brendon

Brendon

2011-Jan-11 22:47 UTC

head link

[Lustre-discuss] lustre OSS IP change

Since the last post, I realized the fstab didn''t have an entry for the
OST to mount. It''s not clear to me how this was working before,
because I don''t recall seeing an OST mounted when running
"mount"
before. Anyway, continuing on...

On oss0 I ran:
mount -t lustre /dev/sda5 /mnt/ost0

Logs on oss0 say:
Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError:
4458:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1 at tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 300s
Jan 11 14:33:33 compute-oss-0-0 kernel: LustreError:
4458:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff81021e84da00 x72/t0 o8-><?>@<?>:0/0 lens
240/144 e 0 to
0 dl 1294785313 ref 1 fl Interpret:/0/0 rc -16/0
Jan 11 14:33:58 compute-oss-0-0 kernel: LustreError:
4459:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1 at tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 275s
...
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:806:target_handle_connect()) mylustre-OST0003:
denying connection for new client 10.1.1.1 at tcp (mylustre-mdtlov_UUID):
3 clients in recovery for 74s
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:806:target_handle_connect()) Skipped 1 previous
similar message
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff81021e310800 x91/t0 o8-><?>@<?>:0/0 lens
240/144 e 0 to
0 dl 1294785538 ref 1 fl Interpret:/0/0 rc -16/0
Jan 11 14:37:18 compute-oss-0-0 kernel: LustreError:
4468:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped1 previous
similar message
Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError:
0:0:(ldlm_lib.c:1105:target_recovery_expired()) mylustre-OST0003:
recovery timed out, aborting
Jan 11 14:38:33 compute-oss-0-0 kernel: LustreError:
4471:0:(genops.c:1024:class_disconnect_stale_exports())
mylustre-OST0003: disconnecting 3 stale clients


logs on head server :
Jan 11 14:33:33 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.2 at tcp. The ost_connect operation
failed with -16
Jan 11 14:34:23 jupiter last message repeated 2 times
Jan 11 14:35:38 jupiter last message repeated 3 times
Jan 11 14:37:18 jupiter last message repeated 3 times
Jan 11 14:37:18 jupiter kernel: LustreError: Skipped 1 previous similar message

Any help is much appreciated.
-Brendon

Brendon

2011-Jan-11 23:07 UTC

head link

[Lustre-discuss] lustre OSS IP change

I continued mounting the OSTs on n1 and n2. I received the same errors
when mounting them as on n0.

I then tried mounting the lustre as a client. That worked, but
touching a file on it caused it to hang and spit the errors below on
server head. After about a minute the server became unresponsive to
ping. I''m guessing it has oops''ed.

I googled the ost_connect error -16, but haven''t found anything
relevnat yet that appears useful.
I''m going to take a break. I''ve been working this one all
day... time
for a late lunch.

Any insight is much appreciated.
-Brendon

Jan 11 14:54:19 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.3 at tcp. The ost_connect operation
failed with -16
Jan 11 14:54:19 jupiter kernel: LustreError: Skipped 5 previous similar messages
Jan 11 14:57:14 jupiter kernel: LustreError:
7694:0:(osc_create.c:348:osc_create()) mylustre-OST0002-osc: oscc
recovery failed: -110
Jan 11 14:57:14 jupiter kernel: LustreError:
7693:0:(osc_create.c:348:osc_create()) mylustre-OST0001-osc: oscc
recovery failed: -110
Jan 11 14:57:14 jupiter kernel: LustreError:
7694:0:(lov_obd.c:1074:lov_clear_orphans()) error in orphan recovery
on OST idx 2/4: rc = -110
Jan 11 14:57:14 jupiter kernel: LustreError: 11-0: an error occurred
while communicating with 10.1.1.4 at tcp. The ost_connect operation
failed with -16
Jan 11 14:57:14 jupiter kernel: LustreError: Skipped 9 previous similar messages

Wojciech Turek

2011-Jan-11 23:35 UTC

head link

[Lustre-discuss] lustre OSS IP change

Hi Brendon,

Can you please provide following:
1) output of ifconfig run on each OSS MDS and at least one client
2) output of lctl list_nids run on each OSS MDS and at least one client
3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device> from
each OSS

Wojciech

On 11 January 2011 23:07, Brendon <b at brendon.com> wrote:
> I continued mounting the OSTs on n1 and n2. I received the same errors
> when mounting them as on n0.
>
> I then tried mounting the lustre as a client. That worked, but
> touching a file on it caused it to hang and spit the errors below on
> server head. After about a minute the server became unresponsive to
> ping. I''m guessing it has oops''ed.
>
> I googled the ost_connect error -16, but haven''t found anything
> relevnat yet that appears useful.
> I''m going to take a break. I''ve been working this one all
day... time
> for a late lunch.
>
> Any insight is much appreciated.
> -Brendon
>
> Jan 11 14:54:19 jupiter kernel: LustreError: 11-0: an error occurred
> while communicating with 10.1.1.3 at tcp. The ost_connect operation
> failed with -16
> Jan 11 14:54:19 jupiter kernel: LustreError: Skipped 5 previous similar
> messages
> Jan 11 14:57:14 jupiter kernel: LustreError:
> 7694:0:(osc_create.c:348:osc_create()) mylustre-OST0002-osc: oscc
> recovery failed: -110
> Jan 11 14:57:14 jupiter kernel: LustreError:
> 7693:0:(osc_create.c:348:osc_create()) mylustre-OST0001-osc: oscc
> recovery failed: -110
> Jan 11 14:57:14 jupiter kernel: LustreError:
> 7694:0:(lov_obd.c:1074:lov_clear_orphans()) error in orphan recovery
> on OST idx 2/4: rc = -110
> Jan 11 14:57:14 jupiter kernel: LustreError: 11-0: an error occurred
> while communicating with 10.1.1.4 at tcp. The ost_connect operation
> failed with -16
> Jan 11 14:57:14 jupiter kernel: LustreError: Skipped 9 previous similar
> messages
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Wojciech Turek

Senior System Architect

High Performance Computing Service
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110111/678804c8/attachment-0001.html

Brendon

2011-Jan-13 17:41 UTC

head link

[Lustre-discuss] lustre OSS IP change

On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:> Hi Brendon,
>
> Can you please provide following:
> 1) output of ifconfig run on each OSS MDS and at least one client
> 2) output of lctl list_nids run on each OSS MDS and at least one client
> 3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device>
from
> each OSS
>
> Wojciech
After someone looked at the emails I sent out, they grabbed me on IRC.
We had a discussion and basically they interpreted the email as
everything should be working, I just needed to wait for a repair to
run and complete. What I then learned is that first, a client has to
connect for a repair to initiate. Secondly, the code isn''t perfect.
The MDS kernel oops''ed twice before it finally completed a repair
successfully. I was in the process of disabling panic on oops, but it
finally completed successfully. Once that was done, I got a clean bill
of health.

Just to complete this discussion, I have listed the requested output.
I might still learn something :)

...Looks like I did learn something. OSS0 has an issue with the root
FS and was remounted RO which I discovered when running ?tunefs.lustre
--print --dryrun /dev/sda5.

The fun never ends :)
-Brendon

1) ifconfig info
MDS: # ifconfig
eth0 ? ? ?Link encap:Ethernet ?HWaddr 00:15:17:5E:46:64
? ? ? ? ?inet addr:10.1.1.1 ?Bcast:10.1.1.255 ?Mask:255.255.255.0
? ? ? ? ?inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link
? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1
? ? ? ? ?RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0
? ? ? ? ?TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0
? ? ? ? ?collisions:0 txqueuelen:1000
? ? ? ? ?RX bytes:18963170801 (17.6 GiB) ?TX bytes:65261762295 (60.7 GiB)
? ? ? ? ?Base address:0xcc00 Memory:f58e0000-f5900000

eth1 ? ? ?Link encap:Ethernet ?HWaddr 00:15:17:5E:46:65
? ? ? ? ?inet addr:192.168.0.181 ?Bcast:192.168.0.255 ?Mask:255.255.255.0
? ? ? ? ?inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link
? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1
? ? ? ? ?RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0
? ? ? ? ?TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0
? ? ? ? ?collisions:0 txqueuelen:100
? ? ? ? ?RX bytes:15562858193 (14.4 GiB) ?TX bytes:686167422947 (639.0 GiB)
? ? ? ? ?Base address:0xc880 Memory:f5880000-f58a0000

OSS : # ifconfig
eth0 ? ? ?Link encap:Ethernet ?HWaddr 00:1D:60:E0:5B:B2
? ? ? ? ?inet addr:10.1.1.2 ?Bcast:10.1.1.255 ?Mask:255.255.255.0
? ? ? ? ?inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link
? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1
? ? ? ? ?RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0
? ? ? ? ?TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0
? ? ? ? ?collisions:0 txqueuelen:1000
? ? ? ? ?RX bytes:1320521551 (1.2 GiB) ?TX bytes:2670089148 (2.4 GiB)
? ? ? ? ?Interrupt:233

client: # ifconfig
eth0 ? ? ?Link encap:Ethernet ?HWaddr 00:1E:8C:39:E4:69
? ? ? ? ?inet addr:10.1.1.5 ?Bcast:10.1.1.255 ?Mask:255.255.255.0
? ? ? ? ?inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link
? ? ? ? ?UP BROADCAST RUNNING MULTICAST ?MTU:1500 ?Metric:1
? ? ? ? ?RX packets:727922 errors:0 dropped:0 overruns:0 frame:0
? ? ? ? ?TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0
? ? ? ? ?collisions:0 txqueuelen:1000
? ? ? ? ?RX bytes:433349006 (413.2 MiB) ?TX bytes:231985578 (221.2 MiB)
? ? ? ? ?Interrupt:50



2) lctl list_nids

client: lctl list_nids
10.1.1.5 at tcp

MDS: lctl list_nids
10.1.1.1 at tcp

OSS: lctl list_nids
10.1.1.2 at tcp

3) tunefs.lustre --print --dryrun /dev/sda5
OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
tunefs.lustre: Can''t create temporary directory /tmp/dirCZXt3k:
Read-only file system

tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5 (30)
tunefs.lustre: exiting with 30 (Read-only file system)

OSS1: # tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

? Read previous values:
Target: ? ? mylustre-OST0001
Index: ? ? ?1
Lustre FS: ?mylustre
Mount type: ldiskfs
Flags: ? ? ?0x2
? ? ? ? ? ? ?(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1 at tcp


? Permanent disk data:
Target: ? ? mylustre-OST0001
Index: ? ? ?1
Lustre FS: ?mylustre
Mount type: ldiskfs
Flags: ? ? ?0x2
? ? ? ? ? ? ?(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1 at tcp

exiting before disk write.


OSS2: # tunefs.lustre --print --dryrun /dev/sda5
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

? Read previous values:
Target: ? ? mylustre-OST0002
Index: ? ? ?2
Lustre FS: ?mylustre
Mount type: ldiskfs
Flags: ? ? ?0x2
? ? ? ? ? ? ?(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1 at tcp


? Permanent disk data:
Target: ? ? mylustre-OST0002
Index: ? ? ?2
Lustre FS: ?mylustre
Mount type: ldiskfs
Flags: ? ? ?0x2
? ? ? ? ? ? ?(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.1.1.1 at tcp

exiting before disk write.

Wojciech Turek

2011-Jan-13 18:02 UTC

head link

[Lustre-discuss] Fwd: lustre OSS IP change

Hi Brendon,

So it looks like you Lustre was just stuck in recovery processes after all.
It is a bit concerning that you had kernel panics on MDS during recovery.
Which Lustre version are you using? Do you have stack traces from the kernel
panics?

Wojciech


On 13 January 2011 17:41, Brendon <b at brendon.com> wrote:
> On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:
> > Hi Brendon,
> >
> > Can you please provide following:
> > 1) output of ifconfig run on each OSS MDS and at least one client
> > 2) output of lctl list_nids run on each OSS MDS and at least one
client
> > 3) output of tunefs.lustre --print --dryrun
/dev/<OST_block_device> from
> > each OSS
> >
> > Wojciech
>
> After someone looked at the emails I sent out, they grabbed me on IRC.
> We had a discussion and basically they interpreted the email as
> everything should be working, I just needed to wait for a repair to
> run and complete. What I then learned is that first, a client has to
> connect for a repair to initiate. Secondly, the code isn''t
perfect.
> The MDS kernel oops''ed twice before it finally completed a repair
> successfully. I was in the process of disabling panic on oops, but it
> finally completed successfully. Once that was done, I got a clean bill
> of health.
>
> Just to complete this discussion, I have listed the requested output.
> I might still learn something :)
>
> ...Looks like I did learn something. OSS0 has an issue with the root
> FS and was remounted RO which I discovered when running  tunefs.lustre
> --print --dryrun /dev/sda5.
>
> The fun never ends :)
> -Brendon
>
> 1) ifconfig info
> MDS: # ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:15:17:5E:46:64
>          inet addr:10.1.1.1  Bcast:10.1.1.255  Mask:255.255.255.0
>          inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link
>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>          RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:18963170801 (17.6 GiB)  TX bytes:65261762295 (60.7 GiB)
>          Base address:0xcc00 Memory:f58e0000-f5900000
>
> eth1      Link encap:Ethernet  HWaddr 00:15:17:5E:46:65
>          inet addr:192.168.0.181  Bcast:192.168.0.255  Mask:255.255.255.0
>          inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link
>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>          RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:100
>          RX bytes:15562858193 (14.4 GiB)  TX bytes:686167422947 (639.0 GiB)
>          Base address:0xc880 Memory:f5880000-f58a0000
>
> OSS : # ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:1D:60:E0:5B:B2
>          inet addr:10.1.1.2  Bcast:10.1.1.255  Mask:255.255.255.0
>          inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link
>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>          RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:1320521551 (1.2 GiB)  TX bytes:2670089148 (2.4 GiB)
>          Interrupt:233
>
> client: # ifconfig
> eth0      Link encap:Ethernet  HWaddr 00:1E:8C:39:E4:69
>          inet addr:10.1.1.5  Bcast:10.1.1.255  Mask:255.255.255.0
>          inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link
>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>          RX packets:727922 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:433349006 (413.2 MiB)  TX bytes:231985578 (221.2 MiB)
>          Interrupt:50
>
>
>
> 2) lctl list_nids
>
> client: lctl list_nids
> 10.1.1.5 at tcp
>
> MDS: lctl list_nids
> 10.1.1.1 at tcp
>
> OSS: lctl list_nids
> 10.1.1.2 at tcp
>
> 3) tunefs.lustre --print --dryrun /dev/sda5
> OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5
> checking for existing Lustre data: found CONFIGS/mountdata
> tunefs.lustre: Can''t create temporary directory /tmp/dirCZXt3k:
> Read-only file system
>
> tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5
> (30)
> tunefs.lustre: exiting with 30 (Read-only file system)
>
> OSS1: # tunefs.lustre --print --dryrun /dev/sda5
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>   Read previous values:
> Target:     mylustre-OST0001
> Index:      1
> Lustre FS:  mylustre
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=10.1.1.1 at tcp
>
>
>   Permanent disk data:
> Target:     mylustre-OST0001
> Index:      1
> Lustre FS:  mylustre
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=10.1.1.1 at tcp
>
> exiting before disk write.
>
>
> OSS2: # tunefs.lustre --print --dryrun /dev/sda5
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>   Read previous values:
> Target:     mylustre-OST0002
> Index:      2
> Lustre FS:  mylustre
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=10.1.1.1 at tcp
>
>
>   Permanent disk data:
> Target:     mylustre-OST0002
> Index:      2
> Lustre FS:  mylustre
> Mount type: ldiskfs
> Flags:      0x2
>              (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=10.1.1.1 at tcp
>
> exiting before disk write.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110113/ee915698/attachment-0001.html

Wojciech Turek

2011-Jan-13 18:40 UTC

head link

[Lustre-discuss] lustre OSS IP change

Kernel panics were due to bugs in the Lustre version you are running. If you
want to avoid this sort of troubles in the future and make your filesystem
stable then you should upgrade to 1.8.5.

On 13 January 2011 18:04, Brendon <b at brendon.com> wrote:
> Wojciech-
>
> Before this, I did read a bit about lustre, but not much. Just some
> high-level stuff. It was definitely a "crash course"
>
> It looks like version 1.6.5. I don''t have stack traces. The kernel
> paniced each time and I don''t have a console server.
>
> # uname -a
> Linux jupiter.nanostellar.com 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1
> SMP Wed Jun 18 19:45:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> -Brendon
>
> On Thu, Jan 13, 2011 at 10:01 AM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:
> > Hi Brendon,
> >
> > So it looks like you Lustre was just stuck in recovery processes after
> all.
> > It is a bit concerning that you had kernel panics on MDS during
recovery.
> > Which Lustre version are you using? Do you have stack traces from the
> kernel
> > panics?
> >
> > Wojciech
> >
> > On 13 January 2011 17:41, Brendon <b at brendon.com> wrote:
> >>
> >> On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wjt27 at
cam.ac.uk>
> wrote:
> >> > Hi Brendon,
> >> >
> >> > Can you please provide following:
> >> > 1) output of ifconfig run on each OSS MDS and at least one
client
> >> > 2) output of lctl list_nids run on each OSS MDS and at least
one
> client
> >> > 3) output of tunefs.lustre --print --dryrun
/dev/<OST_block_device>
> from
> >> > each OSS
> >> >
> >> > Wojciech
> >>
> >> After someone looked at the emails I sent out, they grabbed me on
IRC.
> >> We had a discussion and basically they interpreted the email as
> >> everything should be working, I just needed to wait for a repair
to
> >> run and complete. What I then learned is that first, a client has
to
> >> connect for a repair to initiate. Secondly, the code
isn''t perfect.
> >> The MDS kernel oops''ed twice before it finally completed
a repair
> >> successfully. I was in the process of disabling panic on oops, but
it
> >> finally completed successfully. Once that was done, I got a clean
bill
> >> of health.
> >>
> >> Just to complete this discussion, I have listed the requested
output.
> >> I might still learn something :)
> >>
> >> ...Looks like I did learn something. OSS0 has an issue with the
root
> >> FS and was remounted RO which I discovered when running 
tunefs.lustre
> >> --print --dryrun /dev/sda5.
> >>
> >> The fun never ends :)
> >> -Brendon
> >>
> >> 1) ifconfig info
> >> MDS: # ifconfig
> >> eth0      Link encap:Ethernet  HWaddr 00:15:17:5E:46:64
> >>          inet addr:10.1.1.1  Bcast:10.1.1.255  Mask:255.255.255.0
> >>          inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:63644404 errors:0 dropped:0 overruns:0
carrier:0
> >>          collisions:0 txqueuelen:1000
> >>          RX bytes:18963170801 (17.6 GiB)  TX bytes:65261762295
(60.7
> GiB)
> >>          Base address:0xcc00 Memory:f58e0000-f5900000
> >>
> >> eth1      Link encap:Ethernet  HWaddr 00:15:17:5E:46:65
> >>          inet addr:192.168.0.181  Bcast:192.168.0.255
>  Mask:255.255.255.0
> >>          inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:236738842 errors:0 dropped:0 overruns:0
frame:0
> >>          TX packets:458503163 errors:0 dropped:0 overruns:0
carrier:0
> >>          collisions:0 txqueuelen:100
> >>          RX bytes:15562858193 (14.4 GiB)  TX bytes:686167422947
(639.0
> >> GiB)
> >>          Base address:0xc880 Memory:f5880000-f58a0000
> >>
> >> OSS : # ifconfig
> >> eth0      Link encap:Ethernet  HWaddr 00:1D:60:E0:5B:B2
> >>          inet addr:10.1.1.2  Bcast:10.1.1.255  Mask:255.255.255.0
> >>          inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:3547204 errors:0 dropped:0 overruns:0
carrier:0
> >>          collisions:0 txqueuelen:1000
> >>          RX bytes:1320521551 (1.2 GiB)  TX bytes:2670089148 (2.4
GiB)
> >>          Interrupt:233
> >>
> >> client: # ifconfig
> >> eth0      Link encap:Ethernet  HWaddr 00:1E:8C:39:E4:69
> >>          inet addr:10.1.1.5  Bcast:10.1.1.255  Mask:255.255.255.0
> >>          inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:727922 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0
> >>          collisions:0 txqueuelen:1000
> >>          RX bytes:433349006 (413.2 MiB)  TX bytes:231985578 (221.2
MiB)
> >>          Interrupt:50
> >>
> >>
> >>
> >> 2) lctl list_nids
> >>
> >> client: lctl list_nids
> >> 10.1.1.5 at tcp
> >>
> >> MDS: lctl list_nids
> >> 10.1.1.1 at tcp
> >>
> >> OSS: lctl list_nids
> >> 10.1.1.2 at tcp
> >>
> >> 3) tunefs.lustre --print --dryrun /dev/sda5
> >> OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5
> >> checking for existing Lustre data: found CONFIGS/mountdata
> >> tunefs.lustre: Can''t create temporary directory
/tmp/dirCZXt3k:
> >> Read-only file system
> >>
> >> tunefs.lustre FATAL: Failed to read previous Lustre data from
/dev/sda5
> >> (30)
> >> tunefs.lustre: exiting with 30 (Read-only file system)
> >>
> >> OSS1: # tunefs.lustre --print --dryrun /dev/sda5
> >> checking for existing Lustre data: found CONFIGS/mountdata
> >> Reading CONFIGS/mountdata
> >>
> >>   Read previous values:
> >> Target:     mylustre-OST0001
> >> Index:      1
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >>
> >>   Permanent disk data:
> >> Target:     mylustre-OST0001
> >> Index:      1
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >> exiting before disk write.
> >>
> >>
> >> OSS2: # tunefs.lustre --print --dryrun /dev/sda5
> >> checking for existing Lustre data: found CONFIGS/mountdata
> >> Reading CONFIGS/mountdata
> >>
> >>   Read previous values:
> >> Target:     mylustre-OST0002
> >> Index:      2
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >>
> >>   Permanent disk data:
> >> Target:     mylustre-OST0002
> >> Index:      2
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >> exiting before disk write.
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >
> > --
> > Wojciech Turek
> >
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110113/925eec5f/attachment.html

Lustre discuss - Jan 2011 - lustre OSS IP change

[Lustre-discuss] lustre OSS IP change

[Lustre-discuss] lustre OSS IP change

[Lustre-discuss] lustre OSS IP change

[Lustre-discuss] lustre OSS IP change

[Lustre-discuss] lustre OSS IP change

[Lustre-discuss] Fwd: lustre OSS IP change

[Lustre-discuss] lustre OSS IP change