Nirmal Seenu
2010-Sep-09 16:56 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
I just upgraded my lustre version from 1.8.1.1 to 1.8.4 and I can''t reboot my lustre clients cleanly anymore. I am using the latest RHEL kernel and the openibd that comes part of that RHEL kernel + patchless lustre client installed from the tar ball. The lustre client gets unmounted cleanly but the system deadlocks once the openibd driver is removed. I had to modify the openibd stop script to include "umount lustre" and "lustre_rmmod" as a work around. The following is the error message that I get when I try to reboot the lustre client: Scientific Linux SLF release 5.3 (Lederman) Kernel 2.6.18-194.11.1.el5 on an x86_64 INIT:Shutting down smartd: [ OK ] Stopping atd: [ OK ] Shutting down process accounting: [ OK ] Stopping xinetd: [ OK ] Stopping autofs: Stopping automount: [ OK ] [ OK ] Stopping acpi daemon: [ OK ] Shutting down ntpd: [ OK ] Unmounting network block filesystems: LustreError: 3697:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 3697:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Lustre: client ffff81020f145400 umount complete [ OK ] Unmounting NFS filesystems: [ OK ] Stopping system message bus: [ OK ] Stopping RPC idmapd: [ OK ] Stopping NFS locking: [ OK ] Stopping NFS statd: [ OK ] Stopping portmap: [ OK ] Stopping PC/SC smart card daemon (pcscd): [ OK ] Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] Unloading OpenIB kernel modules:NET: Unregistered protocol family 27 Failed to unload rdma_cm Failed to unload ib_cm Failed to unload iw_cm LustreError: 131-3: Received notification of device removal Please shutdown LNET to allow this to proceed INFO: task rmmod:4151 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. rmmod D ffff810227061420 0 4151 3795 (NOTLB) ffff81021c8ddce8 0000000000000082 000000000000000f 0000000000000292 00000000000000ef 0000000000000001 ffff81020ecdd100 ffff8102271ef040 0000004a957c4bd9 000000000095dc57 ffff81020ecdd2e8 0000000480076646 Call Trace: [<ffffffff80063167>] wait_for_completion+0x79/0xa2 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe [<ffffffff80063b05>] mutex_lock+0xd/0x1d [<ffffffff8838d155>] :rdma_cm:cma_remove_one+0x171/0x1a2 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a [<ffffffff8817d5f0>] :ib_core:ib_unregister_device+0x30/0xdb [<ffffffff881a918a>] :ib_mthca:__mthca_remove_one+0x30/0x11a [<ffffffff80063b05>] mutex_lock+0xd/0x1d [<ffffffff881a928c>] :ib_mthca:mthca_remove_one+0x18/0x25 [<ffffffff8015daeb>] pci_device_remove+0x24/0x3a [<ffffffff801c7a3e>] __device_release_driver+0x9f/0xe9 [<ffffffff801c7e04>] driver_detach+0xad/0x101 [<ffffffff801c6ffe>] bus_remove_driver+0x6f/0x92 [<ffffffff801c7e8b>] driver_unregister+0xd/0x16 [<ffffffff8015ddb4>] pci_unregister_driver+0x2a/0x79 [<ffffffff881bc398>] :ib_mthca:mthca_cleanup+0x10/0x16 [<ffffffff800a6674>] sys_delete_module+0x196/0x1c5 [<ffffffff8005d116>] system_call+0x7e/0x83 Nirmal
Andreas Dilger
2010-Sep-09 19:28 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
On 2010-09-09, at 10:56, Nirmal Seenu wrote:> I just upgraded my lustre version from 1.8.1.1 to 1.8.4 and I can''t reboot my lustre clients cleanly anymore. I am using the latest RHEL kernel and > the openibd that comes part of that RHEL kernel + patchless lustre client installed from the tar ball. > > The lustre client gets unmounted cleanly but the system deadlocks once the openibd driver is removed. I had to modify the openibd stop script to > include "umount lustre" and "lustre_rmmod" as a work around.If you put "_netdev" in the lustre mount options, the shutdown scripts should unmount it before trying to stop the networking.> The following is the error message that I get when I try to reboot the lustre client: > > Scientific Linux SLF release 5.3 (Lederman) > Kernel 2.6.18-194.11.1.el5 on an x86_64 > > INIT:Shutting down smartd: [ OK ] > Stopping atd: [ OK ] > Shutting down process accounting: [ OK ] > Stopping xinetd: [ OK ] > Stopping autofs: Stopping automount: [ OK ] > [ OK ] > Stopping acpi daemon: [ OK ] > Shutting down ntpd: [ OK ] > Unmounting network block filesystems: LustreError: 3697:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway > LustreError: 3697:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 > Lustre: client ffff81020f145400 umount complete > [ OK ] > Unmounting NFS filesystems: [ OK ] > Stopping system message bus: [ OK ] > Stopping RPC idmapd: [ OK ] > Stopping NFS locking: [ OK ] > Stopping NFS statd: [ OK ] > Stopping portmap: [ OK ] > Stopping PC/SC smart card daemon (pcscd): [ OK ] > Shutting down kernel logger: [ OK ] > Shutting down system logger: [ OK ] > Unloading OpenIB kernel modules:NET: Unregistered protocol family 27 > > Failed to unload rdma_cm > > Failed to unload ib_cm > > Failed to unload iw_cm > LustreError: 131-3: Received notification of device removal > Please shutdown LNET to allow this to proceed > INFO: task rmmod:4151 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > rmmod D ffff810227061420 0 4151 3795 (NOTLB) > ffff81021c8ddce8 0000000000000082 000000000000000f 0000000000000292 > 00000000000000ef 0000000000000001 ffff81020ecdd100 ffff8102271ef040 > 0000004a957c4bd9 000000000095dc57 ffff81020ecdd2e8 0000000480076646 > Call Trace: > [<ffffffff80063167>] wait_for_completion+0x79/0xa2 > [<ffffffff8008cfa1>] default_wake_function+0x0/0xe > [<ffffffff80063b05>] mutex_lock+0xd/0x1d > [<ffffffff8838d155>] :rdma_cm:cma_remove_one+0x171/0x1a2 > [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a > [<ffffffff8817d5f0>] :ib_core:ib_unregister_device+0x30/0xdb > [<ffffffff881a918a>] :ib_mthca:__mthca_remove_one+0x30/0x11a > [<ffffffff80063b05>] mutex_lock+0xd/0x1d > [<ffffffff881a928c>] :ib_mthca:mthca_remove_one+0x18/0x25 > [<ffffffff8015daeb>] pci_device_remove+0x24/0x3a > [<ffffffff801c7a3e>] __device_release_driver+0x9f/0xe9 > [<ffffffff801c7e04>] driver_detach+0xad/0x101 > [<ffffffff801c6ffe>] bus_remove_driver+0x6f/0x92 > [<ffffffff801c7e8b>] driver_unregister+0xd/0x16 > [<ffffffff8015ddb4>] pci_unregister_driver+0x2a/0x79 > [<ffffffff881bc398>] :ib_mthca:mthca_cleanup+0x10/0x16 > [<ffffffff800a6674>] sys_delete_module+0x196/0x1c5 > [<ffffffff8005d116>] system_call+0x7e/0x83 > > > Nirmal > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Nirmal Seenu
2010-Sep-09 19:33 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
lustre does get unmounted before NFS filesystem as seen in the log message... the problem is due to the fact that LNET is still up when openibd gets removed. Nirmal On 09/09/2010 02:28 PM, Andreas Dilger wrote:> On 2010-09-09, at 10:56, Nirmal Seenu wrote: >> I just upgraded my lustre version from 1.8.1.1 to 1.8.4 and I can''t reboot my lustre clients cleanly anymore. I am using the latest RHEL kernel and >> the openibd that comes part of that RHEL kernel + patchless lustre client installed from the tar ball. >> >> The lustre client gets unmounted cleanly but the system deadlocks once the openibd driver is removed. I had to modify the openibd stop script to >> include "umount lustre" and "lustre_rmmod" as a work around. > > If you put "_netdev" in the lustre mount options, the shutdown scripts should unmount it before trying to stop the networking. > > >> The following is the error message that I get when I try to reboot the lustre client: >> >> Scientific Linux SLF release 5.3 (Lederman) >> Kernel 2.6.18-194.11.1.el5 on an x86_64 >> >> INIT:Shutting down smartd: [ OK ] >> Stopping atd: [ OK ] >> Shutting down process accounting: [ OK ] >> Stopping xinetd: [ OK ] >> Stopping autofs: Stopping automount: [ OK ] >> [ OK ] >> Stopping acpi daemon: [ OK ] >> Shutting down ntpd: [ OK ] >> Unmounting network block filesystems: LustreError: 3697:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway >> LustreError: 3697:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 >> Lustre: client ffff81020f145400 umount complete >> [ OK ] >> Unmounting NFS filesystems: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping RPC idmapd: [ OK ] >> Stopping NFS locking: [ OK ] >> Stopping NFS statd: [ OK ] >> Stopping portmap: [ OK ] >> Stopping PC/SC smart card daemon (pcscd): [ OK ] >> Shutting down kernel logger: [ OK ] >> Shutting down system logger: [ OK ] >> Unloading OpenIB kernel modules:NET: Unregistered protocol family 27 >> >> Failed to unload rdma_cm >> >> Failed to unload ib_cm >> >> Failed to unload iw_cm >> LustreError: 131-3: Received notification of device removal >> Please shutdown LNET to allow this to proceed >> INFO: task rmmod:4151 blocked for more than 120 seconds. >> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> rmmod D ffff810227061420 0 4151 3795 (NOTLB) >> ffff81021c8ddce8 0000000000000082 000000000000000f 0000000000000292 >> 00000000000000ef 0000000000000001 ffff81020ecdd100 ffff8102271ef040 >> 0000004a957c4bd9 000000000095dc57 ffff81020ecdd2e8 0000000480076646 >> Call Trace: >> [<ffffffff80063167>] wait_for_completion+0x79/0xa2 >> [<ffffffff8008cfa1>] default_wake_function+0x0/0xe >> [<ffffffff80063b05>] mutex_lock+0xd/0x1d >> [<ffffffff8838d155>] :rdma_cm:cma_remove_one+0x171/0x1a2 >> [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a >> [<ffffffff8817d5f0>] :ib_core:ib_unregister_device+0x30/0xdb >> [<ffffffff881a918a>] :ib_mthca:__mthca_remove_one+0x30/0x11a >> [<ffffffff80063b05>] mutex_lock+0xd/0x1d >> [<ffffffff881a928c>] :ib_mthca:mthca_remove_one+0x18/0x25 >> [<ffffffff8015daeb>] pci_device_remove+0x24/0x3a >> [<ffffffff801c7a3e>] __device_release_driver+0x9f/0xe9 >> [<ffffffff801c7e04>] driver_detach+0xad/0x101 >> [<ffffffff801c6ffe>] bus_remove_driver+0x6f/0x92 >> [<ffffffff801c7e8b>] driver_unregister+0xd/0x16 >> [<ffffffff8015ddb4>] pci_unregister_driver+0x2a/0x79 >> [<ffffffff881bc398>] :ib_mthca:mthca_cleanup+0x10/0x16 >> [<ffffffff800a6674>] sys_delete_module+0x196/0x1c5 >> [<ffffffff8005d116>] system_call+0x7e/0x83 >> >> >> Nirmal >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. >
Ken Hornstein
2010-Sep-09 19:44 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
>lustre does get unmounted before NFS filesystem as seen in the log message... >the problem is due to the fact that LNET is still up when openibd gets >removed.Huh, I''m wondering how it ever worked "right" before. Certainly on the systems I have at 1.8.1.1, I always had to have a Lustre start/stop script which did a lustre_rmmod as part of the stop sequence. --Ken
Nirmal Seenu
2010-Sep-09 19:57 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
I guess the trick is to use _netdev as an option in the mount command or the /etc/fstab entry as Andreas mentioned. I used to have the _netdev option when I was using Lustre over ethernet which made the automounts work correctly and I didn''t have the LNET problem. With infiniband(openibd) the _netdev option doesn''t mount lustre correctly and I had to mount lustre from rc.local after the infiniband networks comes up. Nirmal On 09/09/2010 02:44 PM, Ken Hornstein wrote:>> lustre does get unmounted before NFS filesystem as seen in the log message... >> the problem is due to the fact that LNET is still up when openibd gets >> removed. > > Huh, I''m wondering how it ever worked "right" before. Certainly on the systems > I have at 1.8.1.1, I always had to have a Lustre start/stop script which did > a lustre_rmmod as part of the stop sequence. > > --Ken
Mike Hanby
2010-Sep-09 22:17 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
That''s odd, I have 1.8.1.1 and use _netdev for both my Infiniband and GigE clients, both mount successfully. For IB clients: 10.1.11.30 at o2ib:/lustre /lustre lustre _netdev 0 0 And GigE clients: 10.1.10.20.30 at tcp:/lustre /lustre lustre _netdev 0 0 -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu Sent: Thursday, September 09, 2010 2:58 PM To: Ken Hornstein Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting I guess the trick is to use _netdev as an option in the mount command or the /etc/fstab entry as Andreas mentioned. I used to have the _netdev option when I was using Lustre over ethernet which made the automounts work correctly and I didn''t have the LNET problem. With infiniband(openibd) the _netdev option doesn''t mount lustre correctly and I had to mount lustre from rc.local after the infiniband networks comes up. Nirmal On 09/09/2010 02:44 PM, Ken Hornstein wrote:>> lustre does get unmounted before NFS filesystem as seen in the log message... >> the problem is due to the fact that LNET is still up when openibd gets >> removed. > > Huh, I''m wondering how it ever worked "right" before. Certainly on the systems > I have at 1.8.1.1, I always had to have a Lustre start/stop script which did > a lustre_rmmod as part of the stop sequence. > > --Ken_______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Roy Dragseth
2010-Sep-22 12:26 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
(couldn''t decide on top-post or down-post her so I deleted the whole original message) We have just upgraded our Rocks cluster to use the CentOS 5.5 rpms and it includes a complete OFED stack (v1.4.2?) so we decided to just ditch our own self compiled version of OFED 1.4.1. We then ran into the same problems with openibd hanging on shutdown. After a futile attempt trying to inject a lustre-unload-modules service between netfs and openib to run lustre_rmmod. I tried to hack modprobe.conf to eject the lustre modules by inserting this remove rdma_cm /usr/sbin/lustre_rmmod && /sbin/modprobe -r --ignore-remove rdma_cm this didn''t work either because the openibd service script use rmmod instead of modprobe -r (aargghh). So, the solution that seems to work is to disable openibd (chkconfig openibd off) and let the network initialization take care of loading the right modules by putting this into modprobe.conf: alias ib0 ib_ipoib install ib_ipoib modprobe mlx4_ib && /sbin/modprobe --ignore-install ib_ipoib Then network startup will load the right ib modules and the netfs service will automatically load the lustre modules when mounting the lustre partitions. The downside might be that you will not get any clean unload of neither the lustre nor the ofed modules on shutdown/reboot. If you run other hw than us you might have to change the mlx4_ib module with whatever you need. (wasted two days on this, sometimes I make really good use of taxpayers money...) r.
Bernd Schubert
2010-Sep-22 13:41 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
> We then ran into the same problems with openibd hanging on shutdown. After > a futile attempt trying to inject a lustre-unload-modules service between > netfs and openib to run lustre_rmmod. I tried to hack modprobe.conf to > eject the lustre modules by inserting this > > remove rdma_cm /usr/sbin/lustre_rmmod && /sbin/modprobe -r --ignore-remove > rdma_cm > > this didn''t work either because the openibd service script use rmmod > instead of modprobe -r (aargghh). >All of that seem to be rather ugly workarounds. I think we need to figure out why rmmod of infiniband modules not just fails, when still used by lustres o2ib moduls. Cheers, Bernd -- Bernd Schubert DataDirect Networks
Nirmal Seenu
2010-Sep-22 14:25 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
I included the option "_netdev" to the lustre mount command option even though I mount /lustre from rc.local and that seems to work fine. I also included "umount /lustre; lustre_rmmod" in the stop section of the openibd init script. Nirmal
Josh Moles
2010-Sep-23 04:47 UTC
[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
We were experiencing a similar issue on a 1.8.4 install. I found an init.d script in a previous thread someone had authored that is pretty robust. See http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg06035.htmlfor thread and script is at http://github.com/morrone/lustre/raw/1.8.2.0-5chaos/lustre/scripts/lnet. We just had to add a drive unmount to that script and it has been working great. Josh On Wed, Sep 22, 2010 at 10:25 AM, Nirmal Seenu <nirmal at fnal.gov> wrote:> I included the option "_netdev" to the lustre mount command option even > though I mount /lustre from rc.local and that seems to work fine. > > I also included "umount /lustre; lustre_rmmod" in the stop section of the > openibd init script. > > Nirmal > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100923/1e0a74ea/attachment.html