Hello, Not sure if this is the right forum: I''m encountering difficulties with o2ib which prevents an LNET shutdown from proceeding: Unloading OpenIB kernel modules:NET: Unregistered protocal family 27 Failed to unload rdma_cm Failed to unload rdma_cm Failed to unload ib_cm Failed to unload ib_sa LustreError: 131-3: Received notification of device removal Please shutdown LNET to allow this to proceed This happens on server and client nodes alike. We run RHEL5.1 and OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. I narrowed it down to module ko2iblnd, which I attempt to remove first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it doesn''t work. Strangely, in "lsmod" the use count of the module is one, but I don''t see where it''s used. # umount /mnt/lustre # ifconfig ib0 down # modprobe -r ko2iblnd FATAL: Module ko2iblnd is in use. # lsmod | grep ko2 ko2iblnd 143136 1 lnet 258088 5 lustre,ksocklnd,ko2iblnd,ptlrpc,obdclass libcfs 189784 12 osc,mgc,lustre,lov,lquota,mdc,ksocklnd,ko2iblnd,ptlrpc,obdclass,lnet,lvf s rdma_cm 65940 4 ko2iblnd,ib_iser,rdma_ucm,ib_sdp ib_core 88576 16 ko2iblnd,ib_iser,rdma_ucm,ib_ucm,ib_srp,ib_sdp,rdma_cm,ib_cm,iw_cm,ib_lo cal_sa,ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad I''d be grateful for any hints. Regards, Michael
On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote:> Hello, > > Not sure if this is the right forum: I''m encountering difficulties > with o2ib which prevents an LNET shutdown from proceeding: > > Unloading OpenIB kernel modules:NET: Unregistered protocal family 27 > Failed to unload rdma_cm > Failed to unload rdma_cm > Failed to unload ib_cm > Failed to unload ib_sa > LustreError: 131-3: Received notification of device removal > Please shutdown LNET to allow this to proceed > > This happens on server and client nodes alike. We run RHEL5.1 and > OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. > > I narrowed it down to module ko2iblnd, which I attempt to remove > first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it > doesn''t work. Strangely, in "lsmod" the use count of the module is > one, but I don''t see where it''s used.To ask what might sound like a stupid question, but you do have all of your lustre filesystems unmounted before you try to unload ko2iblnd, yes? Can you show us what''s in /proc/mounts when you try to unload ko2iblnd but it shows a refcount > 0? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080415/040457bd/attachment-0002.bin
On Apr 15, 2008, at 12:15, Brian J. Murrell wrote:> On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote: >> Hello, >> >> Not sure if this is the right forum: I''m encountering difficulties >> with o2ib which prevents an LNET shutdown from proceeding: >> >> Unloading OpenIB kernel modules:NET: Unregistered protocal family 27 >> Failed to unload rdma_cm >> Failed to unload rdma_cm >> Failed to unload ib_cm >> Failed to unload ib_sa >> LustreError: 131-3: Received notification of device removal >> Please shutdown LNET to allow this to proceed >> >> This happens on server and client nodes alike. We run RHEL5.1 and >> OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. >> >> I narrowed it down to module ko2iblnd, which I attempt to remove >> first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it >> doesn''t work. Strangely, in "lsmod" the use count of the module is >> one, but I don''t see where it''s used. > > To ask what might sound like a stupid question, but you do have all of > your lustre filesystems unmounted before you try to unload ko2iblnd, > yes? Can you show us what''s in /proc/mounts when you try to unload > ko2iblnd but it shows a refcount > 0?No problem with the question - anything that helps: # cat /proc/mounts rootfs / rootfs rw 0 0 /dev/root / ext3 rw,data=ordered 0 0 /dev /dev tmpfs rw 0 0 /proc /proc proc rw 0 0 /sys /sys sysfs rw 0 0 /proc/bus/usb /proc/bus/usb usbfs rw 0 0 devpts /dev/pts devpts rw 0 0 /dev/sda1 /boot ext3 rw,data=ordered 0 0 tmpfs /dev/shm tmpfs rw 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 172.16.100.3:/drbd/exports/opt /opt nfs rw,vers=3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2, sec=sys,addr=172.16.100.3 0 0 /etc/auto.misc /misc autofs rw,fd=6,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 -hosts /net autofs rw,fd=11,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 This was even after: # ifconfig ib0 down I also have: # grep lnet /etc/modprobe.conf options lnet networks="o2ib0,tcp0(eth0)" accept_port=6988 (the accept_port spec doesn''t work either on a tcp-only node, but that''s a separate issue, or so I believe.) Regards, Michael
Christopher J. Morrone
2008-Apr-15 17:32 UTC
[Lustre-discuss] o2ib module prevents shutdown
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Is the ptlrpc module still loaded? Off the top of my head, I think that is the module that is holding the reference. It is normal for the ko2iblnd to have a ref count of 1, but no "Used by" modules listed. Michael Sternberg wrote:> Hello, > > Not sure if this is the right forum: I''m encountering difficulties > with o2ib which prevents an LNET shutdown from proceeding: > > Unloading OpenIB kernel modules:NET: Unregistered protocal family 27 > Failed to unload rdma_cm > Failed to unload rdma_cm > Failed to unload ib_cm > Failed to unload ib_sa > LustreError: 131-3: Received notification of device removal > Please shutdown LNET to allow this to proceed > > This happens on server and client nodes alike. We run RHEL5.1 and > OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. > > I narrowed it down to module ko2iblnd, which I attempt to remove > first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it > doesn''t work. Strangely, in "lsmod" the use count of the module is > one, but I don''t see where it''s used. > > > # umount /mnt/lustre > # ifconfig ib0 down > # modprobe -r ko2iblnd > FATAL: Module ko2iblnd is in use. > # lsmod | grep ko2 > ko2iblnd 143136 1 > lnet 258088 5 lustre,ksocklnd,ko2iblnd,ptlrpc,obdclass > libcfs 189784 12 > osc,mgc,lustre,lov,lquota,mdc,ksocklnd,ko2iblnd,ptlrpc,obdclass,lnet,lvf > s > rdma_cm 65940 4 ko2iblnd,ib_iser,rdma_ucm,ib_sdp > ib_core 88576 16 > ko2iblnd,ib_iser,rdma_ucm,ib_ucm,ib_srp,ib_sdp,rdma_cm,ib_cm,iw_cm,ib_lo > cal_sa,ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad > > > I''d be grateful for any hints. > > > > Regards, Michael > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIBOabg40IeHfy1xIRCkpYAJ9d8nEdNamRfCKC0FNUoBvmkxtUSwCgiZxV vwpAMfvM0G2U0Yj+cleKlU0=72Vo -----END PGP SIGNATURE-----
Hi, This usually happens when you try to remove IB card drivers before stopping lustre network. What I do is after clean umount I run lustre_rmmod script which removes all lustre modules and stops lustre network. Then you can safety remove IB card driver and nothing should get stuck. Cheers, Wojciech On 15 Apr 2008, at 18:22, Michael Sternberg wrote:> > On Apr 15, 2008, at 12:15, Brian J. Murrell wrote: >> On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote: >>> Hello, >>> >>> Not sure if this is the right forum: I''m encountering difficulties >>> with o2ib which prevents an LNET shutdown from proceeding: >>> >>> Unloading OpenIB kernel modules:NET: Unregistered protocal family >>> 27 >>> Failed to unload rdma_cm >>> Failed to unload rdma_cm >>> Failed to unload ib_cm >>> Failed to unload ib_sa >>> LustreError: 131-3: Received notification of device removal >>> Please shutdown LNET to allow this to proceed >>> >>> This happens on server and client nodes alike. We run RHEL5.1 and >>> OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. >>> >>> I narrowed it down to module ko2iblnd, which I attempt to remove >>> first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it >>> doesn''t work. Strangely, in "lsmod" the use count of the module is >>> one, but I don''t see where it''s used. >> >> To ask what might sound like a stupid question, but you do have all >> of >> your lustre filesystems unmounted before you try to unload ko2iblnd, >> yes? Can you show us what''s in /proc/mounts when you try to unload >> ko2iblnd but it shows a refcount > 0? > > No problem with the question - anything that helps: > > # cat /proc/mounts > rootfs / rootfs rw 0 0 > /dev/root / ext3 rw,data=ordered 0 0 > /dev /dev tmpfs rw 0 0 > /proc /proc proc rw 0 0 > /sys /sys sysfs rw 0 0 > /proc/bus/usb /proc/bus/usb usbfs rw 0 0 > devpts /dev/pts devpts rw 0 0 > /dev/sda1 /boot ext3 rw,data=ordered 0 0 > tmpfs /dev/shm tmpfs rw 0 0 > none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 > sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 > 172.16.100.3:/drbd/exports/opt /opt nfs > rw > ,vers=3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2, > sec=sys,addr=172.16.100.3 0 0 > /etc/auto.misc /misc autofs > rw,fd=6,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 > -hosts /net autofs > rw,fd=11,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 > > > This was even after: > > # ifconfig ib0 down > > I also have: > > # grep lnet /etc/modprobe.conf > options lnet networks="o2ib0,tcp0(eth0)" accept_port=6988 > > (the accept_port spec doesn''t work either on a tcp-only node, but > that''s a separate issue, or so I believe.) > > > Regards, Michael > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello Wojciech, Sorry for the delayed response; lustre_rmmod worked in a manual test to remove modules after ib0 was down; I have yet to try this as part of the init.d shutdown scripts; an alternate solution with a script didn''t quite work. Thanks for the hint! Regards, Michael On Apr 15, 2008, at 12:33 , Wojciech Turek wrote:> Hi, > > This usually happens when you try to remove IB card drivers before > stopping lustre network. What I do is after clean umount I run > lustre_rmmod script which removes all lustre modules and stops > lustre network. Then you can safety remove IB card driver and > nothing should get stuck. > > Cheers, > > Wojciech > > On 15 Apr 2008, at 18:22, Michael Sternberg wrote: > >> >> On Apr 15, 2008, at 12:15, Brian J. Murrell wrote: >>> On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote: >>>> Hello, >>>> >>>> Not sure if this is the right forum: I''m encountering difficulties >>>> with o2ib which prevents an LNET shutdown from proceeding: >>>> >>>> Unloading OpenIB kernel modules:NET: Unregistered protocal >>>> family 27 >>>> Failed to unload rdma_cm >>>> Failed to unload rdma_cm >>>> Failed to unload ib_cm >>>> Failed to unload ib_sa >>>> LustreError: 131-3: Received notification of device removal >>>> Please shutdown LNET to allow this to proceed >>>> >>>> This happens on server and client nodes alike. We run RHEL5.1 and >>>> OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. >>>> >>>> I narrowed it down to module ko2iblnd, which I attempt to remove >>>> first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it >>>> doesn''t work. Strangely, in "lsmod" the use count of the module is >>>> one, but I don''t see where it''s used. >>> >>> To ask what might sound like a stupid question, but you do have >>> all of >>> your lustre filesystems unmounted before you try to unload ko2iblnd, >>> yes? Can you show us what''s in /proc/mounts when you try to unload >>> ko2iblnd but it shows a refcount > 0? >> >> No problem with the question - anything that helps: >> >> # cat /proc/mounts >> rootfs / rootfs rw 0 0 >> /dev/root / ext3 rw,data=ordered 0 0 >> /dev /dev tmpfs rw 0 0 >> /proc /proc proc rw 0 0 >> /sys /sys sysfs rw 0 0 >> /proc/bus/usb /proc/bus/usb usbfs rw 0 0 >> devpts /dev/pts devpts rw 0 0 >> /dev/sda1 /boot ext3 rw,data=ordered 0 0 >> tmpfs /dev/shm tmpfs rw 0 0 >> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 >> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 >> 172.16.100.3:/drbd/exports/opt /opt nfs >> rw >> ,vers >> =3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2, >> sec=sys,addr=172.16.100.3 0 0 >> /etc/auto.misc /misc autofs >> rw,fd=6,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 >> -hosts /net autofs >> rw,fd=11,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 >> >> >> This was even after: >> >> # ifconfig ib0 down >> >> I also have: >> >> # grep lnet /etc/modprobe.conf >> options lnet networks="o2ib0,tcp0(eth0)" accept_port=6988 >> >> (the accept_port spec doesn''t work either on a tcp-only node, but >> that''s a separate issue, or so I believe.) >> >> >> Regards, Michael >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >