Dennis Nelson
2009-Mar-24 23:30 UTC
[Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16.24@o2ib. The ost_connect operation failed with -19
Hi, I have encountered an issue with Lustre that has happened a couple of times now. I am beginning to suspect an issue with the IB fabric but wanted to reach out to the list to confirm my suspicions. The odd part is that even when the MDS complains that it cannot connect to a given ost, lctl ping to the OSS that owns the OST works without an issue. Also, the OSS in question has other OSTs which, in the latest case, have not reported any errors. I have attached a file with the errors that I encountered from the MDS. I am running Lustre 1.6.6 with a a pair of MDSs and 8 OSS and 28 OSTs spread across the the 8 OSSs. I am using IB DDR interconnects between all systems. Thanks, -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090324/c25b0e15/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: errors Type: application/octet-stream Size: 33745 bytes Desc: errors Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090324/c25b0e15/attachment-0001.obj
Kevin Van Maren
2009-Mar-25 16:12 UTC
[Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16.24@o2ib. The ost_connect operation failed with -19
Dennis, You haven''t provided enough context for people to help. What have you done to determine if the IB fabric is working properly? What are hostnames and NIDs for the 10 servers (lctl list_nids)? Which OSTs are on which servers? OST4 is on a machine at 192.168.16.23 What machine is 192.168.16.24? Is that the OST4 failover partner? You have a client at 192.168.16.1? Kevin Dennis Nelson wrote:> > Hi, > > I have encountered an issue with Lustre that has happened a couple of > times > now. I am beginning to suspect an issue with the IB fabric but wanted to > reach out to the list to confirm my suspicions. The odd part is that even > when the MDS complains that it cannot connect to a given ost, lctl ping to > the OSS that owns the OST works without an issue. Also, the OSS in > question > has other OSTs which, in the latest case, have not reported any errors. > > I have attached a file with the errors that I encountered from the MDS. I > am running Lustre 1.6.6 with a a pair of MDSs and 8 OSS and 28 OSTs spread > across the the 8 OSSs. I am using IB DDR interconnects between all > systems. > > Thanks, > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Dennis Nelson
2009-Mar-25 17:03 UTC
[Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16.24@o2ib. The ost_connect operation failed with -19
On 3/25/09 11:12 AM, "Kevin Van Maren" <Kevin.Vanmaren at Sun.COM> wrote:> Dennis, > > You haven''t provided enough context for people to help. > > What have you done to determine if the IB fabric is working properly?Basic functionality appears to be there. I can lctl ping between all servers. I have run ibdiagnet and it appears to be clean. I have run several instances of ib_rdma_bw between various lustre servers and it completes with good performance.> > What are hostnames and NIDs for the 10 servers (lctl list_nids)?Executing on mds2 192.168.17.11 at o2ib Executing on mds1 192.168.16.11 at o2ib Executing on oss1 192.168.16.21 at o2ib Executing on oss2 192.168.16.22 at o2ib Executing on oss3 192.168.16.23 at o2ib Executing on oss4 192.168.16.24 at o2ib Executing on oss5 192.168.17.21 at o2ib Executing on oss6 192.168.17.22 at o2ib Executing on oss7 192.168.17.23 at o2ib Executing on oss8 192.168.17.24 at o2ib> Which OSTs are on which servers?Lustre filesystems on mds2 Lustre filesystems on mds1 /dev/mapper/mdt 2009362216 485528 2008876688 1% /mnt/mdt Lustre filesystems on oss1 /dev/mapper/ost0000 1130279280 715816 1129563464 1% /mnt/ost0000 /dev/mapper/ost0001 1130279280 659436 1129619844 1% /mnt/ost0001 /dev/mapper/ost000f 1130279280 667208 1129612072 1% /mnt/ost000f Lustre filesystems on oss2 /dev/mapper/ost0002 1130279280 697520 1129581760 1% /mnt/ost0002 /dev/mapper/ost0003 1130279280 585260 1129694020 1% /mnt/ost0003 /dev/mapper/ost0010 1130279280 600640 1129678640 1% /mnt/ost0010 Lustre filesystems on oss3 /dev/mapper/ost0004 1130279280 515628 1129763652 1% /mnt/ost0004 /dev/mapper/ost0005 1130279280 549292 1129729988 1% /mnt/ost0005 /dev/mapper/ost0011 1130279280 697956 1129581324 1% /mnt/ost0011 Lustre filesystems on oss4 /dev/mapper/ost0006 1130279280 565684 1129713596 1% /mnt/ost0006 /dev/mapper/ost0012 1130279280 482856 1129796424 1% /mnt/ost0012 /dev/mapper/ost0013 1130279280 482856 1129796424 1% /mnt/ost0013 Lustre filesystems on oss5 /dev/mapper/ost0007 1130279280 532844 1129746436 1% /mnt/ost0007 /dev/mapper/ost0008 1130279280 682308 1129596972 1% /mnt/ost0008 /dev/mapper/ost0014 1130279280 532016 1129747264 1% /mnt/ost0014 /dev/mapper/ost0015 1130279280 482856 1129796424 1% /mnt/ost0015 Lustre filesystems on oss6 /dev/mapper/ost0009 1130279280 482860 1129796420 1% /mnt/ost0009 /dev/mapper/ost000a 1130279280 585260 1129694020 1% /mnt/ost000a /dev/mapper/ost0016 1130279280 499244 1129780036 1% /mnt/ost0016 /dev/mapper/ost0017 1130279280 482856 1129796424 1% /mnt/ost0017 Lustre filesystems on oss7 /dev/mapper/ost000b 1130279280 482852 1129796428 1% /mnt/ost000b /dev/mapper/ost000c 1130279280 482872 1129796408 1% /mnt/ost000c /dev/mapper/ost0018 1130279280 581172 1129698108 1% /mnt/ost0018 /dev/mapper/ost0019 1130279280 665556 1129613724 1% /mnt/ost0019 Lustre filesystems on oss8 /dev/mapper/ost000d 1130279280 687688 1129591592 1% /mnt/ost000d /dev/mapper/ost000e 1130279280 606008 1129673272 1% /mnt/ost000e /dev/mapper/ost001a 1130279280 511600 1129767680 1% /mnt/ost001a /dev/mapper/ost001b 1130279280 482852 1129796428 1% /mnt/ost001b> > OST4 is on a machine at 192.168.16.23Yes, oss3.> What machine is 192.168.16.24? Is that the OST4 failover partner?Yes, oss4 is the failover partner.> > You have a client at 192.168.16.1?Yes, it is hanging each time I attempt IO. oss3:~ # tunefs.lustre --dryrun /dev/mapper/ost0004 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0004 Index: 4 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.16.11 at o2ib mgsnode=192.168.17.11 at o2ib failover.node=192.168.16.24 at o2ib Permanent disk data: Target: lustre-OST0004 Index: 4 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.16.11 at o2ib mgsnode=192.168.17.11 at o2ib failover.node=192.168.16.24 at o2ib exiting before disk write.> > Kevin > > > Dennis Nelson wrote: >> >> Hi, >> >> I have encountered an issue with Lustre that has happened a couple of >> times >> now. I am beginning to suspect an issue with the IB fabric but wanted to >> reach out to the list to confirm my suspicions. The odd part is that even >> when the MDS complains that it cannot connect to a given ost, lctl ping to >> the OSS that owns the OST works without an issue. Also, the OSS in >> question >> has other OSTs which, in the latest case, have not reported any errors. >> >> I have attached a file with the errors that I encountered from the MDS. I >> am running Lustre 1.6.6 with a a pair of MDSs and 8 OSS and 28 OSTs spread >> across the the 8 OSSs. I am using IB DDR interconnects between all >> systems. >> >> Thanks, >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >