thr3ads.net - freebsd stable - Rancid/Expect failing on FreeBSD/SMP systems [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Lars Erik Gullerud

2008-Jan-10 05:53 UTC

Rancid/Expect failing on FreeBSD/SMP systems

My apologies for posting this both to the Rancid list and FreeBSD-STABLE, 
however I am not sure where to start troubleshooting this issue - I am 
suspecting it is a FreeBSD issue, but I am thinking we are probably not 
the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it 
is quite popular in ISP environments), so hopefully someone can look at 
it from the RANCID angle and give some helpful input on how to 
troubleshoot this further.

The problem: After finally giving in and starting to phase out some of our 
oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some 
fresh hardware, I got around to start moving our RANCID server. This 
however, has been the start of a real nightmare. I don't think the 
problems I am seeing are in RANCID itself, however it can be reliable 
reproduced every time i run RANCID and I have not been able to reproduce 
it in any other way with pure expect test-cases directly.

What happens:

Expect processes "hang" during RANCID runs, and go into infinite loops
eating 100% CPU (on one CPU core). The problem is reliably reproduced 
everytime we do a full rancid-run, but the actual device it chokes on 
varies between runs so it is not device-related. It does seem to happen 
most often when collecting Juniper M-series gear with large configurations 
though, using jrancid and ssh.

We can NOT seem to reproduce it by running jrancid (or any other) on a 
single device at at time - which is somewhat confusing at is DOES happen 
when setting PAR_COUNT to 1 and doing a rancid-run (which 
should IMHO be pretty much the same as doing sequential single device 
runs...)

Our environment:

We run RANCID extensively to collect a few hundred devices, including 
Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone, 
FortiNet/FortiGate, etc. We want to start storing CPE configs in addition 
to our own core gear in RANCID now, which means we will be putting several 
thousand routers into RANCID, which also explains the need for fresher 
hardware...

RANCID version does not seem to matter, I have tested with both some 
ancient 2.3.0 scripts and 2.3.2a7, same behaviour.

Using the same RANCID instance (I have tarballed it up and installed it on 
a bunch of servers, i.e. using the same CVS and the same router.db files 
etc.), it fails on:

FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335)
FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon)

Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports.

It however seems to work fine on:

Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130)
FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon)
FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4)

(Linux box has Expect 5.42 and Tcl 8.3...)

So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID 
working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me 
know...)

Now, for some details, if anyone has any ideas. What is actually 
happening is this, when truss'ing the stuck Expect-process:

fcntl(4,F_GETFL,)                   = 0 (0x0)
fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for
device'
fcntl(4,F_GETFL,)                   = 0 (0x0)
fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for
device'
<looping ad nauseum>

So, which device is it trying to fcntl, and what is it trying to do? lsof 
shows the following:

expect     1417 rancid  cwd   VDIR      0,86     2048 7607662
/local/rancid/var/core/configs
expect     1417 rancid  rtd   VDIR      0,81      512       2 /
expect     1417 rancid    0r  VCHR      0,24      0t0      24 /dev/null
expect     1417 rancid    2r  VCHR      0,24      0t0      24 /dev/null
expect     1417 rancid    3r  VCHR      0,24      0t0      24 /dev/null
expect     1417 rancid    4r  VCHR      0,24      0t0      24 /dev/null

file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to 
BLOCKING mode (which is failing)? Why should it be playing with /dev/null 
at all? Well, digging a little, this is what the lsof output looked like 
10 seconds earlier:

expect     1417 rancid  cwd   VDIR               0,86     2048 7607662
/local/rancid/var/core/configs
expect     1417 rancid  rtd   VDIR               0,81      512       2 /
expect     1417 rancid    0r  VCHR               0,24      0t0      24 /dev/null
expect     1417 rancid    1u  PIPE          0x38bfcf8        0        
->0xffffff00038bfba0
expect     1417 rancid    2w  VREG               0,86       76 7583772 /local
(/dev/mfid0s1f)
expect     1417 rancid    3u  VCHR              0,108      0t0     108
/dev/ttyp2
expect     1417 rancid    4u  VCHR              0,117     0t45     117
/dev/ptyp7
ssh        1418 rancid  cwd   VDIR               0,86     2048 7607662
/local/rancid/var/core/configs
ssh        1418 rancid  rtd   VDIR               0,81      512       2 /
ssh        1418 rancid  txt                                            unknown
file system type:  8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE
ssh        1418 rancid    0u  VCHR              0,118      0t0     118
/dev/ttyp7
ssh        1418 rancid    1u  VCHR              0,118      0t0     118
/dev/ttyp7
ssh        1418 rancid    2u  VCHR              0,118      0t0     118
/dev/ttyp7
ssh        1418 rancid    3w  VREG               0,86       76 7583772 /local
(/dev/mfid0s1f)
ssh        1418 rancid    4u  IPv4 0xffffff008c030240      0t0     TCP
*:27776->*:49323
ssh        1418 rancid    5u  VCHR              0,118     0t45     118
/dev/ttyp7

Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418, 
the ssh session to the router, and everything seems to be normal.

PID 1418 is no longer there on the most recent lsof, so 1418 seems to 
have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I 
don't know why that is, but why is it trying to fcntl it to Blocking I/O 
mode? Here is a gdb attach to the PID and a backtrace:

(gdb) bt
#0  0x0000000800aefc9c in fcntl () from /lib/libc.so.7
#1  0x00000000004072c5 in ?? ()
#2  0x00000008006a8c18 in StackSetBlockMode ()
    from /usr/local/lib/libtcl84.so.1
#3  0x00000008006a8c54 in SetBlockMode () from 
/usr/local/lib/libtcl84.so.1
#4  0x00000008006acf75 in Tcl_SetChannelOption ()
    from /usr/local/lib/libtcl84.so.1
#5  0x00000008006aeda0 in TclFinalizeIOSubsystem ()
    from /usr/local/lib/libtcl84.so.1
#6  0x0000000800697f74 in Tcl_FinalizeThread ()
    from /usr/local/lib/libtcl84.so.1
#7  0x0000000800698081 in Tcl_Finalize () from 
/usr/local/lib/libtcl84.so.1
#8  0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1
#9  0x0000000000409610 in ?? ()
#10 0x00000008006742be in TclInvokeStringCommand ()
    from /usr/local/lib/libtcl84.so.1
#11 0x0000000800675944 in TclEvalObjvInternal ()
    from /usr/local/lib/libtcl84.so.1
#12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
#13 0x00000008006b55d9 in Tcl_FSEvalFile () from 
/usr/local/lib/libtcl84.so.1
#14 0x00000008006b5690 in Tcl_EvalFile () from 
/usr/local/lib/libtcl84.so.1
#15 0x0000000000404f58 in ?? ()
#16 0x0000000000404d47 in ?? ()
>From the functions it is running in Tcl, it seems it is Tcl's cleanup code that is failing, when it is trying to restore a Tcl "channel" to 
normal mode during an exit event.

This is where my clue runs out, and I am at a loss as to how to proceed 
from here. I have tried digging in both Tcl and Expect source code to see 
if can catch anything obvious, but alas, this is somewhat outside my area 
of expertise (I am a networking guy, not a programmer)...

Any suggestions on how to proceed to find and fix this issue would be 
welcome, as the only other option for us is to abandon FreeBSD and go with 
Linux on the server, and we have already replaced too many FreeBSD boxes 
with Linux for my liking, I don't want to see yet another one go...

Regards,
Lars Erik Gullerud

Nick Gustas

2008-Jan-10 06:21 UTC

head link

Rancid/Expect failing on FreeBSD/SMP systems

Lars Erik Gullerud wrote:>
> So it only seems to be on newer FreeBSD with SMP. (If anyone have 
> RANCID working okay on FreeBSD 6.x/7.x on SMP systems at all, please 
> let me know...)
>
This probably won't be of much help, I figured I'd chime in anyway since
it may effect me in the future, but I just haven't had any trouble with 
rancid on 7.x yet. It's only been in operation for 11 days though.



FreeBSD xxxx.xxxxxxxxxxxxx.com 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #1: 
Fri Dec 28 21:48:39 EST 2007

CPU: Intel(R) Xeon(TM) CPU 3.06GHz (3059.98-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf29  Stepping = 9
  
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4400<CNXT-ID,xTPR>
real memory  = 4026376192 (3839 MB)
avail memory = 3937681408 (3755 MB)
ACPI APIC Table: <IBM    SERONYXP>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  6





I'm currently  monitoring 18 cisco routers using ssh login, probably not 
enough to trigger the problem.

grep -c up router.db   
18

pkg_info -r rancid-2.3.1_2
Information for rancid-2.3.1_2:

Depends on:
Dependency: kbproto-1.0.3
Dependency: inputproto-1.4.2.1
Dependency: tcl-8.4.16,1
Dependency: perl-5.8.8_1
Dependency: p5-Scalar-List-Utils-1.19,1
Dependency: pkg-config-0.22_1
Dependency: xtrans-1.0.4
Dependency: xproto-7.0.10_1
Dependency: libXdmcp-1.0.2
Dependency: libXau-1.0.3_2
Dependency: libX11-1.1.3,1
Dependency: tk-8.4.16,2
Dependency: expect-5.43.0_3
Dependency: p5-PathTools-3.2501
Dependency: p5-CGI.pm-3.31,1
Dependency: p5-LockFile-Simple-0.2.6

Mohacsi Janos

2008-Jan-10 07:04 UTC

head link

[rancid] Rancid/Expect failing on FreeBSD/SMP systems

Hi Lars,

You should use expect-devel port to avoid hunging on pty

have a look at
http://www.freebsd.org/cgi/query-pr.cgi?pr=118452


Janos Mohacsi
Network Engineer, Research Associate, Head of Network Planning and Projects
NIIF/HUNGARNET, HUNGARY
Key 70EF9882: DEC2 C685 1ED4 C95A 145F  4300 6F64 7B00 70EF 9882

On Thu, 10 Jan 2008, Lars Erik Gullerud wrote:
> My apologies for posting this both to the Rancid list and FreeBSD-STABLE,
> however I am not sure where to start troubleshooting this issue - I am
> suspecting it is a FreeBSD issue, but I am thinking we are probably not
> the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it
> is quite popular in ISP environments), so hopefully someone can look at
> it from the RANCID angle and give some helpful input on how to
> troubleshoot this further.
>
> The problem: After finally giving in and starting to phase out some of our
> oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some
> fresh hardware, I got around to start moving our RANCID server. This
> however, has been the start of a real nightmare. I don't think the
> problems I am seeing are in RANCID itself, however it can be reliable
> reproduced every time i run RANCID and I have not been able to reproduce
> it in any other way with pure expect test-cases directly.
>
> What happens:
>
> Expect processes "hang" during RANCID runs, and go into infinite
loops
> eating 100% CPU (on one CPU core). The problem is reliably reproduced
> everytime we do a full rancid-run, but the actual device it chokes on
> varies between runs so it is not device-related. It does seem to happen
> most often when collecting Juniper M-series gear with large configurations
> though, using jrancid and ssh.
>
> We can NOT seem to reproduce it by running jrancid (or any other) on a
> single device at at time - which is somewhat confusing at is DOES happen
> when setting PAR_COUNT to 1 and doing a rancid-run (which
> should IMHO be pretty much the same as doing sequential single device
> runs...)
>
> Our environment:
>
> We run RANCID extensively to collect a few hundred devices, including
> Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone,
> FortiNet/FortiGate, etc. We want to start storing CPE configs in addition
> to our own core gear in RANCID now, which means we will be putting several
> thousand routers into RANCID, which also explains the need for fresher
> hardware...
>
> RANCID version does not seem to matter, I have tested with both some
> ancient 2.3.0 scripts and 2.3.2a7, same behaviour.
>
> Using the same RANCID instance (I have tarballed it up and installed it on
> a bunch of servers, i.e. using the same CVS and the same router.db files
> etc.), it fails on:
>
> FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335)
> FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon)
>
> Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports.
>
> It however seems to work fine on:
>
> Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130)
> FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon)
> FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4)
>
> (Linux box has Expect 5.42 and Tcl 8.3...)
>
> So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID
> working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me
> know...)
>
> Now, for some details, if anyone has any ideas. What is actually
> happening is this, when truss'ing the stuck Expect-process:
>
> fcntl(4,F_GETFL,)                   = 0 (0x0)
> fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for
device'
> fcntl(4,F_GETFL,)                   = 0 (0x0)
> fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for
device'
> <looping ad nauseum>
>
> So, which device is it trying to fcntl, and what is it trying to do? lsof
> shows the following:
>
> expect     1417 rancid  cwd   VDIR      0,86     2048 7607662
/local/rancid/var/core/configs
> expect     1417 rancid  rtd   VDIR      0,81      512       2 /
> expect     1417 rancid    0r  VCHR      0,24      0t0      24 /dev/null
> expect     1417 rancid    2r  VCHR      0,24      0t0      24 /dev/null
> expect     1417 rancid    3r  VCHR      0,24      0t0      24 /dev/null
> expect     1417 rancid    4r  VCHR      0,24      0t0      24 /dev/null
>
> file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to
> BLOCKING mode (which is failing)? Why should it be playing with /dev/null
> at all? Well, digging a little, this is what the lsof output looked like
> 10 seconds earlier:
>
> expect     1417 rancid  cwd   VDIR               0,86     2048 7607662
/local/rancid/var/core/configs
> expect     1417 rancid  rtd   VDIR               0,81      512       2 /
> expect     1417 rancid    0r  VCHR               0,24      0t0      24
/dev/null
> expect     1417 rancid    1u  PIPE          0x38bfcf8        0        
->0xffffff00038bfba0
> expect     1417 rancid    2w  VREG               0,86       76 7583772
/local (/dev/mfid0s1f)
> expect     1417 rancid    3u  VCHR              0,108      0t0     108
/dev/ttyp2
> expect     1417 rancid    4u  VCHR              0,117     0t45     117
/dev/ptyp7
> ssh        1418 rancid  cwd   VDIR               0,86     2048 7607662
/local/rancid/var/core/configs
> ssh        1418 rancid  rtd   VDIR               0,81      512       2 /
> ssh        1418 rancid  txt                                           
unknown file system type:  8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE
> ssh        1418 rancid    0u  VCHR              0,118      0t0     118
/dev/ttyp7
> ssh        1418 rancid    1u  VCHR              0,118      0t0     118
/dev/ttyp7
> ssh        1418 rancid    2u  VCHR              0,118      0t0     118
/dev/ttyp7
> ssh        1418 rancid    3w  VREG               0,86       76 7583772
/local (/dev/mfid0s1f)
> ssh        1418 rancid    4u  IPv4 0xffffff008c030240      0t0     TCP
*:27776->*:49323
> ssh        1418 rancid    5u  VCHR              0,118     0t45     118
/dev/ttyp7
>
> Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418,
> the ssh session to the router, and everything seems to be normal.
>
> PID 1418 is no longer there on the most recent lsof, so 1418 seems to
> have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I
> don't know why that is, but why is it trying to fcntl it to Blocking
I/O
> mode? Here is a gdb attach to the PID and a backtrace:
>
> (gdb) bt
> #0  0x0000000800aefc9c in fcntl () from /lib/libc.so.7
> #1  0x00000000004072c5 in ?? ()
> #2  0x00000008006a8c18 in StackSetBlockMode ()
>    from /usr/local/lib/libtcl84.so.1
> #3  0x00000008006a8c54 in SetBlockMode () from
> /usr/local/lib/libtcl84.so.1
> #4  0x00000008006acf75 in Tcl_SetChannelOption ()
>    from /usr/local/lib/libtcl84.so.1
> #5  0x00000008006aeda0 in TclFinalizeIOSubsystem ()
>    from /usr/local/lib/libtcl84.so.1
> #6  0x0000000800697f74 in Tcl_FinalizeThread ()
>    from /usr/local/lib/libtcl84.so.1
> #7  0x0000000800698081 in Tcl_Finalize () from
> /usr/local/lib/libtcl84.so.1
> #8  0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1
> #9  0x0000000000409610 in ?? ()
> #10 0x00000008006742be in TclInvokeStringCommand ()
>    from /usr/local/lib/libtcl84.so.1
> #11 0x0000000800675944 in TclEvalObjvInternal ()
>    from /usr/local/lib/libtcl84.so.1
> #12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
> #13 0x00000008006b55d9 in Tcl_FSEvalFile () from
> /usr/local/lib/libtcl84.so.1
> #14 0x00000008006b5690 in Tcl_EvalFile () from
> /usr/local/lib/libtcl84.so.1
> #15 0x0000000000404f58 in ?? ()
> #16 0x0000000000404d47 in ?? ()
>
>> From the functions it is running in Tcl, it seems it is Tcl's
cleanup
> code that is failing, when it is trying to restore a Tcl
"channel" to
> normal mode during an exit event.
>
> This is where my clue runs out, and I am at a loss as to how to proceed
> from here. I have tried digging in both Tcl and Expect source code to see
> if can catch anything obvious, but alas, this is somewhat outside my area
> of expertise (I am a networking guy, not a programmer)...
>
> Any suggestions on how to proceed to find and fix this issue would be
> welcome, as the only other option for us is to abandon FreeBSD and go with
> Linux on the server, and we have already replaced too many FreeBSD boxes
> with Linux for my liking, I don't want to see yet another one go...
>
> Regards,
> Lars Erik Gullerud
> _______________________________________________
> Rancid-discuss mailing list
> Rancid-discuss@shrubbery.net
> http://www.shrubbery.net/mailman/listinfo.cgi/rancid-discuss
>

freebsd stable - Jan 2008 - Rancid/Expect failing on FreeBSD/SMP systems

Rancid/Expect failing on FreeBSD/SMP systems

Rancid/Expect failing on FreeBSD/SMP systems

[rancid] Rancid/Expect failing on FreeBSD/SMP systems