On Tue, Dec 08, 2020 at 08:02:47PM +0100, Kristof Provost wrote:
! > Sorry for the bad news.
! >
! You appear to be triggering two or three different bugs there.
That is possible. Then there are two or three different bugs in the
production code.
In any case, my current workaround, i.e. delaying in the exec.poststop
> exec.poststop = "
> sleep 6 ;
> /usr/sbin/ngctl shutdown ${ifname1l}: ;
> ";
helps for it all and makes the system behave solid. This is true
with and without Your patch.
! Can you reduce your netgraph use case to a small test case that can trigger
! the problem?
I'm sorry, I fear I don't get Your point.
Assumed there are actually two or three bugs here, You are asking me
to reduce config so that it will trigger only one of them? Is that
correct?
Then let me put this different: assuming this is the OS for the life
support system of the manned Jupiter mission. Then, which one of the
bugs do You want to get fixed, and which would You prefer to keep and
make Your oxygen supply cut off?
https://www.youtube.com/watch?v=BEo2g-w545A
! I?m not likely to be able to do anything unless I can reproduce
! the problem(s).
I understand that.
From Your former mail I get the impression that you prefer to rely
on tests. I consider this a bad habit[1] and prefer logical thinking.
So lets try that:
We know that there is a problem with taking down an interface from a
VIMAGE, in the way it is done by "jail -r". We know this problem can
be solidly workarounded by delaying the interface takedown for a short
time.
Now with Your patch, we do not get the typical crash at interface
takedown. Instead, all of a sudden, there are strange crashes from
various other places. And, interestingly, we get these also when
STARTING a jail.
I think this is not an additional problem, it is instead a valuable
information (albeit not the one You might like to get).
Furthermore, we get these new crashes always invoked by "ifconfig",
and they seem to have in common that somebody tries to obtain
information about some interface configuration and receives some
bogus. I might conclude, just out of the belly without looking into
details, that either
- your patch achieves to garble some internal interface data,
instead of what it is intended to do, or
- the original problem manages to garble internal interface data
(leading to the usual crash), and Your patch does not achieve to
solve this, but only protects from the immediate consequence.
It might also be worth consideration, that, while the problem may be
more easy to reproduce with epair, this effect may or may not be a
netgraph specific one[2].
Now lets keep in mind that a successful test means EXACTLY NOTHING.
By which other means can we confirm that Your patch fully achieves
what it is intended for? (E.g. something like dumping and verifying
the respective internal tables in-vivo)
(Background: It is not that I would be unwilling to create clean and
precisely reproducible scenarious, But, one of my problems is
currently, I only have two machines availabe: the graphical one where
I'm just typing, and the backend server with the jails that does
practically everything.
Therefore, experimenting on any of them creates considerable pain.
I'm working on that issue, trying to get a real server board for the
backend so to get the current one free for testing - but what I would
like to use, e.g. ASUS Z10PE+cores+regECC, is not something one would
easily find on yardsales - and seldom for an acceptable price.)
cheerio,
PMc
[1] Rationale: a failing test tells us that either the test or the
application has a bug (50/50 chance). A succeeding test tells us
that 1 equals 1, which we knew already before.
In fact, tests tell us *nothing at all* about the state of our
code, and specifically, 'successful' outcomes do NOT mean that
things are all correct.
The only true usefulness of tests is to protect against
re-introducing a fault that was already fixed before,
i.e. regressions.
[2] My netgraph configuration consists of bringing up some bridges
and then attaching the jails to them.
Here is the bridge starter (only respective component,
there are more of these populated, but probably not influencing
the issue):
------------------------------------------------
#! /bin/sh
# PROVIDE: netgraphs
# REQUIRE: netwait
# BEFORE: NETWORKING
. /etc/rc.subr
name="netgraphs"
start_cmd="${name}_start"
stop_cmd="${name}_stop"
load_rc_config $name
netgraphs_graphs="svc"
netgraphs_svc_if1_name="nge_svc_1u"
netgraphs_svc_if1_mac="00:1d:92:01:02:01"
netgraphs_svc_if1_addr="***.***.***.***/29"
netgraphs_svc_start()
{
local _ifname
if ngctl info svcswitch: > /dev/null 2>&1; then
netgraphs_svc_stop
fi
echo "Creating SVC Switch"
ngctl -f - <<EOF
mkpeer bridge crhook link16
name .:crhook svcswitch
mkpeer svcswitch: eiface link0 ether
name svcswitch:link0 $netgraphs_svc_if1_name
EOF
_ifname=`ngctl msg ${netgraphs_svc_if1_name}: getifname | \
awk '$1 == "Args:" { print substr($2, 2,
length($2)-2)}'`
ifconfig $_ifname name $netgraphs_svc_if1_name
ifconfig $netgraphs_svc_if1_name link $netgraphs_svc_if1_mac
ifconfig $netgraphs_svc_if1_name inet $netgraphs_svc_if1_addr
}
netgraphs_svc_stop()
{
echo "Shutting down SVC switch"
ngctl shutdown svcswitch:
ngctl shutdown ${netgraphs_svc_if1_name}:
}
netgraphs_start()
{
local _cmd
for i in "$@"; do
eval _cmd=netgraphs_${i}_start
if type $_cmd > /dev/null 2>&1; then
$_cmd
else
echo "netgraphs-start: object $i not found" >&2
fi
done
}
netgraphs_stop()
{
local _cmd
for i in "$@"; do
eval _cmd=netgraphs_${i}_stop
if type $_cmd > /dev/null 2>&1; then
$_cmd
else
echo "netgraphs-stop: object $i not found" >&2
fi
done
}
netgraphs_tasks=""
if test $# -eq 1; then
if test "$1" = "stop"; then
for i in $netgraphs_graphs; do
netgraphs_tasks="$i $netgraphs_tasks"
done
else
for i in $netgraphs_graphs; do
netgraphs_tasks="$netgraphs_tasks $i"
done
fi
fi
run_rc_command "$@" "$netgraphs_tasks"
------------------------------------------------
And here is the full jail config (only respective jail:
------------------------------------------------
allow.set_hostname = "false";
allow.mount.procfs = "false";
allow.mount.devfs = "false";
allow.raw_sockets = "false";
enforce_statfs = 1;
devfs_ruleset = 4;
securelevel = 2;
mount.devfs;
exec.start = "/bin/sh /etc/rc";
exec.stop = "/bin/sh /etc/rc.shutdown";
exec.consolelog = "/var/log/jail_${name}_console.log";
path = "/j/$name";
interface = "lo0";
ip4.saddrsel = "false";
rail {
jid = 10;
devfs_ruleset = 11;
host.hostname = "rail.***********.org";
vnet = "new";
sysvshm;
$ifname1l = nge_${name}_1l;
$ifname1l_mac = 00:1d:92:01:01:0a;
vnet.interface = "$ifname1l";
exec.prestart = "
echo -e \"mkpeer eiface crhook ether\nname .:crhook
$ifname1l\" \
| /usr/sbin/ngctl -f -
/usr/sbin/ngctl connect ${ifname1l}: svcswitch: ether link2
ifname=`/usr/sbin/ngctl msg ${ifname1l}: getifname | \
awk '$1 == \"Args:\" { print substr($2, 2,
length($2)-2)}'`
/sbin/ifconfig \$ifname name $ifname1l
/sbin/ifconfig $ifname1l link $ifname1l_mac
";
exec.poststart = "
/usr/sbin/jexec $name /sbin/sysctl kern.securelevel=3 ;
";
exec.poststop = "
# sleep 6 ;
/usr/sbin/ngctl shutdown ${ifname1l}: ;
";
exec.start = "/bin/sleep 4 &";
}
------------------------------------------------