Sorry everyone, I''m late.. always!!
Here is a new version of kernel packet traveling diagram. Thanks a lot
to Julian Anastasov by his comments (notes at the end as I understood).
I insist it''s nice to have this diagram ready and updated. Look that
when Jan Coppens needed to tell us where he does need to mark packets
he just said "At this point I should need another mangle table->"
using
the diagram as reference.
We understand that internal kernel code is complex and interlaced and
not always is possible to identify clearly each part of it in a simple
diagram. But we keep on trying.
I''ve got some comments:
1) I didn''t know there was ipchains code in kernel 2.4; I supposed
iptables new code replace totally old ipchains code. Any feedback
about it would be useful.
2) Below I enclosed a link to an article from Harald Welte "The journey
of a packet through the linux 2.4 network stack"; it could help to the
discussion and getting an improved diagram, if it''s possible.
http://www.gnumonks.org/ftp/pub/doc/packet-journey-2.4.html
3) TODO: include LVS in the diagram. Julian give us this link to study the
issue and trying to complete the diagram.
http://www.linuxvirtualserver.org/Joseph.Mack/HOWTO/LVS-HOWTO-19.html#ss19.21
4) Of course, diagram is ready to be shoot it off. Any comment, criticism,
etc,. is welcome.
Best regards,
Leonardo Balliache
Network
-----------+-----------
|
+-------+------+
| mangle |
| PREROUTING | <- MARK REWRITE
+-------+------+
|
+-------+------+
| nat |
| PREROUTING | <- DEST REWRITE
+-------+------+
|
+-------+------+
| ipchains |
| FILTER |
+-------+------+
|
+-------+------+
| QOS |
| INGRESS | <- controlled by tc
+-------+------+
|
packet is for +-------+------+ packet is for
this address | INPUT | another address
+--------------+ ROUTING +---------------+
| | + PRDB | |
| +--------------+ |
+-------+------+ |
| filter | |
| INPUT | |
+-------+------+ |
| |
+-------+------+ |
| Local | |
| Process | |
+-------+------+ |
| |
+-------+------+ |
| OUTPUT | +-------+-------+
| ROUTING | | filter |
+-------+------+ | FORWARD |
| +-------+-------+
+-------+------+ |
| mangle | |
| OUTPUT | MARK REWRITE |
+-------+------+ |
| |
+-------+------+ |
| nat | |
| OUTPUT | DEST REWRITE |
+-------+------+ |
| |
+-------+------+ |
| filter | |
| OUTPUT | |
+-------+------+ |
| |
+----------------+ +--------------------+
| |
+--+-------+---+
| ipchains |
| FILTER |
+-------+------+
|
+-------+------+
| nat |
| POSTROUTING | SOURCE REWRITE
+-------+------+
|
+-------+------+
| QOS |
| EGRESS | <- controlled by tc
+-------+------+
|
-----------+-----------
Network
Notes:
1) The input routing determines local/forward.
2) ip rule (policy routing database PRDB) is input routing, more correctly,
part of the input routing.
3) The output routing is performed from "higher layer".
4) nexthop and output device are determined both from the input and the
output routing.
5) The forwarding process is called at input routing by functions from
specific place in the code. It executes after input routing and does not
perform nexthop/outdev selection. It''s the process of receiving and
sending the same packet but in the context of all these hooks the code
that sends ICMP redirects (demanded from input routing), decs the IP TTL,
performs dumb NAT and calls the filter chain. This code is used only for
forwarded packets.
6) Sometimes the word "Forwarding" with "big F", is used for
referencing both,
the routing and forwarding process.
Hello, On Thu, 27 Jun 2002, Leonardo Balliache wrote:> I''ve got some comments: > > 1) I didn''t know there was ipchains code in kernel 2.4; I supposed > iptables new code replace totally old ipchains code. Any feedback > about it would be useful.You can run ipchains or iptables but not the both.> 3) TODO: include LVS in the diagram. Julian give us this link to study the > issue and trying to complete the diagram.It was JFYI. I''m not sure whether we can find a place in the diagram for all programs in the world that are using NF hooks :) Of course, you can go further and to build a jumbo picture of the NF world :)> http://www.linuxvirtualserver.org/Joseph.Mack/HOWTO/LVS-HOWTO-19.html#ss19.21> Best regards, > > Leonardo BalliacheRegards -- Julian Anastasov <ja@ssi.bg>
> 4) Of course, diagram is ready to be shoot it off. Any comment, criticism, > etc,. is welcome.Where belongs the IMQ device ? For ingress, it registers the needed netfilter hooks right after the mangle table. For egress, it registers the needed netfilter hooks after all other tables so after POSTROUTING in the diagram. I think the packet is also redirected to the imq device at the same place. But I''m not sure. Stef PS. I putted the diagram online on www.docum.org and I called it KPTD :) (I will upload it tonight from home.docum.org to www.docum.org) -- stef.coene@docum.org "Using Linux as bandwidth manager" http://www.docum.org/ #lartc @ irc.openprojects.net
----- Original Message ----- From: "Leonardo Balliache" <leoball@opalsoft.net> To: <lartc@mailman.ds9a.nl> Sent: Thursday, June 27, 2002 8:02 PM Subject: [LARTC] kernel packet traveling diagram> Sorry everyone, I''m late.. always!! > > Here is a new version of kernel packet traveling diagram. Thanks a lot > to Julian Anastasov by his comments (notes at the end as I understood). > > I insist it''s nice to have this diagram ready and updated. Look that > when Jan Coppens needed to tell us where he does need to mark packets > he just said "At this point I should need another mangle table->" using > the diagram as reference. > > We understand that internal kernel code is complex and interlaced and > not always is possible to identify clearly each part of it in a simple > diagram. But we keep on trying. > > I''ve got some comments: > > 1) I didn''t know there was ipchains code in kernel 2.4; I supposed > iptables new code replace totally old ipchains code. Any feedback > about it would be useful. > > 2) Below I enclosed a link to an article from Harald Welte "The journey > of a packet through the linux 2.4 network stack"; it could help to the > discussion and getting an improved diagram, if it''s possible. > > http://www.gnumonks.org/ftp/pub/doc/packet-journey-2.4.html > > 3) TODO: include LVS in the diagram. Julian give us this link to study the > issue and trying to complete the diagram. > >http://www.linuxvirtualserver.org/Joseph.Mack/HOWTO/LVS-HOWTO-19.html#ss19.2 1> > 4) Of course, diagram is ready to be shoot it off. Any comment, criticism, > etc,. is welcome. > > Best regards, > > Leonardo Balliache > > > > Network > -----------+----------- > | > +-------+------+ > | mangle | > | PREROUTING | <- MARK REWRITE > +-------+------+ > | > +-------+------+ > | nat | > | PREROUTING | <- DEST REWRITE > +-------+------+ > | > +-------+------+ > | ipchains | > | FILTER | > +-------+------+ > | > +-------+------+ > | QOS | > | INGRESS | <- controlled by tc > +-------+------+ > | > packet is for +-------+------+ packet is for > this address | INPUT | another address > +--------------+ ROUTING +---------------+ > | | + PRDB | | > | +--------------+ | > +-------+------+ | > | filter | | > | INPUT | | > +-------+------+ | > | | > +-------+------+ | > | Local | | > | Process | | > +-------+------+ | > | | > +-------+------+ | > | OUTPUT | +-------+-------+ > | ROUTING | | filter | > +-------+------+ | FORWARD | > | +-------+-------+ > +-------+------+ | > | mangle | | > | OUTPUT | MARK REWRITE | > +-------+------+ | > | | > +-------+------+ | > | nat | | > | OUTPUT | DEST REWRITE | > +-------+------+ | > | | > +-------+------+ | > | filter | | > | OUTPUT | | > +-------+------+ | > | | > +----------------+ +--------------------+ > | | > +--+-------+---+ > | ipchains | > | FILTER | > +-------+------+ > | > +-------+------+ > | nat | > | POSTROUTING | SOURCE REWRITE > +-------+------+ > | > +-------+------+ > | QOS | > | EGRESS | <- controlled by tc > +-------+------+ > | > -----------+----------- > Network > > > Notes: > > 1) The input routing determines local/forward. > 2) ip rule (policy routing database PRDB) is input routing, morecorrectly,> part of the input routing.Are you sure? In the previous diagram, the PRDB was checked before the packet hits the QOS Ingress. If the PRDB indeed is checked after QOS Ingress (i.e. in INPUT ROUTING), which seems the logical way, is it possible (with a patch???) to check the tc_index in "ip rule"? This would make it possible to let the output of the QOS ingress participate in the policy routing. FYI, there is a iptables patch out there, called mangle5hooks, so the mangle table registers all 5 netfilter hooks. This implies that the mangle table has 5 chains instead of 2, PREROUTING, INPUT, OUTPUT, FORWARD and POSTROUTING. Cheers, Jan> 3) The output routing is performed from "higher layer". > 4) nexthop and output device are determined both from the input and the > output routing. > 5) The forwarding process is called at input routing by functions from > specific place in the code. It executes after input routing and doesnot> perform nexthop/outdev selection. It''s the process of receiving and > sending the same packet but in the context of all these hooks the code > that sends ICMP redirects (demanded from input routing), decs the IPTTL,> performs dumb NAT and calls the filter chain. This code is used onlyfor> forwarded packets. > 6) Sometimes the word "Forwarding" with "big F", is used for referencingboth,> the routing and forwarding process. > > > _______________________________________________ > LARTC mailing list / LARTC@mailman.ds9a.nl > http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/ > >
Hello! I want to let FTP traffic pass two TBFs (A and B) and let HTTP traffic pass only TBF A, i.e one token bucket filter is shared and the other is not. Any ideas? Regards, Gabriel Paues Swedish Institue of Computer Science
On Friday 28 June 2002 13:39, Gabriel Paues wrote:> Hello! > > I want to let FTP traffic pass two TBFs (A and B) and let HTTP traffic > pass only TBF A, i.e one token bucket filter is shared and the other is > not. > > Any ideas?Use htb. This is a classfull tbf. In fact, a htb qdisc with 1 class with rate = ceil is a tbf qdisc. Stef -- stef.coene@docum.org "Using Linux as bandwidth manager" http://www.docum.org/ #lartc @ irc.openprojects.net
Hi, Jan.
You wrote:
> Are you sure? In the previous diagram, the PRDB was checked before the
> packet hits the QOS Ingress. If the PRDB indeed is checked after QOS
Ingress
> (i.e. in INPUT ROUTING), which seems the logical way, is it possible (with
a
> patch???) to check the tc_index in "ip rule"? This would make it
possible to
> let the output of the QOS ingress participate in the policy routing.
As I understand:
After Julian observation I believe first diagram was wrong; after reading
again "IPROUTE2 Utility Suite Howto" as follows:
-----------
Rules in routing policy database controlling route selection algorithm.
Classic routing algorithms used in the Internet make routing decisions based
only on the destination address of packets and in theory, but not in practice,
on the TOS field. In some circumstances we want to route packets differently
depending not only on the destination addresses, but also on other packet
fields such as source address, IP protocol, transport protocol ports or even
packet payload. This task is called "policy routing".
To solve this task the conventional destination based routing table, ordered
according to the longest match rule, is replaced with the "routing policy
database" or RPDB, which selects the appropriate route through execution of
some set of rules. These rules may have many keys of different natures and
therefore they have no natural ordering excepting that which is imposed by the
network administrator. In Linux the RPDB is a linear list of rules ordered by
a numeric priority value. The RPDB explicitly allows matching packet source
address, packet destination address, TOS, incoming interface (which is packet
metadata, rather than a packet field), and using fwmark values for matching
IP protocols and transport ports.
-------------
ip rule is input routing (and has access to TOS field).
Reading from Almesberger.-
------------------------------
1) DSCP are the upper six bits of DS field.
2) DS field is the same as TOS field.
------------------------------
Reading ip rule code in iproute2 package from Kuznetsov.-
-----------------------------
1) ip rule use iprule_modify function to set rules.
2) iprule_modify use rtnetlink calls thru libnetlink. Structure rtmsg is
used as a channel to interchange information.
3) One of the fields of rtmsg structure is rtm_tos.
4) You have access to check this octet thru ip rule "tos TOS"
selector.
---------------------------------
Also from Differentiated Services on Linux (Almesberger) - 06/1999:
--------------------------------
When using "sch_dsmark", the class number returned by the
classifier is stored in skb->tc_index. This way, the result can be
re-used during later processing steps.
Nodes in multiple DS domains must also be able to distinguish
packets by the inbound interface in order to translate the DSCP to
the correct PHB. This can be done using the "route" classifier,
in
combination with the "ip rule" command interface subset.
-----------------------
I hope this can answer your question; any feedback from experience
people on the list is welcome.
> FYI, there is a iptables patch out there, called mangle5hooks, so the
mangle
> table registers all 5 netfilter hooks. This implies that the mangle table
> has 5 chains instead of 2, PREROUTING, INPUT, OUTPUT, FORWARD and
> POSTROUTING.
I will try to update the diagram. These words from Julian scare me a little:
----------------
It was JFYI. I\''m not sure whether we can find a place in the
diagram for all programs in the world that are using NF hooks :) Of course,
you can go further and to build a jumbo picture of the NF world :)
----------------
Best regards,
Leonardo Balliache
Hi Stef:
You wrote:
> Where belongs the IMQ device ? For ingress, it registers the needed
> netfilter hooks right after the mangle table. For egress, it registers
the
> needed netfilter hooks after all other tables so after POSTROUTING in the
> diagram.
>
> I think the packet is also redirected to the imq device at the same place.
> But I\''m not sure.
I don''t know too but...
From "The intermediate queueing device" by Patrick McHardy:
-------------------------
The Intermediate queueing device can be used for advanced traffic control.
You can use it to implement egress + ingress traffic control, possibly over
multiple network devices. All packets entering/leaving the ipstack marked
with an special iptables target will be directed through the qdisc attached
to an imq device. After enqueueing the decision what happens to a packet is
up to the qdisc. It can reorder/drop packets according to local policies.
This allows you to treat network devices as classes and distribute bandwidth
among them as well as doing real ingress traffic control using egress qdiscs.
-------------------------
The ipstack Patrick is talking about is after input mangle.
Reading from "The journey of a packet through the linux 2.4 network
stack"
by Harald Welte we have:
----------------------------
The IP packet handler is registered via net/core/dev.c:dev_add_pack() called
from net/ipv4/ip_output.c:ip_init().
The IPv4 packet handling function is net/ipv4/ip_input.c:ip_rcv(). After some
initial checks (if the packet is for this host, ...) the ip checksum is
calculated. Additional checks are done on the length and IP protocol
version 4.
Every packet failing one of the sanity checks is dropped at this point.
If the packet passes the tests, we determine the size of the ip packet and
trim the skb in case the transport medium has appended some padding.
Now it is the first time one of the netfilter hooks is called.
Netfilter provides a generic and abstract interface to the standard routing
code. This is currently used for packet filtering, mangling, NAT and queuing
packets to userspace. For further reference see my conference paper
''The
netfilter subsystem in Linux 2.4'' or one of Rustys unreliable guides,
i.e
the netfilter-hacking-guide.
-------------------------------
The ipstack Patrick uses must be what Harald called (after first group of
netfilter hooks) "queueing packets to userspace".
I suppose IMQ is an iptables target extension like QUEUE just before ingress
queueing. Packets are marked in PREROUTING mangle and taken from the ipstack
to enter the dummy device and "on exit" they are polycing using some
of the
queue disciplines.
+-------+------+
| nat |
| PREROUTING | <- DEST REWRITE
+-------+------+
|
+-------+------+
| ipchains |
| FILTER |
+-------+------+
|
is IMQ probably here ??
|
+-------+------+
| QOS |
| INGRESS | <- controlled by tc
+-------+------+
|
packet is for +-------+------+ packet is for
this address | INPUT | another address
+--------------+ ROUTING +---------------+
| | + PRDB | |
| +--------------+ |
If we keep on reading, we have:
----------------------------------------------
After successful traversal the netfilter hook,
net/ipv4/ipv_input.c:ip_rcv_finish() is called.
Inside ip_rcv_finish(), the packet''s destination is determined by
calling the
routing function net/ipv4/route.c:ip_route_input(). Furthermore, if our IP
packet has IP options, they are processed now. Depending on the routing
decision made by net/ipv4/route.c:ip_route_input_slow(), the journey of our
packet continues in one of the following functions:
net/ipv4/ip_input.c:ip_local_deliver()
The packet''s destination is local, we have to process the layer 4
protocol
and pass it to an userspace process.
net/ipv4/ip_forward.c:ip_forward()
The packet''s destination is not local, we have to forward it to another
network.
net/ipv4/route.c:ip_error()
An error occurred, we are unable to find an apropriate routing table entry
for this packet.
net/ipv4/ipmr.c:ip_mr_input()
It is a Multicast packet and we have to do some multicast routing.
If the routing decided that this packet has to be forwarded to another device,
the function net/ipv4/ip_forward.c:ip_forward() is called.
The first task of this function is to check the ip header''s TTL. If it
is <= 1 we drop the packet and return an ICMP time exceeded message to the
sender.
We check the header''s tailroom if we have enough tailroom for the
destination
device''s link layer header and expand the skb if neccessary.
Next the TTL is decremented by one.
If our new packet is bigger than the MTU of the destination device and the
don''t fragment bit in the IP header is set, we drop the packet and send
a
ICMP frag needed message to the sender.
Finally it is time to call another one of the netfilter hooks - this time it
is the NF_IP_FORWARD hook.
Assuming that the netfilter hooks is returning a NF_ACCEPT verdict, the
function net/ipv4/ip_forward.c:ip_forward_finish() is the next step in our
packet''s journey.
ip_forward_finish() itself checks if we need to set any additional options in
the IP header, and has ip_opt *FIXME* doing this. Afterwards it calls
include/net/ip.h:ip_send().
If we need some fragmentation, *FIXME*:ip_fragment gets called, otherwise we
continue in net/ipv4/ip_forward:ip_finish_output().
ip_finish_output() again does nothing else than calling the netfilter
postrouting hook NF_IP_POST_ROUTING and calling ip_finish_output2() on
successful traversal of this hook.
ip_finish_output2() calls prepends the hardware (link layer) header to our
skb and calls net/ipv4/ip_output.c:ip_output().
---------------------
*FIXME* are actually placed in Harald document.
Ok, as I understand the second IMQ hook must be after the netfilter
postrouting hook NF_IP_POST_ROUTING but before calling the link layer
function ip_output in ip_output.c.
|
+-------+------+
| nat |
| POSTROUTING | SOURCE REWRITE
+-------+------+
|
is IMQ probably here ??
|
+-------+------+
| QOS |
| EGRESS | <- controlled by tc
+-------+------+
|
-----------+-----------
Network
I''m not sure again. Perhaps if Patrick is reading this can help a
little.
Best regards,
Leonardo Balliache
PS: thank a lot for uploading the diagram in your site.