Sorry everyone, I''m late.. always!! Here is a new version of kernel packet traveling diagram. Thanks a lot to Julian Anastasov by his comments (notes at the end as I understood). I insist it''s nice to have this diagram ready and updated. Look that when Jan Coppens needed to tell us where he does need to mark packets he just said "At this point I should need another mangle table->" using the diagram as reference. We understand that internal kernel code is complex and interlaced and not always is possible to identify clearly each part of it in a simple diagram. But we keep on trying. I''ve got some comments: 1) I didn''t know there was ipchains code in kernel 2.4; I supposed iptables new code replace totally old ipchains code. Any feedback about it would be useful. 2) Below I enclosed a link to an article from Harald Welte "The journey of a packet through the linux 2.4 network stack"; it could help to the discussion and getting an improved diagram, if it''s possible. http://www.gnumonks.org/ftp/pub/doc/packet-journey-2.4.html 3) TODO: include LVS in the diagram. Julian give us this link to study the issue and trying to complete the diagram. http://www.linuxvirtualserver.org/Joseph.Mack/HOWTO/LVS-HOWTO-19.html#ss19.21 4) Of course, diagram is ready to be shoot it off. Any comment, criticism, etc,. is welcome. Best regards, Leonardo Balliache Network -----------+----------- | +-------+------+ | mangle | | PREROUTING | <- MARK REWRITE +-------+------+ | +-------+------+ | nat | | PREROUTING | <- DEST REWRITE +-------+------+ | +-------+------+ | ipchains | | FILTER | +-------+------+ | +-------+------+ | QOS | | INGRESS | <- controlled by tc +-------+------+ | packet is for +-------+------+ packet is for this address | INPUT | another address +--------------+ ROUTING +---------------+ | | + PRDB | | | +--------------+ | +-------+------+ | | filter | | | INPUT | | +-------+------+ | | | +-------+------+ | | Local | | | Process | | +-------+------+ | | | +-------+------+ | | OUTPUT | +-------+-------+ | ROUTING | | filter | +-------+------+ | FORWARD | | +-------+-------+ +-------+------+ | | mangle | | | OUTPUT | MARK REWRITE | +-------+------+ | | | +-------+------+ | | nat | | | OUTPUT | DEST REWRITE | +-------+------+ | | | +-------+------+ | | filter | | | OUTPUT | | +-------+------+ | | | +----------------+ +--------------------+ | | +--+-------+---+ | ipchains | | FILTER | +-------+------+ | +-------+------+ | nat | | POSTROUTING | SOURCE REWRITE +-------+------+ | +-------+------+ | QOS | | EGRESS | <- controlled by tc +-------+------+ | -----------+----------- Network Notes: 1) The input routing determines local/forward. 2) ip rule (policy routing database PRDB) is input routing, more correctly, part of the input routing. 3) The output routing is performed from "higher layer". 4) nexthop and output device are determined both from the input and the output routing. 5) The forwarding process is called at input routing by functions from specific place in the code. It executes after input routing and does not perform nexthop/outdev selection. It''s the process of receiving and sending the same packet but in the context of all these hooks the code that sends ICMP redirects (demanded from input routing), decs the IP TTL, performs dumb NAT and calls the filter chain. This code is used only for forwarded packets. 6) Sometimes the word "Forwarding" with "big F", is used for referencing both, the routing and forwarding process.
Hello, On Thu, 27 Jun 2002, Leonardo Balliache wrote:> I''ve got some comments: > > 1) I didn''t know there was ipchains code in kernel 2.4; I supposed > iptables new code replace totally old ipchains code. Any feedback > about it would be useful.You can run ipchains or iptables but not the both.> 3) TODO: include LVS in the diagram. Julian give us this link to study the > issue and trying to complete the diagram.It was JFYI. I''m not sure whether we can find a place in the diagram for all programs in the world that are using NF hooks :) Of course, you can go further and to build a jumbo picture of the NF world :)> http://www.linuxvirtualserver.org/Joseph.Mack/HOWTO/LVS-HOWTO-19.html#ss19.21> Best regards, > > Leonardo BalliacheRegards -- Julian Anastasov <ja@ssi.bg>
> 4) Of course, diagram is ready to be shoot it off. Any comment, criticism, > etc,. is welcome.Where belongs the IMQ device ? For ingress, it registers the needed netfilter hooks right after the mangle table. For egress, it registers the needed netfilter hooks after all other tables so after POSTROUTING in the diagram. I think the packet is also redirected to the imq device at the same place. But I''m not sure. Stef PS. I putted the diagram online on www.docum.org and I called it KPTD :) (I will upload it tonight from home.docum.org to www.docum.org) -- stef.coene@docum.org "Using Linux as bandwidth manager" http://www.docum.org/ #lartc @ irc.openprojects.net
----- Original Message ----- From: "Leonardo Balliache" <leoball@opalsoft.net> To: <lartc@mailman.ds9a.nl> Sent: Thursday, June 27, 2002 8:02 PM Subject: [LARTC] kernel packet traveling diagram> Sorry everyone, I''m late.. always!! > > Here is a new version of kernel packet traveling diagram. Thanks a lot > to Julian Anastasov by his comments (notes at the end as I understood). > > I insist it''s nice to have this diagram ready and updated. Look that > when Jan Coppens needed to tell us where he does need to mark packets > he just said "At this point I should need another mangle table->" using > the diagram as reference. > > We understand that internal kernel code is complex and interlaced and > not always is possible to identify clearly each part of it in a simple > diagram. But we keep on trying. > > I''ve got some comments: > > 1) I didn''t know there was ipchains code in kernel 2.4; I supposed > iptables new code replace totally old ipchains code. Any feedback > about it would be useful. > > 2) Below I enclosed a link to an article from Harald Welte "The journey > of a packet through the linux 2.4 network stack"; it could help to the > discussion and getting an improved diagram, if it''s possible. > > http://www.gnumonks.org/ftp/pub/doc/packet-journey-2.4.html > > 3) TODO: include LVS in the diagram. Julian give us this link to study the > issue and trying to complete the diagram. > >http://www.linuxvirtualserver.org/Joseph.Mack/HOWTO/LVS-HOWTO-19.html#ss19.2 1> > 4) Of course, diagram is ready to be shoot it off. Any comment, criticism, > etc,. is welcome. > > Best regards, > > Leonardo Balliache > > > > Network > -----------+----------- > | > +-------+------+ > | mangle | > | PREROUTING | <- MARK REWRITE > +-------+------+ > | > +-------+------+ > | nat | > | PREROUTING | <- DEST REWRITE > +-------+------+ > | > +-------+------+ > | ipchains | > | FILTER | > +-------+------+ > | > +-------+------+ > | QOS | > | INGRESS | <- controlled by tc > +-------+------+ > | > packet is for +-------+------+ packet is for > this address | INPUT | another address > +--------------+ ROUTING +---------------+ > | | + PRDB | | > | +--------------+ | > +-------+------+ | > | filter | | > | INPUT | | > +-------+------+ | > | | > +-------+------+ | > | Local | | > | Process | | > +-------+------+ | > | | > +-------+------+ | > | OUTPUT | +-------+-------+ > | ROUTING | | filter | > +-------+------+ | FORWARD | > | +-------+-------+ > +-------+------+ | > | mangle | | > | OUTPUT | MARK REWRITE | > +-------+------+ | > | | > +-------+------+ | > | nat | | > | OUTPUT | DEST REWRITE | > +-------+------+ | > | | > +-------+------+ | > | filter | | > | OUTPUT | | > +-------+------+ | > | | > +----------------+ +--------------------+ > | | > +--+-------+---+ > | ipchains | > | FILTER | > +-------+------+ > | > +-------+------+ > | nat | > | POSTROUTING | SOURCE REWRITE > +-------+------+ > | > +-------+------+ > | QOS | > | EGRESS | <- controlled by tc > +-------+------+ > | > -----------+----------- > Network > > > Notes: > > 1) The input routing determines local/forward. > 2) ip rule (policy routing database PRDB) is input routing, morecorrectly,> part of the input routing.Are you sure? In the previous diagram, the PRDB was checked before the packet hits the QOS Ingress. If the PRDB indeed is checked after QOS Ingress (i.e. in INPUT ROUTING), which seems the logical way, is it possible (with a patch???) to check the tc_index in "ip rule"? This would make it possible to let the output of the QOS ingress participate in the policy routing. FYI, there is a iptables patch out there, called mangle5hooks, so the mangle table registers all 5 netfilter hooks. This implies that the mangle table has 5 chains instead of 2, PREROUTING, INPUT, OUTPUT, FORWARD and POSTROUTING. Cheers, Jan> 3) The output routing is performed from "higher layer". > 4) nexthop and output device are determined both from the input and the > output routing. > 5) The forwarding process is called at input routing by functions from > specific place in the code. It executes after input routing and doesnot> perform nexthop/outdev selection. It''s the process of receiving and > sending the same packet but in the context of all these hooks the code > that sends ICMP redirects (demanded from input routing), decs the IPTTL,> performs dumb NAT and calls the filter chain. This code is used onlyfor> forwarded packets. > 6) Sometimes the word "Forwarding" with "big F", is used for referencingboth,> the routing and forwarding process. > > > _______________________________________________ > LARTC mailing list / LARTC@mailman.ds9a.nl > http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/ > >
Hello! I want to let FTP traffic pass two TBFs (A and B) and let HTTP traffic pass only TBF A, i.e one token bucket filter is shared and the other is not. Any ideas? Regards, Gabriel Paues Swedish Institue of Computer Science
On Friday 28 June 2002 13:39, Gabriel Paues wrote:> Hello! > > I want to let FTP traffic pass two TBFs (A and B) and let HTTP traffic > pass only TBF A, i.e one token bucket filter is shared and the other is > not. > > Any ideas?Use htb. This is a classfull tbf. In fact, a htb qdisc with 1 class with rate = ceil is a tbf qdisc. Stef -- stef.coene@docum.org "Using Linux as bandwidth manager" http://www.docum.org/ #lartc @ irc.openprojects.net
Hi, Jan. You wrote: > Are you sure? In the previous diagram, the PRDB was checked before the > packet hits the QOS Ingress. If the PRDB indeed is checked after QOS Ingress > (i.e. in INPUT ROUTING), which seems the logical way, is it possible (with a > patch???) to check the tc_index in "ip rule"? This would make it possible to > let the output of the QOS ingress participate in the policy routing. As I understand: After Julian observation I believe first diagram was wrong; after reading again "IPROUTE2 Utility Suite Howto" as follows: ----------- Rules in routing policy database controlling route selection algorithm. Classic routing algorithms used in the Internet make routing decisions based only on the destination address of packets and in theory, but not in practice, on the TOS field. In some circumstances we want to route packets differently depending not only on the destination addresses, but also on other packet fields such as source address, IP protocol, transport protocol ports or even packet payload. This task is called "policy routing". To solve this task the conventional destination based routing table, ordered according to the longest match rule, is replaced with the "routing policy database" or RPDB, which selects the appropriate route through execution of some set of rules. These rules may have many keys of different natures and therefore they have no natural ordering excepting that which is imposed by the network administrator. In Linux the RPDB is a linear list of rules ordered by a numeric priority value. The RPDB explicitly allows matching packet source address, packet destination address, TOS, incoming interface (which is packet metadata, rather than a packet field), and using fwmark values for matching IP protocols and transport ports. ------------- ip rule is input routing (and has access to TOS field). Reading from Almesberger.- ------------------------------ 1) DSCP are the upper six bits of DS field. 2) DS field is the same as TOS field. ------------------------------ Reading ip rule code in iproute2 package from Kuznetsov.- ----------------------------- 1) ip rule use iprule_modify function to set rules. 2) iprule_modify use rtnetlink calls thru libnetlink. Structure rtmsg is used as a channel to interchange information. 3) One of the fields of rtmsg structure is rtm_tos. 4) You have access to check this octet thru ip rule "tos TOS" selector. --------------------------------- Also from Differentiated Services on Linux (Almesberger) - 06/1999: -------------------------------- When using "sch_dsmark", the class number returned by the classifier is stored in skb->tc_index. This way, the result can be re-used during later processing steps. Nodes in multiple DS domains must also be able to distinguish packets by the inbound interface in order to translate the DSCP to the correct PHB. This can be done using the "route" classifier, in combination with the "ip rule" command interface subset. ----------------------- I hope this can answer your question; any feedback from experience people on the list is welcome. > FYI, there is a iptables patch out there, called mangle5hooks, so the mangle > table registers all 5 netfilter hooks. This implies that the mangle table > has 5 chains instead of 2, PREROUTING, INPUT, OUTPUT, FORWARD and > POSTROUTING. I will try to update the diagram. These words from Julian scare me a little: ---------------- It was JFYI. I\''m not sure whether we can find a place in the diagram for all programs in the world that are using NF hooks :) Of course, you can go further and to build a jumbo picture of the NF world :) ---------------- Best regards, Leonardo Balliache
Hi Stef: You wrote: > Where belongs the IMQ device ? For ingress, it registers the needed > netfilter hooks right after the mangle table. For egress, it registers the > needed netfilter hooks after all other tables so after POSTROUTING in the > diagram. > > I think the packet is also redirected to the imq device at the same place. > But I\''m not sure. I don''t know too but... From "The intermediate queueing device" by Patrick McHardy: ------------------------- The Intermediate queueing device can be used for advanced traffic control. You can use it to implement egress + ingress traffic control, possibly over multiple network devices. All packets entering/leaving the ipstack marked with an special iptables target will be directed through the qdisc attached to an imq device. After enqueueing the decision what happens to a packet is up to the qdisc. It can reorder/drop packets according to local policies. This allows you to treat network devices as classes and distribute bandwidth among them as well as doing real ingress traffic control using egress qdiscs. ------------------------- The ipstack Patrick is talking about is after input mangle. Reading from "The journey of a packet through the linux 2.4 network stack" by Harald Welte we have: ---------------------------- The IP packet handler is registered via net/core/dev.c:dev_add_pack() called from net/ipv4/ip_output.c:ip_init(). The IPv4 packet handling function is net/ipv4/ip_input.c:ip_rcv(). After some initial checks (if the packet is for this host, ...) the ip checksum is calculated. Additional checks are done on the length and IP protocol version 4. Every packet failing one of the sanity checks is dropped at this point. If the packet passes the tests, we determine the size of the ip packet and trim the skb in case the transport medium has appended some padding. Now it is the first time one of the netfilter hooks is called. Netfilter provides a generic and abstract interface to the standard routing code. This is currently used for packet filtering, mangling, NAT and queuing packets to userspace. For further reference see my conference paper ''The netfilter subsystem in Linux 2.4'' or one of Rustys unreliable guides, i.e the netfilter-hacking-guide. ------------------------------- The ipstack Patrick uses must be what Harald called (after first group of netfilter hooks) "queueing packets to userspace". I suppose IMQ is an iptables target extension like QUEUE just before ingress queueing. Packets are marked in PREROUTING mangle and taken from the ipstack to enter the dummy device and "on exit" they are polycing using some of the queue disciplines. +-------+------+ | nat | | PREROUTING | <- DEST REWRITE +-------+------+ | +-------+------+ | ipchains | | FILTER | +-------+------+ | is IMQ probably here ?? | +-------+------+ | QOS | | INGRESS | <- controlled by tc +-------+------+ | packet is for +-------+------+ packet is for this address | INPUT | another address +--------------+ ROUTING +---------------+ | | + PRDB | | | +--------------+ | If we keep on reading, we have: ---------------------------------------------- After successful traversal the netfilter hook, net/ipv4/ipv_input.c:ip_rcv_finish() is called. Inside ip_rcv_finish(), the packet''s destination is determined by calling the routing function net/ipv4/route.c:ip_route_input(). Furthermore, if our IP packet has IP options, they are processed now. Depending on the routing decision made by net/ipv4/route.c:ip_route_input_slow(), the journey of our packet continues in one of the following functions: net/ipv4/ip_input.c:ip_local_deliver() The packet''s destination is local, we have to process the layer 4 protocol and pass it to an userspace process. net/ipv4/ip_forward.c:ip_forward() The packet''s destination is not local, we have to forward it to another network. net/ipv4/route.c:ip_error() An error occurred, we are unable to find an apropriate routing table entry for this packet. net/ipv4/ipmr.c:ip_mr_input() It is a Multicast packet and we have to do some multicast routing. If the routing decided that this packet has to be forwarded to another device, the function net/ipv4/ip_forward.c:ip_forward() is called. The first task of this function is to check the ip header''s TTL. If it is <= 1 we drop the packet and return an ICMP time exceeded message to the sender. We check the header''s tailroom if we have enough tailroom for the destination device''s link layer header and expand the skb if neccessary. Next the TTL is decremented by one. If our new packet is bigger than the MTU of the destination device and the don''t fragment bit in the IP header is set, we drop the packet and send a ICMP frag needed message to the sender. Finally it is time to call another one of the netfilter hooks - this time it is the NF_IP_FORWARD hook. Assuming that the netfilter hooks is returning a NF_ACCEPT verdict, the function net/ipv4/ip_forward.c:ip_forward_finish() is the next step in our packet''s journey. ip_forward_finish() itself checks if we need to set any additional options in the IP header, and has ip_opt *FIXME* doing this. Afterwards it calls include/net/ip.h:ip_send(). If we need some fragmentation, *FIXME*:ip_fragment gets called, otherwise we continue in net/ipv4/ip_forward:ip_finish_output(). ip_finish_output() again does nothing else than calling the netfilter postrouting hook NF_IP_POST_ROUTING and calling ip_finish_output2() on successful traversal of this hook. ip_finish_output2() calls prepends the hardware (link layer) header to our skb and calls net/ipv4/ip_output.c:ip_output(). --------------------- *FIXME* are actually placed in Harald document. Ok, as I understand the second IMQ hook must be after the netfilter postrouting hook NF_IP_POST_ROUTING but before calling the link layer function ip_output in ip_output.c. | +-------+------+ | nat | | POSTROUTING | SOURCE REWRITE +-------+------+ | is IMQ probably here ?? | +-------+------+ | QOS | | EGRESS | <- controlled by tc +-------+------+ | -----------+----------- Network I''m not sure again. Perhaps if Patrick is reading this can help a little. Best regards, Leonardo Balliache PS: thank a lot for uploading the diagram in your site.