====================================================================== CMSC 417-F02 NOTES ON CHAPTER 4: NETWORK LAYER SHANKAR - Network layer consists of a network of nodes and links. - Network layer service: any two nodes can exchange messages with some specified performance - Network layer functions - Routing: distributed algorithm run by the nodes to - update set of reachable nodes - compute/update good paths between any two nodes - Forwarding: at each node a forwarding table indicating for each incoming data packet the appropriate outgoing link. ---------------------------------------------------------------------- VIRTUAL-CIRCUIT APPROACH - Paths/packet header/forwarding are flow-specific. - When node A wants to send data to node B, - routing function provides a path from A to B - attempt made to establish a path (reserve resource) along the path. (call-establishment) - if successful, data can be sent from A to B with some specified performance guarantees (eg, CBR, VBR, depending on resources reserved). Otherwise, call is rejected ---------------------------------------------------------------------- DATAGRAM APPROACH: - Paths/packet header/forwarding are not flow-specific, but based only on destination (and perhaps QoS attributes). - When node A wants to send data to node B, it just sends (connectionless network layer). Only provides best-effort delivery. - Routing info updates independent of start/end of flows. ---------------------------------------------------------------------- GENERAL ROUTING PROBLEM - Given network of - nodes and bidirectional links - each link has a time-varying (perhaps traffic-dependent) cost Cost is always greater than zero, and is "infinity" if link is down. - each node monitors its outgoing links costs (eg, up/down, queue size) - Want a distributed algorithm that keeps the nodes updated regarding: - set of (reachable) nodes in network - the "good" paths between them. - "Good" can be based on performance objectives: - dollar cost, delay, reliability, etc, - define link cost and compute min cost paths - "Good" can be based on administrative policy objectives: - domains to use, avoid, not allow transit traffic ---------------------------------------------------------------------- ATTRIBUTES OF ROUTING ALGORITHMS - Global view vs local view Each node has a VIEW of the network, representing the state of the network to some extent. - Global view: the view indicates the full topology of the network and all the link costs. Link state routing (OSPF). - Local view: a node's view is more detailed for close by parts of the network, and less detailed for far-away parts. Distance-vector routing (RIP). - Link cost dynamics: Slowly changing: eg, up/down, queue size averaged over hours, $cost. Rapidly changing: queue size averaged over seconds or minutes. Load independent or load dependent (rapidly changing). Dynamic link cost and single-path routing can cause oscillations if the link cost changes faster than the routing feedback loop. - Centralized vs decentralized - centralized: a special node collects link cost info from around the network, computes paths, informs all nodes of paths - decentralized: each node collects link cost info from around the network, computes paths for itself to use. ---------------------------------------------------------------------- LINK STATE ROUTING - global and decentralized - Each node x maintains a directed graph representing its global view: - vertices/arcs are x's current estimate of the network's nodes/links - each arc has a link cost (x's current estimate of the actual link cost) - a sequence number for each vertex (identifying the last link cost update for this vertex received by x). - Each node x's forwarding table is obtained from its directed graph by computing a shortest path tree from the x vertex to all other vertices. The next-hop to node y is simply the first vertex in the path from x to y in this tree. - Each node x periodically floods its outgoing link costs to the network A link cost message contains: - source node id (the node originating the message) - msg sequence number (or timestamp), to detect duplicates and stop flooding) - for each outgoing link: - node id on the other end of the link - link cost - When a node receives a link cost message from a neighbour, If this is a new message (from source node id/timestamp), the node - forwards the message to its neighbours (except the one from which it received the msg). - updates its view, computes paths, updates forwarding table if paths changed. - Node memory cost: O(N*E) Node computation cost: O(N*logN) for each view update Message complexity: O(N*E) - Responsiveness: Transient loops are possible. That is, the next-hops at the nodes can form loops. This happens only during a routing update in progress, that is, during the time when some of the nodes have reacted to a link cost change and the other nodes have not yet. During this time, data packets can go in loops. These routing loops are short-lived, that is, their duration is of the order of the network diameter (assuming that routing messages get priority over data messages). ---------------------------------------------------------------------- DISTANCE-VECTOR ROUTING - local and decentralized - Key idea behind distance-vector routing. Let x be a node, z a neighbor of x, and y any node other than x. Let C(x,z) denote the cost of the link from x to z. Let C_x(y) denote the minimum cost path from x to y. Let C_x(y,z) denote the minimum cost path from x to y where the first hop is to z (z may be y). Distance-vector routing is based on the following. - C_x(y) = min{C_x(y,z): z ranges over x's neigbors} - C_x(y,z) = C(x,z) + min{C_z(y,u): u ranges over z's neigbors} - In the distance-vector routing algorithm, each node maintains estimates of these costs and updates them according to the above equations. We use the term "distances" to refer to these estimates. When the algorithm has stabilized (i.e., finished "responding" to a link cost change), the distances at each node are accurate. We now describe the distance-vector routing algorithm. - Each node x maintains a local view of the network as follows. For each node y, x maintains the following: - for every neighbor z of x: D_x(y,z), an estimate of the min cost to y via z. D_x(y,z) is referred to as the distance from x to y via z. D_x(y,z) is "infinity" means that y cannot reach z. - Nhop_x(y), the next hop to y Nhop_x(y) is a neighbor of x such that D_x(y,Nhop_x(y)) = min{D_x(y,z): z ranges over x's neigbors}. Nhop_x(y) is NIL means there is no path from x to y. - D_x(y), the shortest distance to y. D_x(y) = D_x(y,Nhop_x(y)) D_x(y) is infinity means there is no path from x to y. - The forwarding table at node x consists of {: all nodes in View} - Each node x does the following upon a change n an outgoing link cost C(x,z), where z is a neighbor: if C(x,z) changes by d (positive or negative), for all y, D_x(y,z) := D_x(y,z) + d // update distances-via if ( D_x(y,z) < D_x(y) ) // update distances,nexthops then { D_x(y) := D_x(y,z) ; Nhop_x(y) := z } else if ( Nhop_x(y) = z ) then update NHop_x(y) and D_x(y) - Let F be the average number of neighbors of a node. Node memory cost: O(N*F) Node computation cost: O(N*F) for each view update. Message complexity: O(N*F) - Responsiveness: - Transient loops (ie, loops in next hops) are possible. - Short-lived (order of network diameter) if initiated by link cost decrease. ("Good news" travels fast) - Long-lived (order of link cost change) if initiated by link cost increase. ("Bad news" travels slowly, count to infinity) There are fixes - Poisoned reverse is a partial fix. - Path-vector algorithms, ie, including path in views and routing updates is a fix but leads to same cost as link state. - Other fixes that do not incur extra cost. EXAMPLES ---------------------------------------------------------------------- HIERARCHICAL ROUTING Basic idea: - Organize nodes into a hierarchy of clusters. Each cluster has one or more nodes that serve to "represent" the cluster to the next higher level in the hieararchy. At any level: - Routing paths within a cluster are obtained as usual: by running a routing algorithm (DV or LS) in the cluster. - Routing paths between clusters is done by using the path supplied by the next higher level. - Reduces overhead cost but increases data path length Hence relevant for very large networks (eg, 10000 nodes). We now expand on this for a two-level hieararchy. Two-level hiearchy: - Consider a network of nodes and links. We can assume that any two neighboring nodes can exchange messages. We cannot assume that two nonneigboring nodes can exchange messages, because this is what routing is supposed to achieve. - Each node is a level-0 node. Each link is a level-0 link. The level-0 nodes and links form the level-0 network; it is the same as the original network. - Organize the level-0 nodes into "clusters" such that - the clusters are non-overlapping and cover all the nodes (ie, every node is in exactly one cluster), and - any two nodes in a cluster can be connected by a path that stays within the cluster. - Each cluster has an id, eg, A, B, ...., and each node has a globally unique id of the form X.y, where X is its cluster id and y is its id within the cluster. - A link is intra-cluster link if its end-points are in the same cluster. A link is inter-cluster link if its end-points are in different clusters. The nodes and intra-cluster links of a cluster define a clusternet. [If we were considering a hierarchy of more than two levels, we would treat these clusters and clusternets as level-0 entities.] - Define a node to be level-1 iff it is attached to an inter-cluster link (the node is also a level-0 node). - Define a level-1 link to exist between two level-1 nodes x and y iff x and y are connected by - a inter-cluster link (the link is both level-0 and level-1), or - a path of intra-cluster links (x and y are in the same cluster). - The level-1 nodes and level-1 links define the level-1 network. Observe that the level-1 network is "overlaid" on the level-0 network. Routing algorithms: - Each clusternet runs a routing algorithm, which yields intra-cluster paths between the nodes of the cluster. The algorithm can be run because any two nodes in the cluster connected by a (level-0) link can exchange messages. - The level-1 network runs a routing algorithm, which yields level-1 paths between the level-1 nodes. The algorithm can be run because any two level-1 neighbors, x and y, (ie, connected by a level-1 link) can exchange (routing) messages. Specifically: - If x and y are in different clusters, they can exchange messages over the level-0 (inter-cluster) link connecting them. - If x and y are in the same cluster, they are connected by an intra-cluster path over which they can exchange messages using the clusternet's routing functionality (eg, using IP or TCP). The linkcost of a level-1 link would be some function (eg, the sum) of the linkcosts of its constituent level-0 links. - Each clusternet runs a distributed algorithm in which the level-1 nodes of the cluster disseminate their "external" reachability info (eg, ids of clusters they can reach, perhaps with some cost estimate). to the level-0 nodes of the cluster. This distributed algorithm does not have to be a full-fledged routing algorithm. For example, it can be periodic broadcasts within the clusternet by the level-1 nodes; or each level-0 node can be statically assigned a level-1 node as its gateway. - Hierarchical routing reduces the cost of routing. Suppose C(x,y) is the cost of a routing algorithm (eg, DV, LS) in a network of x nodes and y edges. Consider a network of M clusters and F inter-cluster links, where each cluster has N nodes, E edges, and a single level-1 node. The cost of routing (ignoring the dissemination cost) is M*C(N,E)+C(M,F), instead of C(M*N,M*N+F) if no hierarchy was used. Internet: - The Internet uses a two-level hierarchy, with clusters corresponding to autonomous systems (AS). [ASs are also called domains (unrelated to naming domains).] Level-0 routing algorithms are typically OSPF and RIP. The level-1 routing algorithm is BGP (more precisely, E-BGP). The dissemination of external AS info within a cluster is done in various ways: DHCP (at boot-time), BGP (more precisely, I-BGP). ---------------------------------------------------------------------- IP (version 4) INTERNET PROTOCOL Purpose - IP intended to interconnect networks (perhaps of different kinds), by providing a common packet and addressing format. IP does not dictate the internal operation of a network. - The networks are connected at special nodes, called IP routers. An IP router has interfaces (links) to nodes in two or more networks. of other networks. Addressing - An IP address consists of an IP NETWORK ID and an IP INTERFACE ID. Each network gets a unique ip network id. Each host gets an unique ip address for its interface to its network; the network id part equals the ip network id of its network. Each IP router gets a unique ip address for each of its interfaces; the network id part equals the ip network id of the network to which the link leads. - Note that a network can have interfaces that have no ip address. Because a host has only one interface, the ip address of the interface is also called the ip address of the host. - IP address is 32 bits (dotted decimal convention: xxx.xxx.xxx.xxx). - Originally, IP had "classfull" addressing: - 4 classes of network ids: A, B, C, D. A, B, C are unicast addresses. D are multicast addresses. First few bits of an address identified its class. - An IP router's forwarding table has an entry for each network id. To forward an ip packet, it looks up the entry corresponding to the network id of the packet's destination ip address. - Classfull addressing resulted in large unusable gaps in address space. - Currently, IP has "classless" addressing: - A, B, C classes are now merged. - The network id of an ip address is now denoted using a NETMASK. So xxx.xxx.xxx.xxx/n means that the first n bits of the ip address consitute the network id. - But an ip packet's destination address does not indicate the netmask. Consequently, care has to taken in assigning network ids. Also, IP routers must now do "longest prefix" matching. The forwarding table still has an entry for every network id, except that now the network ids have the form xxxx/n. To forward an ip packet, the router has to look up the entry corresponding to the longest network id that matches the packet's destination address - But this "longest prefix" matching adds some flexibility in that a network that changes to another ISP may not need to change its IP addresses. Obtaining host IP addresses - A host's ip address can be statically assigned (stored in host OS) or dynamically obtained using DHCP. IP packet fields - version number, header length, datagram length, header checksum - fragmentation stuff: packet identifier, flags, offset - protocol number: TCP, UDP, ICMP, DNS, ... - TOS: mostly ignored by routers. - options (source routing, etc): mostly ignored by routers. - source/destination ip addresses. - data ---------------------------------------------------------------------- ICMP - Protocol for exchanging control information between routers and hosts. - echo - packet dropped (TTL exceeded, too large, destination unknown, etc) - source quench - Uses IP (ie, ICMP packets carried in IP packets). ---------------------------------------------------------------------- DHCP - For dynamically assigning IP addresses to hosts ---------------------------------------------------------------------- NAT - Mechanism for multiplexing many IP addresses, say k1, k2, ..., kn, on one IP address, say j, by using TCP port numbers to distinguish between k1, k2, ..., kn. ---------------------------------------------------------------------- IPv6 - IPv6 packet header - version, payload length - traffic class (like IPv4 TOS) - hop limit (like TTL) - flow label - source/destination addresses (128 bit) - next header (like IPv4 protocol id, but also for IP options) - ICMPv6: ICMP for IPv6 - Transitioning from IPv4 to IPv6 - Dual-stack approach: Use v6 if next-hop is v6 enabled and packet is v6, else use v4 (note that some v6 fields are lost when changing to v4). - Tunneling: v6 packets are ins v4 packets, and travel between v6-enabled routers ---------------------------------------------------------------------- ROUTER - A router with N links incident consists of - N input ports (packets arrive here) - N output ports (packets depart from here) - Switch fabric (transfers packets from input to output ports) - Routing processor (handles routing algorithm) - When a packet arrives, the router does the following: - extract packet's header - if the packet is a data packet, do a look up in forwarding table and transfer packet to appropriate outgoing port queue. - if the packet is a routing packet, do the routing alg update (ie, transfer packet to routing processor and handle there). - Queueing at the output ports is always needed because packets arriving at the same time on different links can be all destined for the same outgoing link. - If the router is fast enough to process every incoming packet as soon as it arrives [ie, O(N*B) for link bandwidth B], there is no need for queueing at the input ports. - In practice, the router is not so fast, and queuing is needed at the input ports, resulting in HOL (head-of-line) blocking. - Switch fabric choices: Memory, Bus, interconnection networks. - Queueing disciplines (also called link scheduling) at output ports: - FIFO/drop-tail is the usual discipline: packet dropped iff it arrives to a full queue. - FIFO/RED is a big improvement: Uses min and max thresholds for queue size. Arriving packet is dropped with prob that depends on queue size: = p1 if queue size is less than min threshold. = p2 if queue size is higher than max threshold. = linear interpolation between p1 and p2 otherwise. Helps TCP flows do fast retransmit/recovery (rather than slow-start) - There are more general class-based queueing disciplines, where packets are scheduled depending on their class (service level). - In general, the more sophisticated the queueing discipline, the slower the router becomes. ---------------------------------------------------------------------- MULTICAST Multicast (mcast) service (Internet version) - Mcast group consists of a set of hosts. - Hosts can join and leave the group. - Every host in the group receives every msg sent by any host in group. Hosts not in the group should not receive these msgs (for efficiency reasons). The obvious (stateless) implementation is hopelessly inefficient: - every sender in the group knows the ip addresses of all hosts in the group; sender does unicast to all hosts in group. We need something better: - Mcast address: ip address denoting a mcast group. Packet with destination mcast addr should go to all hosts in group. We use "unicast" (or "ucast") address to refer to a host ip address. - For each mcast group and mcast source, we need to overlay a "mcast tree" on the network such that the tree is rooted at the source, its leaves are all the other hosts in the group, and every node in the tree forwards every incoming mcast packet on all outgoing links of the tree. - There are two classes of mcast-trees: - Group-shared tree: One undirected (ie, bidirectional links) tree spanning all the hosts in the mcast group. The optimal tree would be a Steiner-tree. - Source-based trees: One directed (ie, unidirectional links) tree for every source in the mcast group, rooted at the source and reaching all hosts in the group. The optimal tree would be the shortest-path tree. Implementation overview: - Consider an underlying network of nodes (routers, hosts) and links. Assume each unicast routing is already in place; ie, a unicast message can be transferred between any two nodes. - We overlay a "mcast network" on this network. - The mcast network implements the mcast-trees referred to above. Mcast-network definition: - Let MCAST-HOSTS refer to the hosts that may want to participate in mcasts. Typically, every host is a mcast-host. Let MCAST-RELAYS refer to the subset of nodes that are willing to participate in building mcast trees (ie, willing to forward and duplicate mcast messsages). Note that a relay can be a router or a host. [In network-level mcasting, all relays are routers. In application-level mcasting, all relays are hosts.] Let MCAST-NODES be the union of mcast-hosts and mcast-relays. - Let MCAST-LINKS refer to "virtual" links between the mcast-nodes. An mcast-links is either between relays or between relay and host. An mcast-link corresponds to a path of one or more links of the underlying network. - Let MCAST-NETWORK refer to the network of mcast-nodes and mcast-links. Mcast-trees will be computed on this mcast-network. NOTE: - Because the underlying network provides unicast routing, we could introduce an mcast-link between any two mcast-relays (ie, a complete mcast-network). but that would result in very inefficient mcast-trees. We want to have only "good" mcast-links. - In general, network-level mcasting is more efficient because the mcast-network would have better links and better topology (more use of backbone links and less use of edge links). Mcast network algorithms: - Mcast group management: An mcast-host joins a mcast group by sending its mcast-relay a join-request (with mcast-addr and any relevant info). Similarly for leaving (leaving can also be done by soft state). - Forwarding: - Each mcast-relay maintains a forwarding table as follows: For each mcast address, for each source unicast address, a list of next-hops (mcast-node unicast addresses). - When the relay receives a mcast packet (ie, packet with destination mcast address and source unicast address), it looks for a matching entry in the forwarding table and sends the packet to every next hop via unicast tunneling. - If the mcast-tree is a group-shared tree, then the entry need not have the source unicast addr field; a received mcast packet is simply forwarded to all next hops except for the one from which the packet was received. - Routing: - The mcast-relays execute a distributed algorithm that yields them a view sufficient to compute the forwarding tables. Several algorithms are discussed next. A NAIVE MCAST-ROUTING ALGORITHM: - Each mcast-relay disseminates the mcast-addresses of its associated mcast-hosts (learned from the mcast group management protocol). - Each mcast-relay maintains a global view of the mcast-network: - The cost of each mcast-link in the view is obtained from the underlying unicast routing view (eg, the mcast-link cost is the unicast distance from itself to the mcast-node at the other end of the mcast-link). - For each mcast-node in the view, the associated mcast addresses. - Periodically, the mcast-relay executes an algorithm that yields mcast-trees (group-shared or source-based) for every mcast address in its view, and updates the forwarding table. If all mcast-relays use the same algorithm, then they would get consistent trees, and hence consistent forwarding tables. CENTER-BASED MCAST-ROUTING ALG FOR GROUP-SHARED TREE: - Each mcast group has a specfied mcast-relay as its "center" (or rendezvous or core). - For a mcast-relay and a mcast address, let "parent" refer to the mcast-relay that is next in the mcast-network path to the center (this path would correspond to the unicast path to the center). The mcast-group center itself has no parent. - A host join message includes the unicast addr of the center. - For any mcast group, we say the mcast-relay is in the group's shared tree iff its forwarding table has an entry for that group. - When a mcast-relay receives from mcast-node S a join message with mcast address mA and center unicast address uA: - If the relay is not in mA's shared-tree (ie, no mA entry in forwarding table) - it forwards the join msg to its parent, say R, for group mA (ie, next mcast-relay towards the center). - it adds to the forwarding table an entry with mcast addr mA and next hops R and S; it is now on mA's shared tree. - If the relay is in mA's shared-tree - it adds S as a next hop to the forwarding table entry for mA - When a mcast-relay receives from mcast-node S a data packet with mcast address mA, it forwards the packet to all next hops other than S in the entry for mA in the forwarding table. - When a mcast-relay receives from a mcast-node, say U, a leave message for mcast addr mA, it deletes from its mA entry in the forwarding table the next hop U. If the mA entry now has only one next hop (that would be the parent), it forwards the leave message to the parent and deletes the mA entry from its forwarding table. REVERSE PATH FORWADING (RPF) MCAST-ROUTING ALG FOR SOURCE-BASED TREES: - This algorithm does a limited form of flooding in the mcast-network. - Requires the mcast-nodes to have unicast forwarding tables, ie, a next hop mcast-node for every mcast-node in the mcast-network. [This forwarding table can be statically defined or obtained from the underlying unicast routing algorithm if it's LS.] - Consider a mcast-node X, a mcast-node Y that is a neighbor of X (in the mcast-network), and a mcast-host Z. With respect to Z, we say Y is upstream (downstream) of X iff Y is (is not) X's next-hop to Z. - When a mcast-relay X receives from a (mcast-relay) neighbor, say Y, a data packet with mcast addr mA and source unicast addr Z, X does the following: if Y is X's upstream neighbor wrt Z (ie, next-hop for the unicast route to X) X forwards the packet to all its downstream mcast neighbors otherwise X drops the packet - The above limited flooding gets the packet to every mcast-node, even those that are not in the mcast group mA. We can improve by using prune/graft messages: - Each mcast-relay now maintains a mcast forwarding table. - When a mcast-node receives a data packet of a mcast group that it does not belong to, it sends back a prune message to the mcast-relay that sent the data packet. - When a mcast-relay receives a prune message from a mcast-node X, it removes X from the mcast forwarding table entry for the group indicated in the prune message. If there are no next-hops left for that group, it sends a prune message to its upstream mcast-relay. - When a mcast-node joins a mcast group, it sends a graft message. Figure out the details. INTERNET MCAST ROUTING ALGORITHMS - DVMRP (Distance Vector Multicast Routing Protocol) - source-based trees - RPF with pruning, grafting, and soft state - IP tunneling to connect mcast-nodes. - PIM (Protocol Independent Multicast) - Has a dense mode and a sparse mode of operation. - Dense mode uses source-based trees built with RPF. - Sparse mode uses core-based group-shared tree with soft state. - MOSPF (Multicast OSPF) - Similar to the NAIVE algorithm described above. ---------------------------------------------------------------------- MOBILITY Want a laptop (or any IP-enabled device) to maintain its TCP/IP or UDP/IP connections when it moves from one IP network to another, ie, to maintain the laptop's IP address when it across such moves. Clearly, some "mobility management" is needed at the IP layer. Network-based solution (routers do all the work) - When the laptop visits a foreign network, it advertises its permanent ip address. This information is disseminated (with netmask 255.255.255.255) to all routers by the usual routing protocols. Because routers do longest prefix matching, laptop can maintain its IP based connections across moves. - Very expensive for routers because forwarding tables become enormous (one entry for each ip address at a foreign network). - Network administrators don't like this: insecure, spoofing, ... Edge-based solution: - Terminology: - PA: permanent ip address of laptop (with network id indicating the home network). - home network: the permanent home network of the laptop. - home agent: entity in home network doing mobility management. - foreign network: the ip network that the laptop is visiting. - foreign agent: entity in foreign network doing mobility management. - When laptop enters foreign network: - Laptop register with foreign agent (supplies PA and any other relevant info to foreign agent). - Foreign agent assigns the laptop a care-of-address (COA), which is an ip address with network part equal to foreign network id. - Foreign agent informs home agent of laptop's PA and COA. - Home agent adds the entry (PA,COA) to its "forwarding table". - As long as the home agent has the (PA,COA) entry: - It intercepts any packet entering the home network with destination addr equal to laptop's PA and tunnels it to the foreign agent. - As long as the laptop is registered with foreign agent: - It intercepts any incoming packet with dest addr equal to COA, decapsulates it, and forwards the content IP packet to the laptop. - When the laptop sends a IP packet, it uses its PA as source addr. - In the above solution, to-laptop packets go via the home agent but from-laptop packets go straight to the destination. It is also possible for to-laptop packets to avoid the home agent, by letting the home agent inform the sender of the COA. ======================================================================