CMSC 417

NOTE 5

Shankar: F02

TCP Congestion Control

Revised: October 3 2002, February 24 2004, March 1 2004

1. Introduction

Congestion control deals with adapting the source send rate to the bandwidth available to the transport connection, which varies over time in a non-predictable way because the network is shared by many applications. Consider the data flow from a TCP source to a TCP sink. The goal of TCP congestion control is three-fold: Because TCP was designed to operate over unreliable heterogenous networks, its approach to congestion control relies on minimal knowledge of the network state. In particular, a TCP entity's knowledge about the current network state is derived solely from its recent history of message sends and receptions. Initial versions of TCP (e.g., Tahoe) used roundtrip times, where a roundtrip time is the time elapsed from sending a primary message (e.g., data message) to receiving an ack for that message. Later versions of TCP make use of the presence of duplicate acks (Reno) and the variation in the time intervals between acks (Vegas). In any case, TCP's view of the current network state is, to put it kindly, not very accurate. Consequently, TCP's congestion control must be very conservative.

The smoothest way to control the send rate is to directly adjust the time between successive packet sends. But TCP (like most computer software) does not operate at such "fine" time scales. Instead, TCP sends bursts of packets and adjusts its send rate by varying the number of packets in a burst. In a similar vein, TCP does not usually maintain separate timers for every outstanding packet (which would be needed to measure the roundtrip times of all primary packets [recall that a primary packet is one that is resent until a response is received]). Instead, TCP usually uses one timer to keep track of the last primary packet sent. If a primary packet is sent when the timer is running, it usually restarts the timer.

A TCP source entity does the following:

The congestion control algorithm for each TCP version (Old-Tahoe, Tahoe, Reno, Vegas, ...) is specified by stating what happens at each protocol event, i.e., message transmission, message reception, timeout. This is given in RFCs. However the specifications are ambiguous in places, resulting in slightly different interpretations. Furthermore, each operating system has its own way of managing timers and other resources. Consequently, there are significant differences between different implementations (Windows NT, BSD Unix, NetBSD, Linux, Solaris, ...) of the same version of TCP, and the only "documentation" is the source code. Currently, TCP Reno is the dominant TCP version.

Below, we describe TCP congestion control, first at the level of individual message transmission and receptions, and then at the level of window transmissions and receptions. We then provide some analysis of the window-based model. Throughout:

2. Message-Level Description

We describe TCP congestion control at the level of individual message transmission and receptions. We first describe a TCP Tahoe source, then a TCP Reno source, and finally a TCP sink (applicable to Tahoe and Reno).

2.1. TCP-Tahoe Source

Define the following variables at the source:


The above variables are updated as follows:

The rtt-related variables are not updated upon reception of a new ack if it acks data that was retransmitted, the motivation being that the ack may not have been sent in response to the last transmission of the data.

2.2. TCP-Reno Source

TCP Reno is like TCP Tahoe except that if the source receives three "duplicate" acks, it considers this indicative of transient buffer oveflow rather than congestion, and does the following: The TCP-Reno source is the TCP-Tahoe source augmented with one additional variable and one additional event as follows:

2.3. TCP Sink

TCP RFCs recommend that a TCP sink (in Tahoe or Reno) wait for a MAX_ACKWAIT delay (= 500ms) before acking the first in-order data received after the last ack sent. If more new data is received before that time, a single ack is sent covering both the data messages.

This is motivated by applications (such as remote login) where the user often generates data in small chunks. Without the MAX_ACKWAIT delay, the sink would be sending acks with small increments, causing the source to send small amounts of data (why?), leading to a vicious cycle (the so-called "Silly Window Syndrome"). With the MAX_ACKWAIT delay, both the acks and the data blocks become larger.

This is modeled by sending acks according to the following:


3. Window-level Description: TCP Tahoe

For understanding the congestion control behavior, it is helpful to take a more abstract (higher-level) view. Specifically, we view the source behavior at a "per-window" time scale, rather than "per-message" time scale. Here, the source execution consists of a sequence of window send events. A send event updates variables and sends the new window; the event is triggered either by timeout or by reception of acks for the previous window send. The per-window view is a good approximation of the per-message view when the roundtrip time is large compared to the message transmission time (why?).

We now give a window-level description of TCP Tahoe congestion control. The variables are as defined above. Instead of the above events, define the following Window Send Event:

4. Packet-level Evolution Example

Consider a TCP connection whose sole bottleneck is a link on the source-to-sink path with bandwidth d packets/second and maximum buffering capacity of 4 packets. Assume the following: no cross-traffic; all packets have size of 1 MSS; rtt consists only of the queueing delay at this link; rto is constant at 5d; the sink sends an ack for every packet received; the source implements TCP Tahoe; intially, cw=1, sst=5, and the queue is empty.

The evolution of the TCP connection is shown below, with each new line representing a time increment of d seconds and the values of variables just after the event at that time executes.

time|ackrcvd  na   cw     sst    send       queue (head at left)
  0 |   -     0    1       5      0           0
  1 |   1     1    2       5      1,2         1,2
  2 |   2     2    3       5      3,4         2,3,4
  3 |   3     3    4       5      5,6         3,4,5,6
  4 |   4     4    5       5      7,8         4,5,6,7      (8 lost)
  5 |   5     5    5+1/5   5      9           5,6,7,9
  6 |   6     6    5+2/5*  5      10          6,7,9,10
  7 |   7     7    5+3/5*  5      11          7,9,10,11
  8 |   8     8    5+4/5*  5      12          9,10,11,12
  9 |   8     8    5+4/5*  5      -           10,11,12     (sink buffers 9)
 10 |   8     8    5+4/5*  5      -           11,12        (sink buffers 10)
 11 |   8     8    5+4/5*  5      -           12           (sink buffers 11)
 12 |   8     8    5+4/5*  5      -           -            (sink buffers 12)
 13 |   -     8    1       3      8           8            (rto timeout)
 14 |  13    13    2       3      13,14       13,14
 15 |  14    14    3       3      15,16       14,15,16
 16 |  15    15    3+1/3   3      17          15,16,17
 17 |  16    16    3+2/3*  3      18          16,17,18
 18 |  17    17    3+3/3*  3      19          17,18,19
 19 |  18    18    4+1/4*  3      20,21       18,19,20,21
 20 |  19    19    4+2/4*  3      22          19,20,21,22
 21 |  20    20    4+3/4*  3      23          20,21,22,23
 22 |  21    21    4+4/4*  3      24          21,22,23,24
 23 |  22    22    5+1/5*  3      25,26       22,23,24,25  (26 lost)
 24 |  23    23    5+2/5*  3      27          23,24,25,27
 25 |  24    24    5+3/5*  3      28          24,25,27,28
 26 |  25    25    5+4/5*  3      29          25,27,28,29
 27 |  26    26    5+4/5*  3      30          27,28,29,30
 28 |  26    26    5+4/5*  3      -           28,29,30
 29 |  26    26    5+4/5*  3      -           29,30
 30 |  26    26    5+4/5*  3      -           30
 31 |  26    26    5+4/5*  3      -           -
 32 |   -    26    1       3      26          26
 33 |  31    31    2       3      31,32       31,32        (rto timeout)
   ...
   ...
Lines 13 through 31 repeat periodically, with every sequence number increased by 18 (=26-8) at each period.

Ignoring the initial transient, the average goodput (packets transferred/second) is 18 packets every 19 (=32-13) seconds, or 18/19 packets/second. Ignoring the initial transient, the average number of packets sent/second is 1 (=19/19) packet/second.

Including the initial transient, what is the goodput and what is the average number of packets sent per second.

If the source implemented TCP-Reno (fast-recovery and fast-retransmit with the numDupAck trigger at 3), then times 0 through 10 would be the same and fast-recovery would be triggered at time 11:

time|ackrcvd  na   cw     sst    send       queue (head at left)
  0 |   -     0    1       5      0           0
  1 |   1     1    2       5      1,2         1,2
  2 |   2     2    3       5      3,4         2,3,4
  3 |   3     3    4       5      5,6         3,4,5,6
  4 |   4     4    5       5      7,8         4,5,6,7      (8 lost)
  5 |   5     5    5+1/5   5      9           5,6,7,9
  6 |   6     6    5+2/5*  5      10          6,7,9,10
  7 |   7     7    5+3/5*  5      11          7,9,10,11
  8 |   8     8    5+4/5*  5      12          9,10,11,12
  9 |   8     8    5+4/5*  5      -           10,11,12     (sink buffers 9)
 10 |   8     8    5+4/5*  5      -           11,12        (sink buffers 10)
 11 |   8     8    3       3      8           12,8  (fast recovery, sink buffers 11)
 12 |   8     8    3       3      -           8
 13 |  13    13    3+1/3   3      13,14,15    13,14,15
   ...
   ...

5. Window-level Evolution Example

Consider a TCP connection whose sole bottleneck is a link on the source-to-sink path with maximum buffering capacity of 4 packets. Assume the following: no cross-traffic; all packets have size of 1 MSS; rtt is constant; rto is constant and is slightly higher than rtt; the sink sends an ack for every packet received; the source implements TCP Tahoe; intially, cw=1, sst=5, and the queue is empty.

The evolution of the TCP connection is shown below, with each new line representing a time increment of rtt seconds and the values of variables just after the event at that time executes.

time|ackrcvd  na   cw     sst    send           queue (head at left)
  0 |   -     0    1       5      0                0
  1 |   1     1    2       5      1,2              1,2
  2 |   3     3    4       5      3,4,5,6          3,4,5,6
  3 |   7     7    6       5      7,8,9,10,11,12   7,8,9,10     (11,12 lost)
  4 |  11    11    1       3      11               11           (rto timeout)
  5 |  12    12    2       3      12,13            12,13
  6 |  14    14    4       3      14,15,16,17      14,15,16,17
  7 |  18    18    5       3      18,19,20,21,22   18,19,20,21  (22 lost)
  8 |  22    22    1       3      22               22           (rto timeout)
  ....
   ...
Lines 4 through 7 repeat periodically, with every sequence number increased by 11 (=22-11) at each period.

Ignoring the initial transient, the average goodput (packets transferred/second) is 11 packets every 4 (=8-4) rtt units, or 11/(4*rtt) packets/second. Ignoring the initial transient, the average number of packets sent/second is 12 packets every 4 (=8-4) rtt units.

Including the initial transient, what is the goodput and what is the average number of packets sent per second.

6. Window-level Analysis: Two TCP connections sharing a queue

Consider two TCP connections sharing a queue on the forward path. Let K denote the buffer size (in packets) of the shared queue. Let d denote the time required by this queue to serve a packet. Assume the following: updates at the two TCP sources are in synch; both sources see the same rtt values and packet drops; the sources are either in multiplicative decrease or additive increase (i.e., no slow-start); the roundtrip times are mostly due to the delay in the queue (i.e., acks encounter essentially zero delay).

We model this system using the window-level model described above. Thus the system evolves by a sequence of send events. At each event, each TCP source updates its variables and sends its current window; the event is triggered either by a timeout or by reception of ack of previous send window.

Let w1(j) and w2(j) denote the congestion window sizes at source 1 and source 2, respectively, just after the jth send. Let rtt(j) denote the roundtrip time for the packets sent in the jth send.

Let us further assume that the the timeout value is never an under-estimate; that is, upon a timeout there is no data or acks in transit. This implies that the queue is empty at timeout (because there is no other source of traffic for the queue).

Putting all this together, we have the following update at the j+1 send event:

 // At previous send, rtt(j) = (w1(j) + w2(j))*d

 if ( w1(j) + w2(j) <= K ) then {
   // no packets were lost in last send event, all were acked,
   // elapsed time since last send is rtt(j)

    w1(j+1)  =  w1(j) + 1 ;
    w2(j+1)  =  w2(j) + 1 ;
    rtt(j+1) =  (w1(j) + w2(j))*d ;
 }

 else {
   // w1(j) + w2(j) > K, packets were lost in last send event
   // elapsed time since last send is rto(j) (> rtt(j))
   
    w1(j+1)  =  ceiling(w1(j)/2) ;
    w2(j+1)  =  ceiling(w2(j)/2) ;
    rtt(j+1) =  (w1(j) + w2(j))*d ;
 }
Suppose the system starts with initial values w1(0)=a1 and w2(0)=a2 such that a1>a2 and a1+a2<K. Does the system reach a fair state, i.e., as j increases, does w1(j) become closer to w2(j)?

Clearly, w1 and w2 will increase linearly in j, i.e., w1(j)=a1+j and w2(j)=a2+j, until w1+w2 exceeds K, at which point both w1 and w2 are halved and the entire process repeats.

Let m denote the value of j when w1+w2 exceeds K, that is, a1+a2+2m > K. So:

m = floor((K-a1-a2+1)/2)
w1(m+1) = ceiling((a1+m)/2)
w2(m+1) = ceiling((a2+m)/2)

To show that the system converges to a state where w1 equals w2, it suffices to show that the difference w1(m+1)-w2(m+1) between the new starting points is less than the difference between the previous starting difference a1-a2.

This is indeed the case. Ignoring the floors and ceilings (which arise because of the discreteness of integers), we have w1(m+1) = (a1 + (K-a1-a2)/2)/2 and w2(m+1) = (a2 + (K-a1-a2)/2)/2. So w1(m+1) - w2(m+1) = (a1-a2)/2.

So after each timeout, the difference between w1 and w2 is halved. So after a while, w1 and w2 will equal each other (or differ by at most 1).

7. Window-level Analysis: TCP with Additive Decrease

[Problem from the text (3-20/edition 1, 3-27/edition 2.]

Consider two TCP connections sharing a queue on the forward path, as in the previous section. However, now the sources now do additive decrease instead of multiplicative decrease. So we have the following update at the j+1 send event:

 if ( w1(j) + w2(j) <= K ) then {
   // no packets were lost in last send event, all were acked,
   // elapsed time since last send is rtt(j)

    w1(j+1)  =  w1(j) + 1 ;
    w2(j+1)  =  w2(j) + 1 ;
    rtt(j+1) =  (w1(j) + w2(j))*d ;
 }

 else {
   // w1(j) + w2(j) > K, packets were lost in last send event
   // elapsed time since last send is rto(j) (> rtt(j))
   
    w1(j+1)  =  w1(j) - 1 ;
    w2(j+1)  =  w2(j) - 1 ;
    rtt(j+1) =  (w1(j) + w2(j))*d ;
 }
Suppose the system starts with initial values w1(0)=a1 and w2(0)=a2 such that a1>a2. Does the system reach a fair state, that is, as j increases, does w1(j) become closer to w2(j)?

Note that whenever w1 increases, w2 increases by the same amount (namely 1). Similarly, whenever w1 decreases, w2 decreases by the same amount. So obviously, w1(j)-w2(j) will remain constant at a1-a2. So the system will never reach a fair state.

NOTE: Lke all models, this is approximate. For example, if a1+a2 is odd, then w1(j)+w2(j) is always odd. So when it exceeds K, it does so by 1 and so we would expect only one packet to be lost, and so we would expect one connection's window to go down and the other's to go up. Instead in the model we bring down both windows.

8. Window-level Analysis: Steady-State Loss and Throughput

[Problem from the text (3-21/edition 1, 3-30/edition 2.]

Suppose a TCP source is either in multiplicative decrease or additive increase (i.e., no slow-start). Then its window size evolves in a saw-tooth fashion, increasing linearly until it drops at a timeout, and then repeating.

Let W(j) be the value of the window size just prior to the jth timeout. Clearly, W(1), W(2), ..., will not be the same. But in a steady-state situation (where the cross traffic in the network is not changing significantly), the W(j)'s will be close to each other. With this motivation, let us make the additional assumption that W(j) equals a constant W for all j.

We also assume that if a timeout occurs, then exactly one packet is lost (and that would be the last packet in the previous send window).

We also assume that the roundtrip time is a constant, say RTT.

Subject to these assumptions, it is easy to compute the steady-state loss rate and throughput. We start just after a timeout, when the window size is W/2. In subsequent sends, the window size linearly increases until it equals W (after which it drops to W/2). The number of packets sent in this phase is as follows, where X denotes W/2:

   (X) + (1+X) + (2+X) + (X+X)
 = (0+X)+(1+X) + (2+X) + (X+X)
 = (0+1+2+ ... + X) + (X + ... X)  // X+1 X's in second term
 =  X(X+1)/2        + X(X+1)       // using 1+2+...+n = n(n+1)/2
 = (3/2)(X)(X+1)
 = (3/2)(W/2)(W/2 + 1)             // X = W/2
 = (3/8)(W*W + 2W)

Only one packet out of all these is lost. So the loss rate is
  1/[(3/8)(W*W + 2W)]

The time taken for all these window sends is RTT*X, so the
send rate is
  (3/2)(W/2 + 1)/RTT

The goodput (send rate not counting retransmits) is
  (3/2)(W/2)/RTT

9. Conclusions

TCP's performance over large-rtt connections (e.g., rtt above 70ms) is typically governed by its congestion control mechanism, and window-level models are usually adequate. TCP's performance over low-rtt connections (e.g., rtt below 30ms) is governed by the sink's ability to absorb data.

As Internet traffic has grown both in quantity and variety, various retransmission and windowing policies have been attempted in TCP, but all under the premise that TCP's knowledge of the network state should be derived solely from end-point behavior. As a result, TCP is very conservative. While this is very robust, it tends to underutilize the network resources. Currently, a connection can easily outperform TCP by using a more aggressive congestion control scheme, implemented either in a ``rogue'' TCP or on top of UDP. Of course, if everyone did that, the Internet may frequently become congested and break down. The only real fix for this is for routers to do per-flow throttling, but this is technically costly (because a the number of flows going through a router can be immense) and so far the ISPs have no financial incentive to do this. Per-flow throttling would also allow ISPs to offer different levels of service to different classes of customers.

Another major performance issue is fast packet processing, that is, speeding up the processing of received packets, by first checking for likely situations and and then handling less likely situations (e.g., after receiving a data packet with sequence number n, its is quite likely that the next packet will be a data packet of the same connection with sequence number n+1. Speeding up packet processing was not an issue many years ago, when processor speeds were much higher than link speeds. But that is no longer the case, thanks to fibre optic links.