Scheduling to the bottleneck or Congestion control with Rate-Based Pacing

(for XTP Technical Advisory Board)
(Draft, Aug89)

Introduction

When I first implemented (in sixties) the scheduler that was going to be called "fair-share", it also had another attribute that I called "scheduling to the bottleneck". Something similar needs to be introduced for network scheduling and congestion control. End-to-end pacing in complex networks also must worry about scheduling to the bottleneck.

Much of the rate-based information originated with discussions on the subject during the early to mid-80s while working on the HSDT (High Speed Data Transport) project. It become readily apparent that rate-based controls where the only solutions to high-speed, complex, and/or "long" latency networks.

Network Environment

If a packet is transmitted at T0, an ACK would return at T0+k. The round trip time (RTT) is defined:

RTT = (T0+k) - T0

The elapsed time k can be viewed as being composed of a sequence of packet events having elapsed time e(i). Events are actually a variety of things like link transmission, queueing delays and/or router (or other types of store&forward processing elements) processing. RTT can then be expressed as:

RTT = Sumof(e(1), e(2), ..., e(n))

Pacing is concerned with controlling congestion and/or not saturating some resource in the network. Slow start pacing with fewer than two packets provides no really useful information. Slow start pacing with two packets provides some useful information (at least in situations where paths/flows are consistent).

Assume packets A and B are essentially transmitted at the same time (initial start-up conditions with window=2). Then ACK(a) for packet A returns at time T(a) and ACK(b) for packet B returns at time T(b).

Now the events that compose RTT can be classified as two types:

  1. non-overlapped, sequential processing of single packets (transmission & router processing)
  2. overlapped, parallel events like queueing delays

Assuming various simplifications, the difference between T(a) and T(b) will tend towards:

(T(b)-T(a)) = max(f(1), f(2), ..., f(k))

Where f(j) is a non-overlapped processing event. This can be viewed as the time separation (that occurs for packet A and B transversing the network) for the maximum delay associated with a specific sequential processing server.

For long propagation delay networks, i.e.

RTT >> (T(b)-T(a))

it is more efficient to use rate based pacing where the sender uses time (T(b)-T(a)) as the initial estimate of the minimum interval between packet transmission (i.e. regulating packet production to packet service/consumption rates).

The justification is that (T(b)-T(a)) should be the maximum service time of the slowest component in the network. If the slowest component is a transmission link, packet handling is slowed to the throughput of that link. If the slowest component is a router (intermediate node, etc), packets arriving more frequently result in congestion.

Queue Delays

Queueing delays don't result in direct (first order) time separation of consecutive packets. Typically, there are second order effects which may cause time-separation of packets. At a particular network node, packets for other streams may be ordered in the (same) queue between packets A and B (possibly because they arrived on other interfaces between the time packet A was received and the time packet B was received). The time spent processing such intervening packets will cause additional end-to-end temporal separation of two consecutive packets.

Resource contention delay may grow to dominate the RTT elapsed time equation. Furthermore, it is such stuff that congestion is likely to be made of (i.e. not flooding by single origin/destination, but aggregate activity from lots of different packet flows). Temporal separation because of intervening packet processing shows up in T(b)-T(a) automatically. However, total queue delay and/or maximum queue delay is not directly associated with end-to-end temporal separation transit information.

Total queue delay for each network node is information that should be calculated by the network layer and can be made available to the transport layer (as well as other nodes).

Rate-Based Pacing

In general, rate-based pacing should apply at three levels, link, network, and transport. Simple queueing theory indicates that when the arrival rate exceeds the service rate, congestion (server saturation) will occur. The objective of rate based pacing is to regulate the "arrival rate" in order to avoid congestion.

Simple link layer rate-based pacing can be effectively applied in situations where there is a single link in and a single link out. Therefor, any link-level rate-based back-pressure applies equally to all transiting packets. However, in more complex nodes, congestion may occur only for specific traffic flows. To optimally regulate such situations, the network layer at the two nodes must exchange bottleneck information regarding regulating packets that are part of specific traffic flows. As environments become more complex, it is advantageous to propagate the back-pressure up to the transport layer to further regulate traffic congestion (and make optimal use of available resources).

Rate-based pacing subsumes much of the function of window-based pacing. Window-based pacing is much more applicable to the link layer and/or situations involving low latency. Basically in window-based pacing, buffers are pre-allocated at receiving end to handle incoming packets when there isn't sufficient latency to control things at a finer granularity. As latency becomes more significant, then there are significantly more optimal ways of managing the available resources. When transport sessions nearly map one-to-one on top of very low latency link layer, then link-level, window-based pacing can be effective. As the latency increases (either at the link layer, network layer, and/or transport layer), much more complex resource optimization techniques become necessary.

In window pacing, buffers are allocated on the receiving end of a circuit (link-level circuit, or virtual circuit or transport-level session). This allows the receiver to have some overlap of packet processing while packets are being received. Simple double buffering is adequate in some situations. However, on multi-tasking systems there are numerous conditions that contribute to bursty processing (time-slices, DASD queues, etc). The net result is that simple double buffering is not sufficient to overlap incoming packets with average processing throughput (and so a larger number of buffers become necessary to match a sustained-rate transmission layer to a bursty packet processor layer).

As transit latency begins to dominate the elapsed time, then (at link, network, or transport layers) window buffering begins to break down as a technique for optimizing resources (other than simple packet destination allocation).

For low latency transit time, it is relatively easy to establish an implicit coupling of the rate of packet production (by the transmitter) to the rate of packet consumption (processing; by the receiver). As the environment becomes more complex (and/or the transit time begins to dominate the environment), implicit coupling becomes inadequate and explicit control is required. For high latency transit time environments, explicit inter-packet delay pacing (i.e. controlling packet production) is required for optimal operation.

     A          a         A          a         A
     B             b         B          b         B
     C                c         C          c         C
     D                   d         D          d         D
     +----------+----------+----------+----------+----------+
     0          1          2          3          4          5


            Window-ACK protocol example

The Window-ACK protocol example depicts an environment where the window size is set to four, no slow start is used, and the RTT time is 19. Capital letters are used to indicate a packet transmission from the sender and lower-case letters are used to indicate ACK transmission. Packet transmission elapsed time is 10, while ACK transmission elapsed time is 9 (total RTT is 19).

During initial start-up all packets are transmitted (essentially) simultaneously. The packets arrive at the destination at times 10, 13, 16, and 19 and are immediately ACK'ed on arrival. The ACKs for packets A, B, C, and D arrive back at the source at times 19, 22, 25, and 28 respectively (RTT calculation can't use start-up times because of the initial stretch-out). Immediately on arrival of the corresponding ACKs, new packets can be immediately sent.

In this environment, the first approximation to inter-packet separation should be taken to be three. However, with RTT being 19, the pipe is only 2/3rds full with a window size of four. It would be necessary to increase the window size to six to almost fill the pipe. However, the pipe is actually unfilled for 1/19th of the time. If rate-based pacing where used, the inter-packet transmission delay could be set to 3, and the pipe would be kept filled at all times (rate-based pacing also provides finer granularity of controls compared to window-ack).

Slow-start

Window pacing is originally designed to provide some overlap in processing with transmission in situations involving low latency transit time (avoid stop&wait and decouple direct producer/consumer synchronization). However, as the transit time increases, stop&wait situations will still occur unless the window is opened large enough to provide for filling the total transit/network environment equal to the RTT (round trip time). As the transit time further increases the number of buffers required to keep the transit pipe full greatly exceeds the number of buffers to provide efficient overlap of the receiving hardware and the packet consuming process.

Ideally, the window size (in low latency transit case) is just sufficient to smooth out the producer/consumer relationship, matching the average transmission time to the average consuming/processing time (while recognizing that at the microscopic scheduling level, packet processing may actually be extremely bursty). It is a simple protocol to implement and because of the return ACK separation, can create implicit rate pacing between the production and consumption of packets.

As transit time latency increases, the "simple" solution to maintaining efficient resource utilization (as well as minimizing saturation/congestion situations) is to increase the window size. However, this is prone to a number of pitfalls.

The initial pitfall that is relatively easy to recognize is the start-up condition. With a large window, there can be the (essentially) simultaneously injection of a large number of packets (equal to the window size) into the transmission media. In response to this condition, the "slow-start" algorithm has been developed. In chapter three on TCP/IP (by Comer and Narten), of Unix Networking (Hayden Books, 1989, Kochan and Wood editors), there is a description of slow start on pages 68-69. In the description there is the statement that slow-start guarantees that no more than two segments will be transmitted at any one time. The statement is frequently wrong in the real world. Also see V. Jacobson's "Congestion avoidance and control" proceedings of ACM SIGCOMM 8/16-19, 1988, V18N4

Slow-start Not Adequate

Based on a great deal of real world scheduling experience, in real world situations, systems have to deal with

  1. start-up conditions,
  2. running conditions,
  3. terminating conditions

For many people, it is easy to see that the basic window-pacing algorithm (may) handles the running-condition, but ignores the start-up condition. Slow-start attempts to handle the start-up condition. However, built into the window algorithm is an implicit assumption that with ACK-pacing the packet production rate is controlled and things eventually reach semi-steady-state condition (once things are up and running). During the early '80s in the HSDT project, it was frequently observed that for transit times that had high latency and high latency*bandwidth products, that activity would be very bursty.

One of the attributes of a bursty environment (at the sender) is that several ACKs may arrive during some inter-burst interval. This (and other conditions) result in the packet sender building up a (potentially large) packet credit window. Using the standard window algorithm, the sender is then able to (essentially) simultaneously inject packets into the network up to the credit (i.e. multiple, possibly >>2, back-to-back segments).

Rate-Based & Bursty Environment

The basic problem to be solved is a combination of efficient utilization of the available resources while avoiding resource saturation (congestion). The congestion problem basically boils down to keeping the total packet production rate at (or below) the total packet processing rate (at link, network, and transport layers). Window+ACK implicit rate-based pacing only works in a steady-state environment. It quickly breaks down for non-steady-state cases (especially for networks with large latency*bandwidth products). Non-steady-state occurs not only at start-up (addressed by slow start), but during normal operation, especially when numerous bursty computer applications are involved.

The actual problem to be solved (in congestion control) is not to limit the total number of outstanding packets. At a congested node, the problem is more easily expressed as "the packets are arriving faster than they are being processed" (simple queueing theory resource saturation). Multiple back-to-back packets will have a tendency to push nodes (that are operating near capacity) across the line to saturation (resulting in congestion). Slow-start attempts to (partially) address the problem (trying to limit the number of back-to-back packets). However, there is an implicit assumption that window-ACK (implicit rate pacing) protocol will approximate steady-state running once slow-start reaches an equilibrium. However, this totally ignores the fact that numerous computer situations involve bursty processing and that the packet-producer can frequently accumulate a large packet credit window.

Packet credit accumulation is relatively easy in the window-ACK protocol and slow-start is not sufficient to guarantee the limit on back-to-back transmitted packets. To minimize back-to-back transmitted packets (and/or to limit high arrival rates at congested nodes), direct pacing of packet injection is required. One simple method of implementing packet injection control is to calculate a minimum inter-packet transmission interval (which can be derived from a more generalized rate-based calculation).

A minimum value is selected for inter-packet transmission interval (delay between transmitting packets) to prevent saturation of the most congested resource (occurring in the packet route/path). This minimum packet transmission interval guarantees that no back-to-back packets are transmitted.

Scheduling To The Bottleneck

Another way of presenting the problem is "scheduling to the bottleneck" or directly "managing the problem". The basic objective is to maximize the packet throughput, while limiting the packet arrival rate at any node/server to below the saturation/congestion level. A congestion/saturation threshold occurs when the packet arrival rate exceeds the server processing rate (resulting in congestion or server saturation).

Window pacing primarily presents a solution involving second order effects. Solutions involving second order effects break-down as the direct correlation with first order effects weaken (less than explicit one-to-one). Window-ACK protocol works nominally well with low latency (small &/or stable RTT) networks. Effectiveness (high correlation) can be achieved in more complex networks only when relatively constant steady-state running operation can be achieved. Relatively minor events (like large packet credit balances) can violate basic, underlying, implicit assumptions regarding operational environment (invalidating criteria for successful operation of window-ACK protocol).

Inter-packet transmission delay intervals (i.e. rate-based pacing) presents a solution that directly correlates with the objective (as stated: direct control of packet arrival rates). Given that resources and processing capacity limits exist at the link, network, and transport layer, it is appropriate that rate based pacing be applied at all three levels. Furthermore, interaction between the rate controls at the separate levels is required (for optimal operation).

In the past, network (intermediate) nodes where somewhat insulated from transient back-to-back packet arrivals by slow speed links. This insulation layer evaporates with the newer, higher speed link technologies. Furthermore, faster links have tended to exacerbate bursty operational characteristics. A quantitative view of the change in technology has been the significant reduction in transmission "e(i)" values compared to various intermediate node values (while router processing rates have had to keep pace, queueing delays at intermediate nodes can frequently far exceed elapsed link transmission times).

The first approximation for inter-packet transmission delays (at the transport session layer) involves tracking the ACK intervals for successive packets (travelling on the same route/path). This value gives some indication of elapsed processing times in various components along the network path. However, it doesn't directly reflect resource queueing delays (although indirect indication of contention from other transport sessions is present).

HSDT Buffer Management

There was an additional realization in the HSDT project regarding rate-based vis-a-vis window-ack ... which specifically involved buffer allocation at the link, network and transport level. The average number of simultaneously active buffers tended to stabilize for the system as a whole (except when the situation destabilized with onset of saturation/congestion).

The net result was that it was possible to allocate a common buffer pool (assuming some inter-level cooperation), the size of which depended on the average activity in the system (plus an overflow fudge factor for temporary transient conditions). The number of required buffers (for incoming traffic) was dependent on:

Some degree of packet latency (in the system) is perfectly normal (and can be allowed for). Temporary transients (increases) in the packet latency time is also acceptable. However, significant increases in packet latency would result in significant increases in the number of required buffers. However large increases in latency (number of required buffers) would also indicate the onset of congestion/saturation.

The HSDT (in-coming) buffer pool size was totally independent of the end-to-end latency and/or RTT of various sessions (transport level) and/or connections (network and link level). In window-ack, the "theoretical" buffer pool size would be proportional to the combination of all the various RTTs.

A simple example is an end-point node with a single network connection. In rate-pacing, the (in-coming) buffer pool size is basically a function of the network link speed and the processor activity. For window-ack, the aggregate (in-coming) buffer pool size would be a function of the number of sessions and the RTT for the various sessions. Furthermore, because of queueing delays (on the network link), average session RTT time increases with the number of sessions (positive feedback between number of sessions and session RTT), which further increases the total buffer requirements (i.e. packet arrival rate may stay constant and number of buffers for efficient processing stays fairly constant, but window-ack protocol can require large increases in number of allocated buffers).

The rate-based common buffer pool management allowed for a significant reduction in the number of required buffers (real storage) compared to a window-ack approach (especially in situations where there is a large latency*bandwidth product).

In addition, for situations where the buffer pool is allocated out of virtual memory, the rate-based common buffer pool management also had the attribute of maintaining a much smaller working set size with a dense page reference pattern. There are, in fact, some superficial analogies between the rate-based common pool vis-a-vis window-ack and global LRU page replacement vis-a-vis local LRU page replacement. Global LRU normally requires much less real storage, as well as significantly less processing time to provide the same level of service.

HSDT also investigated go-back-N and selective resend. For an error/loss in the selective resend case, the buffer requirements for a particular session (requiring in-order delivery) would temporarily balloon to 2*RTT. However, this would be viewed as a temporary, acceptable transient (with the overall common buffer pool size being managed by normal queueing theory controls). This was the only situation where buffer requirements was proportional to RTT and/or latency*bandwidth product. Given a large pipe, large number of sessions, and low error rate; a transient 2*RTT buffer requirement for a specific session was a minor transient in the overall buffer pool size. The size of the transient would increase as the error rate increased and/or as the percentage of the pipe bandwidth for the session increased.

Adaptive Rate-Pacing

Adaptive techniques for adjusting inter-packet transmission delays involve:

  1. exchanging information between network & link layer
  2. exchanging information with adjacent network nodes
  3. watching for ACK intervals greater than transmission intervals
  4. periodically probing network with a packet pair having shorten inter-packet transmission delay

Of course, various existing information techniques are also applicable:

  1. RTT (round-trip times) and trends in RTT changes (at link, network, and transport layers)
  2. missing packets
  3. dropped packets, ICMP source quench, etc.

Adaptive rate-pacing problems:

  1. increase in complexity
  2. defining interaction & exchange of information between layers
  3. heterogeneous routes/paths for packets
  4. early ACK'ing
  5. ACK batching
  6. large random fluctuations in RTT
  7. router queueing delay doesn't directly show up in "T(b)-T(a)" calculations
  8. network partitioning of bandwidth between end-to-end traffic flows