noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

Baker & McKenzie LLP

Baker McKenzie Advises Fundamenta Real Estate on CHF 70 million Rights[...]
Southeast Missouri State University

SEMO Receives National Recognition for Achievements in Student Success
Alaska Department of Natural Resources

ADL 233813, Aquatic Farmsite Lease Amendment, Sitkalidak Strait, Kodiak[...]

Technology

APNIC Pty Ltd.

11/07/2025 | Press release | Distributed by Public on 11/07/2025 00:34

Flow control in switch ASICs: Part 3 — Notes on flow control

In this series 'Notes on flow control' Diptanshu Singh shares notes from his studies of scheduling mechanisms used in router architectures. The series covered principles of flow control, rate-based versus credit-based flow control mechanisms, and is concluding by exploring how flow control is implemented in switch ASICs.

Flow control in switch application-specific integrated circuits (ASICs) presents unique challenges. Hardware constraints, latency requirements, and scalability considerations all influence how flow control is implemented at the chip level.

This post explores how these issues come together to shape real-world hardware designs.

Why do we need parallel pipelines?

Assume that we want to build a 25.6Tbps (32 x 400G) switch. This translates into a packet processing rate of ≈ 38 billion packets-per-second (pps) for small 64 byte packets and ≈ 11.6 billion pps for 256 byte packets. If the switch ASIC is clocked at 1.2GHz with a 256B wide data bus, processing packets each clock cycle, a single pipeline can handle at most ≈ 1.2 billion pps.

Addressing this mismatch requires parallelism in our hardware design, which also requires some tradeoffs. We accept that the switch won't reach full line rate for very small packets, so that we can keep the number of parallel pipelines manageable. We use parallel pipelines with wider internal data paths, which allow us to process more packets-per-second and maintain the overall switch throughput.

Minimum required parallelism

A simple crude way to calculate the number of parallel pipelines we will need is by calculating the:

Packet arrival rate
Single pipeline processing rate
Number of pipelines needed

Calculate the packet arrival rate (R): Packets on a link with a Bandwidth bits per second, payload size bytes, and overhead bytes (includes preamble, start frame delimiter, interframe gap, and so on) arrive at an average rate given by:

R = B 8 × ( S + G )

Calculate the single pipeline processing rate(r): If each pipeline operates at a clock frequency f (in Hz), has a datapath width (in bytes), is the packets per clock cycle, and processes packets that requires multiple cycles if they exceed the bus width, the single pipeline processing rate is:

r = f ⌈ ( S + G W ) / c ⌉

Where the the number of cycles required to process each packet, determined by the datapath width and packets per clock cycle is represented by:

⌈ ( S + G W ) / c ⌉

Determine the required parallelism (P): To avoid packet loss, the number of parallel pipelines required is calculated as:

P = R r = B 8 ( S + G ) × ⌈ ( S + G W ) / c ⌉ f

If P is greater than one, multiple pipelines are required.

The plot below shows the number of pipelines needed, assuming a 1.3GHz, 256B databus for rates of 12.8Tbps, 25.6Tbps and 51.2Tbps.

Below is an abstracted logical view of multiple parallel ingress and egress pipelines connected by a fabric. Some vendors will also refer to this as a slice.

Flow control between ingress and egress

This brings us to the flow control mechanism between the ingress and egress pipelines within modern switches. Broadly, these ASICs can be classified into two categories based on their internal memory architecture.

Monolithic ASICs have ingress and egress sides that share a common memory domain. Due to this shared memory architecture, congestion and buffer occupancy can be effectively managed using simple threshold-based counters. In these switches, the scheduler can directly observe the occupancy state of each egress queue by checking local occupancy counters, removing the need for explicit flow-control messaging.

On the other hand, there is another category of ASICs designed for distributed-style switches, either as a single-chassis or distributed architectures like Disaggregated Scheduled Fabric (DSF). These ASICs distribute buffering and forwarding responsibilities across separate ASICs, which are interconnected by a switching fabric.

The internal fabric usually will have a speed-up, for example Jericho 2c+ has a 33% speed-up. Since ingress and egress ASICs maintain separate memory domains without direct visibility into each other's occupancy states, explicit flow-control communication becomes necessary. Typically, this is implemented using credit-based flow control (CBFC), where ingress ASICs transmit data only after receiving explicit credits (grants) from egress ASICs. This ensures data transmission never exceeds the available buffering capacity at the egress.

A receiver grants explicit credits to the sender, with each credit representing permission to send a fixed number of bytes, known as the credit quantum. Some (but not all) distributed-style systems often route packets internally as fixed-size cells to simplify scheduling and improve efficiency.

Below is an abstracted view where the egress scheduler gives a credit grant to the ingress before it can send the data to egress. Some people assume that if an ASIC has a virtual output queue (VOQ) it is using CBFC, but this is not the case. For example monolithic ASICs do not use CBFC.

Credit quantum sizing

How big or small should the credit quantum size be?

A small credit quantum means each credit message covers fewer bytes, which requires the receiver to issue more frequent credit messages, placing a higher demand on the control plane resources. On the other hand, a credit quantum that is too large may be inefficient.

The credit scheme also needs to make sure that it releases enough credit to account for the delay between the sender and receiver with no loss of throughput at the receiver. For example, let's assume that we have:

Clock rate: 1GHz
Credit issue rate: 1 credit every 2 clock cycles, 0.5G credits per second
Rate: 400Gbps (50GBps)
Fabric Speed-up: 1.05 (5%)
Cell size: 256B
RTT Control loop delay: .8 μs

To keep the port busy in this example the minimum credit rate needs to be at least:

C m i n = 50 G B p s 0.5 = 100 B

If we account for 5% speed-up in the fabric, then its 105B. Typical cell size is 256 bytes, so we can round this to 256 bytes per credit.

What if credits are given at a slice level and not port level? In an example where 18 x 400G is one slice, then the size of a credit will be around 1800B, which can be rounded to 2KB.

When sizing the credit quantum, some practical limits to keep in mind are:

The scheduler must issue credits fast enough to keep the pipe full.
On-chip counters must be wide enough to track all in-flight credits without rollover.
A small fabric speed-up is sometimes needed to mask any residual scheduler slack.

Credits issued at the slice level can cause unfairness.

This can be fixed by using port-level credit allocation instead of slice-level credit allocation, though I think this result in more control plane overhead in terms of credit message generation.

It's worth mentioning this issue stems from how the scheduler allocates resources, and not from the credit/grant system itself. The issue is applicable to monolithic ASICs as well.

The highest amount of data that can be outstanding between ingress to egress is ~1x BDP worth. In our example this will be around 400Gbps (50 GB) x 800ns = 40KB. Assuming 256B per cell, it is about ~156 in-flight cells. This also means the egress needs to have a small buffer worth at least ~40KB to ingest to avoid any drops.

Buffering in the cell fabric

The cell fabric, whether it's internal to a switch fabric or distributed, must have some minimal buffering to handle the contention when cells from multiple inputs compete for an output link. Shallow buffers are also important to provide minimal latency and jitter for the cells travelling the fabric.

The amount of buffering needed can be modelled by the M/D/1 queueing model, where M is Markovian arrivals, D is a deterministic service time, and 1 is a single service channel.

Intuition behind using the M/D/1 queuing model

Imagine you're at a toll booth on a highway. Each car represents a data cell arriving at an output link. Now, consider two different scenarios:

Scenario 1: M/M/1 queue
Cars (data cells) arrive randomly - that's the Markovian (M) arrival process. The toll booth operator also processes each car in a random amount of time - sometimes fast, sometimes slow. This randomness in both arrival and service times makes the system highly variable.
Scenario 2: M/D/1 queue
Cars still arrive randomly, but the toll booth operator is now a machine that takes exactly 60 seconds to process each car, no matter what. The arrival process is still Markovian, but service is deterministic (D).

Now ask yourself - which scenario would we rather be stuck in as a driver?

The second one, with the predictable machine, will always have shorter average queues.

Why?

Because unpredictability compounds. In the M/M/1 case, you might face a burst of cars and slow processing at the same time - a worst-case scenario. But in M/D/1, you remove one major source of randomness - the service time. This greatly improves queue behaviour and average wait times.

This is exactly what happens with a cell-based switching fabric. Cells may arrive randomly from multiple ingress points, but because each cell has a fixed size, the time to transmit each one is deterministic. That is, the service time is constant. This deterministic behaviour in transmission mirrors the M/D/1 scenario. By eliminating variability in service time, the system maintains shorter queues and better overall throughput - even under bursty traffic conditions.

Why deterministic service helps

As we mentioned earlier, in an M/M/1 queue, you have two sources of randomness fighting each other. The queue length at any moment depends on the accumulated difference between random arrivals and random departures. It's like taking a random walk where both your forward and backward steps are random; we can wander quite far from the origin.

In an M/D/1 queue, you've replaced one random process with a deterministic process. Every tick, exactly one customer leaves (if there's anyone in the queue). The only randomness comes from arrivals. This is like a random walk where you always take exactly one step backward each time unit, but take a random number of steps forward. We will still wander, but much less wildly.

Mathematically, this manifests as the average queue length being exactly half in M/D/1 compared to M/M/1. For instance, if utilization ρ=0.9, an M/M/1 queue averages 9 customers waiting, while M/D/1 averages only 4.5.

See the table below for a comparison of queuing results between M/M/1 and M/D/1, where μ is the service rate, λ is arrival rate and ρ is:

λ μ ( 0 < ρ < 1 )

Tail probability behaviour

Typically, we don't just care about average behaviour. We also care about rare events. What's the probability that the queue exceeds some large value N? This helps in determining how much buffer memory we need to avoid dropping cells.

Imagine tracking the queue size over time as a graph. Most of the time, it hovers around its average value. Occasionally, it spikes up due to a burst of arrivals. The tail probability tells us how often these spikes exceed level N.

For both M/M/1 and M/D/1, these probabilities decay exponentially with N, but the decay rates differ. The M/D/1 queue 'decays' faster, meaning large queues are even rarer than in M/M/1.

Queueing theory is a rabbit hole of its own, and diving deep into M/D/1 would take us too far afield. I'll likely cover that in a separate post down the line. However, we can look at M/D/1 tail probability behaviour using a large deviation (LD) approximation, which can be given as:

P { Q > N } ≈ C q e − θ N
with θ solving ρ ( e θ − 1 ) = θ and C q = 1 − ρ ρ + e − θ

The way to solve theta θ is by using iterative algorithms like the Newton-Raphson method. If we want the queue to be no larger than some target (for example 10−6), we can figure it out by using inequality:

C q e − θ N ≤ P m a x

Which will reduce to:

N ≥ l o g C q − l o g P m a x θ

For example, if we want to find out what the chances are that on average, at most, one packet in a million arrives to a full queue and what the size of the queue should be when the utilization rate is 90% (ρ=0.9), we get ~53 cells, which is ≈ 13.5kB.

In this series we covered the principles of flow control, compared rate-based and credit-based flow control systems, and looked at how flow control is implemented in ASICs.

This turned out longer than planned - a classic case of flow control without a rate limiter. I will leave the final conclusions to you. If you found it helpful, great! And if you spot any errors, think of them as lost packets - feel free to resend with corrections.

Diptanshu Singh is a Network Engineer with extensive experience designing, implementing and operating large and complex customer networks.

This post was adapted from the original on his blog.

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.

APNIC Pty Ltd. published this content on November 07, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on November 07, 2025 at 06:34 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]

Back

View original format