-
Notifications
You must be signed in to change notification settings - Fork 39
How to Configure Lossless RoCE
This page discusses real-life configuration of Mellanox Spectrum based Ethernet switches for Lossless RoCE and TCP traffic. The switch will be enabled with PFC and ECN.
Note: This page is a translation of an article titled How to Configure Mellanox Spectrum Switch for Lossless RoCE into the language of Linux Switch.
- Overview of Configuration
- Configuring Shared Buffer Pools
- Configuring Traffic Prioritization
- Configuring Traffic Scheduling
- Configuring Priority Group Buffers
- Configure Mapping of Traffic to Pools
- Configure ECN
- Configure PFC
- Further Resources
There are three principal traffic flows: RDMA, CNP, and everything else. This is an overview of the way the treatment of these types of traffic is configured. Traffic prioritization is based on trust DSCP.
Type | DSCP | Prio | Buf | Pools | Scheduling |
---|---|---|---|---|---|
TCP | 0 | 0, PFC off | PG0 | ing 0 / eg 4 | TC0, WRR |
RDMA | 24 | 3, PFC on | PG3 | ing 1 / eg 5 | TC3, WRR, ECN |
CNP | 48 | 6, PFC off | PG6 | ing 1 / eg 5 | TC6, strict |
See the QoS page for details about configuration of shared buffer pools for lossless traffic and in general.
Pools 0 (ingress) and 4 (egress) will be used for lossy traffic, pools 1 (ingress) and 5 (egress) for losslesss traffic. Pools 0, 1 and 4 will use half of the available chip memory:
$ devlink -j sb show pci/0000:03:00.0 | jq '.sb[][0].size / 2'
7012352
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 7012352 thtype dynamic # ingress lossy
$ devlink sb pool set pci/0000:03:00.0 pool 4 size 7012352 thtype dynamic # egress lossy
$ devlink sb pool set pci/0000:03:00.0 pool 1 size 7012352 thtype dynamic # ingress lossless
Pool 5 (egress lossless), will be as large as the chip permits:
$ devlink -j sb show pci/0000:03:00.0 | jq '.sb[][0].size'
14024704
$ devlink sb pool set pci/0000:03:00.0 pool 5 size 14024704 thtype dynamic # egress lossless
Note: In practice, the limit just needs to be large enough to not be a limiting factor. This is a good way of making sure it is.
Finally, configure the port-pool quotas of pools 1 (ingress lossless) and 5 (egress lossless) to not be a limiting factor as well:
$ devlink sb port pool set swp1 pool 1 th 16 # ingress lossless
$ devlink sb port pool set swp1 pool 5 th 16 # egress lossless
See the QoS page for details about configuration of traffic prioritization in general and Trust DSCP in particular.
Use iproute2
dcb
to install prioritization
rules for DSCP values of 0, CS3 and CS6:
$ dcb app flush dev swp1 dscp-prio
$ dcb app add dev swp1 dscp-prio 0:0
$ dcb app add dev swp1 dscp-prio CS3:3
$ dcb app add dev swp1 dscp-prio CS6:6
To turn the port headroom to the TC mode, and thus permit manual configuration of TC mapping and buffer sizes, the qdisc needs to be installed first. See the Queues Management page for details about qdiscs in general, and ETS in particular.
Add the ETS qdisc to configure strict and WRR TCs. The configuration described above needs TC6 (and therefore band 1) to be strict, but that means TC7 needs to be strict as well, and thus two strict bands are needed. The priority map should forward all priorities to TC0 (band 7), except for priority 3, which should go to TC3 (band 4) and priority 6, which should go to TC6 (band 1).
$ tc qdisc replace dev swp1 root handle 1: \
ets bands 8 strict 2 quanta 250 250 2000 250 250 2000 \
priomap 7 7 7 4 7 7 1 7
The quanta shown above are to a degree arbitrary. Only bands 4 and 7, both with a quantum of 2000, will see WRR traffic. The 2000 / 2000 split simply means that both bands should have the same weight. Now the HW is configured using percentages, not using quanta. The quanta for the remaining bands are chosen so that each band's quantum ends up being a nice even percentage of the total. Here, the HW will be configured 5% : 5% : 40% : 5% : 5% : 40%. With no traffic hitting those 5% traffic classes, the split among the relevant ones is 1:1.
In this particular case, we could have left the quanta configuration out altogether. By default, each DRR bands gets a quantum of MTU. But because 100% does not split evenly among 6 bands, the HW configuration is 16% : 17% : 17% : 16% : 17% : 17%. The relevant bands are therefore 1:1 again, but we have to rely on the knowledge of how the algorithm splits the rounding error among bands, so the result is not self-evident like above.
See the QoS page for details about configuration of priority group buffers.
Traffic with priority 0 should go to the PG buffer 0, priority 3 to PG3 and
priority 6 to PG6. This has to be configured using iproute2
dcb
. PG3 needs to be given non-zero size to
cover traffic that arrives after Xoff is transmitted. Other PGs can be set to
zero, so that mlxsw
autoconfigures them according to the port MTU at the time
the command is issued.
To determine the size of the buffer, one needs to take into consideration all the traffic to be accomodated after the need to emit a PAUSE or PFC frame is identified by the chip, but before the frame is actually emitted, received, and acted upon. This has to take into account e.g. line rate, cable length, MTU and various delays and latencies. There correspondingly is no "rule of thumb" value to use. The necessary size can be determined using a hdroom_sz tool:
$ hdroom_sz --asic spc1 --linerate 100G --mtu 9000 --cable-length 1.0
xon_thresh 19456
xoff_thresh 19456
headroom_size 96432
$ dcb buffer set dev swp1 prio-buffer all:0 3:3
$ dcb buffer set dev swp1 buffer-size all:0 3:97K
See the QoS page for details about configuration of pool binding.
Lossy traffic from PG0 and TC0 should go to pools 0 and 4, respectively:
$ devlink sb tc bind set swp1 tc 0 type ingress pool 0 th 11 # ingress lossy
$ devlink sb tc bind set swp1 tc 0 type egress pool 4 th 13 # egress lossy
Lossless PG3 and TC3 should go to, respectively, pool 1 (ingress lossless) and pool 5 (egress lossless). The egress pool quota should again be effectively infinite:
$ devlink sb tc bind set swp1 tc 3 type ingress pool 1 th 11 # ingress lossless
$ devlink sb tc bind set swp1 tc 3 type egress pool 5 th 16 # egress lossless
Lossy traffic from PG6 and TC6 will likewise go to this pool:
$ devlink sb tc bind set swp1 tc 6 type ingress pool 1 th 13 # ingress lossless
$ devlink sb tc bind set swp1 tc 6 type egress pool 5 th 13 # egress lossless
See the description of RED qdisc for more details.
RED / ECN should be configured on TC3 (band 4, parent 1:5). In this example, we use minimum and maximum such that:
- When the queue length reaches 150KB, some packets will randomly be marked with congestion on the ECN bits on the IP header.
- When the queue length reaches 1500KB, all packets will be marked with congestion on the ECN bits on the IP header.
$ tc qdisc replace dev swp1 parent 1:5 handle 15: \
red ecn limit 2M avpkt 1000 probability 0.1 min 150K max 1.5M
See the QoS page for details about configuration of lossless traffic.
Use iproute2
dcb
to enable PFC for
priority 3:
$ dcb pfc set dev swp1 prio-pfc all:off 3:on
General information
System Maintenance
Network Interface Configuration
- Switch Port Configuration
- Netdevice Statistics
- Persistent Configuration
- Quality of Service
- Queues Management
- How To Configure Lossless RoCE
- Port Mirroring
- ACLs
- OVS
- Resource Management
- Precision Time Protocol (PTP)
Layer 2
Network Virtualization
Layer 3
- Static Routing
- Virtual Routing and Forwarding (VRF)
- Tunneling
- Multicast Routing
- Virtual Router Redundancy Protocol (VRRP)
Debugging