Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NDP.p4 trimming switch code targetting Tofino TNA #92

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions research_projects/NDP/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# NDP trimming switch

This directory hosts the `ndp.p4` trimming switch
implementation targetting the TNA architecture.
For more details, see our paper [Re-architecting datacenter networks and stacks for low latency and high performance](http://nets.cs.pub.ro/~costin/files/ndp.pdf).

The p4 code, table population scripts and instructions for building and running
NDP for the Tofino switch are in the `dev_root/` directory.

# How it works

To summarize, `ndp.p4` keeps an under-approximation of
the buffer occupancy for each port in ingress (by means of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: 'each port' -> 'each egress port'

three-color meter). Whenever the meter turns red, it means
that the buffer is full and packet undergoes trimming. To achieve
that, we mark the packet to be cloned to egress and setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'to be cloned from ingress to egress'

the clone session to truncate the packet in such way as to
only keep the packet headers.

Since Tofino is keeping per-pipeline meter state, we may
end up in the situation where multiple ingress pipelines are
flooding a single output port without any of the meters turning
red. To solve this situation, we devise a three-level meter
strategy and make use of the Deflect-on-Drop capabilities
on the Tofino to make ingress meters trim more aggressively for
the port which is experiencing drops. After a pre-defined
interval, the meters switch to an intermediate level of trimming
and after even more time, when the incast has passed, to their original trim rate.

A more detailed description of the implementation:

* on ingress

0) the packet undergoes regular ipv4 forwarding with fwd decision to port egport
1) if packet is ndp control => output packet to HIGH_PRIORITY_QUEUE
2) if packet is ndp data => pass packet through meter[egport]

2.1) if meter color is GREEN => output packet to LOW_PRIORITY_QUEUE

2.2) if meter color != GREEN => clone packet to sid where (sid maps to egport, HIGH_PRIORITY_QUEUE,
packet length = 80B)
3) if packet is not ndp => proceed with forwarding on OTHERS_QUEUE

* on egress:
1) if packet is ndp data and comes in from DoD port (dropped due to congestion)
2) when trimmed or normal packets come in => do rewrites (mac src and dst addresses) and set ndp trim flags
3) when clone packet back to egress to sesssion esid (esid maps to recirculation port, HIGH_PRIORITY_QUEUE, packet length = 80B)
4) when packet comes back from egress clone => forward as-is (i.e. recirculate back into ingress) and notify all pipelines
to transition into pessimistic mode

### NDP modes:
* Each egress port works in 3 modes:
- optimistic
- pessimistic
- "halftimistic"

The mode decides what meter will be used for NDP packets going out on the given port

* In optimistic, we use meter_optimistic (line-rate)

* In pessimistic, we use meter_pessimistic (1/4 * line-rate)

* In halftimistic, we use meter_halftimistic (1/2 * line-rate)

Initially, the switch starts in optimistic mode for all ports.

Whenever a DoD packet is received in egress => all ingress pipelines are notified to
trim more aggressively (i.e. transition into pessimistic mode).

A port remains in pessimistic mode for T0 ns if no extra DoDs occur.
After T0 ns, the port transitions into halftimistic mode.

A port remains in halftimistic mode for T1 ns if no other DoDs occur.
After T1 ns, the port transitions back into optimistic mode.

NB: T0 and T1 are hardcoded into ndp.p4 and are currently set
to 6us and 24us respectively.

20 changes: 20 additions & 0 deletions research_projects/NDP/dev_root/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
P4C ?= $(SDE_INSTALL)/bin/bf-p4c

all: .ndp.ts

.ndp.ts: ndp.p4 common/util.p4 common/headers.p4
ifndef SDE_INSTALL
$(error SDE_INSTALL is undefined)
endif
$(P4C) $(OPT) $< --bf-rt-schema ndp.tofino/bfrt.json
cp -r ndp.tofino $(SDE_INSTALL)/
@touch $@

package:
tar -cvf ndp.tar -T package.txt

clean:
rm -rf .ndp.ts ndp.tar
rm -rf .dodtest.ts

.PHONY: clean package
136 changes: 136 additions & 0 deletions research_projects/NDP/dev_root/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# ndp.p4

This repo contains TNA P4 code for running NDP with trimming.

* ndp.p4 contains the P4 source code
* setup_ndp.py contains the control-plane code which populates
NDP's tables
* samples/ - contains configuration samples for Tofino

## Compiling

`ndp.p4` was tested against version `9.2.0` of the Intel SDE
(formerly `bf-sde`). We assume that the following environment variables are set prior to building and running: `$SDE`, `$SDE_INSTALL`.

```
make
```
The output of this command is `$SDE_INSTALL/ndp.tofino/`

## Running
Deploying the P4 switch on hardware:
```
$SDE/run_switchd.sh -p ndp -c $SDE_INSTALL/ndp.tofino/ndp.conf
```

## Control plane
The control plane consists of the script `setup_ndp.py`
which takes as input a configuration file and populates
the entries for NDP. The current configuration is static
(i.e. no dynamic routing, no ARP etc.).

`setup_ndp.py` works in two modes:
* single-pipe: no extra CLI arguments - the input is a *single-pipe* json - it will set all tables as symmetric (see below)
* multi-pipe: requires `-multi-pipe` CLI option. Expects as input a
multi-pipe json file which consists of a dictionary where keys are
strings representing pipe_ids and values are objects with the sole
attribute "file" whose value points to the single-pipe input json
for the particular pipe_id. The *single-pipe* input format is
described below

The *single-pipe* input to `setup_ndp.py` is a json file with the following contents:
- arp - the contents of the ARP table of the switch (maps IPv4 -> MAC) - a dictionary of **arp entry** objects key IPv4 address, value MAC address
- rates - a list of **rate** objects with the following attributes:
- eg_port - dev_port for which the following attributes apply
- rate_kbps - int - meter speed in kbps (required)
- burst_kbits - int - meter "buffer" (burst size) - in kBits (required)
- shaper_rate_kbps - int - meter speed in kbps (optional: default to rate_kbps) - NB: shaper_rate_kbps = 0 ==> shaper is disabled for given port
- shaper_burst_kbits - int - burst size of shaper (optional: defaults to shaper_burst_kbits)
- port_speed - str - one of 10G, 25G, 40G, 50G, 100G (required)
- port_bufsize - int
- fec - str - one of NONE, RS (required)
- entries - a list of **entry** objects with the following attributes (all of them are required):
- smac - source MAC of outgoing port
- dip - destination IPv4
- eg_port - dev_port - outgoing device port
- nhop - IPv4 of next hop or 0 if the destination IP subnet is directly connected
- allow_pessimism - bool - optional: default True. Disables optimistic/pessimistic modes and only considers optimistic meter

## Examples

First of all, set up the PYTHONPATH
```
PYTHONPATH=$SDE_INSTALL/lib/python2.7/site-packages/:$SDE_INSTALL/lib/python2.7/site-packages/tofino
```
* multi-pipe
```
python setup_ndp.py -multi-pipe samples/multi_pipe/r1.json
```

* single-pipe
```
python setup_ndp.py samples/single_pipe/r0_config.json
```

* troubleshooting
If running the script fails with something like `google.protobuf.internal`
not found, run the following (assuming original python site-packages is in /usr/local/lib/python2.7/site-packages)
```
cp -r /usr/local/lib/python2.7/site-packages/protobuf*/google/protobuf/ $SDE_INSTALL/lib/python2.7/site-packages/google/
```

Changing the running mode (single-pipe vs multi-pipe) between two
consecutive runs may sometimes lead to errors. If this is the case,
re-deploying ndp.p4 on the switch should solve the issue.

## How it works

Check out the original [NDP SIGCOMM'17 paper](https://dl.acm.org/doi/10.1145/3098822.3098825).

### Current implementation
* on ingress

0) the packet undergoes regular ipv4 forwarding with fwd decision to port egport
1) if packet is ndp control => output packet to HIGH_PRIORITY_QUEUE
2) if packet is ndp data => pass packet through meter[egport]

2.1) if meter color is GREEN => output packet to LOW_PRIORITY_QUEUE

2.2) if meter color != GREEN => clone packet to sid where (sid maps to egport, HIGH_PRIORITY_QUEUE,
packet length = 80B)
3) if packet is not ndp => proceed with forwarding on OTHERS_QUEUE

* on egress:
1) if packet is ndp data and comes in from DoD port (dropped due to congestion)
2) when trimmed or normal packets come in => do rewrites (mac src and dst addresses) and set ndp trim flags
3) when clone packet back to egress to sesssion esid (esid maps to recirculation port, HIGH_PRIORITY_QUEUE, packet length = 80B)
4) when packet comes back from egress clone => forward as-is (i.e. recirculate back into ingress) and notify all pipelines
to transition into pessimistic mode

### NDP modes:
* Each egress port works in 3 modes:
- optimistic
- pessimistic
- "halftimistic"

The mode decides what meter will be used for NDP packets going out on the given port

* In optimistic, we use meter_optimistic (line-rate)

* In pessimistic, we use meter_pessimistic (1/4 * line-rate)

* In halftimistic, we use meter_halftimistic (1/2 * line-rate)

Initially, the switch starts in optimistic mode for all ports.

Whenever a DoD packet is received in egress => all ingress pipelines are notified to
trim more aggressively (i.e. transition into pessimistic mode).

A port remains in pessimistic mode for T0 ns if no extra DoDs occur.
After T0 ns, the port transitions into halftimistic mode.

A port remains in halftimistic mode for T1 ns if no other DoDs occur.
After T1 ns, the port transitions back into optimistic mode.

NB: T0 and T1 are hardcoded into ndp.p4 and are currently set
to 6us and 24us respectively.
Loading