Skip to content

Switch Port Configuration

Danielle Ratson edited this page Oct 1, 2024 · 36 revisions
Table of Contents
  1. Port Identification
  2. Physical Port Identification
    1. Using udev Rules
    2. Using ethtool
    3. Using systemd
  3. Port Administrative State
  4. Link Down Reason
  5. Port MTU
  6. Port Speed
  7. Port Lanes
  8. Port Splitting
    1. Splitting
    2. Unsplitting
  9. Transceiver Module Information
  10. Transceiver Module Reset
  11. Transceiver Module Power Mode Policy
  12. Transceiver Module Firmware Flashing
  13. Further Resources

Features by Version

Kernel Version
4.6 Port splitting
5.9 Link down reason
5.12 ethtool lanes support
5.14 Transceiver module EEPROM full read access
5.16 Transceiver module reset and power-mode policy
6.11 Transceiver Module Firmware Flashing

Port Identification

Management ports do not use the same driver as front panel ports and can therefore be distinguished using the Linux ethtool utility.

The following is an output example of a management port on a Mellanox SN2700 switch:

$ ethtool -i eth1
driver: e1000e
version: 3.2.6-k
firmware-version: 1.10-0
bus-info: 0000:06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

The following is an output example of a front panel port on a Mellanox SN2700 switch:

$ ethtool -i sw1p1
driver: mlxsw_spectrum
version: 1.0
firmware-version: 13.400.116
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

The output example above shows that the management port uses Intel's e1000e driver whereas the front panel port uses Mellanox's mlxsw_spectrum driver.

Physical Port Identification

Using udev rules

As of Linux 4.7 it has become possible to create udev rules which rename the software interfaces (port netdevs) corresponding to the front panel ports according to the front panel numbering. To do so, create the following rule in /etc/udev/rules.d/10-local.rules:

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="mlxsw_spectrum*", \
    NAME="sw$attr{phys_port_name}"

Using ethtool

It is possible to make the port LED blink using ethtool and thereby identify the corresponding physical interface:

$ ethtool -p sw1p1

This command turns on the LED next to the port until it is explicitly turned off by killing ethtool. It is possible to turn the LED on for a specific number of seconds by running:

$ ethtool -p sw1p1 5

Using systemd

systemd 234 can automatically rename the ports according to their front panel numbering without user intervention. This results in names such as enp3s0np5, which represents front panel port 5.

Note: This functionality was backported to systemd 231 in Fedora and thus available in Fedora 25 and onwards.

Port Administrative State

After booting the switch or loading the driver, all the ports go down. The following command changes the administrative state of the port to up:

$ ip link set dev sw1p5 up

However, the operational state of the port only changes to up if the port is able to negotiate the link with its partner. In which case, the output appears as follows:

$ ip link show dev sw1p5
31: sw1p5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast switchid e41d2d45a9c0 state UP mode DEFAULT group default qlen 1000
    link/ether e4:1d:2d:45:a9:f5 brd ff:ff:ff:ff:ff:ff

To set the port to down, run:

$ ip link set dev sw1p5 down

Link Down Reason

In case the administrative state of the port is up, but the operational state of the port is down, the reason can be found using ethtool inside parentheses next to link state. The format is:

$ ethtool ethX
Link detected: yes/no (extended_state, extended_substate)

For example:

$ ethtool ethX
...
Link detected: no (Autoneg, No partner detected)

The extended state is optional and there are cases where no reason will be provided.

In addition, there might be cases where only an extended state is provided, but without an extended substate. For example, when there is no link due to a missing cable, only an extended state is provided:

$ ethtool ethX
Link detected: no (No cable)

Description of the extended states and substates can be found in the kernel source tree.

Note: ethtool version 5.8 is required in order to display the link down reason.

Port MTU

To set the port MTU, run:

$ ip link set dev sw1p1 mtu 1400

The switch supports jumbo frames, so values higher than 1500 may be used.

Port Speed

Port speed settings are performed with the ethtool utility. Assuming the port's operational status is up, the user may query its current speed:

$ ethtool sw1p5 | grep Speed
        Speed: 40000Mb/s

In this case the port's speed is 40Gb/s. To set a different speed, run:

$ ethtool -s sw1p5 speed 10000 autoneg off

This sets the port's speed to 10Gb/s. Assuming the administrative state of the port is up, this command makes the port go through link negotiation again by toggling its administrative state to down and then up. However, the port only goes up if its partner also supports the configured speed.

The command also disables speed auto-negotiation by setting only one desired speed. To allow the switch to auto-negotiate and choose the highest advertised speed, the user may enable auto-negotiation by running:

$ ethtool -s sw1p5 autoneg on

To query the port speed after speed negotiation, run:

$ ethtool sw1p5 | grep Speed
        Speed: 40000Mb/s

In Spectrum-2 it is not possible to advertise a specific link mode. Instead, all the supported link modes of a given speed must be advertised. For example, in order to advertise only 100Gb/s, run:

$ ethtool -s sw1p5 advertise 0xF000000000

Where 0xF000000000 is the result of OR-ing the hexadecimal values of all the supported 100Gb/s link modes:

0x1000000000      100000baseKR4 Full
0x2000000000      100000baseSR4 Full
0x4000000000      100000baseCR4 Full
0x8000000000      100000baseLR4_ER4 Full

The complete list can be found in man ethtool.

Port Lanes

In Spectrum-2 and later ASICs, some speeds can be achieved with different number of lanes. In order to choose the number of lanes, port lanes configuration can be preformed with the ethtool utility. Assuming the port's operational status is up, the user can query the number of lanes currently used by the port:

$ ethtool swp1 | grep Lanes
        Lanes: 4

To advertise all link modes with a speed of 100Gbps that use two lanes, run:

$ ethtool -s swp1 speed 100000 lanes 2 autoneg on

To force a speed of 100Gbps using four lanes, run:

$ ethtool -s swp1 speed 100000 lanes 4 autoneg off

Port Statistics

Two types of statistics exist for each port:

  • Software
  • Hardware

Software statistics account for packets trapped to the CPU or packets sent from the CPU. Hardware statistics account for all packets going through the port, including those not trapped to or originating from the CPU.

Software Statistics

The ifstat utility is used to query the port's software statistics:

$ ifstat -x cpu sw1p5
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate
sw1p5                  0 0             0 0             0 0             0 0
                       0 0             0 0             0 0             0 0

Hardware Statistics

Two utilities can be used to query the port's hardware statistics:

  • ip utility
  • ethtool utility

Using ip:

Note: The ip utility only shows hardware statistics for netdevs representing physical ports. The statistics of virtual netdevs (e.g., vlan, bridge) are software statistics.

Using ethtool:

Resetting Statistics

The port's statistics are never reset while the driver is loaded. They can only be reset by removing and inserting the driver.

However, it is possible to see the difference in the hardware statistics using iproute2's ifstat utility. When executed, it shows the difference between the last and the current call:

$ ifstat sw1p5
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate
sw1p5                  1 0             1 0            98 0           114 0
                       0 0             0 0             0 0             0 0

(... after some time passes ...)

$ ifstat sw1p5
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate
sw1p5                  9 0             9 0           882 0          1026 0
                       0 0             0 0             0 0             0 0

Port Splitting

As of Linux 4.6 it has become possible to split and unsplit the front panel ports using the devlink utility, which is part of the iproute2 package. Note that devlink is available in iproute2 starting with version 4.6.0.

Splitting

The following command splits the first front panel port into 4 ports:

$ devlink port split pci/0000:03:00.0/61 count 4

Where pci/0000:03:00.0/61 is the DEV/PORT_INDEX handle used by devlink and can be retrieved using the command devlink port show:

$ devlink port show
...
pci/0000:03:00.0/61: type eth netdev sw1p1
...

Assuming the previously described udev rule is used, sw1p1 disappears and the following net devices are created:

$ devlink port show
...
pci/0000:03:00.0/61: type eth netdev sw1p1s0 split_group 0
pci/0000:03:00.0/62: type eth netdev sw1p1s1 split_group 0
pci/0000:03:00.0/63: type eth netdev sw1p1s2 split_group 0
pci/0000:03:00.0/64: type eth netdev sw1p1s3 split_group 0
...

Note: In SN2700 and SN2410, splitting a port by four disables the adjacent port in the front panel column. So in the case above, both sw1p1 and sw1p2 disappear.

Unsplitting

The following command unsplits the previously split sw1p1 port:

$ devlink port unsplit pci/0000:03:00.0/62

The handle DEV/PORT_INDEX of any of the split ports can be used when unsplitting. The unsplit command re-spawns the previously present front panel ports: sw1p1 and sw1p2.

Transceiver Module Information

In order to access the transceiver module internal EEPROM info, use the ethtool -m command. For example:

$ ethtool -m sw1p7
Identifier                                : 0x0d (QSFP+)
Extended identifier                       : 0x00
Extended identifier description           : 1.5W max. Power consumption
Extended identifier description           : No CDR in TX, No CDR in RX
Extended identifier description           : High Power Class (> 3.5 W) not enabled
Connector                                 : 0x23 (No separable connector)
Transceiver codes                         : 0x88 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Transceiver type                          : 40G Ethernet: 40G Base-CR4
Transceiver type                          : 100G Ethernet: 100G Base-CR4 or 25G Base-CR CA-L
Encoding                                  : 0x00 (unspecified)
BR, Nominal                               : 25500Mbps
Rate identifier                           : 0x00
Length (SMF,km)                           : 0km
Length (OM3 50um)                         : 0m
Length (OM2 50um)                         : 0m
Length (OM1 62.5um)                       : 0m
Length (Copper or Active cable)           : 1m
Transmitter technology                    : 0xa0 (Copper cable unequalized)
Attenuation at 2.5GHz                     : 2db
Attenuation at 5.0GHz                     : 3db
Attenuation at 7.0GHz                     : 4db
Attenuation at 12.9GHz                    : 7db
Vendor name                               : Mellanox
Vendor OUI                                : 00:02:c9
Vendor PN                                 : MCP1600-E00A
Vendor rev                                : A2
Vendor SN                                 : MT1526VS05742
Revision Compliance                       : SFF-8636 Rev 1.5
Module temperature                        : 0.00 degrees C / 32.00 degrees F
Module voltage                            : 0.0000 V

Since kernel 5.14, user space can request the kernel to retrieve specific module EEPROM pages that might not be available through the legacy IOCTL interface. For example, to retrieve Page 02h of a QSFP-DD (CMIS) module, run:

$ ethtool -m swp11 offset 128 length 128 page 2
Offset          Values
------          ------
0x0080:         50 00 f6 00 4b 00 fb 00 8e 00 74 00 87 70 7a 48
0x0090:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00a0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00b0:         c3 c6 03 7b 7b 86 06 f2 1d 4a 08 c8 19 63 09 c4
0x00c0:         c3 c6 02 85 7b 86 05 08 00 00 00 00 00 00 00 00
0x00d0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00e0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00f0:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e5

Note: ethtool version 5.13 is required to retrieve specific module EEPROM pages.

This functionality is especially useful when used by ethtool to retrieve optional EEPROM pages which are then parsed and displayed. For example, without this functionality, ethtool could not have displayed diagnostic information from CMIS compliant modules, as this information resides in optional and banked EEPROM pages.

$ ethtool -m swp11
	Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
	Power class                               : 5
	Max power                                 : 10.00W
	Connector                                 : 0x23 (No separable connector)
	Cable assembly length                     : 10.00m
	Tx CDR bypass control                     : No
	Rx CDR bypass control                     : No
	Tx CDR                                    : Yes
	Rx CDR                                    : Yes
	Transmitter technology                    : 0x00 (850 nm VCSEL)
	Laser wavelength                          : 850.000nm
	Laser wavelength tolerance                : 10.000nm
	Length (SMF)                              : 0.00km
	Length (OM5)                              : 0m
	Length (OM4)                              : 0m
	Length (OM3 50/125um)                     : 0m
	Length (OM2 50/125um)                     : 0m
	Vendor name                               : INNOLIGHT
	Vendor OUI                                : 44:7c:7f
	Vendor PN                                 : C-DQ8FNM010-N00
	Vendor rev                                : 2A
	Vendor SN                                 : INKBUA280241B
	Date code                                 : 201121__
	Revision compliance                       : Rev. 4.0
	Module State                              : 0x03 (ModuleReady)
	LowPwrAllowRequestHW                      : Off
	LowPwrRequestSW                           : Off
	Module temperature                        : 56.18 degrees C / 133.13 degrees F
	Module voltage                            : 3.2831 V
	Laser tx bias current (Channel 1)         : 7.922 mA
	Laser tx bias current (Channel 2)         : 7.636 mA
	[...]
	Laser tx bias current (Channel 16)        : 8.120 mA
	Transmit avg optical power (Channel 1)    : 1.2861 mW / 1.09 dBm
	Transmit avg optical power (Channel 2)    : 1.2385 mW / 0.93 dBm
	[...]
	Transmit avg optical power (Channel 16)   : 1.3090 mW / 1.17 dBm
	Rcvr signal avg optical power (Channel 1) : 1.1158 mW / 0.48 dBm
	Rcvr signal avg optical power (Channel 2) : 1.1329 mW / 0.54 dBm
	[...]
	Rcvr signal avg optical power (Channel 16) : 1.1536 mW / 0.62 dBm
	Module temperature high alarm             : Off
	Module temperature low alarm              : Off
	Module temperature high warning           : Off
	Module temperature low warning            : Off
	Module voltage high alarm                 : Off
	Module voltage low alarm                  : Off
	Module voltage high warning               : Off
	Module voltage low warning                : Off
	Laser bias current high alarm   (Chan 1)  : Off
	Laser bias current low alarm    (Chan 1)  : Off
	Laser bias current high warning (Chan 1)  : Off
	Laser bias current low warning  (Chan 1)  : Off
	Laser tx power high alarm   (Channel 1)   : Off
	Laser tx power low alarm    (Channel 1)   : Off
	Laser tx power high warning (Channel 1)   : Off
	Laser tx power low warning  (Channel 1)   : Off
	Laser rx power high alarm   (Channel 1)   : Off
	Laser rx power low alarm    (Channel 1)   : Off
	Laser rx power high warning (Channel 1)   : Off
	Laser rx power low warning  (Channel 1)   : Off
	Laser bias current high alarm   (Chan 2)  : Off
	Laser bias current low alarm    (Chan 2)  : Off
	Laser bias current high warning (Chan 2)  : Off
	Laser bias current low warning  (Chan 2)  : Off
	Laser tx power high alarm   (Channel 2)   : Off
	Laser tx power low alarm    (Channel 2)   : Off
	Laser tx power high warning (Channel 2)   : Off
	Laser tx power low warning  (Channel 2)   : Off
	Laser rx power high alarm   (Channel 2)   : Off
	Laser rx power low alarm    (Channel 2)   : Off
	Laser rx power high warning (Channel 2)   : Off
	Laser rx power low warning  (Channel 2)   : Off
	[...]
	Laser bias current high alarm   (Chan 16) : Off
	Laser bias current low alarm    (Chan 16) : Off
	Laser bias current high warning (Chan 16) : Off
	Laser bias current low warning  (Chan 16) : Off
	Laser tx power high alarm   (Channel 16)  : Off
	Laser tx power low alarm    (Channel 16)  : Off
	Laser tx power high warning (Channel 16)  : Off
	Laser tx power low warning  (Channel 16)  : Off
	Laser rx power high alarm   (Channel 16)  : Off
	Laser rx power low alarm    (Channel 16)  : Off
	Laser rx power high warning (Channel 16)  : Off
	Laser rx power low warning  (Channel 16)  : Off
	Laser bias current high alarm threshold   : 14.996 mA
	Laser bias current low alarm threshold    : 4.496 mA
	Laser bias current high warning threshold : 12.998 mA
	Laser bias current low warning threshold  : 5.000 mA
	Laser output power high alarm threshold   : 5.0118 mW / 7.00 dBm
	Laser output power low alarm threshold    : 0.0891 mW / -10.50 dBm
	Laser output power high warning threshold : 3.1622 mW / 5.00 dBm
	Laser output power low warning threshold  : 0.1778 mW / -7.50 dBm
	Module temperature high alarm threshold   : 80.00 degrees C / 176.00 degrees F
	Module temperature low alarm threshold    : -10.00 degrees C / 14.00 degrees F
	Module temperature high warning threshold : 75.00 degrees C / 167.00 degrees F
	Module temperature low warning threshold  : -5.00 degrees C / 23.00 degrees F
	Module voltage high alarm threshold       : 3.6352 V
	Module voltage low alarm threshold        : 2.9696 V
	Module voltage high warning threshold     : 3.4672 V
	Module voltage low warning threshold      : 3.1304 V
	Laser rx power high alarm threshold       : 5.0118 mW / 7.00 dBm
	Laser rx power low alarm threshold        : 0.0645 mW / -11.90 dBm
	Laser rx power high warning threshold     : 3.1622 mW / 5.00 dBm
	Laser rx power low warning threshold      : 0.1288 mW / -8.90 dBm

Transceiver Module Reset

Transceiver modules can be reset either via a dedicated hardware signal or by setting a bit in their EEPROM. This is useful in order to allow a module to transition out of a fault. For example, section 6.3.2.13 in CMIS 5.0 states: "Except for a power cycle, the only exit path from the ModuleFault state is to perform a module reset by taking an action that causes the ResetS transition signal to become TRUE (see Table 6-11)".

Transceiver module reset can be performed using ethtool. For example:

$ ethtool --reset swp11 phy
ETHTOOL_RESET 0x40
Components reset:     0x40

Reset will fail if the port is administratively up:

$ ip link set dev swp11 up
$ ethtool --reset swp11 phy
ETHTOOL_RESET 0x40
Cannot issue ETHTOOL_RESET: Invalid argument

If multiple ports are using the same module (split), the phy-shared flag needs to be used as the reset will affect all the ports using the module:

$ ethtool --reset swp11s0 phy-shared
ETHTOOL_RESET 0x400000
Components reset:     0x400000

Similarly, reset will fail if any of the split ports are administratively up:

$ ip link set dev swp11s0 up
$ ethtool --reset swp11s0 phy-shared
ETHTOOL_RESET 0x400000
Cannot issue ETHTOOL_RESET: Invalid argument

Transceiver Module Power Mode Policy

Active optical cable (AOC) transceiver modules can operate in either low or high power mode. In low power mode, the power consumption of the module is reduced to the minimum, the management interface towards the host is available, but the data path is deactivated. In high power mode, the module is fully operational and its power consumption is according to its advertised maximum.

By default, the Spectrum firmware will try to transition modules to high power mode upon plug-in event, as can be seen in the following ethtool output:

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high
power-mode high

Alternatively, the power mode policy can be changed to auto. In this power mode policy, the module is transitioned by the host to high power mode when the first port using it is put administratively up and to low power mode when the last port using it is put administratively down. For example:

$ ethtool --set-module swp11 power-mode-policy auto
$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy auto
power-mode low

This is useful when user space wishes to limit the power consumption of the module when not in use and to reduce its temperature. For example, temperature in high power mode:

$ ethtool -m swp11
...
        Module temperature                        : 57.60 degrees C / 135.68 degrees F

Temperature in low power mode:

$ ethtool -m swp11
...
        Module temperature                        : 39.89 degrees C / 103.81 degrees F

The trade-off is that link-up times will increase, as the transition from low power mode to high power mode takes a few seconds.

See more info in these commit messages.

Transceiver Module Firmware Flashing

CMIS compliant modules such as QSFP-DD might be running a firmware that can be updated in a vendor-neutral way by exchanging messages between the host and the module as described in section 7.3.1 of CMIS 5.2.

The active and inactive firmware versions can be queried using ethtool along with other CDB messaging support advertisement information. For example:

# ethtool -m swp23
...
        Active firmware version                   : 2.6
        Inactive firmware version                 : 2.7
        CDB instances                             : 1
        CDB background mode                       : Supported
        CDB EPL pages                             : 0
        CDB Maximum EPL RW length                 : 128
        CDB Maximum LPL RW length                 : 128
        CDB trigger method                        : Single write

Transceiver module firmware flashing can be performed using ethtool, where the firmware file is a relative path to the /lib/firmware directory. For example:

# ethtool --flash-module-firmware swp40 file test.bin

Transceiver module firmware flashing started for device swp40
Transceiver module firmware flashing in progress for device swp40
Progress: 99%
Transceiver module firmware flashing completed for device swp40

The kernel will prevent user space from flashing the module in the following cases:

  • The module does not support firmware flashing.

  • The net device associated with the module is administratively up.

  • The port where the module is connected was split.

  • Flashing is already in progress.

In addition, in order not to interrupt the flashing process, the following transceiver module operations are rejected while flashing is in progress: EEPROM dump, reset and power mode policy set / get.

Further Resources

  1. SN2000 Series User Manual
  2. SN3000 Series User Manual
  3. SN4000 Series User Manual
  4. man ethtool
  5. Writing udev rules
  6. man ip
  7. man devlink
  8. man devlink-dev
  9. man devlink-port
  10. man ifstat
Clone this wiki locally