-
Notifications
You must be signed in to change notification settings - Fork 39
Switch Port Configuration
- Port Identification
- Physical Port Identification
- Port Administrative State
- Link Down Reason
- Port MTU
- Port Speed
- Port Lanes
- Port Splitting
- Transceiver Module Information
- Transceiver Module Reset
- Transceiver Module Power Mode Policy
- Transceiver Module Firmware Flashing
- Further Resources
Kernel Version | |
---|---|
4.6 | Port splitting |
5.9 | Link down reason |
5.12 | ethtool lanes support |
5.14 | Transceiver module EEPROM full read access |
5.16 | Transceiver module reset and power-mode policy |
6.11 | Transceiver Module Firmware Flashing |
Management ports do not use the same driver as front panel ports and can
therefore be distinguished using the Linux ethtool
utility.
The following is an output example of a management port on a Mellanox SN2700 switch:
$ ethtool -i eth1
driver: e1000e
version: 3.2.6-k
firmware-version: 1.10-0
bus-info: 0000:06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
The following is an output example of a front panel port on a Mellanox SN2700 switch:
$ ethtool -i sw1p1
driver: mlxsw_spectrum
version: 1.0
firmware-version: 13.400.116
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
The output example above shows that the management port uses Intel's
e1000e
driver whereas the front panel port uses Mellanox's mlxsw_spectrum
driver.
As of Linux 4.7 it has become possible to create udev
rules which rename the
software interfaces (port netdevs) corresponding to the front panel ports
according to the front panel numbering. To do so, create the following rule in
/etc/udev/rules.d/10-local.rules
:
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="mlxsw_spectrum*", \
NAME="sw$attr{phys_port_name}"
It is possible to make the port LED blink using ethtool
and thereby
identify the corresponding physical interface:
$ ethtool -p sw1p1
This command turns on the LED next to the port until it is explicitly turned off by
killing ethtool
. It is possible to turn the LED on for a specific number of
seconds by running:
$ ethtool -p sw1p1 5
systemd
234 can automatically rename the ports according to their
front panel numbering without user intervention. This results in
names such as enp3s0np5
, which represents front panel port 5.
Note: This functionality was backported to systemd
231 in Fedora
and thus available in Fedora 25 and onwards.
After booting the switch or loading the driver, all the ports go down. The following command changes the administrative state of the port to up:
$ ip link set dev sw1p5 up
However, the operational state of the port only changes to up if the port is able to negotiate the link with its partner. In which case, the output appears as follows:
$ ip link show dev sw1p5
31: sw1p5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast switchid e41d2d45a9c0 state UP mode DEFAULT group default qlen 1000
link/ether e4:1d:2d:45:a9:f5 brd ff:ff:ff:ff:ff:ff
To set the port to down, run:
$ ip link set dev sw1p5 down
In case the administrative state of the port is up, but the operational state of the port is down, the reason can be found using ethtool inside parentheses next to link state. The format is:
$ ethtool ethX
Link detected: yes/no (extended_state, extended_substate)
For example:
$ ethtool ethX
...
Link detected: no (Autoneg, No partner detected)
The extended state is optional and there are cases where no reason will be provided.
In addition, there might be cases where only an extended state is provided, but without an extended substate. For example, when there is no link due to a missing cable, only an extended state is provided:
$ ethtool ethX
Link detected: no (No cable)
Description of the extended states and substates can be found in the kernel source tree.
Note: ethtool
version 5.8 is required in order to display the
link down reason.
To set the port MTU, run:
$ ip link set dev sw1p1 mtu 1400
The switch supports jumbo frames, so values higher than 1500 may be used.
Port speed settings are performed with the ethtool
utility. Assuming the
port's operational status is up, the user may query its current speed:
$ ethtool sw1p5 | grep Speed
Speed: 40000Mb/s
In this case the port's speed is 40Gb/s. To set a different speed, run:
$ ethtool -s sw1p5 speed 10000 autoneg off
This sets the port's speed to 10Gb/s. Assuming the administrative state of the port is up, this command makes the port go through link negotiation again by toggling its administrative state to down and then up. However, the port only goes up if its partner also supports the configured speed.
The command also disables speed auto-negotiation by setting only one desired speed. To allow the switch to auto-negotiate and choose the highest advertised speed, the user may enable auto-negotiation by running:
$ ethtool -s sw1p5 autoneg on
To query the port speed after speed negotiation, run:
$ ethtool sw1p5 | grep Speed
Speed: 40000Mb/s
In Spectrum-2 it is not possible to advertise a specific link mode. Instead, all the supported link modes of a given speed must be advertised. For example, in order to advertise only 100Gb/s, run:
$ ethtool -s sw1p5 advertise 0xF000000000
Where 0xF000000000
is the result of OR-ing the hexadecimal values of
all the supported 100Gb/s link modes:
0x1000000000 100000baseKR4 Full
0x2000000000 100000baseSR4 Full
0x4000000000 100000baseCR4 Full
0x8000000000 100000baseLR4_ER4 Full
The complete list can be found in man ethtool
.
In Spectrum-2 and later ASICs, some speeds can be achieved with
different number of lanes. In order to choose the number of lanes, port
lanes configuration can be preformed with the ethtool
utility.
Assuming the port's operational status is up, the user can query the
number of lanes currently used by the port:
$ ethtool swp1 | grep Lanes
Lanes: 4
To advertise all link modes with a speed of 100Gbps that use two lanes, run:
$ ethtool -s swp1 speed 100000 lanes 2 autoneg on
To force a speed of 100Gbps using four lanes, run:
$ ethtool -s swp1 speed 100000 lanes 4 autoneg off
Two types of statistics exist for each port:
- Software
- Hardware
Software statistics account for packets trapped to the CPU or packets sent from the CPU. Hardware statistics account for all packets going through the port, including those not trapped to or originating from the CPU.
The ifstat
utility is used to query the port's software statistics:
$ ifstat -x cpu sw1p5
#kernel
Interface RX Pkts/Rate TX Pkts/Rate RX Data/Rate TX Data/Rate
RX Errs/Drop TX Errs/Drop RX Over/Rate TX Coll/Rate
sw1p5 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Two utilities can be used to query the port's hardware statistics:
-
ip
utility -
ethtool
utility
Using ip
:
Note: The ip
utility only shows hardware statistics for netdevs
representing physical ports. The statistics of virtual netdevs (e.g.,
vlan, bridge) are software statistics.
Using ethtool
:
The port's statistics are never reset while the driver is loaded. They can only be reset by removing and inserting the driver.
However, it is possible to see the difference in the hardware statistics using
iproute2
's ifstat
utility. When executed, it shows the difference between the
last and the current call:
$ ifstat sw1p5
#kernel
Interface RX Pkts/Rate TX Pkts/Rate RX Data/Rate TX Data/Rate
RX Errs/Drop TX Errs/Drop RX Over/Rate TX Coll/Rate
sw1p5 1 0 1 0 98 0 114 0
0 0 0 0 0 0 0 0
(... after some time passes ...)
$ ifstat sw1p5
#kernel
Interface RX Pkts/Rate TX Pkts/Rate RX Data/Rate TX Data/Rate
RX Errs/Drop TX Errs/Drop RX Over/Rate TX Coll/Rate
sw1p5 9 0 9 0 882 0 1026 0
0 0 0 0 0 0 0 0
As of Linux 4.6 it has become possible to split and unsplit the front panel
ports using the devlink
utility, which is part of the iproute2
package.
Note that devlink
is available in iproute2
starting with version 4.6.0.
The following command splits the first front panel port into 4 ports:
$ devlink port split pci/0000:03:00.0/61 count 4
Where pci/0000:03:00.0/61
is the DEV/PORT_INDEX
handle used by devlink
and
can be retrieved using the command devlink port show
:
$ devlink port show
...
pci/0000:03:00.0/61: type eth netdev sw1p1
...
Assuming the previously described udev
rule is used, sw1p1
disappears
and the following net devices are created:
$ devlink port show
...
pci/0000:03:00.0/61: type eth netdev sw1p1s0 split_group 0
pci/0000:03:00.0/62: type eth netdev sw1p1s1 split_group 0
pci/0000:03:00.0/63: type eth netdev sw1p1s2 split_group 0
pci/0000:03:00.0/64: type eth netdev sw1p1s3 split_group 0
...
Note: In SN2700 and SN2410, splitting a port by four disables the adjacent
port in the front panel column. So in the case above, both sw1p1
and sw1p2
disappear.
The following command unsplits the previously split sw1p1
port:
$ devlink port unsplit pci/0000:03:00.0/62
The handle DEV/PORT_INDEX
of any of the split ports can be used when
unsplitting. The unsplit
command re-spawns the previously present front
panel ports: sw1p1
and sw1p2
.
In order to access the transceiver module internal EEPROM info, use the
ethtool -m
command. For example:
$ ethtool -m sw1p7
Identifier : 0x0d (QSFP+)
Extended identifier : 0x00
Extended identifier description : 1.5W max. Power consumption
Extended identifier description : No CDR in TX, No CDR in RX
Extended identifier description : High Power Class (> 3.5 W) not enabled
Connector : 0x23 (No separable connector)
Transceiver codes : 0x88 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Transceiver type : 40G Ethernet: 40G Base-CR4
Transceiver type : 100G Ethernet: 100G Base-CR4 or 25G Base-CR CA-L
Encoding : 0x00 (unspecified)
BR, Nominal : 25500Mbps
Rate identifier : 0x00
Length (SMF,km) : 0km
Length (OM3 50um) : 0m
Length (OM2 50um) : 0m
Length (OM1 62.5um) : 0m
Length (Copper or Active cable) : 1m
Transmitter technology : 0xa0 (Copper cable unequalized)
Attenuation at 2.5GHz : 2db
Attenuation at 5.0GHz : 3db
Attenuation at 7.0GHz : 4db
Attenuation at 12.9GHz : 7db
Vendor name : Mellanox
Vendor OUI : 00:02:c9
Vendor PN : MCP1600-E00A
Vendor rev : A2
Vendor SN : MT1526VS05742
Revision Compliance : SFF-8636 Rev 1.5
Module temperature : 0.00 degrees C / 32.00 degrees F
Module voltage : 0.0000 V
Since kernel 5.14, user space can request the kernel to retrieve specific module EEPROM pages that might not be available through the legacy IOCTL interface. For example, to retrieve Page 02h of a QSFP-DD (CMIS) module, run:
$ ethtool -m swp11 offset 128 length 128 page 2
Offset Values
------ ------
0x0080: 50 00 f6 00 4b 00 fb 00 8e 00 74 00 87 70 7a 48
0x0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00b0: c3 c6 03 7b 7b 86 06 f2 1d 4a 08 c8 19 63 09 c4
0x00c0: c3 c6 02 85 7b 86 05 08 00 00 00 00 00 00 00 00
0x00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e5
Note: ethtool
version 5.13 is required to retrieve specific
module EEPROM pages.
This functionality is especially useful when used by ethtool
to
retrieve optional EEPROM pages which are then parsed and displayed. For
example, without this functionality, ethtool
could not have displayed
diagnostic information from CMIS compliant modules, as this information
resides in optional and banked EEPROM pages.
$ ethtool -m swp11
Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
Power class : 5
Max power : 10.00W
Connector : 0x23 (No separable connector)
Cable assembly length : 10.00m
Tx CDR bypass control : No
Rx CDR bypass control : No
Tx CDR : Yes
Rx CDR : Yes
Transmitter technology : 0x00 (850 nm VCSEL)
Laser wavelength : 850.000nm
Laser wavelength tolerance : 10.000nm
Length (SMF) : 0.00km
Length (OM5) : 0m
Length (OM4) : 0m
Length (OM3 50/125um) : 0m
Length (OM2 50/125um) : 0m
Vendor name : INNOLIGHT
Vendor OUI : 44:7c:7f
Vendor PN : C-DQ8FNM010-N00
Vendor rev : 2A
Vendor SN : INKBUA280241B
Date code : 201121__
Revision compliance : Rev. 4.0
Module State : 0x03 (ModuleReady)
LowPwrAllowRequestHW : Off
LowPwrRequestSW : Off
Module temperature : 56.18 degrees C / 133.13 degrees F
Module voltage : 3.2831 V
Laser tx bias current (Channel 1) : 7.922 mA
Laser tx bias current (Channel 2) : 7.636 mA
[...]
Laser tx bias current (Channel 16) : 8.120 mA
Transmit avg optical power (Channel 1) : 1.2861 mW / 1.09 dBm
Transmit avg optical power (Channel 2) : 1.2385 mW / 0.93 dBm
[...]
Transmit avg optical power (Channel 16) : 1.3090 mW / 1.17 dBm
Rcvr signal avg optical power (Channel 1) : 1.1158 mW / 0.48 dBm
Rcvr signal avg optical power (Channel 2) : 1.1329 mW / 0.54 dBm
[...]
Rcvr signal avg optical power (Channel 16) : 1.1536 mW / 0.62 dBm
Module temperature high alarm : Off
Module temperature low alarm : Off
Module temperature high warning : Off
Module temperature low warning : Off
Module voltage high alarm : Off
Module voltage low alarm : Off
Module voltage high warning : Off
Module voltage low warning : Off
Laser bias current high alarm (Chan 1) : Off
Laser bias current low alarm (Chan 1) : Off
Laser bias current high warning (Chan 1) : Off
Laser bias current low warning (Chan 1) : Off
Laser tx power high alarm (Channel 1) : Off
Laser tx power low alarm (Channel 1) : Off
Laser tx power high warning (Channel 1) : Off
Laser tx power low warning (Channel 1) : Off
Laser rx power high alarm (Channel 1) : Off
Laser rx power low alarm (Channel 1) : Off
Laser rx power high warning (Channel 1) : Off
Laser rx power low warning (Channel 1) : Off
Laser bias current high alarm (Chan 2) : Off
Laser bias current low alarm (Chan 2) : Off
Laser bias current high warning (Chan 2) : Off
Laser bias current low warning (Chan 2) : Off
Laser tx power high alarm (Channel 2) : Off
Laser tx power low alarm (Channel 2) : Off
Laser tx power high warning (Channel 2) : Off
Laser tx power low warning (Channel 2) : Off
Laser rx power high alarm (Channel 2) : Off
Laser rx power low alarm (Channel 2) : Off
Laser rx power high warning (Channel 2) : Off
Laser rx power low warning (Channel 2) : Off
[...]
Laser bias current high alarm (Chan 16) : Off
Laser bias current low alarm (Chan 16) : Off
Laser bias current high warning (Chan 16) : Off
Laser bias current low warning (Chan 16) : Off
Laser tx power high alarm (Channel 16) : Off
Laser tx power low alarm (Channel 16) : Off
Laser tx power high warning (Channel 16) : Off
Laser tx power low warning (Channel 16) : Off
Laser rx power high alarm (Channel 16) : Off
Laser rx power low alarm (Channel 16) : Off
Laser rx power high warning (Channel 16) : Off
Laser rx power low warning (Channel 16) : Off
Laser bias current high alarm threshold : 14.996 mA
Laser bias current low alarm threshold : 4.496 mA
Laser bias current high warning threshold : 12.998 mA
Laser bias current low warning threshold : 5.000 mA
Laser output power high alarm threshold : 5.0118 mW / 7.00 dBm
Laser output power low alarm threshold : 0.0891 mW / -10.50 dBm
Laser output power high warning threshold : 3.1622 mW / 5.00 dBm
Laser output power low warning threshold : 0.1778 mW / -7.50 dBm
Module temperature high alarm threshold : 80.00 degrees C / 176.00 degrees F
Module temperature low alarm threshold : -10.00 degrees C / 14.00 degrees F
Module temperature high warning threshold : 75.00 degrees C / 167.00 degrees F
Module temperature low warning threshold : -5.00 degrees C / 23.00 degrees F
Module voltage high alarm threshold : 3.6352 V
Module voltage low alarm threshold : 2.9696 V
Module voltage high warning threshold : 3.4672 V
Module voltage low warning threshold : 3.1304 V
Laser rx power high alarm threshold : 5.0118 mW / 7.00 dBm
Laser rx power low alarm threshold : 0.0645 mW / -11.90 dBm
Laser rx power high warning threshold : 3.1622 mW / 5.00 dBm
Laser rx power low warning threshold : 0.1288 mW / -8.90 dBm
Transceiver modules can be reset either via a dedicated hardware signal or by setting a bit in their EEPROM. This is useful in order to allow a module to transition out of a fault. For example, section 6.3.2.13 in CMIS 5.0 states: "Except for a power cycle, the only exit path from the ModuleFault state is to perform a module reset by taking an action that causes the ResetS transition signal to become TRUE (see Table 6-11)".
Transceiver module reset can be performed using ethtool
. For example:
$ ethtool --reset swp11 phy
ETHTOOL_RESET 0x40
Components reset: 0x40
Reset will fail if the port is administratively up:
$ ip link set dev swp11 up
$ ethtool --reset swp11 phy
ETHTOOL_RESET 0x40
Cannot issue ETHTOOL_RESET: Invalid argument
If multiple ports are using the same module (split), the phy-shared
flag needs to be used as the reset will affect all the ports using the
module:
$ ethtool --reset swp11s0 phy-shared
ETHTOOL_RESET 0x400000
Components reset: 0x400000
Similarly, reset will fail if any of the split ports are administratively up:
$ ip link set dev swp11s0 up
$ ethtool --reset swp11s0 phy-shared
ETHTOOL_RESET 0x400000
Cannot issue ETHTOOL_RESET: Invalid argument
Active optical cable (AOC) transceiver modules can operate in either low or high power mode. In low power mode, the power consumption of the module is reduced to the minimum, the management interface towards the host is available, but the data path is deactivated. In high power mode, the module is fully operational and its power consumption is according to its advertised maximum.
By default, the Spectrum firmware will try to transition modules to high
power mode upon plug-in event, as can be seen in the following ethtool
output:
$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high
power-mode high
Alternatively, the power mode policy can be changed to auto
. In this
power mode policy, the module is transitioned by the host to high power
mode when the first port using it is put administratively up and to low
power mode when the last port using it is put administratively down. For
example:
$ ethtool --set-module swp11 power-mode-policy auto
$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy auto
power-mode low
This is useful when user space wishes to limit the power consumption of the module when not in use and to reduce its temperature. For example, temperature in high power mode:
$ ethtool -m swp11
...
Module temperature : 57.60 degrees C / 135.68 degrees F
Temperature in low power mode:
$ ethtool -m swp11
...
Module temperature : 39.89 degrees C / 103.81 degrees F
The trade-off is that link-up times will increase, as the transition from low power mode to high power mode takes a few seconds.
See more info in these commit messages.
CMIS compliant modules such as QSFP-DD might be running a firmware that can be updated in a vendor-neutral way by exchanging messages between the host and the module as described in section 7.3.1 of CMIS 5.2.
The active and inactive firmware versions can be queried using ethtool
along
with other CDB messaging support advertisement information. For example:
# ethtool -m swp23
...
Active firmware version : 2.6
Inactive firmware version : 2.7
CDB instances : 1
CDB background mode : Supported
CDB EPL pages : 0
CDB Maximum EPL RW length : 128
CDB Maximum LPL RW length : 128
CDB trigger method : Single write
Transceiver module firmware flashing can be performed using ethtool
, where
the firmware file is a relative path to the /lib/firmware
directory.
For example:
# ethtool --flash-module-firmware swp40 file test.bin
Transceiver module firmware flashing started for device swp40
Transceiver module firmware flashing in progress for device swp40
Progress: 99%
Transceiver module firmware flashing completed for device swp40
The kernel will prevent user space from flashing the module in the following cases:
-
The module does not support firmware flashing.
-
The net device associated with the module is administratively up.
-
The port where the module is connected was split.
-
Flashing is already in progress.
In addition, in order not to interrupt the flashing process, the following transceiver module operations are rejected while flashing is in progress: EEPROM dump, reset and power mode policy set / get.
- SN2000 Series User Manual
- SN3000 Series User Manual
- SN4000 Series User Manual
- man ethtool
- Writing
udev
rules - man ip
- man devlink
- man devlink-dev
- man devlink-port
- man ifstat
General information
System Maintenance
Network Interface Configuration
- Switch Port Configuration
- Netdevice Statistics
- Persistent Configuration
- Quality of Service
- Queues Management
- How To Configure Lossless RoCE
- Port Mirroring
- ACLs
- OVS
- Resource Management
- Precision Time Protocol (PTP)
Layer 2
Network Virtualization
Layer 3
- Static Routing
- Virtual Routing and Forwarding (VRF)
- Tunneling
- Multicast Routing
- Virtual Router Redundancy Protocol (VRRP)
Debugging