Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EEH report when Network card iperf test #280

Open
lili-lilili opened this issue Aug 11, 2023 · 1 comment
Open

EEH report when Network card iperf test #280

lili-lilili opened this issue Aug 11, 2023 · 1 comment

Comments

@lili-lilili
Copy link

When we do the netcard iperf test(on P10), the EEH error always report.

Here is the register dump from OPAL.
It looks like the PCIe link get some link problems.
"phbRegbErrorStatus" register records a link down error.

I also had problems with parsing the registers.
I only have a pcie spec for P9, bu it has no explanation for "phbRegbErrorStatus bit31".

Any suggestions? @fbarrat

[ 1005.489389250,3] PHB#006f[6:3]: PHB Freeze/Fence detected !
[ 1005.489438608,3] PHB#006f[6:3]: PCI FIR=0000000000000000
[ 1005.489474501,3] PHB#006f[6:3]: PCI FIR WOF=0000000000000000
[ 1005.489510147,3] PHB#006f[6:3]: NEST FIR=0800000000000000
[ 1005.489542000,3] PHB#006f[6:3]: NEST FIR WOF=0800000000000000
[ 1005.489577729,3] PHB#006f[6:3]: ERR RPT0=0010000000000000
[ 1005.489607921,3] PHB#006f[6:3]: ERR RPT1=0000000000000000
[ 1005.489642726,3] PHB#006f[6:3]: AIB ERR=00cc100000000000
[ 1005.490223446,3] PHB#006f[6:3]: brdgCtl = 00000002
[ 1005.490262691,3] PHB#006f[6:3]: deviceStatus = 00000140
[ 1005.490295542,3] PHB#006f[6:3]: slotStatus = 00402000
[ 1005.490335762,3] PHB#006f[6:3]: linkStatus = c1010008
[ 1005.490366070,3] PHB#006f[6:3]: devCmdStatus = 00100107
[ 1005.490397781,3] PHB#006f[6:3]: devSecStatus = 00000000
[ 1005.490428166,3] PHB#006f[6:3]: rootErrorStatus = 00000000
[ 1005.490463922,3] PHB#006f[6:3]: corrErrorStatus = 00000000
[ 1005.490497253,3] PHB#006f[6:3]: uncorrErrorStatus = 00000000
[ 1005.490530530,3] PHB#006f[6:3]: devctl = 00000140
[ 1005.490562582,3] PHB#006f[6:3]: devStat = 00000000
[ 1005.490593271,3] PHB#006f[6:3]: tlpHdr1 = 00000000
[ 1005.490623159,3] PHB#006f[6:3]: tlpHdr2 = 00000000
[ 1005.490654178,3] PHB#006f[6:3]: tlpHdr3 = 00000000
[ 1005.490687329,3] PHB#006f[6:3]: tlpHdr4 = 00000000
[ 1005.490721063,3] PHB#006f[6:3]: sourceId = 00000000
[ 1005.490751859,3] PHB#006f[6:3]: nFir = 0800000000000000
[ 1005.490786389,3] PHB#006f[6:3]: nFirMask = 003001d000000000
[ 1005.490820066,3] PHB#006f[6:3]: nFirWOF = 0800000000000000
[ 1005.490853079,3] PHB#006f[6:3]: phbPlssr = 0000001c00000000
[ 1005.490888389,3] PHB#006f[6:3]: phbCsr = 0000001c00000000
[ 1005.490926839,3] PHB#006f[6:3]: lemFir = 0000000100000100
[ 1005.490964803,3] PHB#006f[6:3]: lemErrorMask = 0000000000000000
[ 1005.491002854,3] PHB#006f[6:3]: lemWOF = 0000000000000100
[ 1005.491036256,3] PHB#006f[6:3]: phbErrorStatus = 000004e000000000
[ 1005.491069266,3] PHB#006f[6:3]: phbFirstErrorStatus = 0000040000000000
[ 1005.491102274,3] PHB#006f[6:3]: phbErrorLog0 = 0000000000000000
[ 1005.491140034,3] PHB#006f[6:3]: phbErrorLog1 = 0000000000000000
[ 1005.491177123,3] PHB#006f[6:3]: phbTxeErrorStatus = 0000000000000000
[ 1005.491213447,3] PHB#006f[6:3]: phbTxeFirstErrorStatus = 0000000000000000
[ 1005.491248051,3] PHB#006f[6:3]: phbTxeErrorLog0 = 0000000000000000
[ 1005.491281223,3] PHB#006f[6:3]: phbTxeErrorLog1 = 0000000000000000
[ 1005.491314903,3] PHB#006f[6:3]: phbRxeArbErrorStatus = 0000000000000000
[ 1005.491349975,3] PHB#006f[6:3]: phbRxeArbFrstErrorStatus = 0000000000000000
[ 1005.491388215,3] PHB#006f[6:3]: phbRxeArbErrorLog0 = 0000000000000000
[ 1005.491425458,3] PHB#006f[6:3]: phbRxeArbErrorLog1 = 0000000000000000
[ 1005.491459174,3] PHB#006f[6:3]: phbRxeMrgErrorStatus = 0000000000000001
[ 1005.491492464,3] PHB#006f[6:3]: phbRxeMrgFrstErrorStatus = 0000000000000001
[ 1005.491524353,3] PHB#006f[6:3]: phbRxeMrgErrorLog0 = 0000000000000000
[ 1005.491557762,3] PHB#006f[6:3]: phbRxeMrgErrorLog1 = 0000000000000000
[ 1005.491594088,3] PHB#006f[6:3]: phbRxeTceErrorStatus = 0000000000000000
[ 1005.491632865,3] PHB#006f[6:3]: phbRxeTceFrstErrorStatus = 0000000000000000
[ 1005.491671010,3] PHB#006f[6:3]: phbRxeTceErrorLog0 = 0000000000000000
[ 1005.491705600,3] PHB#006f[6:3]: phbRxeTceErrorLog1 = 0000000000000000
[ 1005.491738612,3] PHB#006f[6:3]: phbPblErrorStatus = 0100000000000000
[ 1005.491772417,3] PHB#006f[6:3]: phbPblFirstErrorStatus = 0100000000000000
[ 1005.491810821,3] PHB#006f[6:3]: phbPblErrorLog0 = 0000000000000000
[ 1005.491846866,3] PHB#006f[6:3]: phbPblErrorLog1 = 0000000000000000
[ 1005.491884228,3] PHB#006f[6:3]: phbPcieDlpErrorLog1 = 0000000000000000
[ 1005.491921107,3] PHB#006f[6:3]: phbPcieDlpErrorLog2 = 0000000000000000
[ 1005.491954581,3] PHB#006f[6:3]: phbPcieDlpErrorStatus = 0000000000000000
[ 1005.491988376,3] PHB#006f[6:3]: phbRegbErrorStatus = 0090001100000000
[ 1005.492022294,3] PHB#006f[6:3]: phbRegbFirstErrorStatus = 0000001000000000
[ 1005.492058404,3] PHB#006f[6:3]: phbRegbErrorLog0 = 2480005800000000
[ 1005.492094757,3] PHB#006f[6:3]: phbRegbErrorLog1 = 0000000000000000

@fbarrat
Copy link
Contributor

fbarrat commented Aug 11, 2023

REGB Error Status Register, bit 31 is:

31 pdlo_dl_target_speed_not_reached_err
Error bit to detect HW not training to negotiated speed due to EQ timeouts.

I'm getting mixed messages from some of those errors. I think it needs to be analyzed by the hw team. You must have some official support channel for hw, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants