Bug 60799 - 82571EB: Detected Hardware Unit Hang (invalid MPS config)
Summary: 82571EB: Detected Hardware Unit Hang (invalid MPS config)
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: http://lkml.kernel.org/r/4FFA9B96.604...
Keywords:
Depends on:
Blocks:
 
Reported: 2013-08-26 20:55 UTC by Bjorn Helgaas
Modified: 2013-08-26 21:07 UTC (History)
0 users

See Also:
Kernel Version: 2.6.32
Subsystem:
Regression: No
Bisected commit-id:


Attachments
more lspci and dmesg info (111.15 KB, text/plain)
2013-08-26 20:55 UTC, Bjorn Helgaas
Details
full lspci (BIOS default MPS settings) (64.94 KB, text/plain)
2013-08-26 20:59 UTC, Bjorn Helgaas
Details
dmesg (BIOS MPS=128 setting) (81.27 KB, text/plain)
2013-08-26 21:01 UTC, Bjorn Helgaas
Details

Description Bjorn Helgaas 2013-08-26 20:55:53 UTC
Created attachment 107323 [details]
more lspci and dmesg info

Joe Jin <joe.jin@oracle.com> reported this (URL above):

I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
a big file (>500M) from another server will hit it at once. 

device info:
# lspci -s 05:00.0 
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)

# lspci -s 05:00.0 -n
05:00.0 0200: 8086:10bc (rev 06)

# ethtool -i eth0
driver: e1000e
version: 2.0.0-NAPI
firmware-version: 5.10-2
bus-info: 0000:05:00.0

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: on

kernel log:
-----------
e1000e 0000:05:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6c>
  TDT                  <81>
  next_to_use          <81>
  next_to_clean        <6b>
buffer_info[next_to_clean]:
  time_stamp           <fffc7a23>
  next_to_watch        <71>
  jiffies              <fffc8c0c>
  next_to_watch.status <0>
MAC Status             <80387>
PHY Status             <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
Comment 1 Bjorn Helgaas 2013-08-26 20:59:13 UTC
Created attachment 107324 [details]
full lspci (BIOS default MPS settings)

Note the following MPS settings:

  00:07.0 Root Port bridge to [bus 02-05]       MPS=256
  02:00.0 bridge to [bus 03-05]                 MPS=256
  03:02.0 bridge to [bus 05]                    MPS=128
  03:04.0 bridge to [bus 04]                    MPS=128 <SERR+
  04:00.0 82571EB                               MPS=128 FatalErr+ MalfTLP+
  04:00.1 82571EB                               MPS=128 FatalErr+ MalfTLP+
  05:00.0 82571EB                               MPS=128 FatalErr+ MalfTLP+
  05:00.1 82571EB                               MPS=128 FatalErr+ MalfTLP+

00:07.0 and 02:00.0 appear to be incorrect -- they have MPS=256 but should be 128.
Comment 2 Bjorn Helgaas 2013-08-26 21:01:12 UTC
Created attachment 107325 [details]
dmesg (BIOS MPS=128 setting)

82571EB works correctly with MPS=128 for all devices.  No MPS information in dmesg log; attached for context only.
Comment 3 Bjorn Helgaas 2013-08-26 21:02:54 UTC
The incorrect MPS setting is a BIOS issue, and apparently this was resolved by a BIOS fix.

It would be good if Linux noticed the incorrect setting and at least warned about it.
Comment 4 Bjorn Helgaas 2013-08-26 21:07:20 UTC
http://lkml.kernel.org/r/509B5038.8090304@oracle.com is a similar report that might be the same issue, but I didn't see complete lspci output in that thread, so I can't be sure.

Note You need to log in before you can comment on or make changes to this bug.