Bug 6398

Summary: forcedeth broken powermanagement/irq handling ?
Product: Drivers Reporter: drago01
Component: NetworkAssignee: Francois Romieu (romieu)
Status: CLOSED CODE_FIX    
Severity: high CC: aabdulla, akpm, d.schroeter, ranma+kernel, romieu
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.16 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Awfully experimental suspend/resume support for the forcedeth driver

Description drago01 2006-04-17 02:15:59 UTC
Most recent kernel where this bug did not occur: N/A
Distribution: Fedora Core 5
Hardware Environment:
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller 
(rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev a2)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller 
(rev a3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
01:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host 
Controller (rev 80)
01:0a.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 
Gigabit Ethernet Controller (rev 13)
01:0b.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 0a)
01:0b.1 Input device controller: Creative Labs SB Live! MIDI/Game Port 
(rev 0a)
05:00.0 VGA compatible controller: nVidia Corporation GeForce 7800 GTX 
(rev a1)
Software Environment:
Problem Description:
I am runnig a 2.6.16.1 kernel on a DFI LP NF4 SLI-DR Expert mobo, which 
has an nvidia chipset with onboard nic. The nic works fine with the 
forcedeth driver, perfomance is ok (good). The system is a x86_64 FC5 
install on a dual core opteron 170 cpu with 2GB (2x1GB in dual channel) 
of Ram installed.
When I suspend the machine using ACPI S3 or swsup and resume it the 
network is dead. I cannot recive an packages. ifdown / ifup does not 
help. Restarting the network using /sbin/service network restart also 
does not get network working. Unloading and loading the driver (modprobe 
-r forcedeth;modprobe forcedeth) has the same result-> dead network.
I have to reboot in order to get the network working again. I have a 
static IP so no dhcp issue.
This makes suspend useles on my box, because I have to reboot anyway to 
get my network working.
What could be causing this? If there is any info that I can provide to 
help fixing this bug tell me.
I also noticed this (don't know if it is related or not but doubt it):
cat /proc/interrupts
          CPU0       CPU1
 0:     640968     628532    IO-APIC-edge  timer
 1:       4763       4745    IO-APIC-edge  i8042
 8:          0          0    IO-APIC-edge  rtc
 9:          0          0   IO-APIC-level  acpi
14:       1552       1082    IO-APIC-edge  ide0
15:      44443      44261    IO-APIC-edge  ide1
16:      57625      44633   IO-APIC-level  libata
17:     972904          0   IO-APIC-level  eth0
Steps to reproduce:
suspend or hibernate
resume
try to do anything with the onbaord nic.
I mailed this to lkml but got no reply so I filled it as a bug.
Comment 1 drago01 2006-04-17 02:17:47 UTC
I also noticed this (don't know if it is related or not but doubt it):
cat /proc/interrupts
          CPU0       CPU1
 0:     640968     628532    IO-APIC-edge  timer
 1:       4763       4745    IO-APIC-edge  i8042
 8:          0          0    IO-APIC-edge  rtc
 9:          0          0   IO-APIC-level  acpi
14:       1552       1082    IO-APIC-edge  ide0
15:      44443      44261    IO-APIC-edge  ide1
16:      57625      44633   IO-APIC-level  libata
17:     972904          0   IO-APIC-level  eth0
note: all irqs of eth0 are only handled by cpu0 never 1 irqbalance is running.
Comment 2 Francois Romieu 2006-04-18 13:48:54 UTC
Created attachment 7895 [details]
Awfully experimental suspend/resume support for the forcedeth driver
Comment 3 Francois Romieu 2006-04-18 14:41:01 UTC
If you are really bored and you can afford to crash your computer a few times,
you can try the patch above. No warranty implied, really.

Don't worry about irqbalance. It is a different topic.

-- 
Ueimor
Comment 4 drago01 2006-04-19 07:15:54 UTC
ok I will try it this weekend.
Comment 5 drago01 2006-04-21 22:54:52 UTC
the patch works fine!!
thx.
Can this be included into the mainline kernel ?
suspend->resume->network works fine
What about the irq issue?
Comment 6 drago01 2006-05-13 23:54:28 UTC
is this patch going to be included in 2.6.17 ?
Comment 7 Ayaz Abdulla 2006-07-10 13:39:33 UTC
How do I suspend the ethernet device? Using "suspend -f" will only suspend the 
console. Also, there is no "resume" command.
Comment 8 drago01 2006-07-11 04:36:08 UTC
ifup the device 
suspend the box (S3,S4 or software suspend)
resume and see if your are still able to recive/send anything.
Comment 9 Ayaz Abdulla 2006-07-11 11:48:54 UTC
That was my question. How do I suspend the box into S3 or S5?
Comment 10 drago01 2006-07-11 23:05:41 UTC
I don't know what S5 is but S3 (supend to ram):
do as root:
echo mem > /sys/power/state
for S4 (suspend to disk):
echo disk > /sys/power/state
(on older kernels /proc/power/state)
Comment 11 Tobias Diedrich 2006-07-14 08:04:42 UTC
I've recently got myself a new nforce 570 board (mcp55) with dual-gige onboard
lan and also noticed, that after S4 network is down.  After "ifconfig eth0 down
&& ifconfig eth0 up" it's coming back to life in my case (running kernel
2.6.18-rc1).

lspci:
00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)
00:01.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:06.0 PCI bridge: nVidia Corporation Unknown device 0370 (rev a2)
00:06.1 Audio device: nVidia Corporation MCP55 High Definition Audio (rev a2)
00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:0a.0 PCI bridge: nVidia Corporation Unknown device 0376 (rev a2)
00:0b.0 PCI bridge: nVidia Corporation Unknown device 0374 (rev a2)
00:0c.0 PCI bridge: nVidia Corporation Unknown device 0374 (rev a2)
00:0d.0 PCI bridge: nVidia Corporation Unknown device 0378 (rev a2)
00:0e.0 PCI bridge: nVidia Corporation Unknown device 0375 (rev a2)
00:0f.0 PCI bridge: nVidia Corporation Unknown device 0377 (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:06.0 Mass storage controller: Promise Technology, Inc. PDC20268 (Ultra100
TX2) (rev 02)
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)
01:08.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07)
01:08.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 07)
01:0b.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
Controller (PHY/Link)
07:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300
(PCIE)]
07:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE]

(Only one GigE showing up, because I've disabled the second one in BIOS)
Comment 12 Tobias Diedrich 2006-07-15 05:52:31 UTC
I just tried the 'awfully experimental' patch from Romieu and found that it does
not blow up and it even _works_ when I disable msi/msix.
So for now I added "options forcedeth msi=0 msix=0" to /etc/modprobe.d/forcedeth.

With msi enabled I get
|Jul 15 12:29:37 melchior kernel: APIC error on CPU0: 00(40)
followed by a lot of
|Jul 15 12:29:37 melchior kernel: APIC error on CPU0: 40(40)
|Jul 15 12:30:07 melchior last message repeated 6122 times
|Jul 15 12:31:03 melchior last message repeated 11430 times

Also interesting:
|Jul 15 12:47:00 melchior kernel: pnp: Device 00:08 activated.
|Jul 15 12:47:00 melchior kernel: pnp: Failed to activate device 00:09.
even though I get this with the working case too.
|00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
|00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
Comment 13 drago01 2006-07-15 06:12:27 UTC
nice to hear that it works (with minor issues)
can you post the autput of cat /proc/interupts? (one with msi enabled) and one
with off?
also dual or single cpu system?
Comment 14 Tobias Diedrich 2006-07-15 06:48:47 UTC
Single core (Orleans):
|processor       : 0
|vendor_id       : AuthenticAMD
|cpu family      : 15
|model           : 79
|model name      : AMD Athlon(tm) 64 Processor 3200+
|stepping        : 2
|cpu MHz         : 1000.000
|cache size      : 512 KB
|fpu             : yes
|fpu_exception   : yes
|cpuid level     : 1
|wp              : yes
|flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
 pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm
3dnowext 3dnow pni cx16 lahf_lm svm cr8_legacy
|bogomips        : 2010.90
|TLB size        : 1024 4K pages
|clflush size    : 64
|cache_alignment : 64
|address sizes   : 40 bits physical, 48 bits virtual
|power management: ts fid vid ttp tm stc

/proc/interrupts, with msi
|           CPU0       
|  0:      18536    IO-APIC-edge  timer
|  1:          8    IO-APIC-edge  i8042
|  7:          1    IO-APIC-edge  parport0
|  8:          0    IO-APIC-edge  rtc
|  9:          0   IO-APIC-level  acpi
| 14:        100    IO-APIC-edge  ide0
| 50:        381   IO-APIC-level  ehci_hcd:usb1, HDA Intel
| 58:        293   IO-APIC-level  ohci_hcd:usb2
| 66:          0   IO-APIC-level  EMU10K1
| 74:          1   IO-APIC-level  gige0
| 98:          0       PCI-MSI-X  eth0
|106:          0       PCI-MSI-X  eth0
|114:       5221       PCI-MSI-X  eth0
|122:        236       PCI-MSI-X  eth1
|130:        284       PCI-MSI-X  eth1
|138:       5224       PCI-MSI-X  eth1
|233:       2304   IO-APIC-level  ide2
|NMI:         40 
|LOC:      18496 
|ERR:          0
|MIS:          0

/proc/interrupts, without msi
|           CPU0       
|  0:      84858    IO-APIC-edge  timer
|  1:         10    IO-APIC-edge  i8042
|  7:          1    IO-APIC-edge  parport0
|  8:          0    IO-APIC-edge  rtc
|  9:          0   IO-APIC-level  acpi
| 14:        100    IO-APIC-edge  ide0
| 50:        381   IO-APIC-level  ehci_hcd:usb1, HDA Intel
| 58:       4938   IO-APIC-level  ohci_hcd:usb2
| 66:          0   IO-APIC-level  EMU10K1
| 74:          1   IO-APIC-level  gige0
| 82:      24087   IO-APIC-level  eth0
| 90:      28469   IO-APIC-level  eth1
|233:      23167   IO-APIC-level  ide2, radeon@pci:0000:07:00.0
|NMI:         60 
|LOC:      84821 
|ERR:          0
|MIS:          0

gige0 is a realtek pci card from the old system with a udev renaming rule.
eth0 and eth1 are the two onboard interfaces.

Which brings me to another issue I'll probably have to open a new bug for:
With both interfaces enabled I'm unable to get a 1000Mbit link, only 100Mbit
works and link detection takes ages (a few secs). If I disable the second
interface 1000Mbit works (would have to check for the link detection time).

And on eth0 (no cable attached) ethtool says:
|        Speed: Unknown! (65535)
|        Duplex: Unknown! (255)

As opposed to gige0 (no cable attached):
|        Speed: 100Mb/s
|        Duplex: Half

Now, I was about to say forcing speed with ethtool doesn't work, but just found
that "ethtool -s eth1 speed 1000 duplex full autoneg on" actually did get the
speed up to 1000 just now.  Weird.
Comment 15 drago01 2006-08-14 07:40:19 UTC
any updates?
anything I can do to fix this one?
Comment 16 drago01 2006-08-14 07:40:58 UTC
sorry should be 
anything I can do to help fixing this one?
Comment 17 Tobias Diedrich 2006-09-30 06:12:16 UTC
I'm happy to report, that with 2.6.18-mm2 suspend to disk works without
additional patches, even with MSI interrupts enabled (the mm2 announcement said
something about MSI changes, so I figured I'd try both with msi disabled and
enabled).
Comment 18 Francois Romieu 2006-09-30 06:43:22 UTC
It would be better if the patch avoids a complete close/open cycle which can fail.

As:
1) I do not have a lot of time to poke in the guts of the driver without
documentation and figure the use of the registers (let alone seekreet ones :o) )
2) It would imply a new test cycle and more delay
3) The user experience without the patch sucks
4) The patch is not _that_ rotten

I see no reason to further delay the inclusion of the patch in the kernel.

I'll do a proper submission of the patch to Jeff.

Thanks for your help and patience.

-- 
Ueimor
Comment 19 Francois Romieu 2007-01-01 14:08:52 UTC
Fix has been included in mainline under id
a189317fa0e9d425cd3a4c248b06f96d876cf7fd :

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a189317fa0e9d425cd3a4c248b06f96d876cf7fd

It is available since 2.6.20-rc1.

-- 
Ueimor