Bug 8952

Summary: tulip driver oops in tulip_interrupt when hibernating with swsusp/suspend2
Product: Drivers Reporter: kernelbugs
Component: NetworkAssignee: Grant Grundler (grundler)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk, kyle, pavel, protasnb, rjwysocki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22.5 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: Photo of 2.6.22.5 oops report
dmesg output after boot (kernel 2.6.22.5)
Config file for 2.6.22.5
Photo of 2.6.22.9 oops
objdump -rD of drivers/net/tulip/interrupt.o
Panic report from unpatched 2.6.24.4
First panic report from patched 2.6.24.4
Second panic report from patched 2.6.24.4
Kernel message log, 2.6.24.4, with printks added
PATCH suspend/resume rewrite...after checking other drivers.

Description kernelbugs 2007-08-28 05:36:50 UTC
Most recent kernel where this bug did not occur: I think this may have appeared when I switched from 2.6.17.11 to 2.6.18.8. I was originally using suspend2 but switched to mainline swsusp to determine whether the problem occurred with it.

Distribution: Debian sarge, x86

Hardware Environment:
/proc/cpuinfo:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 1
cpu MHz         : 599.825
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 1200.83
clflush size    : 32

/proc/ioports:
0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-006f : keyboard
0070-0077 : rtc
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : 0000:00:07.1
  0170-0177 : ide1
01f0-01f7 : 0000:00:07.1
  01f0-01f7 : ide0
02f8-02ff : serial
0376-0376 : 0000:00:07.1
  0376-0376 : ide1
0378-037a : parport0
03c0-03df : vesafb
03f2-03f5 : floppy
03f6-03f6 : 0000:00:07.1
  03f6-03f6 : ide0
03f7-03f7 : floppy DIR
03f8-03ff : serial
0cf8-0cff : PCI conf1
4000-4003 : ACPI PM1a_EVT_BLK
4004-4005 : ACPI PM1a_CNT_BLK
4008-400b : ACPI PM_TMR
4010-4015 : ACPI CPU throttle
4020-4023 : ACPI GPE0_BLK
5000-5007 : vt596_smbus
d000-d00f : 0000:00:07.1
  d000-d007 : ide0
  d008-d00f : ide1
d400-d41f : 0000:00:07.2
  d400-d41f : uhci_hcd
dc00-dcff : 0000:00:0f.0
e000-e007 : 0000:00:12.0
  e000-e007 : au8830
e400-e407 : 0000:00:12.0
  e400-e407 : au8830
e800-e8ff : 0000:00:13.0
  e800-e8ff : tulip
ec00-ecff : 0000:00:14.0
  ec00-ecff : tulip

/proc/iomem:
00000000-0009ffff : System RAM
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000c8000-000c9fff : pnp 00:00
000f0000-000fffff : System ROM
00100000-17feffff : System RAM
  00100000-00309b3a : Kernel code
  00309b3b-003d4f6b : Kernel data
17ff0000-17ff2fff : ACPI Non-volatile Storage
17ff3000-17ffffff : ACPI Tables
20000000-2001ffff : 0000:00:0f.0
d8000000-dbffffff : 0000:00:00.0
dc000000-dcffffff : 0000:00:0f.0
  dc000000-dc3fffff : vesafb
de000000-de01ffff : 0000:00:13.0
df000000-df01ffff : 0000:00:14.0
e0000000-e003ffff : 0000:00:12.0
  e0000000-e003ffff : au8830
e0040000-e0040fff : 0000:00:0f.0
e0041000-e00413ff : 0000:00:14.0
  e0041000-e00413ff : tulip
e0042000-e00423ff : 0000:00:13.0
  e0042000-e00423ff : tulip
ffff0000-ffffffff : reserved


/proc/interrupts:
           CPU0
  0:     386801    XT-PIC-XT        timer
  1:         42    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
  5:          1    XT-PIC-XT        acpi
  6:          5    XT-PIC-XT        floppy
  7:          0    XT-PIC-XT        parport0
  8:          1    XT-PIC-XT        rtc
 10:      30561    XT-PIC-XT        au8830
 11:      27583    XT-PIC-XT        uhci_hcd:usb1, eth0
 14:     138592    XT-PIC-XT        ide0
 15:      11574    XT-PIC-XT        ide1
NMI:          0
ERR:          0

lspci -vvv:
00:00.0 Host bridge: VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x] (rev 82)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR+
        Latency: 0
        Region 0: Memory at d8000000 (32-bit, prefetchable) [size=64M]
        Capabilities: [a0] AGP version 2.0
                Status: RQ=32 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW- AGP3- Rate=x1,x2,x4
                Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- Rate=<none>

00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP] (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
        Latency: 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000f000-00000fff
        Memory behind bridge: fff00000-000fffff
        Prefetchable memory behind bridge: fff00000-000fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-
        Capabilities: [80] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.0 ISA bridge: VIA Technologies, Inc. VT82C596 ISA [Mobile South] (rev 12)
        Subsystem: VIA Technologies, Inc. VT82C596/A/B PCI to ISA Bridge
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 0

00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) (prog-if 8a [Master SecP PriP])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 32
        Region 0: [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8]
        Region 1: [virtual] Memory at 000003f0 (type 3, non-prefetchable) [size=1]
        Region 2: [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8]
        Region 3: [virtual] Memory at 00000370 (type 3, non-prefetchable) [size=1]
        Region 4: I/O ports at d000 [size=16]

00:07.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 08) (prog-if 00 [UHCI])
        Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 32, Cache Line Size: 32 bytes
        Interrupt: pin D routed to IRQ 11
        Region 4: I/O ports at d400 [size=32]

00:07.3 Host bridge: VIA Technologies, Inc. VT82C596 Power Management (rev 20)
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-

00:0f.0 VGA compatible controller: ATI Technologies Inc 3D Rage II+ 215GTB [Mach64 GTB] (rev 9a) (prog-if 00 [VGA])
        Subsystem: ATI Technologies Inc 3D Rage II+ 215GTB [Mach64 GTB]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (2000ns min), Cache Line Size: 32 bytes
        Region 0: Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: I/O ports at dc00 [size=256]
        Region 2: Memory at e0040000 (32-bit, non-prefetchable) [size=4K]
        [virtual] Expansion ROM at 20000000 [disabled] [size=128K]

00:12.0 Multimedia audio controller: Aureal Semiconductor Vortex 2 (rev fe)
        Subsystem: Aureal Semiconductor AU8830 Vortex 3D Digital Audio Processor
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (1000ns min, 3000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 10
        Region 0: Memory at e0000000 (32-bit, non-prefetchable) [size=256K]
        Region 1: I/O ports at e000 [size=8]
        Region 2: I/O ports at e400 [size=8]
        Capabilities: [dc] Power Management version 1
                Flags: PMEClk- DSI+ D1- D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:13.0 Ethernet controller: ADMtek NC100 Network Everywhere Fast Ethernet 10/100 (rev 11)
        Subsystem: Accton Technology Corporation Unknown device 1216
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (63750ns min, 63750ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 11
        Region 0: I/O ports at e800 [size=256]
        Region 1: Memory at e0042000 (32-bit, non-prefetchable) [size=1K]
        Expansion ROM at de000000 [disabled] [size=128K]
        Capabilities: [c0] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:14.0 Ethernet controller: Microsoft Corporation Unknown device 0002 (rev 11)
        Subsystem: Microsoft Corporation Unknown device 0002
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (16000ns min, 32000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 12
        Region 0: I/O ports at ec00 [size=256]
        Region 1: Memory at e0041000 (32-bit, non-prefetchable) [size=1K]
        Expansion ROM at df000000 [disabled] [size=128K]
        Capabilities: [c0] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=100mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-



Software Environment:
Hibernate Script 1.93

Linux thursday 2.6.22.5 #1 Thu Aug 23 10:59:50 EDT 2007 i686 GNU/Linux

Gnu C                  4.1.2
Gnu make               3.81
binutils               2.17
util-linux             2.12r
mount                  2.12r
module-init-tools      3.3-pre2
e2fsprogs              1.40-WIP
Linux C Library        2.3.6
Dynamic linker (ldd)   2.3.6
Procps                 3.2.7
Net-tools              1.60
Console-tools          0.2.3
Sh-utils               5.97
udev                   105
Modules Loaded         lp thermal fan button processor ac battery sha1 arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss usbhid snd_au8830 8250_pnp snd_ac97_codec 8250 snd_pcm snd_timer snd_page_alloc ac97_bus snd_mpu401_uart snd_rawmidi snd_seq_device serial_core tulip snd uhci_hcd


Problem Description:

The kernel reports an Oops in the tulip driver while hibernating, but not every time. (It occurs perhaps 30% to 50% of the time.) The system hangs, caps lock and scroll lock flash, and the computer has to be powered off by holding down the power button.

The original bug report (before reproducing the problem with swsusp) is on the Suspend2 bugzilla at: http://bugzilla.suspend2.net/show_bug.cgi?id=294

I am using the hibernate shell script to suspend the system. I will attach a photo of the Oops report from 2.6.22.5 to this bug, but there are several additional ones from older kernels on the original bug page.

Speculation: The network card is on a shared interrupt with the USB controller, so it may be receiving unexpected interrupts generated by USB.

Steps to reproduce:

1. Try suspending to disk.

2. Experience kernel Oops (but not every time).

Here is an Oops report from kernel 2.6.21.1 that I sent in a private mail concerning the bug:
Clocksource tsc unstable (delta = 4144113840505 ns)
pnp: Device 00:08 activated.
pnp: Device 00:09 activated.
pnp: Device 00:0a activated.
pnp: Failed to activate device 00:0b.
 usbdev1.2_ep00: PM: resume from 0, parent 1-1 still 1
usbhid 1-1:1.0: PM: resume from 1, parent 1-1 still 1
 usbdev1.2_ep01: PM: resume from 0, parent 1-1:1.0 still 1
 usbdev1.2: PM: resume from 0, parent 1-1 still 1
Restarting tasks ... <6>usb 1-1: USB disconnect, address 2
done.
Stopping tasks ... done.
Shrinking memory... -<1>BUG: unable to handle kernel NULL pointer dereference at virtual address 0000008c
 printing eip:
d883e960
*pde = 00000000
Ooops: 0000 [#1]
Modules linked in: lp thermal fan processor ac battery sha1 arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_isa i2c_viapro i2c_core usbhid 8250_pnp 8250 serial_core uhci_hcd snd_page_alloc ac97_bus tulip
CPU: 0
EIP: 0060:[<d883e960>] Not tainted VLI
EFLGS: 00010286 (2.6.21.1 #1)
EIP is at tulip_interrupt+0x29b/0xb37 [tulip]
eax: 00000000 ebx: cfde7aa0 ecx: 0000003c edx: d7c68000
esi: 00000035 edi: d768ca12 ebp: d7c68340 esp: c03c9ee4
ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
Process swapper (pid: 0, ti=c03c8000 task=c03a3380 task.ti=c03c8000)
Stack: ffffffff c03a6b40 c0126d99 6c3a9c20 00003d3f 80750c34 6c776d00 0001e828
d7c68000 0001e800 fc69c0d0 00000000 00000000 00000000 00000019 c011a58f
0000003c 00000000 0000007f d7b28220 00000000 00000000 0000000b c012f8a8
Call Trace:
[<c0126d99>] tick_program_event+0x2a/0x49
[<c011a58f>] run_timer_softirq+0x11/0x12f
[<c012f8a8>] handle_IRQ_event+0x1a/0x3f
[<c01308d4>] handl_level_irq+0x5c/0x94
[<c0105530>] do_IRQ+0x54/0x70
[<c0103cab>] common_interrupt+0x23/0x28
[<c02000d8>] acpi_ds_create_node+0x45/0x55
[<d8c1fe6a>] acpi_processor_idle+0x1a7/0x335 [processor]
[<c0102327>] cpu_idle+0x39/0x4e
[<c03caacb>] start_kernel+0x33b/0x343
[<c03ca42b>] unknown_bootoption+0x0/0x202
Code: 00 8b 80 90 00 00 00 8b 54 24 20 83 c7 12 83 c0 12 89 bb 8c 00 00 00 89 53 14 89 83 90 00 00 00 8b 84 f5 18 01 00 00 8b 4c 24 40 <8b> b0 8c 00 00 00 c1 e9 02 f3 a5 8b 4c 24 40 83 e1 03 74 02 f3
EIP: [<d883e960>] tulip_interrupt+0x29b/0xb37 [tulip] SS:ESP 0068:c03c9ee4
Kernel panic - not syncing: Fatal exception in interrupt
Comment 1 kernelbugs 2007-08-28 05:39:55 UTC
Created attachment 12582 [details]
Photo of 2.6.22.5 oops report
Comment 2 Andrew Morton 2007-08-28 20:13:27 UTC
that's a tulip bug, not a power-management bug.  I updated the report appropriately, thanks.
Comment 3 Grant Grundler 2007-08-30 09:56:43 UTC
Yeah, I agree with akpm that is most likely a tulip driver bug. I've several race conditions with interrupts and shutdown in the past. I'm not surprised there might be similar races in swsusp code path.

I need a few more bits of info:
1) Console output (e.g. dmesg) after booting 2.6.22.5 (please attach)
2) tulip driver CONFIG options (e.g. "fgrep TULIP /boot/config-2.6.22.5" output)

thanks,
grant
Comment 4 kernelbugs 2007-08-30 10:31:08 UTC
Created attachment 12636 [details]
dmesg output after boot (kernel 2.6.22.5)

A couple lines of "vortex: IRQ fifo error" appear at the end immediately after login, presumably when the mixer is initialized by KDE.
Comment 5 kernelbugs 2007-08-30 10:33:08 UTC
fgrep TULIP config-2.6.22.5 gives:
CONFIG_NET_TULIP=y
CONFIG_TULIP=m
# CONFIG_TULIP_MWI is not set
# CONFIG_TULIP_MMIO is not set
# CONFIG_TULIP_NAPI is not set

I will attach the entire config file in case other parts prove relevant.

By the way, I apologize for the huge information dump. I didn't realize how unreadable it was until after I had submitted the bug.
Comment 6 kernelbugs 2007-08-30 10:33:47 UTC
Created attachment 12637 [details]
Config file for 2.6.22.5
Comment 7 Grant Grundler 2007-08-30 10:39:04 UTC
(In reply to comment #5)
> fgrep TULIP config-2.6.22.5 gives:
> CONFIG_NET_TULIP=y
> CONFIG_TULIP=m
> # CONFIG_TULIP_MWI is not set
> # CONFIG_TULIP_MMIO is not set
> # CONFIG_TULIP_NAPI is not set

Ok - so using IO Port space and NAPI isn't enabled...hrm.
I'll stare at the code a bit and see where there might be races.

> 
> I will attach the entire config file in case other parts prove relevant.
> 
> By the way, I apologize for the huge information dump. I didn't realize how
> unreadable it was until after I had submitted the bug.

No problem. I don't mind. Attachments are sometime better (for long output like lspci) but I'm happy to get the info in the bug report.

thanks,
grant
Comment 8 Rafael J. Wysocki 2007-09-23 05:43:22 UTC
Can you tell me what the current status of this is, please?
Comment 9 kernelbugs 2007-09-28 08:01:09 UTC
Still broken in 2.6.22.9. I haven't tried any of the 2.6.23 rcs to see if they randomly fix or hide the bug. I have no idea what's going on on the "developing a fix" side of things.

-- Tap
Comment 10 Rafael J. Wysocki 2007-09-28 08:08:46 UTC
I'm not sure if we know what exactly the problem is.

In case we don't, it's better to try the newest kernel and check if the bug is gone, because it might be fixed as a result of seemingly unrelated changes.  This often is the case with suspend bugs.
Comment 11 Grant Grundler 2007-09-30 00:49:57 UTC
Apologies for dropping off the scene for a bit.

I've looked at the tulip code and didn't see anything obvious. But I'm not familiar with the calling path of the suspend and resume events. I'll take another quick look tonight and more later this week.

I built a 2.6.22.9 kernel on my Debian "testing" machine and then ran "objdump -rD tulip.ko" to see what is in tulip_interrupt(). But "tulip_interrupt+0x29b" doesn't look like a reasonable offset for my build:
     97c:       89 e9                   mov    %ebp,%ecx
     97e:       c1 e9 02                shr    $0x2,%ecx
     981:       8b b0 8c 00 00 00       mov    0x8c(%eax),%esi
     987:       f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
     989:       89 e9                   mov    %ebp,%ecx
     98b:       83 e1 03                and    $0x3,%ecx
     98e:       74 02                   je     992 <tulip_interrupt+0x2a9>
     990:       f3 a4                   rep movsb %ds:(%esi),%es:(%edi)

Hex math (tulip_interrupt starts at 0x6e9):
6E9+29B = 984

Can you "objdump -rD" the tulip.ko (or interrupt.o if you have that) which crashed?
I need to know the EIP and objdump output for the file. Please send me the objdump output as an attachment directly.

My guess is 0x8c is the offset for the "skb" (based on reviewing code in tulip_refill_rx which also uses 0x8c offset). So it sounds like skb is deallocated but then we get an interrupt which wants to derefence that skb.
Again, a typical race in "driver shutdown" sequence.
Comment 12 kernelbugs 2007-10-01 04:47:40 UTC
Created attachment 13010 [details]
Photo of 2.6.22.9 oops
Comment 13 kernelbugs 2007-10-01 04:48:46 UTC
Created attachment 13011 [details]
objdump -rD of drivers/net/tulip/interrupt.o
Comment 14 kernelbugs 2007-10-01 04:52:07 UTC
The 29B offset was from kernel 2.6.21.1 and may not even have been right; the photo I just attached has the dump from 2.6.22.9. (I can transcribe it if needed.)

The EIP in the photo says "EIP is tulip_interrupt+0x2b4/0xc27 [tulip]".

I can attach the objdump for tulip.ko too, if you want.
Comment 15 kernelbugs 2007-10-14 06:18:28 UTC
This crash still occurs in 2.6.23; following is most of the oops dump (I left out some of the numbers that I've picked up probably don't matter much). I can attach an updated interrupt.o dump if desired.

swsusp: Basic memory bitmaps created
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000084
 printing eip:
d88429e1
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: lp ac battery sha1 arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core usbhid fan thermal snd_page_alloc ac97_bus 8250_pnp processor uhci_hcd 8250 tulip serial_core
CPU: 0
EIP: 0060:[<d88429e1>] Not tainted VLI
EFLAGS: 00010206 (2.6.23 #1)
EIP is at tulip_interrupt+0x318/0xc37 [tulip]
eax: 000000d8 ebx: 00000000 ecx: d6b3e380 edx: d6b3e380
esi: 00000007 edi: 00000070 ebp: 000000d8 esp: d5a5fcec
ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068
Process klogd [etc.]
Stack [omitted]
Call Trace:
__alloc_pages+0x54/0x28a
handle_IRQ_event+0x1a/0x3f
handle_level_irq+0x4f/0x85
do_IRQ+0x50/0x69
enqueue_entity+0x1d2/0x1f3
common_interrupt+0x23/0x28
hpet_set_mode+0x5a/0xaf
kmem_cache_alloc+0x3c/0x72
__aloc_skb+0x2a/0xf5
sock_alloc_send_skb+0x6e/0x18e
unix_dgram_sendmsg+0x14b/0x3b3
sock_aio_write+0xbc/0xc8
__dequeue_signal+0x10/0x11a
find_lock_page+0x12/0x62
do_sync_write+0xc6/0x109
__sched_text_start+0x212/0x26f
autoremove_wake_function+0x0/0x35
autoremove_wake_function+0x0/0x35 (NOT A TYPO)
vfs_write+0x9e/0x10c
sys_write+0x41/0x67
sysenter_past_esp+0x5f/0x85
xs_send_kvec+0x66/0x84
=====================
Comment 16 Natalie Protasevich 2007-11-08 05:47:07 UTC
Thanks, kernelbugs ;) for keeping status updated.
Adding Pavel to give a hand with suspend/resume.
Comment 17 Rafael J. Wysocki 2007-12-12 16:42:25 UTC
kernelbugs, can you see if the problem is present in the current mainline kernel?

Grant, what kind of information is needed to get an idea of what's wrong?
Comment 18 kernelbugs 2007-12-12 18:24:10 UTC
Rafael, what do you mean by current mainline? 2.6.23.9? The latest rc? Linus's git tree? I am pretty sure it's in 2.6.23.9 (it has frozen on suspend, but it doesn't always wait until a text console is visible for me to see the oops dump so I'm not 100% sure it's the same problem), but haven't tried the others.
Comment 19 Rafael J. Wysocki 2007-12-13 06:52:59 UTC
I meant the Linus' tree.
Comment 20 Grant Grundler 2007-12-13 10:52:11 UTC
(In reply to comment #17)
...
> Grant, what kind of information is needed to get an idea of what's wrong?

My guess is there is a race condition between the suspend and delivery of a pending IRQ. Ie the suspend path isn't waiting until either the interrupt has been delivered or possibly delivered but not completed handling of the interrupt.

I need to know the sequence of calls into the driver so I can work out the state of the driver when it's interrupt handler _should_ have been disabled. See comment #11. Something is freeing an skb that the tulip driver thinks it owns and has "in flight". I suspect it's either the netdev layer or some other part of the tulip driver.

thanks,
grant
Comment 21 Rafael J. Wysocki 2007-12-13 12:26:49 UTC
I understand.

So, would it be possible to instrument the code somehow to get that information?
Comment 22 kernelbugs 2007-12-13 20:49:26 UTC
I tried a kernel compiled from Linus's git tree today, and got it to hang three
times on a suspend (with the keyboard LEDs flashing), but unfortunately it hung
before switching out of X to a text console where I could see the oops report.
I might try getting it to dump to a serial port sometime, if necessary. I was
unable to reproduce it by starting a suspend from a text console, and I tried
15 or 20 times, so some change may have made it harder to reproduce, or I was
just unlucky.

The short answer: I think the bug exists in main line, but I don't have the
proof of an oops report, just a hang with the same LEDs flashing on suspend.
Comment 23 Grant Grundler 2007-12-14 21:26:58 UTC
>(In reply to comment #22)
> I tried a kernel compiled from Linus's git tree today, and got it to hang
> three
> times on a suspend (with the keyboard LEDs flashing), but unfortunately it
> hung
> before switching out of X to a text console where I could see the oops
> report.

Please be careful with the terminology: "hang" is different from "oops" (aka panic).

> I might try getting it to dump to a serial port sometime, if necessary. I was
> unable to reproduce it by starting a suspend from a text console, and I tried
> 15 or 20 times, so some change may have made it harder to reproduce, or I was
> just unlucky.

Are you accessing the X11 server through the network card somehow?

I'm wondering how X on console affects the tulip driver and don't see any connection other than PCI bus and difference in timing. It possible this is a different bug with a different symptom. If you could use serial port to dump the panic tombstone, that would probably help alot to determine if it is the same bug.


> The short answer: I think the bug exists in main line, but I don't have the
> proof of an oops report, just a hang with the same LEDs flashing on suspend.

I'm already convinced a real bug was reported. I'm just not sure where to look for it. And in this update, I'm thinking there is a second problem with X11 suspend/resume that is seperate from the original bug report.
Comment 24 kernelbugs 2007-12-15 09:51:54 UTC
Sorry about the hang/panic terminology, but I assume it's both as (1) the system completely stops responding and (2) the keyboard LEDs flash the same as when I actually get to see the oops/panic. Also, I remember in the past having other oops/panics (not sure which now) that didn't result in complete non-responsiveness  (they may have been related to a failing swap partition, if I remember correctly, and perhaps the only reason it's not responding in this case is because the problem occurs with interrupt handling).

I assumed the issue with X was merely a race condition concerning whether it got to the console before oopsing, since with previous kernel versions it also hung and blinked LEDs at me before switching to a console. However, it could be a bad assumption, since I could consequently not see the bug report.

My speculation in the original bug report, which no one seems to have commented on, was that the shared interrupt between the network card and the USB controller could have something to do with it; the mouse is hooked up through USB, although the keyboard is not. However, I am not doing accessing the X server across the network in the vast majority of cases where this crash happens.

I will see if I can get a null-modem cable working to see that panic information.

Also, on a related note, there is another hang problem with this machine where the LEDs do not blink, but I want to rule out a failing hard drive (since things that don't need to be loaded from disk seem to respond) before I file a bug report against the driver.
Comment 25 Rafael J. Wysocki 2007-12-15 14:32:04 UTC
It would be helpful to test if suspend works with the tulip driver unloaded, too.
Comment 26 kernelbugs 2007-12-15 19:50:13 UTC
This oops report appeared on the mainline from Thursday morning on both
the text console and the serial console (i.e., X switched out of
graphics mode before the kernel oopsed in this case):

Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Shrinking memory...  done (0 pages freed)
Freed 0 kbytes in 0.04 seconds (0.00 MB/s)
BUG: unable to handle kernel NULL pointer dereference at virtual address 0000008c
printing eip: d880f98b *pde = 00000000 
Oops: 0000 [#1] 
Modules linked in: lp ac battery sha1_generic arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core usbhid fan snd_page_alloc ac97_bus thermal uhci_hcd processor tulip

Pid: 10124, comm: echo Not tainted (2.6.24-rc5-gda8cadb3-dirty #1)
EIP: 0060:[<d880f98b>] EFLAGS: 00010286 CPU: 0
EIP is at tulip_interrupt+0x2c2/0xc13 [tulip]
EAX: 00000000 EBX: d7615cc0 ECX: 0000000f EDX: d793a360
ESI: 00000014 EDI: d7a82012 EBP: 0000003c ESP: d69e5cc0
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process echo (pid: 10124, ti=d69e4000 task=ce503050 task.ti=d69e4000)
Stack: ffffffff 00000000 00000001 003d09fa 00000086 c042ac70 c042ab80 0001e828 
       d793a000 d793a360 0001e800 fc69c8d0 00000000 00000000 00000000 00000019 
       c01247e4 00000000 0000007f d793a360 d670a720 00000000 00000000 0000000b 
Call Trace:
 [<c01247e4>] hrtimer_forward+0x70/0xce
 [<c01321a2>] handle_IRQ_event+0x1a/0x3f
 [<c0133240>] handle_level_irq+0x4f/0x87
 [<c010532e>] do_IRQ+0x50/0x68
 [<c0103db7>] common_interrupt+0x23/0x28
 [<c01188c3>] __do_softirq+0x2c/0x75
 [<c011892e>] do_softirq+0x22/0x26
 [<c0105333>] do_IRQ+0x55/0x68
 [<c0103db7>] common_interrupt+0x23/0x28
 [<c01159b3>] vprintk+0x23c/0x255
 [<c0103db7>] common_interrupt+0x23/0x28
 [<c01159b3>] vprintk+0x23c/0x255
 [<c01188cc>] __do_softirq+0x35/0x75
 [<c0105333>] do_IRQ+0x55/0x68
 [<c01159e7>] printk+0x1b/0x1f
 [<c012f27e>] swsusp_show_speed+0xe5/0xed
 [<c012f4ca>] swsusp_shrink_memory+0x10c/0x116
 [<c012f999>] hibernation_snapshot+0xd/0xbf
 [<c012f08a>] freeze_processes+0x71/0x76
 [<c012fbc7>] hibernate+0xc6/0x173
 [<c012ea7b>] state_store+0x42/0xa2
 [<c012ea39>] state_store+0x0/0xa2
 [<c0179174>] subsys_attr_store+0x24/0x28
 [<c017944f>] sysfs_write_file+0xa8/0xdb
 [<c01793a7>] sysfs_write_file+0x0/0xdb
 [<c014c4c2>] vfs_write+0x8a/0x10c
 [<c014c961>] sys_write+0x41/0x67
 [<c0103ba6>] sysenter_past_esp+0x5f/0x85
 [<c0310000>] svc_tcp_recvfrom+0x313/0x7ce
 =======================
Code: 00 00 00 89 e9 8b bb 8c 00 00 00 c1 e9 02 83 c0 12 83 c7 12 89 bb 8c 00 00 00 89 83 80 00 00 00 8b 54 24 4c 8b 84 f2 18 01 00 00 <8b> b0 8c 00 00 00 f3 a5 89 e9 83 e1 03 74 02 f3 a4 83 7b 54 00 
EIP: [<d880f98b>] tulip_interrupt+0x2c2/0xc13 [tulip] SS:ESP 0068:d69e5cc0
Kernel panic - not syncing: Fatal exception in interrupt

30 suspend/restart cycles after removing the tulip module worked fine.
(This isn't infallible, as an oops doesn't occur every time, but it
would usually occur after 30 tries.)

About 10 suspend/restart cycles after reinserting the tulip module
worked fine, except... it stopped working because /etc/sudoers became
corrupted in RAM; a clean reboot resulted in the original contents
reappearing, so go figure. These tests were still running from the same
in-memory image I had just tried 30 times without the tulip module;
there was no clean restart in between. However, since the oops itself
didn't happen (although something might have if I had been able to continue),
module load order could have something to do with the bug appearing.

After a clean restart, I got this oops report on the serial console
while X was still in graphics mode (this was on the 10th suspend/resume
cycle):

Vortex: shutdown...<6>done.
ACPI: PCI interrupt for device 0000:00:12.0 disabled
BUG: unable to handle kernel NULL pointer dereference at virtual address 0000008c
printing eip: d885198b *pde = 00000000 
Oops: 0000 [#1] 
Modules linked in: button lp ac battery sha1_generic arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core usbhid fan thermal processor snd_page_alloc ac97_bus tulip uhci_hcd

Pid: 0, comm: swapper Not tainted (2.6.24-rc5-gda8cadb3-dirty #1)
EIP: 0060:[<d885198b>] EFLAGS: 00010286 CPU: 0
EIP is at tulip_interrupt+0x2c2/0xc13 [tulip]
EAX: 00000000 EBX: d7acba20 ECX: 00000017 EDX: d7ba0360
ESI: 0000001f EDI: d6c27e12 EBP: 0000005c ESP: c03f5eec
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c03f4000 task=c03c92e0 task.ti=c03f4000)
Stack: ffffffff c031f720 d5b64030 c01120a5 d5b64030 c01115fe c042ac40 0001e828 
       d7ba0000 d7ba0360 0001e800 fc69c0d0 00000000 00000000 00000000 00000019 
       c011b1ab 00000000 0000007f d7ba0360 cf5895e0 00000000 00000000 0000000b 
Call Trace:
 [<c01120a5>] enqueue_task_fair+0x16/0x24
 [<c01115fe>] enqueue_task+0xa/0x14
 [<c011b1ab>] run_timer_softirq+0xe0/0x12f
 [<c01321a2>] handle_IRQ_event+0x1a/0x3f
 [<c0133240>] handle_level_irq+0x4f/0x87
 [<c010532e>] do_IRQ+0x50/0x68
 [<c0103db7>] common_interrupt+0x23/0x28
 [<d88667bc>] acpi_processor_idle+0x228/0x39c [processor]
 [<c010252d>] cpu_idle+0x3e/0x53
 [<c03f6994>] start_kernel+0x232/0x237
 [<c03f6303>] unknown_bootoption+0x0/0x195
 =======================
Code: 00 00 00 89 e9 8b bb 8c 00 00 00 c1 e9 02 83 c0 12 83 c7 12 89 bb 8c 00 00 00 89 83 80 00 00 00 8b 54 24 4c 8b 84 f2 18 01 00 00 <8b> b0 8c 00 00 00 f3 a5 89 e9 83 e1 03 74 02 f3 a4 83 7b 54 00 
EIP: [<d885198b>] tulip_interrupt+0x2c2/0xc13 [tulip] SS:ESP 0068:c03f5eec
Kernel panic - not syncing: Fatal exception in interrupt
Comment 27 Grant Grundler 2007-12-17 22:49:10 UTC
(In reply to comment #24)
> Sorry about the hang/panic terminology, but I assume it's both as (1) the
> system completely stops responding and (2) the keyboard LEDs flash the same
> as
> when I actually get to see the oops/panic.

It certainly can be both depending on how the kernel is configured and what goes wrong. Thanks for clarifying.

...
> I assumed the issue with X was merely a race condition concerning whether it
> got to the console before oopsing, since with previous kernel versions it
> also
> hung and blinked LEDs at me before switching to a console. However, it could
> be
> a bad assumption, since I could consequently not see the bug report.

I don't think it's a bad assumption - just would like to see evidence (and you collected very good evidence in the next post - awesome! :)

> My speculation in the original bug report, which no one seems to have
> commented
> on, was that the shared interrupt between the network card and the USB
> controller could have something to do with it; the mouse is hooked up through
> USB, although the keyboard is not.

Sorry - I re-read that now and agree - it could be causing spurious interrupts. But the tulip driver should be able to protect itself against that. In other words, more evidence this is a tulip bug but the race condition might be different than I thought.

> However, I am not doing accessing the X
> server across the network in the vast majority of cases where this crash
> happens.

I suggested this only because the X11 server was in use and it's possible X is generating additional traffic - different kind of race than with the shared IRQ.

> I will see if I can get a null-modem cable working to see that panic
> information.
> 
> Also, on a related note, there is another hang problem with this machine
> where
> the LEDs do not blink, but I want to rule out a failing hard drive (since
> things that don't need to be loaded from disk seem to respond) before I file
> a
> bug report against the driver.

If you can reproduce it without the tulip driver involved, that would probably help.

thanks again!
Comment 28 Grant Grundler 2008-03-23 23:01:32 UTC
(In reply to comment #27)
> (In reply to comment #24)
...
> > My speculation in the original bug report, which no one seems to have
> commented
> > on, was that the shared interrupt between the network card and the USB
> > controller could have something to do with it; the mouse is hooked up
> through
> > USB, although the keyboard is not.
> 
> Sorry - I re-read that now and agree - it could be causing spurious
> interrupts.
> But the tulip driver should be able to protect itself against that. In other
> words, more evidence this is a tulip bug but the race condition might be
> different than I thought.

Yes, It's not the same race that I thought it was. *sigh*
I think I found the problem - will propose a solution but don't know if it will work (or not).

tulip_suspend() calls tulip_down() if the interface is configured and "running". tulip_down() has this bit of code:
static void tulip_down (struct net_device *dev)
{
        struct tulip_private *tp = netdev_priv(dev);
        void __iomem *ioaddr = tp->base_addr;
        unsigned long flags;

        flush_scheduled_work();

#ifdef CONFIG_TULIP_NAPI
        napi_disable(&tp->napi);
#endif

        del_timer_sync (&tp->timer);
#ifdef CONFIG_TULIP_NAPI
        del_timer_sync (&tp->oom_timer);
#endif
        spin_lock_irqsave (&tp->lock, flags);

        /* Disable interrupts by clearing the interrupt mask. */
        iowrite32 (0x00000000, ioaddr + CSR7);

        /* Stop the Tx and Rx processes. */
        tulip_stop_rxtx(tp);

        /* prepare receive buffers */
        tulip_refill_rx(dev);

        /* release any unconsumed transmit buffers */
        tulip_clean_tx_ring(tp);
...


The intent of the iowrite32 to CSR7 is to prevent any additional interrupts. tulip_stop_rxtx() will flush that write out so we are sure that no new interrupts will be generated. Later code flushes in flight interrupts so we can be sure this tulip won't be generating interrupts.

But that completely ignores the case of a shared IRQ and polling of the tulip device. There are two ways to prevent a shared IRQ from being generated: free_irq() to unregister the Interrupt handler OR disable_irq() to block that IRQ from generating interrupts. The latter isn't what we want since USB is still active.

But I don't know offhand what hazards we will hit by moving the free_irq() call earlier in tulip_suspend(). Can someone look at this and/or test it as well?

It's a trivial change and I'll just inline it here:
--- a/drivers/net/tulip/tulip_core.c
+++ b/drivers/net/tulip/tulip_core.c
@@ -1733,11 +1738,12 @@ static int tulip_suspend (struct pci_dev *pdev, pm_message_t state)
        if (!dev)
                return -EINVAL;
 
+       free_irq(dev->irq, dev);
+
        if (netif_running(dev))
                tulip_down(dev);
 
        netif_device_detach(dev);
-       free_irq(dev->irq, dev);
 
        pci_save_state(pdev);
        pci_disable_device(pdev);


But like I said, this may break netif assumptions and possibly assumptions in tulip_down(). Perhaps just swapping netif_device_detach() and free_irq() would be sufficient too...

hth,
grant
Comment 29 kernelbugs 2008-03-29 11:00:01 UTC
Your suggested patch does not appear to help. (I didn't try the second alternative you mentioned.) I have three panic traces right now -- one with 2.6.24.4 unpatched, the other two with your patch. What interested me about the two patched ones is that they appear to have occurred at slightly different places in tulip_interrupt.

Let me know if you want any updated objdump output, etc., or if you would prefer me to test with an older kernel.
Comment 30 kernelbugs 2008-03-29 11:01:07 UTC
Created attachment 15500 [details]
Panic report from unpatched 2.6.24.4
Comment 31 kernelbugs 2008-03-29 11:01:47 UTC
Created attachment 15501 [details]
First panic report from patched 2.6.24.4
Comment 32 kernelbugs 2008-03-29 11:02:05 UTC
Created attachment 15503 [details]
Second panic report from patched 2.6.24.4
Comment 33 Grant Grundler 2008-03-30 22:56:54 UTC
Kernelbugs,
thanks for testing that out...I suspect I'm still confused about then the panic is occurring. Is it happening on the "suspend" event or the "resume" event?

I'm realizing now this might be happening on the "resume" since i'm seeing "starting disk" in messages a few lines (10 or so) above the panic.

My previous suggestion was for the "suspend" (tulip_down()) code path. I think I found a real problem in "down" code path and will have to think about how to fix that. BUt if the immediate problem is with resume, I need to look more closely at tulip_init_one(). Particularly where request_irq() is called since that will effectively enable when tulip_interrupt() will start getting called.

thanks,
grant
Comment 34 Grant Grundler 2008-03-30 22:59:03 UTC
sorry - in previous comment s/tulip_init_one/tulip_resume/
Comment 35 Grant Grundler 2008-03-30 23:34:39 UTC
Still _assuming_ this is about resume code path...

tulip_init_one() called before tulip_open() can be called.
That means the rough sequence normally on startup is:
    register_netdev()
    request_irq()
    tulip_up()
    netif_start_queue()

tulip_resume() doesn't call register_netdev but rather netif_device_attach().

It's late here and I need sleep...but the order of the calls is my first suspicion in tulip_resume(). I think netif_attach_device() will call netif_wake_queue(). This is the equivalent to calling netif_start_queue().

Can you try the patch below out?
Or experiment with the order between request_irq() and tulip_up() in tulip_resume() ?

grundler <503>git diff drivers/net/tulip/tulip_core.c
diff --git a/drivers/net/tulip/tulip_core.c b/drivers/net/tulip/tulip_core.c
index ed600bf..8770721 100644
--- a/drivers/net/tulip/tulip_core.c
+++ b/drivers/net/tulip/tulip_core.c
@@ -1768,11 +1768,11 @@ static int tulip_resume(struct pci_dev *pdev)
                return retval;
        }
 
-       netif_device_attach(dev);
-
        if (netif_running(dev))
                tulip_up(dev);
 
+       netif_device_attach(dev);
+
        return 0;
 }
Comment 36 kernelbugs 2008-03-31 05:50:43 UTC
I don't know which code path it's in, but this is occurring when I _suspend_. Resuming (once the image has been saved successfully) has never, to my recollection, had a problem. I suspect (from various reading) that the suspend process involves freezing devices, creating a system image, and then thawing the devices in order to be able to write the image out. Perhaps this is the source of the confusion?

I'll plan to try your suggestion later. Thanks for the time and effort you're putting into solving this.
Comment 37 kernelbugs 2008-04-02 06:03:36 UTC
OK, I tried the patch suggested in comment 35 (well, I applied it by hand) alone, in combination with the one in comment 28, and in combination with comment 28 and putting the if/tulip_up before request_irq in tulip_resume. All three combinations still crashed (I can provide the dumps on request, although I'm not sure it's worth it.)

However, I did notice this oddity, which may explain a lot. Before the first hibernate after a fresh boot with the standard (unpatched) tulip module, /proc/interrupts contains this line:
 11:       2140    XT-PIC-XT        uhci_hcd:usb1, eth0

After resuming from the first hibernate, it has:
 11:       2696    XT-PIC-XT        uhci_hcd:usb1, eth0, eth0

I haven't noticed whether it always takes at least one cycle before it crashed (actually, I thought it did sometimes crash on the first try), but with this last round of testing I noticed that it didn't crash until the second try for at least a couple of the patches. Hopefully this will help to pin down at least one of the bugs.
Comment 38 Grant Grundler 2008-04-02 16:17:40 UTC
(In reply to comment #37)
> OK, I tried the patch suggested in comment 35 (well, I applied it by hand)
> alone, in combination with the one in comment 28, and in combination with
> comment 28 and putting the if/tulip_up before request_irq in tulip_resume.
> All
> three combinations still crashed (I can provide the dumps on request,
> although
> I'm not sure it's worth it.)

Thanks!
I'm sure there are issues with shared IRQ that the code does NOT take into consideration today.

> However, I did notice this oddity, which may explain a lot. Before the first
> hibernate after a fresh boot with the standard (unpatched) tulip module,
> /proc/interrupts contains this line:
>  11:       2140    XT-PIC-XT        uhci_hcd:usb1, eth0
> 
> After resuming from the first hibernate, it has:
>  11:       2696    XT-PIC-XT        uhci_hcd:usb1, eth0, eth0

This is definitely a bug. It means either the IRQ didn't get properly unregister un suspend (not likely), got registered twice (possible I guess), or the IRQ mgt code has a bug where it's messing up when the same handler calls request_irq() again. I tend to like the latter theory. (ie "My driver is fine!" ;)

> I haven't noticed whether it always takes at least one cycle before it
> crashed
> (actually, I thought it did sometimes crash on the first try), but with this
> last round of testing I noticed that it didn't crash until the second try for
> at least a couple of the patches. Hopefully this will help to pin down at
> least
> one of the bugs.

definitely will help. Thanks again!
Give me something new to look at. :)
Comment 39 kernelbugs 2008-04-19 06:04:27 UTC
OK, here is something new to look at. (By the way, I'm not sure if by that comment, you meant that you wanted the crash dumps from the various combinations of patches I tried; if so, let me know.)

The machine in question actually has two network cards supported by the tulip driver, although the second one isn't actually in use. What's interesting about this, however, is that the second one also shows an extra interrupt after a hibernate cycle. And it doesn't appear to be on a shared interrupt.

Before hibernate, with eth1 down:
           CPU0
  0:      31705    XT-PIC-XT        timer
  1:        285    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
  3:          9    XT-PIC-XT        serial
  4:          3    XT-PIC-XT
  5:          1    XT-PIC-XT        acpi
  6:          5    XT-PIC-XT        floppy
  7:          0    XT-PIC-XT        parport0
  8:          1    XT-PIC-XT        rtc
 10:       2391    XT-PIC-XT        au8830
 11:       5354    XT-PIC-XT        uhci_hcd:usb1, eth0
 14:       6937    XT-PIC-XT        libata
 15:       1053    XT-PIC-XT        libata

After hibernate, with eth1 still down:
           CPU0
  0:      41132    XT-PIC-XT        timer
  1:        516    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
  3:         11    XT-PIC-XT        serial
  4:          3    XT-PIC-XT
  5:          1    XT-PIC-XT        acpi
  6:          5    XT-PIC-XT        floppy
  7:          0    XT-PIC-XT        parport0
  8:          3    XT-PIC-XT        rtc
 10:       2591    XT-PIC-XT        au8830
 11:       6264    XT-PIC-XT        uhci_hcd:usb1, eth0, eth0
 12:     100000    XT-PIC-XT        eth1
 14:       7313    XT-PIC-XT        libata
 15:       1431    XT-PIC-XT        libata

If I try to bring eth1 up before hibernating and then hibernate, the network doesn't seem to come back on resume -- the hibernate script doesn't even finish, although the system otherwise comes back to a usable desktop. This could be because eth1 isn't in hibernate's configuration as an interface to bring down on suspend and back up on resume, but my initial reason for adding eth0 to that list was so that DHCP would know when the system was on or off and DHCP hostnames would work properly.

If I bring eth1 up after a hibernate, it shows the doubled interrupt handler too:
 12:     100000    XT-PIC-XT        eth1, eth1
Comment 40 kernelbugs 2008-04-19 06:14:16 UTC
(In reply to comment #39)

By the way, this is still on 2.6.24.4; I haven't tried 2.6.25 yet.
Comment 41 Grant Grundler 2008-04-19 13:49:45 UTC
(In reply to comment #40)
> By the way, this is still on 2.6.24.4; I haven't tried 2.6.25 yet.

That's ok - I'm just as comfortable debugging this on 2.6.24.
Move to 2.6.25 whenever it's convenient for you. We can continue poking at this.

thanks,
grant
Comment 42 Grant Grundler 2008-04-19 14:29:58 UTC
(In reply to comment #39)
> OK, here is something new to look at. (By the way, I'm not sure if by that
> comment, you meant that you wanted the crash dumps from the various
> combinations of patches I tried; if so, let me know.)

Nope. I shouldn't need console dumps unless there is something different/new than from the previous ones.

Note that "crash dump" is also called a "core dump" - it's the memory contents and CPU state saved off when the machine oop's (aka panic).

> 
> The machine in question actually has two network cards supported by the tulip
> driver, although the second one isn't actually in use. What's interesting
> about
> this, however, is that the second one also shows an extra interrupt after a
> hibernate cycle. And it doesn't appear to be on a shared interrupt.

After reading the whole thing, you meant "interrupt handler" or "Interrupt Service Routine (ISR).

It's really not a different handler - it's just registering the same handler twice. The ISR should have been unregistered in "tulip_suspend" when "free_irq()" is called. But I now suspect we register the ISR once in tulip_resume() and again when we call tulip_open(). tulip_open() is called
when we "ifconfig" the interface and I assume that is done by udev or some other user space script on resume since the driver isn't saving the TCP state (it's TCP address/etc).
 
> Before hibernate, with eth1 down:
...
> After hibernate, with eth1 still down:
...
>  11:       6264    XT-PIC-XT        uhci_hcd:usb1, eth0, eth0
>  12:     100000    XT-PIC-XT        eth1

Yeah - this is the main clue that tulip_resume() should NOT be calling request_irq().  Good catch! :)

>  14:       7313    XT-PIC-XT        libata
>  15:       1431    XT-PIC-XT        libata
> 
> If I try to bring eth1 up before hibernating and then hibernate, the network
> doesn't seem to come back on resume -- the hibernate script doesn't even
> finish, although the system otherwise comes back to a usable desktop. This
> could be because eth1 isn't in hibernate's configuration as an interface to
> bring down on suspend and back up on resume, but my initial reason for adding
> eth0 to that list was so that DHCP would know when the system was on or off
> and
> DHCP hostnames would work properly.

Makes sense.

> If I bring eth1 up after a hibernate, it shows the doubled interrupt handler
> too:
>  12:     100000    XT-PIC-XT        eth1, eth1

Excellent! I'm pretty confident tulip_resume() is wrong.
tulip_suspend() will exit early if the interface is not configured.

Can you delete a bunch of stuff from tulip_resume() so it only has the following contents?

static int tulip_resume(struct pci_dev *pdev)
{
        struct net_device *dev = pci_get_drvdata(pdev);
        int retval;

        if (!dev)
                return -EINVAL;

        pci_set_power_state(pdev, PCI_D0);
        pci_restore_state(pdev);

        if ((retval = pci_enable_device(pdev))) {
                printk (KERN_ERR "tulip: pci_enable_device failed in resume\n");
                return retval;
        }

        return 0;
}

I'm guessing tulip_resume() should put the NIC in the same state as if tulip_init_one() had been called. Then user space will have to ifconfig the interface so tulip_open() gets called to register the interrupt handler.

Please also add "printk" entries at the top of each of the following routines to help us debug if the suggestion above isn't working:
   tulip_init_one()
   tulip_remove_one()
   tulip_open()
   tulip_close()
   tulip_suspend()
   tulip_resume()

and any other tulip functions you think we are calling.

If I am correct, it means the tulip_suspend() and tulip_resume() are not symetrical (code assumed they were).  But I'll remind that I don't know suspend/resume code path or what's expected from the NIC drivers for that to work.

thanks,
grant
Comment 43 kernelbugs 2008-04-19 15:43:20 UTC
(In reply to comment #42)
> Can you delete a bunch of stuff from tulip_resume() so it only has the
> following contents?

Snipping actual contents, but something in what was removed is apparently necessary for the interface to come back up; dhclient was failing to bring it back with that change, saying:

Apr 19 18:09:08 thursday dhclient: receive_packet failed on eth0: Network is down
Apr 19 18:09:12 thursday dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5
Apr 19 18:09:12 thursday dhclient: send_packet: Network is down
Apr 19 18:09:17 thursday dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 12
Apr 19 18:09:17 thursday dhclient: send_packet: Network is down

I didn't try another time to see if this also eliminated the crash, although I guess I should; if it did still crash with this change, that would also be revealing.

> Please also add "printk" entries at the top of each of the following routines
> to help us debug if the suggestion above isn't working:
>    tulip_init_one()
>    tulip_remove_one()
>    tulip_open()
>    tulip_close()
>    tulip_suspend()
>    tulip_resume()
> 
> and any other tulip functions you think we are calling.

I will attach the output from doing this. I prefixed all the printk lines I added with "TULIP DEBUG:", so you have something to scan for.

> If I am correct, it means the tulip_suspend() and tulip_resume() are not
> symetrical (code assumed they were).  But I'll remind that I don't know
> suspend/resume code path or what's expected from the NIC drivers for that to
> work.

Hmm... is there no documentation? :)
Comment 44 kernelbugs 2008-04-19 15:46:06 UTC
Created attachment 15822 [details]
Kernel message log, 2.6.24.4, with printks added

The printk lines specified in comment #42 are prefixed with "TULIP DEBUG:". This log is from a boot, hibernate, resume, and hibernate again cycle; the crash occurs on the second hibernate, so you can see what gets called before it crashes (assuming the functions all make it to their printks before crashing, which is not necessarily a safe assumption).
Comment 45 kernelbugs 2008-04-19 16:31:12 UTC
(In reply to comment #43)
> I didn't try another time to see if this also eliminated the crash, although
> I
> guess I should; if it did still crash with this change, that would also be
> revealing.

I've now tried a few more times with the shortened version of tulip_resume, and it hasn't crashed. (The standard 2.6.24.4 version crashes almost every time after the first hibernate/resume cycle.) But, as mentioned, the interface also doesn't come back up, so something isn't completely right with that change, and the printks only report tulip_close being called before the first hibernate, with no tulip_open or tulip_close thereafter, even after multiple cycles, only tulip_suspend and tulip_resume. The interface still shows up in ifconfig -a, but doesn't seem to be configurable after a hibernate.

As a possibly useful explanatory note: I believe the tulip_open and tulip_close calls seen in the printk log are caused by the user-space dhclient bringing the interface up and down, not by an in-kernel part of the suspend process.
Comment 46 kernelbugs 2008-04-19 17:50:33 UTC
(In reply to comment #42)
> If I am correct, it means the tulip_suspend() and tulip_resume() are not
> symetrical (code assumed they were).  But I'll remind that I don't know
> suspend/resume code path or what's expected from the NIC drivers for that to
> work.

Some guessing based on a cursory inspection of what a few other drivers (eepro100.c, r8169.c, typhoon.c, via-velocity.c) do suggested the patch below, which hasn't crashed yet (after about 10 hibernate/resume cycles) and brings the interface back on resume. I have no idea whether it is actually correct; probably you know more about the semantics of netif_running. (typhoon.c in particular contains a couple comments of the form "If we're down, resume when upped" and "If we're down, we're already suspended.")

--- tulip_core.c.standard       2008-03-31 18:57:39.000000000 -0400
+++ tulip_core.c        2008-04-19 20:26:37.000000000 -0400
@@ -1733,12 +1733,15 @@
        if (!dev)
                return -EINVAL;

-       if (netif_running(dev))
-               tulip_down(dev);
+    if (!netif_running(dev))
+        goto finish;
+
+    tulip_down(dev);

        netif_device_detach(dev);
        free_irq(dev->irq, dev);

+finish:
        pci_save_state(pdev);
        pci_disable_device(pdev);
        pci_set_power_state(pdev, pci_choose_state(pdev, state));
@@ -1763,6 +1766,9 @@
                return retval;
        }

+    if (!netif_running(dev))
+        return 0;
+
        if ((retval = request_irq(dev->irq, &tulip_interrupt, IRQF_SHARED, dev->name, dev))) {
                printk (KERN_ERR "tulip: request_irq failed in resume\n");
                return retval;
Comment 47 Anonymous Emailer 2008-04-21 02:28:45 UTC
Reply-To: pavel@ucw.cz

On Sat 2008-04-19 17:50:34, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=8952
> 
> 
> 
> 
> 
> ------- Comment #46 from kernelbugs@tap.homeip.net  2008-04-19 17:50 -------
> (In reply to comment #42)
> > If I am correct, it means the tulip_suspend() and tulip_resume() are not
> > symetrical (code assumed they were).  But I'll remind that I don't know
> > suspend/resume code path or what's expected from the NIC drivers for that
> to
> > work.
> 
> Some guessing based on a cursory inspection of what a few other drivers
> (eepro100.c, r8169.c, typhoon.c, via-velocity.c) do suggested the patch
> below,
> which hasn't crashed yet (after about 10 hibernate/resume cycles) and brings
> the interface back on resume. I have no idea whether it is actually correct;
> probably you know more about the semantics of netif_running. (typhoon.c in
> particular contains a couple comments of the form "If we're down, resume when
> upped" and "If we're down, we're already suspended.")
> 
> --- tulip_core.c.standard       2008-03-31 18:57:39.000000000 -0400
> +++ tulip_core.c        2008-04-19 20:26:37.000000000 -0400
> @@ -1733,12 +1733,15 @@
>         if (!dev)
>                 return -EINVAL;
> 
> -       if (netif_running(dev))
> -               tulip_down(dev);
> +    if (!netif_running(dev))
> +        goto finish;
> +
> +    tulip_down(dev);
> 
>         netif_device_detach(dev);
>         free_irq(dev->irq, dev);
> 
> +finish:
>         pci_save_state(pdev);
>         pci_disable_device(pdev);
>         pci_set_power_state(pdev, pci_choose_state(pdev, state));
> @@ -1763,6 +1766,9 @@
>                 return retval;
>         }
> 
> +    if (!netif_running(dev))
> +        return 0;
> +
>         if ((retval = request_irq(dev->irq, &tulip_interrupt, IRQF_SHARED,
> dev->name, dev))) {
>                 printk (KERN_ERR "tulip: request_irq failed in resume\n");
>                 return retval;
> 

Seems sane to me, actually. Replace spaces with tabs, and it looks
like mainline candidate...
Comment 48 Grant Grundler 2008-04-27 22:54:13 UTC
Created attachment 15941 [details]
PATCH suspend/resume rewrite...after checking other drivers.

kernelbugs, can you please test this patch?
If this works, I'd be happy to send this upstream.

Pavel,
The whole suspend/resume stuff for NICs is a mess. It would be good if you could look at several of the drivers and then suggest some guidelines in Documentation someplace. You might start with the following:

o NIC suspend must always call pci_save_state() and set the power to the requested state. It will also call pci_device_disable().
o NIC resume must always call pci_restore_state(). It will only call pci_enable_device() if the device was before suspend in netif_running() state. It will NOT call pci_set_power_state(D0) since pci_enable_device() is already doing that.
o In general, do not call free_irq/request_irq (most drivers don't..I'm assume low power state will stop interrupts).

etc. I don't know all the rules. After looking at several drivers (and submitting one small fix to typhoon driver), I'm convinced some guidelines will do wonders to help clean this up.

thanks,
grant
Comment 49 Rafael J. Wysocki 2008-04-28 14:22:08 UTC
Well, some time ago I added the suspend/resume support to uli526x.c .  It is a good representation of how I think that should be done, although I'm not really an experienced driver writer.

If you look at the code in there, you'll notice that it doesn't call pci_disable_device()/pci_enable_device() at all, since that just plain didn't work with that chip.

As far as the guidance is concerned, we are in the process of changing the framework for suspending/resuming devices and there surely will be some guidance for the new framework (the patches that introduce the new callbacks are targeted at 2.6.27).
Comment 50 kernelbugs 2008-04-29 05:13:44 UTC
(In reply to comment #48)
> Created an attachment (id=15941) [details]
> PATCH suspend/resume rewrite...after checking other drivers.
> 
> kernelbugs, can you please test this patch?
> If this works, I'd be happy to send this upstream.
> 

That patch eliminates the panic, but fails to bring the interface back up in a working state. dhclient3 just sends DHCPDISCOVERs repeatedly (after any suspend attempt), and at the beginning of a second suspend process, the kernel message log contains:

0000:00:13.0: tulip_stop_rxtx() failed (CSR5 0xffffffff CSR6 0xffffffff)

and

ADDRCONF(NETDEV_UP): eth0: link is not ready

at the very end (after resuming a second time, presumably when the hibernate script tries to bring the interface back up).

The messages above appear in the log for a second suspend process, after the first one has already failed to bring the interface back. (Yes, I am repeating myself to make sure the point isn't missed.)

Removing the module and reinserting allows the interface to be brought back up in a working state.
Comment 51 Grant Grundler 2008-05-31 21:15:46 UTC
jgarzik applied to his tree. Expect this to be fixed in 2.6.26-rc5 or so.

Syntax error in the submission to netdev was my fault:
    http://marc.info/?l=linux-netdev&m=121057430021417&w=2

Thanks to jgarzik for fixing the syntax error and applying it.

I'll track the other issues I found during code reviews (interrupts
being delivered despite disabling the chip) in my TODO list.
Comment 52 Adrian Bunk 2008-06-05 01:30:44 UTC
fixed by commit 56997fa838e333cea33ab641d4aeedd23aef0eb1