Most recent kernel where this bug did not occur: I think this may have appeared when I switched from 2.6.17.11 to 2.6.18.8. I was originally using suspend2 but switched to mainline swsusp to determine whether the problem occurred with it. Distribution: Debian sarge, x86 Hardware Environment: /proc/cpuinfo: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 1 cpu MHz : 599.825 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1200.83 clflush size : 32 /proc/ioports: 0000-001f : dma1 0020-0021 : pic1 0040-0043 : timer0 0050-0053 : timer1 0060-006f : keyboard 0070-0077 : rtc 0080-008f : dma page reg 00a0-00a1 : pic2 00c0-00df : dma2 00f0-00ff : fpu 0170-0177 : 0000:00:07.1 0170-0177 : ide1 01f0-01f7 : 0000:00:07.1 01f0-01f7 : ide0 02f8-02ff : serial 0376-0376 : 0000:00:07.1 0376-0376 : ide1 0378-037a : parport0 03c0-03df : vesafb 03f2-03f5 : floppy 03f6-03f6 : 0000:00:07.1 03f6-03f6 : ide0 03f7-03f7 : floppy DIR 03f8-03ff : serial 0cf8-0cff : PCI conf1 4000-4003 : ACPI PM1a_EVT_BLK 4004-4005 : ACPI PM1a_CNT_BLK 4008-400b : ACPI PM_TMR 4010-4015 : ACPI CPU throttle 4020-4023 : ACPI GPE0_BLK 5000-5007 : vt596_smbus d000-d00f : 0000:00:07.1 d000-d007 : ide0 d008-d00f : ide1 d400-d41f : 0000:00:07.2 d400-d41f : uhci_hcd dc00-dcff : 0000:00:0f.0 e000-e007 : 0000:00:12.0 e000-e007 : au8830 e400-e407 : 0000:00:12.0 e400-e407 : au8830 e800-e8ff : 0000:00:13.0 e800-e8ff : tulip ec00-ecff : 0000:00:14.0 ec00-ecff : tulip /proc/iomem: 00000000-0009ffff : System RAM 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000c8000-000c9fff : pnp 00:00 000f0000-000fffff : System ROM 00100000-17feffff : System RAM 00100000-00309b3a : Kernel code 00309b3b-003d4f6b : Kernel data 17ff0000-17ff2fff : ACPI Non-volatile Storage 17ff3000-17ffffff : ACPI Tables 20000000-2001ffff : 0000:00:0f.0 d8000000-dbffffff : 0000:00:00.0 dc000000-dcffffff : 0000:00:0f.0 dc000000-dc3fffff : vesafb de000000-de01ffff : 0000:00:13.0 df000000-df01ffff : 0000:00:14.0 e0000000-e003ffff : 0000:00:12.0 e0000000-e003ffff : au8830 e0040000-e0040fff : 0000:00:0f.0 e0041000-e00413ff : 0000:00:14.0 e0041000-e00413ff : tulip e0042000-e00423ff : 0000:00:13.0 e0042000-e00423ff : tulip ffff0000-ffffffff : reserved /proc/interrupts: CPU0 0: 386801 XT-PIC-XT timer 1: 42 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 5: 1 XT-PIC-XT acpi 6: 5 XT-PIC-XT floppy 7: 0 XT-PIC-XT parport0 8: 1 XT-PIC-XT rtc 10: 30561 XT-PIC-XT au8830 11: 27583 XT-PIC-XT uhci_hcd:usb1, eth0 14: 138592 XT-PIC-XT ide0 15: 11574 XT-PIC-XT ide1 NMI: 0 ERR: 0 lspci -vvv: 00:00.0 Host bridge: VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x] (rev 82) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR+ Latency: 0 Region 0: Memory at d8000000 (32-bit, prefetchable) [size=64M] Capabilities: [a0] AGP version 2.0 Status: RQ=32 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW- AGP3- Rate=x1,x2,x4 Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- Rate=<none> 00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP] (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- Latency: 0 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: 0000f000-00000fff Memory behind bridge: fff00000-000fffff Prefetchable memory behind bridge: fff00000-000fffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B- Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C596 ISA [Mobile South] (rev 12) Subsystem: VIA Technologies, Inc. VT82C596/A/B PCI to ISA Bridge Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) (prog-if 8a [Master SecP PriP]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 Region 0: [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8] Region 1: [virtual] Memory at 000003f0 (type 3, non-prefetchable) [size=1] Region 2: [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8] Region 3: [virtual] Memory at 00000370 (type 3, non-prefetchable) [size=1] Region 4: I/O ports at d000 [size=16] 00:07.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 08) (prog-if 00 [UHCI]) Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size: 32 bytes Interrupt: pin D routed to IRQ 11 Region 4: I/O ports at d400 [size=32] 00:07.3 Host bridge: VIA Technologies, Inc. VT82C596 Power Management (rev 20) Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- 00:0f.0 VGA compatible controller: ATI Technologies Inc 3D Rage II+ 215GTB [Mach64 GTB] (rev 9a) (prog-if 00 [VGA]) Subsystem: ATI Technologies Inc 3D Rage II+ 215GTB [Mach64 GTB] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (2000ns min), Cache Line Size: 32 bytes Region 0: Memory at dc000000 (32-bit, non-prefetchable) [size=16M] Region 1: I/O ports at dc00 [size=256] Region 2: Memory at e0040000 (32-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at 20000000 [disabled] [size=128K] 00:12.0 Multimedia audio controller: Aureal Semiconductor Vortex 2 (rev fe) Subsystem: Aureal Semiconductor AU8830 Vortex 3D Digital Audio Processor Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (1000ns min, 3000ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 10 Region 0: Memory at e0000000 (32-bit, non-prefetchable) [size=256K] Region 1: I/O ports at e000 [size=8] Region 2: I/O ports at e400 [size=8] Capabilities: [dc] Power Management version 1 Flags: PMEClk- DSI+ D1- D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:13.0 Ethernet controller: ADMtek NC100 Network Everywhere Fast Ethernet 10/100 (rev 11) Subsystem: Accton Technology Corporation Unknown device 1216 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (63750ns min, 63750ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at e800 [size=256] Region 1: Memory at e0042000 (32-bit, non-prefetchable) [size=1K] Expansion ROM at de000000 [disabled] [size=128K] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:14.0 Ethernet controller: Microsoft Corporation Unknown device 0002 (rev 11) Subsystem: Microsoft Corporation Unknown device 0002 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (16000ns min, 32000ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 12 Region 0: I/O ports at ec00 [size=256] Region 1: Memory at e0041000 (32-bit, non-prefetchable) [size=1K] Expansion ROM at df000000 [disabled] [size=128K] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=100mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Software Environment: Hibernate Script 1.93 Linux thursday 2.6.22.5 #1 Thu Aug 23 10:59:50 EDT 2007 i686 GNU/Linux Gnu C 4.1.2 Gnu make 3.81 binutils 2.17 util-linux 2.12r mount 2.12r module-init-tools 3.3-pre2 e2fsprogs 1.40-WIP Linux C Library 2.3.6 Dynamic linker (ldd) 2.3.6 Procps 3.2.7 Net-tools 1.60 Console-tools 0.2.3 Sh-utils 5.97 udev 105 Modules Loaded lp thermal fan button processor ac battery sha1 arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss usbhid snd_au8830 8250_pnp snd_ac97_codec 8250 snd_pcm snd_timer snd_page_alloc ac97_bus snd_mpu401_uart snd_rawmidi snd_seq_device serial_core tulip snd uhci_hcd Problem Description: The kernel reports an Oops in the tulip driver while hibernating, but not every time. (It occurs perhaps 30% to 50% of the time.) The system hangs, caps lock and scroll lock flash, and the computer has to be powered off by holding down the power button. The original bug report (before reproducing the problem with swsusp) is on the Suspend2 bugzilla at: http://bugzilla.suspend2.net/show_bug.cgi?id=294 I am using the hibernate shell script to suspend the system. I will attach a photo of the Oops report from 2.6.22.5 to this bug, but there are several additional ones from older kernels on the original bug page. Speculation: The network card is on a shared interrupt with the USB controller, so it may be receiving unexpected interrupts generated by USB. Steps to reproduce: 1. Try suspending to disk. 2. Experience kernel Oops (but not every time). Here is an Oops report from kernel 2.6.21.1 that I sent in a private mail concerning the bug: Clocksource tsc unstable (delta = 4144113840505 ns) pnp: Device 00:08 activated. pnp: Device 00:09 activated. pnp: Device 00:0a activated. pnp: Failed to activate device 00:0b. usbdev1.2_ep00: PM: resume from 0, parent 1-1 still 1 usbhid 1-1:1.0: PM: resume from 1, parent 1-1 still 1 usbdev1.2_ep01: PM: resume from 0, parent 1-1:1.0 still 1 usbdev1.2: PM: resume from 0, parent 1-1 still 1 Restarting tasks ... <6>usb 1-1: USB disconnect, address 2 done. Stopping tasks ... done. Shrinking memory... -<1>BUG: unable to handle kernel NULL pointer dereference at virtual address 0000008c printing eip: d883e960 *pde = 00000000 Ooops: 0000 [#1] Modules linked in: lp thermal fan processor ac battery sha1 arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_isa i2c_viapro i2c_core usbhid 8250_pnp 8250 serial_core uhci_hcd snd_page_alloc ac97_bus tulip CPU: 0 EIP: 0060:[<d883e960>] Not tainted VLI EFLGS: 00010286 (2.6.21.1 #1) EIP is at tulip_interrupt+0x29b/0xb37 [tulip] eax: 00000000 ebx: cfde7aa0 ecx: 0000003c edx: d7c68000 esi: 00000035 edi: d768ca12 ebp: d7c68340 esp: c03c9ee4 ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 Process swapper (pid: 0, ti=c03c8000 task=c03a3380 task.ti=c03c8000) Stack: ffffffff c03a6b40 c0126d99 6c3a9c20 00003d3f 80750c34 6c776d00 0001e828 d7c68000 0001e800 fc69c0d0 00000000 00000000 00000000 00000019 c011a58f 0000003c 00000000 0000007f d7b28220 00000000 00000000 0000000b c012f8a8 Call Trace: [<c0126d99>] tick_program_event+0x2a/0x49 [<c011a58f>] run_timer_softirq+0x11/0x12f [<c012f8a8>] handle_IRQ_event+0x1a/0x3f [<c01308d4>] handl_level_irq+0x5c/0x94 [<c0105530>] do_IRQ+0x54/0x70 [<c0103cab>] common_interrupt+0x23/0x28 [<c02000d8>] acpi_ds_create_node+0x45/0x55 [<d8c1fe6a>] acpi_processor_idle+0x1a7/0x335 [processor] [<c0102327>] cpu_idle+0x39/0x4e [<c03caacb>] start_kernel+0x33b/0x343 [<c03ca42b>] unknown_bootoption+0x0/0x202 Code: 00 8b 80 90 00 00 00 8b 54 24 20 83 c7 12 83 c0 12 89 bb 8c 00 00 00 89 53 14 89 83 90 00 00 00 8b 84 f5 18 01 00 00 8b 4c 24 40 <8b> b0 8c 00 00 00 c1 e9 02 f3 a5 8b 4c 24 40 83 e1 03 74 02 f3 EIP: [<d883e960>] tulip_interrupt+0x29b/0xb37 [tulip] SS:ESP 0068:c03c9ee4 Kernel panic - not syncing: Fatal exception in interrupt
Created attachment 12582 [details] Photo of 2.6.22.5 oops report
that's a tulip bug, not a power-management bug. I updated the report appropriately, thanks.
Yeah, I agree with akpm that is most likely a tulip driver bug. I've several race conditions with interrupts and shutdown in the past. I'm not surprised there might be similar races in swsusp code path. I need a few more bits of info: 1) Console output (e.g. dmesg) after booting 2.6.22.5 (please attach) 2) tulip driver CONFIG options (e.g. "fgrep TULIP /boot/config-2.6.22.5" output) thanks, grant
Created attachment 12636 [details] dmesg output after boot (kernel 2.6.22.5) A couple lines of "vortex: IRQ fifo error" appear at the end immediately after login, presumably when the mixer is initialized by KDE.
fgrep TULIP config-2.6.22.5 gives: CONFIG_NET_TULIP=y CONFIG_TULIP=m # CONFIG_TULIP_MWI is not set # CONFIG_TULIP_MMIO is not set # CONFIG_TULIP_NAPI is not set I will attach the entire config file in case other parts prove relevant. By the way, I apologize for the huge information dump. I didn't realize how unreadable it was until after I had submitted the bug.
Created attachment 12637 [details] Config file for 2.6.22.5
(In reply to comment #5) > fgrep TULIP config-2.6.22.5 gives: > CONFIG_NET_TULIP=y > CONFIG_TULIP=m > # CONFIG_TULIP_MWI is not set > # CONFIG_TULIP_MMIO is not set > # CONFIG_TULIP_NAPI is not set Ok - so using IO Port space and NAPI isn't enabled...hrm. I'll stare at the code a bit and see where there might be races. > > I will attach the entire config file in case other parts prove relevant. > > By the way, I apologize for the huge information dump. I didn't realize how > unreadable it was until after I had submitted the bug. No problem. I don't mind. Attachments are sometime better (for long output like lspci) but I'm happy to get the info in the bug report. thanks, grant
Can you tell me what the current status of this is, please?
Still broken in 2.6.22.9. I haven't tried any of the 2.6.23 rcs to see if they randomly fix or hide the bug. I have no idea what's going on on the "developing a fix" side of things. -- Tap
I'm not sure if we know what exactly the problem is. In case we don't, it's better to try the newest kernel and check if the bug is gone, because it might be fixed as a result of seemingly unrelated changes. This often is the case with suspend bugs.
Apologies for dropping off the scene for a bit. I've looked at the tulip code and didn't see anything obvious. But I'm not familiar with the calling path of the suspend and resume events. I'll take another quick look tonight and more later this week. I built a 2.6.22.9 kernel on my Debian "testing" machine and then ran "objdump -rD tulip.ko" to see what is in tulip_interrupt(). But "tulip_interrupt+0x29b" doesn't look like a reasonable offset for my build: 97c: 89 e9 mov %ebp,%ecx 97e: c1 e9 02 shr $0x2,%ecx 981: 8b b0 8c 00 00 00 mov 0x8c(%eax),%esi 987: f3 a5 rep movsl %ds:(%esi),%es:(%edi) 989: 89 e9 mov %ebp,%ecx 98b: 83 e1 03 and $0x3,%ecx 98e: 74 02 je 992 <tulip_interrupt+0x2a9> 990: f3 a4 rep movsb %ds:(%esi),%es:(%edi) Hex math (tulip_interrupt starts at 0x6e9): 6E9+29B = 984 Can you "objdump -rD" the tulip.ko (or interrupt.o if you have that) which crashed? I need to know the EIP and objdump output for the file. Please send me the objdump output as an attachment directly. My guess is 0x8c is the offset for the "skb" (based on reviewing code in tulip_refill_rx which also uses 0x8c offset). So it sounds like skb is deallocated but then we get an interrupt which wants to derefence that skb. Again, a typical race in "driver shutdown" sequence.
Created attachment 13010 [details] Photo of 2.6.22.9 oops
Created attachment 13011 [details] objdump -rD of drivers/net/tulip/interrupt.o
The 29B offset was from kernel 2.6.21.1 and may not even have been right; the photo I just attached has the dump from 2.6.22.9. (I can transcribe it if needed.) The EIP in the photo says "EIP is tulip_interrupt+0x2b4/0xc27 [tulip]". I can attach the objdump for tulip.ko too, if you want.
This crash still occurs in 2.6.23; following is most of the oops dump (I left out some of the numbers that I've picked up probably don't matter much). I can attach an updated interrupt.o dump if desired. swsusp: Basic memory bitmaps created BUG: unable to handle kernel NULL pointer dereference at virtual address 00000084 printing eip: d88429e1 *pde = 00000000 Oops: 0000 [#1] Modules linked in: lp ac battery sha1 arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core usbhid fan thermal snd_page_alloc ac97_bus 8250_pnp processor uhci_hcd 8250 tulip serial_core CPU: 0 EIP: 0060:[<d88429e1>] Not tainted VLI EFLAGS: 00010206 (2.6.23 #1) EIP is at tulip_interrupt+0x318/0xc37 [tulip] eax: 000000d8 ebx: 00000000 ecx: d6b3e380 edx: d6b3e380 esi: 00000007 edi: 00000070 ebp: 000000d8 esp: d5a5fcec ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 Process klogd [etc.] Stack [omitted] Call Trace: __alloc_pages+0x54/0x28a handle_IRQ_event+0x1a/0x3f handle_level_irq+0x4f/0x85 do_IRQ+0x50/0x69 enqueue_entity+0x1d2/0x1f3 common_interrupt+0x23/0x28 hpet_set_mode+0x5a/0xaf kmem_cache_alloc+0x3c/0x72 __aloc_skb+0x2a/0xf5 sock_alloc_send_skb+0x6e/0x18e unix_dgram_sendmsg+0x14b/0x3b3 sock_aio_write+0xbc/0xc8 __dequeue_signal+0x10/0x11a find_lock_page+0x12/0x62 do_sync_write+0xc6/0x109 __sched_text_start+0x212/0x26f autoremove_wake_function+0x0/0x35 autoremove_wake_function+0x0/0x35 (NOT A TYPO) vfs_write+0x9e/0x10c sys_write+0x41/0x67 sysenter_past_esp+0x5f/0x85 xs_send_kvec+0x66/0x84 =====================
Thanks, kernelbugs ;) for keeping status updated. Adding Pavel to give a hand with suspend/resume.
kernelbugs, can you see if the problem is present in the current mainline kernel? Grant, what kind of information is needed to get an idea of what's wrong?
Rafael, what do you mean by current mainline? 2.6.23.9? The latest rc? Linus's git tree? I am pretty sure it's in 2.6.23.9 (it has frozen on suspend, but it doesn't always wait until a text console is visible for me to see the oops dump so I'm not 100% sure it's the same problem), but haven't tried the others.
I meant the Linus' tree.
(In reply to comment #17) ... > Grant, what kind of information is needed to get an idea of what's wrong? My guess is there is a race condition between the suspend and delivery of a pending IRQ. Ie the suspend path isn't waiting until either the interrupt has been delivered or possibly delivered but not completed handling of the interrupt. I need to know the sequence of calls into the driver so I can work out the state of the driver when it's interrupt handler _should_ have been disabled. See comment #11. Something is freeing an skb that the tulip driver thinks it owns and has "in flight". I suspect it's either the netdev layer or some other part of the tulip driver. thanks, grant
I understand. So, would it be possible to instrument the code somehow to get that information?
I tried a kernel compiled from Linus's git tree today, and got it to hang three times on a suspend (with the keyboard LEDs flashing), but unfortunately it hung before switching out of X to a text console where I could see the oops report. I might try getting it to dump to a serial port sometime, if necessary. I was unable to reproduce it by starting a suspend from a text console, and I tried 15 or 20 times, so some change may have made it harder to reproduce, or I was just unlucky. The short answer: I think the bug exists in main line, but I don't have the proof of an oops report, just a hang with the same LEDs flashing on suspend.
>(In reply to comment #22) > I tried a kernel compiled from Linus's git tree today, and got it to hang > three > times on a suspend (with the keyboard LEDs flashing), but unfortunately it > hung > before switching out of X to a text console where I could see the oops > report. Please be careful with the terminology: "hang" is different from "oops" (aka panic). > I might try getting it to dump to a serial port sometime, if necessary. I was > unable to reproduce it by starting a suspend from a text console, and I tried > 15 or 20 times, so some change may have made it harder to reproduce, or I was > just unlucky. Are you accessing the X11 server through the network card somehow? I'm wondering how X on console affects the tulip driver and don't see any connection other than PCI bus and difference in timing. It possible this is a different bug with a different symptom. If you could use serial port to dump the panic tombstone, that would probably help alot to determine if it is the same bug. > The short answer: I think the bug exists in main line, but I don't have the > proof of an oops report, just a hang with the same LEDs flashing on suspend. I'm already convinced a real bug was reported. I'm just not sure where to look for it. And in this update, I'm thinking there is a second problem with X11 suspend/resume that is seperate from the original bug report.
Sorry about the hang/panic terminology, but I assume it's both as (1) the system completely stops responding and (2) the keyboard LEDs flash the same as when I actually get to see the oops/panic. Also, I remember in the past having other oops/panics (not sure which now) that didn't result in complete non-responsiveness (they may have been related to a failing swap partition, if I remember correctly, and perhaps the only reason it's not responding in this case is because the problem occurs with interrupt handling). I assumed the issue with X was merely a race condition concerning whether it got to the console before oopsing, since with previous kernel versions it also hung and blinked LEDs at me before switching to a console. However, it could be a bad assumption, since I could consequently not see the bug report. My speculation in the original bug report, which no one seems to have commented on, was that the shared interrupt between the network card and the USB controller could have something to do with it; the mouse is hooked up through USB, although the keyboard is not. However, I am not doing accessing the X server across the network in the vast majority of cases where this crash happens. I will see if I can get a null-modem cable working to see that panic information. Also, on a related note, there is another hang problem with this machine where the LEDs do not blink, but I want to rule out a failing hard drive (since things that don't need to be loaded from disk seem to respond) before I file a bug report against the driver.
It would be helpful to test if suspend works with the tulip driver unloaded, too.
This oops report appeared on the mainline from Thursday morning on both the text console and the serial console (i.e., X switched out of graphics mode before the kernel oopsed in this case): Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done. Shrinking memory... done (0 pages freed) Freed 0 kbytes in 0.04 seconds (0.00 MB/s) BUG: unable to handle kernel NULL pointer dereference at virtual address 0000008c printing eip: d880f98b *pde = 00000000 Oops: 0000 [#1] Modules linked in: lp ac battery sha1_generic arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core usbhid fan snd_page_alloc ac97_bus thermal uhci_hcd processor tulip Pid: 10124, comm: echo Not tainted (2.6.24-rc5-gda8cadb3-dirty #1) EIP: 0060:[<d880f98b>] EFLAGS: 00010286 CPU: 0 EIP is at tulip_interrupt+0x2c2/0xc13 [tulip] EAX: 00000000 EBX: d7615cc0 ECX: 0000000f EDX: d793a360 ESI: 00000014 EDI: d7a82012 EBP: 0000003c ESP: d69e5cc0 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 Process echo (pid: 10124, ti=d69e4000 task=ce503050 task.ti=d69e4000) Stack: ffffffff 00000000 00000001 003d09fa 00000086 c042ac70 c042ab80 0001e828 d793a000 d793a360 0001e800 fc69c8d0 00000000 00000000 00000000 00000019 c01247e4 00000000 0000007f d793a360 d670a720 00000000 00000000 0000000b Call Trace: [<c01247e4>] hrtimer_forward+0x70/0xce [<c01321a2>] handle_IRQ_event+0x1a/0x3f [<c0133240>] handle_level_irq+0x4f/0x87 [<c010532e>] do_IRQ+0x50/0x68 [<c0103db7>] common_interrupt+0x23/0x28 [<c01188c3>] __do_softirq+0x2c/0x75 [<c011892e>] do_softirq+0x22/0x26 [<c0105333>] do_IRQ+0x55/0x68 [<c0103db7>] common_interrupt+0x23/0x28 [<c01159b3>] vprintk+0x23c/0x255 [<c0103db7>] common_interrupt+0x23/0x28 [<c01159b3>] vprintk+0x23c/0x255 [<c01188cc>] __do_softirq+0x35/0x75 [<c0105333>] do_IRQ+0x55/0x68 [<c01159e7>] printk+0x1b/0x1f [<c012f27e>] swsusp_show_speed+0xe5/0xed [<c012f4ca>] swsusp_shrink_memory+0x10c/0x116 [<c012f999>] hibernation_snapshot+0xd/0xbf [<c012f08a>] freeze_processes+0x71/0x76 [<c012fbc7>] hibernate+0xc6/0x173 [<c012ea7b>] state_store+0x42/0xa2 [<c012ea39>] state_store+0x0/0xa2 [<c0179174>] subsys_attr_store+0x24/0x28 [<c017944f>] sysfs_write_file+0xa8/0xdb [<c01793a7>] sysfs_write_file+0x0/0xdb [<c014c4c2>] vfs_write+0x8a/0x10c [<c014c961>] sys_write+0x41/0x67 [<c0103ba6>] sysenter_past_esp+0x5f/0x85 [<c0310000>] svc_tcp_recvfrom+0x313/0x7ce ======================= Code: 00 00 00 89 e9 8b bb 8c 00 00 00 c1 e9 02 83 c0 12 83 c7 12 89 bb 8c 00 00 00 89 83 80 00 00 00 8b 54 24 4c 8b 84 f2 18 01 00 00 <8b> b0 8c 00 00 00 f3 a5 89 e9 83 e1 03 74 02 f3 a4 83 7b 54 00 EIP: [<d880f98b>] tulip_interrupt+0x2c2/0xc13 [tulip] SS:ESP 0068:d69e5cc0 Kernel panic - not syncing: Fatal exception in interrupt 30 suspend/restart cycles after removing the tulip module worked fine. (This isn't infallible, as an oops doesn't occur every time, but it would usually occur after 30 tries.) About 10 suspend/restart cycles after reinserting the tulip module worked fine, except... it stopped working because /etc/sudoers became corrupted in RAM; a clean reboot resulted in the original contents reappearing, so go figure. These tests were still running from the same in-memory image I had just tried 30 times without the tulip module; there was no clean restart in between. However, since the oops itself didn't happen (although something might have if I had been able to continue), module load order could have something to do with the bug appearing. After a clean restart, I got this oops report on the serial console while X was still in graphics mode (this was on the 10th suspend/resume cycle): Vortex: shutdown...<6>done. ACPI: PCI interrupt for device 0000:00:12.0 disabled BUG: unable to handle kernel NULL pointer dereference at virtual address 0000008c printing eip: d885198b *pde = 00000000 Oops: 0000 [#1] Modules linked in: button lp ac battery sha1_generic arc4 ecb blkcipher cryptomgr ppp_mppe ppp_generic slhc eeprom w83781d hwmon_vid i2c_viapro i2c_core usbhid fan thermal processor snd_page_alloc ac97_bus tulip uhci_hcd Pid: 0, comm: swapper Not tainted (2.6.24-rc5-gda8cadb3-dirty #1) EIP: 0060:[<d885198b>] EFLAGS: 00010286 CPU: 0 EIP is at tulip_interrupt+0x2c2/0xc13 [tulip] EAX: 00000000 EBX: d7acba20 ECX: 00000017 EDX: d7ba0360 ESI: 0000001f EDI: d6c27e12 EBP: 0000005c ESP: c03f5eec DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c03f4000 task=c03c92e0 task.ti=c03f4000) Stack: ffffffff c031f720 d5b64030 c01120a5 d5b64030 c01115fe c042ac40 0001e828 d7ba0000 d7ba0360 0001e800 fc69c0d0 00000000 00000000 00000000 00000019 c011b1ab 00000000 0000007f d7ba0360 cf5895e0 00000000 00000000 0000000b Call Trace: [<c01120a5>] enqueue_task_fair+0x16/0x24 [<c01115fe>] enqueue_task+0xa/0x14 [<c011b1ab>] run_timer_softirq+0xe0/0x12f [<c01321a2>] handle_IRQ_event+0x1a/0x3f [<c0133240>] handle_level_irq+0x4f/0x87 [<c010532e>] do_IRQ+0x50/0x68 [<c0103db7>] common_interrupt+0x23/0x28 [<d88667bc>] acpi_processor_idle+0x228/0x39c [processor] [<c010252d>] cpu_idle+0x3e/0x53 [<c03f6994>] start_kernel+0x232/0x237 [<c03f6303>] unknown_bootoption+0x0/0x195 ======================= Code: 00 00 00 89 e9 8b bb 8c 00 00 00 c1 e9 02 83 c0 12 83 c7 12 89 bb 8c 00 00 00 89 83 80 00 00 00 8b 54 24 4c 8b 84 f2 18 01 00 00 <8b> b0 8c 00 00 00 f3 a5 89 e9 83 e1 03 74 02 f3 a4 83 7b 54 00 EIP: [<d885198b>] tulip_interrupt+0x2c2/0xc13 [tulip] SS:ESP 0068:c03f5eec Kernel panic - not syncing: Fatal exception in interrupt
(In reply to comment #24) > Sorry about the hang/panic terminology, but I assume it's both as (1) the > system completely stops responding and (2) the keyboard LEDs flash the same > as > when I actually get to see the oops/panic. It certainly can be both depending on how the kernel is configured and what goes wrong. Thanks for clarifying. ... > I assumed the issue with X was merely a race condition concerning whether it > got to the console before oopsing, since with previous kernel versions it > also > hung and blinked LEDs at me before switching to a console. However, it could > be > a bad assumption, since I could consequently not see the bug report. I don't think it's a bad assumption - just would like to see evidence (and you collected very good evidence in the next post - awesome! :) > My speculation in the original bug report, which no one seems to have > commented > on, was that the shared interrupt between the network card and the USB > controller could have something to do with it; the mouse is hooked up through > USB, although the keyboard is not. Sorry - I re-read that now and agree - it could be causing spurious interrupts. But the tulip driver should be able to protect itself against that. In other words, more evidence this is a tulip bug but the race condition might be different than I thought. > However, I am not doing accessing the X > server across the network in the vast majority of cases where this crash > happens. I suggested this only because the X11 server was in use and it's possible X is generating additional traffic - different kind of race than with the shared IRQ. > I will see if I can get a null-modem cable working to see that panic > information. > > Also, on a related note, there is another hang problem with this machine > where > the LEDs do not blink, but I want to rule out a failing hard drive (since > things that don't need to be loaded from disk seem to respond) before I file > a > bug report against the driver. If you can reproduce it without the tulip driver involved, that would probably help. thanks again!
(In reply to comment #27) > (In reply to comment #24) ... > > My speculation in the original bug report, which no one seems to have > commented > > on, was that the shared interrupt between the network card and the USB > > controller could have something to do with it; the mouse is hooked up > through > > USB, although the keyboard is not. > > Sorry - I re-read that now and agree - it could be causing spurious > interrupts. > But the tulip driver should be able to protect itself against that. In other > words, more evidence this is a tulip bug but the race condition might be > different than I thought. Yes, It's not the same race that I thought it was. *sigh* I think I found the problem - will propose a solution but don't know if it will work (or not). tulip_suspend() calls tulip_down() if the interface is configured and "running". tulip_down() has this bit of code: static void tulip_down (struct net_device *dev) { struct tulip_private *tp = netdev_priv(dev); void __iomem *ioaddr = tp->base_addr; unsigned long flags; flush_scheduled_work(); #ifdef CONFIG_TULIP_NAPI napi_disable(&tp->napi); #endif del_timer_sync (&tp->timer); #ifdef CONFIG_TULIP_NAPI del_timer_sync (&tp->oom_timer); #endif spin_lock_irqsave (&tp->lock, flags); /* Disable interrupts by clearing the interrupt mask. */ iowrite32 (0x00000000, ioaddr + CSR7); /* Stop the Tx and Rx processes. */ tulip_stop_rxtx(tp); /* prepare receive buffers */ tulip_refill_rx(dev); /* release any unconsumed transmit buffers */ tulip_clean_tx_ring(tp); ... The intent of the iowrite32 to CSR7 is to prevent any additional interrupts. tulip_stop_rxtx() will flush that write out so we are sure that no new interrupts will be generated. Later code flushes in flight interrupts so we can be sure this tulip won't be generating interrupts. But that completely ignores the case of a shared IRQ and polling of the tulip device. There are two ways to prevent a shared IRQ from being generated: free_irq() to unregister the Interrupt handler OR disable_irq() to block that IRQ from generating interrupts. The latter isn't what we want since USB is still active. But I don't know offhand what hazards we will hit by moving the free_irq() call earlier in tulip_suspend(). Can someone look at this and/or test it as well? It's a trivial change and I'll just inline it here: --- a/drivers/net/tulip/tulip_core.c +++ b/drivers/net/tulip/tulip_core.c @@ -1733,11 +1738,12 @@ static int tulip_suspend (struct pci_dev *pdev, pm_message_t state) if (!dev) return -EINVAL; + free_irq(dev->irq, dev); + if (netif_running(dev)) tulip_down(dev); netif_device_detach(dev); - free_irq(dev->irq, dev); pci_save_state(pdev); pci_disable_device(pdev); But like I said, this may break netif assumptions and possibly assumptions in tulip_down(). Perhaps just swapping netif_device_detach() and free_irq() would be sufficient too... hth, grant
Your suggested patch does not appear to help. (I didn't try the second alternative you mentioned.) I have three panic traces right now -- one with 2.6.24.4 unpatched, the other two with your patch. What interested me about the two patched ones is that they appear to have occurred at slightly different places in tulip_interrupt. Let me know if you want any updated objdump output, etc., or if you would prefer me to test with an older kernel.
Created attachment 15500 [details] Panic report from unpatched 2.6.24.4
Created attachment 15501 [details] First panic report from patched 2.6.24.4
Created attachment 15503 [details] Second panic report from patched 2.6.24.4
Kernelbugs, thanks for testing that out...I suspect I'm still confused about then the panic is occurring. Is it happening on the "suspend" event or the "resume" event? I'm realizing now this might be happening on the "resume" since i'm seeing "starting disk" in messages a few lines (10 or so) above the panic. My previous suggestion was for the "suspend" (tulip_down()) code path. I think I found a real problem in "down" code path and will have to think about how to fix that. BUt if the immediate problem is with resume, I need to look more closely at tulip_init_one(). Particularly where request_irq() is called since that will effectively enable when tulip_interrupt() will start getting called. thanks, grant
sorry - in previous comment s/tulip_init_one/tulip_resume/
Still _assuming_ this is about resume code path... tulip_init_one() called before tulip_open() can be called. That means the rough sequence normally on startup is: register_netdev() request_irq() tulip_up() netif_start_queue() tulip_resume() doesn't call register_netdev but rather netif_device_attach(). It's late here and I need sleep...but the order of the calls is my first suspicion in tulip_resume(). I think netif_attach_device() will call netif_wake_queue(). This is the equivalent to calling netif_start_queue(). Can you try the patch below out? Or experiment with the order between request_irq() and tulip_up() in tulip_resume() ? grundler <503>git diff drivers/net/tulip/tulip_core.c diff --git a/drivers/net/tulip/tulip_core.c b/drivers/net/tulip/tulip_core.c index ed600bf..8770721 100644 --- a/drivers/net/tulip/tulip_core.c +++ b/drivers/net/tulip/tulip_core.c @@ -1768,11 +1768,11 @@ static int tulip_resume(struct pci_dev *pdev) return retval; } - netif_device_attach(dev); - if (netif_running(dev)) tulip_up(dev); + netif_device_attach(dev); + return 0; }
I don't know which code path it's in, but this is occurring when I _suspend_. Resuming (once the image has been saved successfully) has never, to my recollection, had a problem. I suspect (from various reading) that the suspend process involves freezing devices, creating a system image, and then thawing the devices in order to be able to write the image out. Perhaps this is the source of the confusion? I'll plan to try your suggestion later. Thanks for the time and effort you're putting into solving this.
OK, I tried the patch suggested in comment 35 (well, I applied it by hand) alone, in combination with the one in comment 28, and in combination with comment 28 and putting the if/tulip_up before request_irq in tulip_resume. All three combinations still crashed (I can provide the dumps on request, although I'm not sure it's worth it.) However, I did notice this oddity, which may explain a lot. Before the first hibernate after a fresh boot with the standard (unpatched) tulip module, /proc/interrupts contains this line: 11: 2140 XT-PIC-XT uhci_hcd:usb1, eth0 After resuming from the first hibernate, it has: 11: 2696 XT-PIC-XT uhci_hcd:usb1, eth0, eth0 I haven't noticed whether it always takes at least one cycle before it crashed (actually, I thought it did sometimes crash on the first try), but with this last round of testing I noticed that it didn't crash until the second try for at least a couple of the patches. Hopefully this will help to pin down at least one of the bugs.
(In reply to comment #37) > OK, I tried the patch suggested in comment 35 (well, I applied it by hand) > alone, in combination with the one in comment 28, and in combination with > comment 28 and putting the if/tulip_up before request_irq in tulip_resume. > All > three combinations still crashed (I can provide the dumps on request, > although > I'm not sure it's worth it.) Thanks! I'm sure there are issues with shared IRQ that the code does NOT take into consideration today. > However, I did notice this oddity, which may explain a lot. Before the first > hibernate after a fresh boot with the standard (unpatched) tulip module, > /proc/interrupts contains this line: > 11: 2140 XT-PIC-XT uhci_hcd:usb1, eth0 > > After resuming from the first hibernate, it has: > 11: 2696 XT-PIC-XT uhci_hcd:usb1, eth0, eth0 This is definitely a bug. It means either the IRQ didn't get properly unregister un suspend (not likely), got registered twice (possible I guess), or the IRQ mgt code has a bug where it's messing up when the same handler calls request_irq() again. I tend to like the latter theory. (ie "My driver is fine!" ;) > I haven't noticed whether it always takes at least one cycle before it > crashed > (actually, I thought it did sometimes crash on the first try), but with this > last round of testing I noticed that it didn't crash until the second try for > at least a couple of the patches. Hopefully this will help to pin down at > least > one of the bugs. definitely will help. Thanks again! Give me something new to look at. :)
OK, here is something new to look at. (By the way, I'm not sure if by that comment, you meant that you wanted the crash dumps from the various combinations of patches I tried; if so, let me know.) The machine in question actually has two network cards supported by the tulip driver, although the second one isn't actually in use. What's interesting about this, however, is that the second one also shows an extra interrupt after a hibernate cycle. And it doesn't appear to be on a shared interrupt. Before hibernate, with eth1 down: CPU0 0: 31705 XT-PIC-XT timer 1: 285 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 9 XT-PIC-XT serial 4: 3 XT-PIC-XT 5: 1 XT-PIC-XT acpi 6: 5 XT-PIC-XT floppy 7: 0 XT-PIC-XT parport0 8: 1 XT-PIC-XT rtc 10: 2391 XT-PIC-XT au8830 11: 5354 XT-PIC-XT uhci_hcd:usb1, eth0 14: 6937 XT-PIC-XT libata 15: 1053 XT-PIC-XT libata After hibernate, with eth1 still down: CPU0 0: 41132 XT-PIC-XT timer 1: 516 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 11 XT-PIC-XT serial 4: 3 XT-PIC-XT 5: 1 XT-PIC-XT acpi 6: 5 XT-PIC-XT floppy 7: 0 XT-PIC-XT parport0 8: 3 XT-PIC-XT rtc 10: 2591 XT-PIC-XT au8830 11: 6264 XT-PIC-XT uhci_hcd:usb1, eth0, eth0 12: 100000 XT-PIC-XT eth1 14: 7313 XT-PIC-XT libata 15: 1431 XT-PIC-XT libata If I try to bring eth1 up before hibernating and then hibernate, the network doesn't seem to come back on resume -- the hibernate script doesn't even finish, although the system otherwise comes back to a usable desktop. This could be because eth1 isn't in hibernate's configuration as an interface to bring down on suspend and back up on resume, but my initial reason for adding eth0 to that list was so that DHCP would know when the system was on or off and DHCP hostnames would work properly. If I bring eth1 up after a hibernate, it shows the doubled interrupt handler too: 12: 100000 XT-PIC-XT eth1, eth1
(In reply to comment #39) By the way, this is still on 2.6.24.4; I haven't tried 2.6.25 yet.
(In reply to comment #40) > By the way, this is still on 2.6.24.4; I haven't tried 2.6.25 yet. That's ok - I'm just as comfortable debugging this on 2.6.24. Move to 2.6.25 whenever it's convenient for you. We can continue poking at this. thanks, grant
(In reply to comment #39) > OK, here is something new to look at. (By the way, I'm not sure if by that > comment, you meant that you wanted the crash dumps from the various > combinations of patches I tried; if so, let me know.) Nope. I shouldn't need console dumps unless there is something different/new than from the previous ones. Note that "crash dump" is also called a "core dump" - it's the memory contents and CPU state saved off when the machine oop's (aka panic). > > The machine in question actually has two network cards supported by the tulip > driver, although the second one isn't actually in use. What's interesting > about > this, however, is that the second one also shows an extra interrupt after a > hibernate cycle. And it doesn't appear to be on a shared interrupt. After reading the whole thing, you meant "interrupt handler" or "Interrupt Service Routine (ISR). It's really not a different handler - it's just registering the same handler twice. The ISR should have been unregistered in "tulip_suspend" when "free_irq()" is called. But I now suspect we register the ISR once in tulip_resume() and again when we call tulip_open(). tulip_open() is called when we "ifconfig" the interface and I assume that is done by udev or some other user space script on resume since the driver isn't saving the TCP state (it's TCP address/etc). > Before hibernate, with eth1 down: ... > After hibernate, with eth1 still down: ... > 11: 6264 XT-PIC-XT uhci_hcd:usb1, eth0, eth0 > 12: 100000 XT-PIC-XT eth1 Yeah - this is the main clue that tulip_resume() should NOT be calling request_irq(). Good catch! :) > 14: 7313 XT-PIC-XT libata > 15: 1431 XT-PIC-XT libata > > If I try to bring eth1 up before hibernating and then hibernate, the network > doesn't seem to come back on resume -- the hibernate script doesn't even > finish, although the system otherwise comes back to a usable desktop. This > could be because eth1 isn't in hibernate's configuration as an interface to > bring down on suspend and back up on resume, but my initial reason for adding > eth0 to that list was so that DHCP would know when the system was on or off > and > DHCP hostnames would work properly. Makes sense. > If I bring eth1 up after a hibernate, it shows the doubled interrupt handler > too: > 12: 100000 XT-PIC-XT eth1, eth1 Excellent! I'm pretty confident tulip_resume() is wrong. tulip_suspend() will exit early if the interface is not configured. Can you delete a bunch of stuff from tulip_resume() so it only has the following contents? static int tulip_resume(struct pci_dev *pdev) { struct net_device *dev = pci_get_drvdata(pdev); int retval; if (!dev) return -EINVAL; pci_set_power_state(pdev, PCI_D0); pci_restore_state(pdev); if ((retval = pci_enable_device(pdev))) { printk (KERN_ERR "tulip: pci_enable_device failed in resume\n"); return retval; } return 0; } I'm guessing tulip_resume() should put the NIC in the same state as if tulip_init_one() had been called. Then user space will have to ifconfig the interface so tulip_open() gets called to register the interrupt handler. Please also add "printk" entries at the top of each of the following routines to help us debug if the suggestion above isn't working: tulip_init_one() tulip_remove_one() tulip_open() tulip_close() tulip_suspend() tulip_resume() and any other tulip functions you think we are calling. If I am correct, it means the tulip_suspend() and tulip_resume() are not symetrical (code assumed they were). But I'll remind that I don't know suspend/resume code path or what's expected from the NIC drivers for that to work. thanks, grant
(In reply to comment #42) > Can you delete a bunch of stuff from tulip_resume() so it only has the > following contents? Snipping actual contents, but something in what was removed is apparently necessary for the interface to come back up; dhclient was failing to bring it back with that change, saying: Apr 19 18:09:08 thursday dhclient: receive_packet failed on eth0: Network is down Apr 19 18:09:12 thursday dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5 Apr 19 18:09:12 thursday dhclient: send_packet: Network is down Apr 19 18:09:17 thursday dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 12 Apr 19 18:09:17 thursday dhclient: send_packet: Network is down I didn't try another time to see if this also eliminated the crash, although I guess I should; if it did still crash with this change, that would also be revealing. > Please also add "printk" entries at the top of each of the following routines > to help us debug if the suggestion above isn't working: > tulip_init_one() > tulip_remove_one() > tulip_open() > tulip_close() > tulip_suspend() > tulip_resume() > > and any other tulip functions you think we are calling. I will attach the output from doing this. I prefixed all the printk lines I added with "TULIP DEBUG:", so you have something to scan for. > If I am correct, it means the tulip_suspend() and tulip_resume() are not > symetrical (code assumed they were). But I'll remind that I don't know > suspend/resume code path or what's expected from the NIC drivers for that to > work. Hmm... is there no documentation? :)
Created attachment 15822 [details] Kernel message log, 2.6.24.4, with printks added The printk lines specified in comment #42 are prefixed with "TULIP DEBUG:". This log is from a boot, hibernate, resume, and hibernate again cycle; the crash occurs on the second hibernate, so you can see what gets called before it crashes (assuming the functions all make it to their printks before crashing, which is not necessarily a safe assumption).
(In reply to comment #43) > I didn't try another time to see if this also eliminated the crash, although > I > guess I should; if it did still crash with this change, that would also be > revealing. I've now tried a few more times with the shortened version of tulip_resume, and it hasn't crashed. (The standard 2.6.24.4 version crashes almost every time after the first hibernate/resume cycle.) But, as mentioned, the interface also doesn't come back up, so something isn't completely right with that change, and the printks only report tulip_close being called before the first hibernate, with no tulip_open or tulip_close thereafter, even after multiple cycles, only tulip_suspend and tulip_resume. The interface still shows up in ifconfig -a, but doesn't seem to be configurable after a hibernate. As a possibly useful explanatory note: I believe the tulip_open and tulip_close calls seen in the printk log are caused by the user-space dhclient bringing the interface up and down, not by an in-kernel part of the suspend process.
(In reply to comment #42) > If I am correct, it means the tulip_suspend() and tulip_resume() are not > symetrical (code assumed they were). But I'll remind that I don't know > suspend/resume code path or what's expected from the NIC drivers for that to > work. Some guessing based on a cursory inspection of what a few other drivers (eepro100.c, r8169.c, typhoon.c, via-velocity.c) do suggested the patch below, which hasn't crashed yet (after about 10 hibernate/resume cycles) and brings the interface back on resume. I have no idea whether it is actually correct; probably you know more about the semantics of netif_running. (typhoon.c in particular contains a couple comments of the form "If we're down, resume when upped" and "If we're down, we're already suspended.") --- tulip_core.c.standard 2008-03-31 18:57:39.000000000 -0400 +++ tulip_core.c 2008-04-19 20:26:37.000000000 -0400 @@ -1733,12 +1733,15 @@ if (!dev) return -EINVAL; - if (netif_running(dev)) - tulip_down(dev); + if (!netif_running(dev)) + goto finish; + + tulip_down(dev); netif_device_detach(dev); free_irq(dev->irq, dev); +finish: pci_save_state(pdev); pci_disable_device(pdev); pci_set_power_state(pdev, pci_choose_state(pdev, state)); @@ -1763,6 +1766,9 @@ return retval; } + if (!netif_running(dev)) + return 0; + if ((retval = request_irq(dev->irq, &tulip_interrupt, IRQF_SHARED, dev->name, dev))) { printk (KERN_ERR "tulip: request_irq failed in resume\n"); return retval;
Reply-To: pavel@ucw.cz On Sat 2008-04-19 17:50:34, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=8952 > > > > > > ------- Comment #46 from kernelbugs@tap.homeip.net 2008-04-19 17:50 ------- > (In reply to comment #42) > > If I am correct, it means the tulip_suspend() and tulip_resume() are not > > symetrical (code assumed they were). But I'll remind that I don't know > > suspend/resume code path or what's expected from the NIC drivers for that > to > > work. > > Some guessing based on a cursory inspection of what a few other drivers > (eepro100.c, r8169.c, typhoon.c, via-velocity.c) do suggested the patch > below, > which hasn't crashed yet (after about 10 hibernate/resume cycles) and brings > the interface back on resume. I have no idea whether it is actually correct; > probably you know more about the semantics of netif_running. (typhoon.c in > particular contains a couple comments of the form "If we're down, resume when > upped" and "If we're down, we're already suspended.") > > --- tulip_core.c.standard 2008-03-31 18:57:39.000000000 -0400 > +++ tulip_core.c 2008-04-19 20:26:37.000000000 -0400 > @@ -1733,12 +1733,15 @@ > if (!dev) > return -EINVAL; > > - if (netif_running(dev)) > - tulip_down(dev); > + if (!netif_running(dev)) > + goto finish; > + > + tulip_down(dev); > > netif_device_detach(dev); > free_irq(dev->irq, dev); > > +finish: > pci_save_state(pdev); > pci_disable_device(pdev); > pci_set_power_state(pdev, pci_choose_state(pdev, state)); > @@ -1763,6 +1766,9 @@ > return retval; > } > > + if (!netif_running(dev)) > + return 0; > + > if ((retval = request_irq(dev->irq, &tulip_interrupt, IRQF_SHARED, > dev->name, dev))) { > printk (KERN_ERR "tulip: request_irq failed in resume\n"); > return retval; > Seems sane to me, actually. Replace spaces with tabs, and it looks like mainline candidate...
Created attachment 15941 [details] PATCH suspend/resume rewrite...after checking other drivers. kernelbugs, can you please test this patch? If this works, I'd be happy to send this upstream. Pavel, The whole suspend/resume stuff for NICs is a mess. It would be good if you could look at several of the drivers and then suggest some guidelines in Documentation someplace. You might start with the following: o NIC suspend must always call pci_save_state() and set the power to the requested state. It will also call pci_device_disable(). o NIC resume must always call pci_restore_state(). It will only call pci_enable_device() if the device was before suspend in netif_running() state. It will NOT call pci_set_power_state(D0) since pci_enable_device() is already doing that. o In general, do not call free_irq/request_irq (most drivers don't..I'm assume low power state will stop interrupts). etc. I don't know all the rules. After looking at several drivers (and submitting one small fix to typhoon driver), I'm convinced some guidelines will do wonders to help clean this up. thanks, grant
Well, some time ago I added the suspend/resume support to uli526x.c . It is a good representation of how I think that should be done, although I'm not really an experienced driver writer. If you look at the code in there, you'll notice that it doesn't call pci_disable_device()/pci_enable_device() at all, since that just plain didn't work with that chip. As far as the guidance is concerned, we are in the process of changing the framework for suspending/resuming devices and there surely will be some guidance for the new framework (the patches that introduce the new callbacks are targeted at 2.6.27).
(In reply to comment #48) > Created an attachment (id=15941) [details] > PATCH suspend/resume rewrite...after checking other drivers. > > kernelbugs, can you please test this patch? > If this works, I'd be happy to send this upstream. > That patch eliminates the panic, but fails to bring the interface back up in a working state. dhclient3 just sends DHCPDISCOVERs repeatedly (after any suspend attempt), and at the beginning of a second suspend process, the kernel message log contains: 0000:00:13.0: tulip_stop_rxtx() failed (CSR5 0xffffffff CSR6 0xffffffff) and ADDRCONF(NETDEV_UP): eth0: link is not ready at the very end (after resuming a second time, presumably when the hibernate script tries to bring the interface back up). The messages above appear in the log for a second suspend process, after the first one has already failed to bring the interface back. (Yes, I am repeating myself to make sure the point isn't missed.) Removing the module and reinserting allows the interface to be brought back up in a working state.
jgarzik applied to his tree. Expect this to be fixed in 2.6.26-rc5 or so. Syntax error in the submission to netdev was my fault: http://marc.info/?l=linux-netdev&m=121057430021417&w=2 Thanks to jgarzik for fixing the syntax error and applying it. I'll track the other issues I found during code reviews (interrupts being delivered despite disabling the chip) in my TODO list.
fixed by commit 56997fa838e333cea33ab641d4aeedd23aef0eb1