Bug 15843 (ath5k-wakeup)
Summary: | ath5k can't properly resume/reset AR5004X (minipci) after sleep | ||
---|---|---|---|
Product: | Drivers | Reporter: | Tomas Mudrunka (harviecz) |
Component: | network-wireless | Assignee: | drivers_network-wireless (drivers_network-wireless) |
Status: | CLOSED INSUFFICIENT_DATA | ||
Severity: | blocking | CC: | bugzilla.kernel.org, giulio.genovese, harviecz, linville, me, mickflemm |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33 - ArchLinux | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
grep ath /var/log/messages.log
Use pci save/restore mugshot of crashed machine |
Description
Tomas Mudrunka
2010-04-24 14:32:17 UTC
"ath5k phy0: Invalid EEPROM checksum: 0xdb11 eep_max: 0x0340 (default size)" -- surely this is the issue. Bob & Nick -- any suggestions to help Thomas correct this? Is it reasonably possible that ath5k corrupted the EEPROM? (In reply to comment #0) > Hello, i've bought NeWeb Wistron CM9 (Atheros AR5001X+) few days ago and it > was working since that day, but today i booted up my pc and i saw that pidgin > was connected when i was leaving to get a breakfast, when i returned, i found > my laptop completely freezed up, so i've rebooted it and after that i was not This is quite unfortunate... > in lspci now i see: > 00:0b.0 Ethernet controller: CastleNet Technology Inc. Device 0013 (rev 01) > > i am not sure, but i guess that before it was identyfiing itself as NeWeb > Wistron CM9 Can you show lspci -vnn? > so i have another piece of hardware bricked by Linux kernel (and second > minipci > wifi card). What was the first bricked hardware? In the same system? Hmm, this seems to be a common problem on these cards, with madwifi as well although this is the first I heard of it: http://www.mobilnews.cz/blog/?p=36 (In reply to comment #1) > Bob & Nick -- any suggestions to help Thomas correct this? Is it reasonably > possible that ath5k corrupted the EEPROM? Well, I guess we can reload an EEPROM image with ath_info, it's a bit dangerous though. Whether the driver can write the eeprom depends on the version of hardware - (5001X is a 5212+?) on 5210 (which works for virtually no one anyway) any write to the eeprom space basically works; for 5211+ you have to set the proper bit in an address register first, and we never do that in the driver. The pages are mapped writable in mmio space, so some errant kernel bug can conceivably set the bit and then scribble in it. ath_info manages to change the eeprom by mmap on /dev/mem as root. > lspci -vnn ---CUT---- 00:0b.0 Ethernet controller [0200]: CastleNet Technology Inc. Device [1688:0013] (rev 01) Subsystem: Wistron NeWeb Corp. Device [185f:1012] Flags: bus master, fast Back2Back, medium devsel, latency 0 Memory at e2010000 (32-bit, non-prefetchable) [size=64K] Memory at 88020000 (64-bit, non-prefetchable) [size=64K] Memory at <unassigned> (64-bit, non-prefetchable) Memory at <invalid-64bit-slot> (64-bit, non-prefetchable) Capabilities: [44] MSI: Enable- Count=16/2 Maskable+ 64bit+ ---CUT---- NOTE: in fact it's NOT a "CastleNet Technology Inc. Device" there was different id when EEPROM was OK and instead of "Memory at e2010000" there was "Memory at e2000000" or something else... > Hmm, this seems to be a common problem on these cards, with madwifi as well > although this is the first I heard of it: > http://www.mobilnews.cz/blog/?p=36 i know that page, but i am not able to build such obsolete drivers even with oldest kernel which i have available :-( and it's quite unusable to do this each few hours just to get my internet connection back (and sometimes kernel freezes up totaly during ath5k problems and therefore i can't get no log about whats going on). > Well, I guess we can reload an EEPROM image with ath_info, it's a bit > dangerous though. It would be cool if we'll have way to restore the card... BTW EEPROM contains some information about MAC adress and fine performance tuning which is different for each card, so this should be saved. isn't there just some way to lock EEPROM memory of this card so it will be read-only and therefore nothing will be able to brick it? I found another strange thing... when i lefted my laptop powered off (not suspended) for few hours the eeprom was back but after few minutes of operation it was gone again and lspci -vvv was again showing "CastleNet Technology Inc. Device", so i am bit confused about what is happening. and logs are bit different each time. i will try to leave my comp powered off again when i will have some time to do so. BTW i have same problem with 2.6.27 kernel + ath5k driver. > What was the first bricked hardware? In the same system? I was bricked my broadcom 4318 (b43 driver), but it was caused by my stupidness by accidentally calling two rmmods and two modprobes of b43 all almost at the same time in my script... right now i am using few years old PCMCIA D-Link DWL-G650 (wifi b/g) with ath5k driver and everything just works as supposed (even with madwifi). Wistron CM9 is bit newer (and supports wifi a/b/g), it should work with madwifi, but there are some issues. It's very popular and it's refered to be a "best linux minipci wifi card", it's widely used in embeded linux (OpenWRT) routers and mikrotik routerboards. Chipset used is ar5001X+ which contains few different Atheros chips... I left computer off for ~30 minutes and miracle happend again: ... 00:0b.0 Ethernet controller [0200]: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter [168c:0013] (rev 01) Subsystem: Wistron NeWeb Corp. CM9 Wireless a/b/g MiniPCI Adapter [185f:1012] Flags: bus master, medium devsel, latency 168, IRQ 17 Memory at e2000000 (32-bit, non-prefetchable) [size=64K] Capabilities: [44] Power Management version 2 Kernel driver in use: ath5k Kernel modules: ath_pci, ath5k ... OT: BTW when i use rfkill button to disable wifi and then i enable it again, i need to be messing around for a while to get network manager use the card again (reloading module was not enough...) 1 ;( root@harvie-ntb ~ # rmmod ath5k 0 ;) root@harvie-ntb ~ # modprobe ath5k 0 ;) root@harvie-ntb ~ # ifconfig wlan0 up SIOCSIFFLAGS: Unknown error 132 255 ;( root@harvie-ntb ~ # iwconfig wlan0 txpower auto 0 ;) root@harvie-ntb ~ # ifconfig wlan0 up but it's unrelated. seems that EEPROM can be fixed by letting computer turned off for a while (1 minute is not enough) but it's not much cool to have crashing kernel with ath5k and need to turn off computer for a minutes when i wan't to do something. (first time i experienced this issue was during programming lecture when i needed to write code which means using google of course). Anyway i am POSTing this update using affected Wistron CM9 card. This is what i see now when card and driver are properly loaded: ath5k 0000:00:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 ath5k 0000:00:0b.0: registered as 'phy1' ath: EEPROM regdomain: 0x0 ath: EEPROM indicates default country code should be used ath: doing EEPROM country->regdmn map search ath: country maps to regdmn code: 0x3a ath: Country alpha2 being used: US ath: Regpair used: 0x3a ath5k phy1: Atheros AR5213A chip found (MAC: 0x59, PHY: 0x43) ath5k phy1: RF5112B multiband radio found (0x36) Here are few errors which i have seen in past days (in many various combinations - they are not chronologicaly sorted!): kernel: ath5k phy3: failed to warm reset the MAC Chip kernel: ath5k phy3: failed to resume the MAC Chip kernel: ath5k phy3: failed to wakeup the MAC Chip kernel: ath5k phy3: can't reset hardware (-5) kernel: ath5k phy0: POST Failed !!! kernel: ath5k: probe of 0000:00:0b.0 failed with error -5 kernel: ath5k: probe of 0000:00:0b.0 failed with error -11 kernel: ath5k phy3: noise floor calibration failed (2427MHz) kernel: ath5k phy3: noise floor calibration failed (2412MHz) kernel: ath5k phy3: ath5k_chan_set: unable to reset channel (2412 Mhz) kernel: ath5k phy3: ath5k_chan_set: unable to reset channel (2417 Mhz) kernel: ath5k phy3: ath5k_chan_set: unable to reset channel (2422 Mhz) hope this will help Another fail just now! note that card looks ok, but after next reboot i guess it will look like corrupted card... lspci -vvvnn: 00:0b.0 Ethernet controller [0200]: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter [168c:0013] (rev 01) Subsystem: Wistron NeWeb Corp. CM9 Wireless a/b/g MiniPCI Adapter [185f:1012] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B+ DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR+ INTx- Latency: 96 (2500ns min, 7000ns max), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 17 Region 0: Memory at e2000000 (32-bit, non-prefetchable) [size=64K] Capabilities: [44] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME- Kernel modules: ath_pci, ath5k dmesg | grep ath: ath5k 0000:00:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 ath5k 0000:00:0b.0: registered as 'phy0' ath: EEPROM regdomain: 0x0 ath: EEPROM indicates default country code should be used ath: doing EEPROM country->regdmn map search ath: country maps to regdmn code: 0x3a ath: Country alpha2 being used: US ath: Regpair used: 0x3a ath5k phy0: Atheros AR5213A chip found (MAC: 0x59, PHY: 0x43) ath5k phy0: RF5112B multiband radio found (0x36) ath5k 0000:00:0b.0: PCI INT A disabled ath5k 0000:00:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 ath5k 0000:00:0b.0: registered as 'phy0' ath: EEPROM regdomain: 0x0 ath: EEPROM indicates default country code should be used ath: doing EEPROM country->regdmn map search ath: country maps to regdmn code: 0x3a ath: Country alpha2 being used: US ath: Regpair used: 0x3a ath5k phy0: Atheros AR5213A chip found (MAC: 0x59, PHY: 0x43) ath5k phy0: RF5112B multiband radio found (0x36) ath5k 0000:00:0b.0: PCI INT A disabled ath5k 0000:00:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 ath5k 0000:00:0b.0: registered as 'phy1' ath: EEPROM regdomain: 0x0 ath: EEPROM indicates default country code should be used ath: doing EEPROM country->regdmn map search ath: country maps to regdmn code: 0x3a ath: Country alpha2 being used: US ath: Regpair used: 0x3a ath5k phy1: Atheros AR5213A chip found (MAC: 0x59, PHY: 0x43) ath5k phy1: RF5112B multiband radio found (0x36) WARNING: at drivers/net/wireless/ath/ath5k/base.c:1166 ath5k_tasklet_rx+0x52c/0x560 [ath5k]() Modules linked in: ath5k led_class mac80211 ath cfg80211 rfkill ipv6 ext4 jbd2 crc16 cpufreq_ondemand cpufreq_stats cpufreq_conservative cpufreq_userspace cpufreq_powersave snd_seq_dummy arc4 ecb powernow_k8 freq_table tun snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device joydev snd_intel8x0m snd_intel8x0 snd_ac97_codec snd_pcm snd_timer thermal ac button snd pcmcia soundcore sis_agp battery snd_page_alloc processor ac97_bus usbhid yenta_socket amd64_agp i2c_sis96x sr_mod rsrc_nonstatic shpchp hid psmouse agpgart cdrom i2c_core sis900 mii sg pcmcia_core pci_hotplug k8temp loop serio_raw pcspkr evdev autofs4 tpm_tis tpm tpm_bios fuse rtc_cmos rtc_core rtc_lib ext3 jbd mbcache ohci_hcd ehci_hcd usbcore sisfb ata_generic sd_mod pata_sis pata_acpi libata scsi_mod [last unloaded: led_class] [<c104314d>] warn_slowpath_common+0x6d/0xa0 [<fa66ecfc>] ? ath5k_tasklet_rx+0x52c/0x560 [ath5k] [<fa66ecfc>] ? ath5k_tasklet_rx+0x52c/0x560 [ath5k] [<c10431c6>] warn_slowpath_fmt+0x26/0x30 [<fa66ecfc>] ath5k_tasklet_rx+0x52c/0x560 [ath5k] ath5k phy1: failed to warm reset the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k 0000:00:0b.0: restoring config space at offset 0xf (was 0x1c0a0100, writing 0x1c0a0104) ath5k 0000:00:0b.0: restoring config space at offset 0x4 (was 0x0, writing 0xe2000000) ath5k 0000:00:0b.0: restoring config space at offset 0x3 (was 0x0, writing 0xa810) ath5k 0000:00:0b.0: restoring config space at offset 0x1 (was 0x2900000, writing 0x82900016) ath5k 0000:00:0b.0: restoring config space at offset 0xf (was 0x1c0a0100, writing 0x1c0a0104) ath5k 0000:00:0b.0: restoring config space at offset 0x4 (was 0x0, writing 0xe2000000) ath5k 0000:00:0b.0: restoring config space at offset 0x3 (was 0x0, writing 0xa810) ath5k 0000:00:0b.0: restoring config space at offset 0x1 (was 0x2900000, writing 0x2900016) WARNING: at drivers/net/wireless/ath/ath5k/base.c:1166 ath5k_tasklet_rx+0x52c/0x560 [ath5k]() Modules linked in: ath5k led_class mac80211 ath cfg80211 rfkill ipv6 ext4 jbd2 crc16 cpufreq_ondemand cpufreq_stats cpufreq_conservative cpufreq_userspace cpufreq_powersave snd_seq_dummy arc4 ecb powernow_k8 freq_table tun snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device joydev snd_intel8x0m snd_intel8x0 snd_ac97_codec snd_pcm snd_timer thermal ac button snd pcmcia soundcore sis_agp battery snd_page_alloc processor ac97_bus usbhid yenta_socket amd64_agp i2c_sis96x sr_mod rsrc_nonstatic shpchp hid psmouse agpgart cdrom i2c_core sis900 mii sg pcmcia_core pci_hotplug k8temp loop serio_raw pcspkr evdev autofs4 tpm_tis tpm tpm_bios fuse rtc_cmos rtc_core rtc_lib ext3 jbd mbcache ohci_hcd ehci_hcd usbcore sisfb ata_generic sd_mod pata_sis pata_acpi libata scsi_mod [last unloaded: led_class] [<c104314d>] warn_slowpath_common+0x6d/0xa0 [<fa66ecfc>] ? ath5k_tasklet_rx+0x52c/0x560 [ath5k] [<fa66ecfc>] ? ath5k_tasklet_rx+0x52c/0x560 [ath5k] [<c10431c6>] warn_slowpath_fmt+0x26/0x30 [<fa66ecfc>] ath5k_tasklet_rx+0x52c/0x560 [ath5k] [<fa667b78>] ? ath5k_hw_calibration_poll+0x18/0x70 [ath5k] ath5k phy1: failed to warm reset the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k 0000:00:0b.0: PCI INT A disabled ath5k 0000:00:0b.0: enabling device (0604 -> 0606) ath5k 0000:00:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 ath5k 0000:00:0b.0: setting latency timer to 64 ath5k 0000:00:0b.0: registered as 'phy2' ath5k phy2: failed to wakeup the MAC Chip ath5k 0000:00:0b.0: PCI INT A disabled ath5k: probe of 0000:00:0b.0 failed with error -5 0 ;) root@harvie-ntb ~ # rmmod ath5k Killed (SIGKILL) 137 ;( root@harvie-ntb ~ # rmmod ath5k ERROR: Removing 'ath5k': Device or resource busy i left pc turned of for 15 minutes and it didn't helped: ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) BTW i mentioned that my problem happens when i wake pc from suspended state (to ram using pm-utils and basic kernel sleep driver in my case). maybe there's some problem with waking up... but in some cases i was using wifi few minutes after wakeup and then i got error message (and lost link). which seems to cause messages like this: ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to resume the MAC Chip ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: failed to warm reset the MAC Chip Kernel crashed again... card suddenly started reconnecting several times (with help of networkmanager) and after few ath5k phy1: failed to warm reset the MAC Chip ath5k phy1: can't reset hardware (-5) kernel finaly crashed. i think i'll rather try madwifi for few days (even when card have very loosy sensitivity with madwifi) because i can't do kernel debuging while learning for my exams... i hope madwifi will do the work for a while :( i've noted few errors i didn't seen before: ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800824b5 WARNING: at drivers/net/wireless/ath/ath5k/base.c:1166 ath5k_tasklet_rx+0x52c/0x560 [ath5k]() [<f812bb78>] ? ath5k_hw_calibration_poll+0x18/0x70 [ath5k] ath5k phy43: unsupported jumbo (i will attach whole log | grep ath) Created attachment 26156 [details]
grep ath /var/log/messages.log
added some more logs... hope it will help ;o)
There are some patches... looks like there are really problems with suspending... http://article.gmane.org/gmane.linux.drivers.ath5k.devel/3419 http://www.mail-archive.com/ath5k-devel@lists.ath5k.org/msg03304.html https://patchwork.kernel.org/patch/38550/ http://blog.gmane.org/gmane.linux.drivers.ath5k.devel i've tried: echo 1 > /sys/module/ath5k/drivers/pci\:ath5k/0000\:00\:0b.0/reset and i was getting endless loop of this error in dmesg: ath5k 0000:00:0b.0: restoring config space at offset 0x1 (was 0x2900400, writing 0x2900016) when i unloaded ath5k, echo returned. when i tried to load it back, kernel stucked. Anyway... i've been using Wistron CM9 happily with ath5k (btw madwifi have the same issue) for some time and everything works just fine when i am not trying to suspend my pc to ram (which puts card to powersaving mode). Because driver is not able to reset card properly after wakeup for some reason... I believe that this issue was not so obvious because many of people are using Wistron CM9 on Linux routers where is no need to hibernate the system or enter some kind of powersaving mode... maybe the problem is in ath driver which is AFAIK used by both - ath5k and madwifi. BTW unloading module before sleep and reloading it later after wakeup do not help. Is there some way to protect card from being powersaved (just for workaround until driver will be fixed)? I've just found that NeWeb Wistron CM9 card have AR5004X chipset (NOT AR5001X+), so there is also some problem in detection with lspci: http://www.atheros.com/pt/bulletins/AR5004XBulletin.pdf (Wistron CM9) http://www.atheros.com/pt/bulletins/AR5001X+Bulletin.pdf (seems to be older chipset used in popular Orinoco cards) But my Wistron CM9 shows this: 00:0b.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01) but according to dmesg it seems that particular chips from this chipsets are detected properly... BTW now i've had small similar problems (unable to reset chip, disconnects and kernel crash) with wifi after reboot (i am not using sleep since i am using this card), but after imediate reboot everything started to work again... madwifi does not use the ath driver, just ath5k, ath9k, and etc. It does have a lot of the same code as madwifi. lspci information comes directly from PCI configuration space -- the driver just reads it and reports it as a string, there's no detection there, so not to worry about seeing AR5001X. Those numbers are largely meaningless anyway. The reason I asked for lspci -vnn before was to see if there were bit flips in the pci config memory, and there are: When the device misreports as CastleNet xxx, pci id is: 1688:0013. When wireless is working, you get: 168b:0013. Note that the bottom two bits in the real value of the vendor ID got lost due to suspend to ram. That, and lots more than normal values in the configuration space got lost. So something bad is happening when power drops due to suspend; some values in the pci memory disappear. I'm not sure what to debug next in this setup, but it's not the first case I've heard of vendor ID changing, there were a couple of other reports in recent kernels... ath5k didn't change a lot in that cycle so it could be platform related. Does it happen in 2.6.{31,32} by chance? BTW here's a couple of patches to try reverting. commit 0d0cd72fa1e6bfd419c99478ec70b4877ed0ef86 Author: Bob Copeland <me@bobcopeland.com> Date: Sat Jul 4 12:59:54 2009 -0400 ath5k: do not release irq across suspend/resume Paraphrasing Rafael J. Wysocki: "drivers should not release PCI IRQs in suspend." Doing so causes a warning during suspend/resume on some platforms. Cc: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> commit baee1f3caa5a771880144358dd07d32e09ba4dcf Author: Rafael J. Wysocki <rjw@sisk.pl> Date: Mon Oct 5 00:52:09 2009 +0200 Wireless / ath5k: Simplify suspend and resume callbacks Simplify the suspend and resume callbacks of ath5k by converting the driver to struct dev_pm_ops and allowing the PCI PM core to do the PCI-specific suspend/resume handling. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: John W. Linville <linville@tuxdriver.com> Bob: Are you sure this really matters? i've tried to unload module before suspend and reload it after wakeup and it didn't helped... (i think that module should properly resume card when it's loaded) And BTW both patches were merged after 2.6.30, but i had problems even with 2.6.27-LTS... btw now i will try madwifi-hal for some time to see if something will go better... by now i found ath5k usable in 99% of cases when i do not use sleep (which is very uncomfortable for me) I've tried MadWifi-HAL version and same problem there... maybe there's something wrong with suspend in kernel or with something not directly related to ath5k... PM: resume of devices complete after 2407.394 msecs PM: Finishing wakeup. Restarting tasks ... done. ath_pci 0000:00:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 MadWifi: ath_attach: Switching rfkill capability off. wifi0: Atheros AR5213A chip found (MAC 5.9, PHY 2112 4.3, Radio 3.6) ath_pci: wifi0: Atheros 5212: mem=0xe2000000, irq=17 ath0: no IPv6 routers present wifi0: ath_rxorn_tasklet: Receive FIFO overrun; resetting. wifi0: ath_reset: Unable to reset hardware: 'Hardware didn't respond as expected' (HAL status 3) wifi0: FAILED verification of AR5K_PHY_AGCSIZE_DESIRED default value [found=0x2 (2) expected=0xde (-34)]. AR5K_PHY_AGCSIZE:0x9850:0x302d3130:..11.... ..1.11.1 ..11...1 ..11....:unknown wifi0: FAILED verification of AR5K_PHY_AGCCOARSE_LO default value [found=0xae (-82) expected=0xcc (-52)]. AR5K_PHY_AGCCOARSE:0x985c:0x72695700:.111..1. .11.1..1 .1.1.111 ........:unknown wifi0: FAILED verification of AR5K_PHY_AGCCOARSE_HI default value [found=0x52 (-46) expected=0x6e (-18)]. AR5K_PHY_AGCCOARSE:0x985c:0x72695700:.111..1. .11.1..1 .1.1.111 ........:unknown wifi0: FAILED verification of AR5K_PHY_SIG_FIRPWR default value [found=0xc (12) expected=0xba (-70)]. AR5K_PHY_SIG:0x9858:0x30303030:..11.... ..11.... ..11.... ..11....:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_OFDM_LOW_M1 default value [found=0x49 (73) expected=0x32 (50)]. (unknown):0x986c:0x65726566:.11..1.1 .111..1. .11..1.1 .11..11.:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_OFDM_LOW_M2 default value [found=0x2b (43) expected=0x28 (40)]. (unknown):0x986c:0x65726566:.11..1.1 .111..1. .11..1.1 .11..11.:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_OFDM_LOW_M2_COUNT default value [found=0x25 (37) expected=0x30 (48)]. (unknown):0x986c:0x65726566:.11..1.1 .111..1. .11..1.1 .11..11.:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_OFDM_LOW_SELFCOR default value [found=0x0 (0) expected=0x1 (1)]. (unknown):0x986c:0x65726566:.11..1.1 .111..1. .11..1.1 .11..11.:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_OFDM_HIGH_M1 default value [found=0x29 (41) expected=0x4d (77)]. (unknown):0x9868:0x6552204e:.11..1.1 .1.1..1. ..1..... .1..111.:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_OFDM_HIGH_M2 default value [found=0x65 (101) expected=0x40 (64)]. (unknown):0x9868:0x6552204e:.11..1.1 .1.1..1. ..1..... .1..111.:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_OFDM_HIGH_M2_COUNT default value [found=0xe (14) expected=0x10 (16)]. (unknown):0x9868:0x6552204e:.11..1.1 .1.1..1. ..1..... .1..111.:unknown wifi0: FAILED verification of AR5K_PHY_WEAK_CCK_THRESH default value [found=0x2 (2) expected=0x8 (8)]. (unknown):0xa208:0x04001202:.....1.. ........ ...1..1. ......1.:unknown wifi0: FAILED verification of AR5K_PHY_SIG_FIRSTEP default value [found=0x3 (3) expected=0x0 (0)]. AR5K_PHY_SIG:0x9858:0x30303030:..11.... ..11.... ..11.... ..11....:unknown wifi0: FAILED verification of AR5K_PHY_SPUR_THRESH default value [found=0x3 (3) expected=0x2 (2)]. AR5K_PHY_SPUR:0x9924:0x00000106:........ ........ .......1 .....11.:unknown wifi0: ath_fatal_tasklet: Hardware error; resetting. wifi0: ath_chan_set: Unable to reset channel 1 (2412 MHz) flags 0xc0 'Hardware didn't respond as expected' (HAL status 3) any ideas what can this mean?: ------------[ cut here ]------------ WARNING: at drivers/net/wireless/ath/ath5k/base.c:1166 ath5k_tasklet_rx+0x565/0x590 [ath5k]() Hardware name: Aspire 3000 invalid hw_rix: 0 Modules linked in: ipv6 arc4 ecb cpufreq_ondemand cpufreq_stats cpufreq_conservative cpufreq_powersave powernow_k8 freq_table vboxnetadp vboxnetflt vboxdrv tun snd_seq_dummy ath5k snd_seq_oss snd_seq_midi_event mac80211 ath snd_seq snd_seq_device cfg80211 rfkill led_class snd_intel8x0 snd_intel8x0m joydev snd_ac97_codec sis_agp pcmcia snd_pcm snd_timer snd soundcore battery snd_page_alloc amd64_agp ac i2c_sis96x ac97_bus sr_mod agpgart i2c_core cdrom yenta_socket rsrc_nonstatic processor thermal button pcmcia_core psmouse usbhid hid pcspkr sg loop shpchp pci_hotplug sis900 mii k8temp serio_raw evdev autofs4 tpm_tis fuse tpm tpm_bios rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 ohci_hcd ehci_hcd usbcore sisfb ata_generic sd_mod pata_sis pata_acpi libata scsi_mod Pid: 3235, comm: sensors-applet Tainted: G W 2.6.33-ARCH #1 Call Trace: [<c1043b4d>] warn_slowpath_common+0x6d/0xa0 [<fa45a3d5>] ? ath5k_tasklet_rx+0x565/0x590 [ath5k] [<fa45a3d5>] ? ath5k_tasklet_rx+0x565/0x590 [ath5k] [<c1043bc6>] warn_slowpath_fmt+0x26/0x30 [<fa45a3d5>] ath5k_tasklet_rx+0x565/0x590 [ath5k] [<c1049a38>] tasklet_action+0x58/0xc0 [<c104a4bd>] __do_softirq+0x8d/0x1d0 [<c101dec6>] ? irq_complete_move+0x16/0x20 [<c101e7af>] ? ack_apic_level+0x5f/0x1f0 [<c104a63d>] do_softirq+0x3d/0x50 [<c104a9fd>] irq_exit+0x6d/0x70 [<c1005b00>] do_IRQ+0x50/0xc0 [<c10f4d9d>] ? sys_write+0x3d/0x70 [<c1003cb0>] common_interrupt+0x30/0x38 ---[ end trace 1b3facddd5b36d8e ]--- ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800824b5 ath5k phy0: failed to warm reset the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800824b5 ------------[ cut here ]------------ WARNING: at drivers/net/wireless/ath/ath5k/base.c:1166 ath5k_tasklet_rx+0x565/0x590 [ath5k]() Hardware name: Aspire 3000 invalid hw_rix: 0 Modules linked in: ipv6 arc4 ecb cpufreq_ondemand cpufreq_stats cpufreq_conservative cpufreq_powersave powernow_k8 freq_table vboxnetadp vboxnetflt vboxdrv tun snd_seq_dummy ath5k snd_seq_oss snd_seq_midi_event mac80211 ath snd_seq snd_seq_device cfg80211 rfkill led_class snd_intel8x0 snd_intel8x0m joydev snd_ac97_codec sis_agp pcmcia snd_pcm snd_timer snd soundcore battery snd_page_alloc amd64_agp ac i2c_sis96x ac97_bus sr_mod agpgart i2c_core cdrom yenta_socket rsrc_nonstatic processor thermal button pcmcia_core psmouse usbhid hid pcspkr sg loop shpchp pci_hotplug sis900 mii k8temp serio_raw evdev autofs4 tpm_tis fuse tpm tpm_bios rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 ohci_hcd ehci_hcd usbcore sisfb ata_generic sd_mod pata_sis pata_acpi libata scsi_mod Pid: 3098, comm: perl Tainted: G W 2.6.33-ARCH #1 Call Trace: [<c1043b4d>] warn_slowpath_common+0x6d/0xa0 [<fa45a3d5>] ? ath5k_tasklet_rx+0x565/0x590 [ath5k] [<fa45a3d5>] ? ath5k_tasklet_rx+0x565/0x590 [ath5k] [<c1043bc6>] warn_slowpath_fmt+0x26/0x30 [<fa45a3d5>] ath5k_tasklet_rx+0x565/0x590 [ath5k] [<c1049a38>] tasklet_action+0x58/0xc0 [<c104a4bd>] __do_softirq+0x8d/0x1d0 [<c101dec6>] ? irq_complete_move+0x16/0x20 [<c101e7af>] ? ack_apic_level+0x5f/0x1f0 [<c104a510>] ? __do_softirq+0xe0/0x1d0 [<c104a63d>] do_softirq+0x3d/0x50 [<c104a9fd>] irq_exit+0x6d/0x70 [<c1005b00>] do_IRQ+0x50/0xc0 [<c1003cb0>] common_interrupt+0x30/0x38 ---[ end trace 1b3facddd5b36d8f ]--- FWIW, I think those warnings are a separate issue from the resume-related failures... We need to just drop that warning, it's a nuisance. (Yes, I added it, I suck.) It really seems like there's something wrong with the hardware (either platform or card), or the order in which we are saving/restoring PCI data. You bought this device new? Bob: anyway... i guess that removing warning will not fix the suspend/resume issue :-) Yeah the warning just happens due to deferred rx processing after channel changes, and is harmless. We can try pci_save/restore_state in the suspend/resume handlers to ensure config space is saved before the platform handlers run - that's my best guess on what to try next. I'll spin a patch this wknd. Hey, Bob -- any word on that patch? :-) And what if the problem is in SiS chipset (minipci bus) or it's driver? Created attachment 26714 [details]
Use pci save/restore
Heh, oops slipped my mind. Ok try this one.
If the problem is not in the wireless driver, then I guess it needs to be reassigned to whomever handles the platform or bus driver, I don't know who that would be.
I've googled for pci save restore and i've found few interesting things: http://software.itags.org/linux-unix/141535/ especialy those lines: 1. Current PCI save/restore routines only cover first 64 bytes 2. No PCI bridge driver currently. 3. Some special devices can't or are difficult to save/restore config space with current model. Such as PCI link device, it's a sysdev, but its resume code can't be invoked with irq disabled. 4. ACPI possibly changes special devices' config space, such as host bridge or LPC bridge. The special devices generally are vender specific, and possibly will not have a driver forever. i can say this: 1.) wifi cards probably use more than 64 bytes of config space (lot of options) 2.) i have some pci2pci bridge in my laptop: [22:15:05] 0 ;) harvie@harvie-ntb Shared $ lspci | grep -i pci 00:01.0 PCI bridge: Silicon Integrated Systems [SiS] SG86C202 ... seems that there can be some workaround (or even fix) for this... can i at least save/restore whole config space manualy somehow? Oh i didn't noticed the attachment :-) I've found this already in LTS kernel 2.6.32: ( src/linux-2.6.32/drivers/net/wireless/ath/ath5k/base.c ) pci_save_state(pdev); pci_restore_state(pdev); look: 673 ath5k_pci_suspend(struct pci_dev *pdev, pm_message_t state) 674 { 675 struct ieee80211_hw *hw = pci_get_drvdata(pdev); 676 struct ath5k_softc *sc = hw->priv; 677 678 ath5k_led_off(sc); 679 680 pci_save_state(pdev); 681 pci_disable_device(pdev); 682 pci_set_power_state(pdev, PCI_D3hot); 683 684 return 0; 685 } 686 687 static int 688 ath5k_pci_resume(struct pci_dev *pdev) 689 { 690 struct ieee80211_hw *hw = pci_get_drvdata(pdev); 691 struct ath5k_softc *sc = hw->priv; 692 int err; 693 694 pci_restore_state(pdev); 695 696 err = pci_enable_device(pdev); 697 if (err) 698 return err; 699 700 /* 701 * Suspend/Resume resets the PCI configuration space, so we have to 702 * re-disable the RETRY_TIMEOUT register (0x41) to keep 703 * PCI Tx retries from interfering with C3 CPU state 704 */ 705 pci_write_config_byte(pdev, 0x41, 0); 706 707 ath5k_led_enable(sc); 708 return 0; 709 } i'll try it... or pdev is something different than to_pci_dev(dev) ? Yeah we don't need to save the whole mmio space though -- mac80211 saves the wifi state and we reprogram the card after resume. We need to preserve enough to recognize the card though. pdev is not different from to_pci_dev(). The code from LTS is from an earlier kernel (one of the patches I suggested reverting removed it). Bob: yeh, anyway the LTS kernel makes the troubles too: ... ath5k phy0: noise floor calibration failed (2417MHz) __ratelimit: 12 callbacks suppressed ath5k phy0: noise floor calibration timeout (2417MHz) ath5k phy0: noise floor calibration failed (2417MHz) ath5k phy0: noise floor calibration failed (2417MHz) ath5k phy0: noise floor calibration failed (2417MHz) ath5k phy0: noise floor calibration failed (2417MHz) ath5k phy0: noise floor calibration failed (2417MHz) ath5k phy0: noise floor calibration timeout (2417MHz) ath5k phy0: noise floor calibration failed (2417MHz) ath5k phy0: noise floor calibration failed (2417MHz) ath5k phy0: noise floor calibration timeout (2417MHz) __ratelimit: 42 callbacks suppressed ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) ath5k phy0: noise floor calibration failed (5745MHz) __ratelimit: 53 callbacks suppressed ath5k phy0: failed to resume the MAC Chip ath5k phy0: can't reset hardware (-5) No probe response from AP 00:04:e2:fc:b8:aa after 500ms, disconnecting. ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) wlan0: direct probe to AP 00:04:e2:fc:b8:aa (try 1) wlan0: direct probe to AP 00:04:e2:fc:b8:aa (try 2) wlan0: direct probe to AP 00:04:e2:fc:b8:aa (try 3) wlan0: direct probe to AP 00:04:e2:fc:b8:aa timed out __ratelimit: 58 callbacks suppressed ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) __ratelimit: 56 callbacks suppressed ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k phy0: failed to wakeup the MAC Chip ath5k phy0: can't reset hardware (-5) ... Well, one problem at a time. LTS kernel is not upstream and old, so please try the patch against a recent vanilla kernel and let us know if it fixes the restore problem. Created attachment 26816 [details]
mugshot of crashed machine
Bob: problems even with 2.6.34, i have new screenshot of crashed box saying HARDWARE ERROR, so it looks to be even more complicated and only thing we can do is probably some kind of workaround...
*** Bug 14561 has been marked as a duplicate of this bug. *** This report is very confusing... Did you investigate the hardware problem? Did you run 'mcelog --ascii' as the backtrace suggested? Did you report (or otherwise resolve) that bug? Did you try Bob's pach from comment 25 against a recent kernel (i.e. 2.6.35 or later)? Can you reproduce the problem with that? John: i am looking at the bobs patch nad base.c and i've found that there is such piece of code: /* * Suspend/Resume resets the PCI configuration space, so we have to * re-disable the RETRY_TIMEOUT register (0x41) to keep * PCI Tx retries from interfering with C3 CPU state */ pci_write_config_byte(pdev, 0x41, 0); Maybe it's the actual problem, i've applied the patch, but i will also try to remove this line... Maybe some writing to some magic 0x41 pointer is not a good idea with all chipsets... John:
1.)
> Did you report (or otherwise resolve) that bug?
I think that it's what i am doing right now. it seems to me that its VERY much related to ath5k.
2.)
the mcelog output doesn't seem much satysfying to me:
[22:30:02] 0 ;) root@harvie-ntb mcelog # cat /home/harvie/Desktop/mcelog.txt
HARDWARE ERROR
CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f
TSC 789a2e7426
PROCESSOR 2:20fc2 TIME 1276714843 SOCKET 0 APIC 0
[22:30:13] 0 ;) root@harvie-ntb mcelog # cat /home/harvie/Desktop/mcelog.txt | mcelog --ascii
CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f
TSC 789a2e7426
CPU 0 BANK 4 TSC 789a2e7426
TIME 1276714843 Wed Jun 16 21:00:43 2010
STATUS b200000000070f0f MCGSTATUS 4
PROCESSOR 2:20fc2 TIME 1276714843 SOCKET 0 APIC 0
Bit more accurate output: # cat /home/harvie/Desktop/mcelog.txt | mcelog --cpu k8 --dmi --ascii WARNING: with --dmi mcelog --ascii must run on the same machine with the same BIOS/memory configuration as where the machine check occurred. CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f TSC 789a2e7426 CPU 0 4 northbridge TSC 789a2e7426 TIME 1276714843 Wed Jun 16 21:00:43 2010 Northbridge Watchdog error bit57 = processor context corrupt bit61 = error uncorrected bus error 'generic participation, request timed out generic error mem transaction generic access, level generic' STATUS b200000000070f0f MCGSTATUS 4 PROCESSOR 2:20fc2 TIME 1276714843 SOCKET 0 APIC 0 FWIW we tried removing the write to configuration space and it broke some stuff. At the time Jouni from Atheros confirmed it needs to be there. I would still like a confirmation on whether the patch does anything. It is not in any upstream kernel yet. Bob: oh yes, i can definetely say that removing that line breaks something :-) Right now i am building kernel with those changes: - your patch - commented out lines for disabling/enabling LED during suspend (hope this will not break something) - added CONFIG_PCIEASPM=y - added pci_write_config_byte(pdev, 0x41, 0); to few more PCI events (attach,detach) in hope that i will be able to fix something by reloading module I'll let you know ASAP. You must understand that it takes some time before i am able to reproduce the bug, because i need to leave computer turned off for sufficient time, then i have to use it for a while and then i have to leave it suspended in RAM for sufficient time (few hours - whole night), then i have to turn it on and use it while watching logs... BTW behaviour of system without pci_write_config_byte(pdev, 0x41, 0); in resume function is very similar to behaviour of system with that line after resume issuing this bug. So we have two important questions: 1.) How can absence of that line (in PCI resume handler) affect the system which haven't been suspended yet? 2.) Adding this line seems to fix problem, but only partialy. Aren't we forgoten about something similar elsewhere? Or it's completely different issue? And why the RETRY_TIMEOUT register (0x41) does not survive the suspend/resume? Why it does not survive save/restore of PCI state? How should other registers survive? Maybe we should store all known atheros registers manualy, because they are lost too... Are we uploading some firmware to the device? Maybe it should be reuploaded after resume... Bob, John: It seems that patch doesn't helped much... i saw this few moments before errors starts to appear in log: ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800814b5 what does this mean? is it OK? Here is whole log: ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800814b5 ------------[ cut here ]------------ WARNING: at drivers/net/wireless/ath/ath5k/base.c:1179 ath5k_tasklet_rx+0x5ab/0x670 [ath5k]() Hardware name: Aspire 3000 invalid hw_rix: 0 Modules linked in: ath5k arc4 ecb mac80211 ath cfg80211 rfkill led_class ipv6 cpufreq_userspace cpufreq_ondemand cpufreq_stats cpufreq_conservative cpufreq_powersave powernow_k8 freq_table mperf snd_seq_dummy snd_seq_oss tun snd_seq_midi_event snd_seq snd_seq_device joydev snd_intel8x0m i2c_sis96x i2c_core usbhid hid sr_mod pcmcia cdrom snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ac battery thermal button processor amd64_agp pcspkr agpgart sg loop psmouse k8temp sis900 mii shpchp yenta_socket pci_hotplug pcmcia_rsrc pcmcia_core evdev serio_raw autofs4 fuse rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 ohci_hcd ehci_hcd usbcore sisfb ata_generic sd_mod pata_sis pata_acpi libata scsi_mod [last unloaded: ath5k] Pid: 9031, comm: epiphany Not tainted 2.6.35-ARCH #3 Call Trace: [<c104373d>] warn_slowpath_common+0x6d/0xa0 [<fa36d33b>] ? ath5k_tasklet_rx+0x5ab/0x670 [ath5k] [<fa36d33b>] ? ath5k_tasklet_rx+0x5ab/0x670 [ath5k] [<c10437ee>] warn_slowpath_fmt+0x2e/0x30 [<fa36d33b>] ath5k_tasklet_rx+0x5ab/0x670 [ath5k] [<c1049778>] tasklet_action+0x58/0xc0 [<c1049ca0>] __do_softirq+0x90/0x1e0 [<c101fc36>] ? irq_complete_move+0x16/0x20 [<c102027f>] ? ack_apic_level+0x5f/0x1f0 [<c1049cf1>] ? __do_softirq+0xe1/0x1e0 [<c1067f6f>] ? ktime_get_ts+0xff/0x130 [<c1049e2d>] do_softirq+0x3d/0x50 [<c104a1ed>] irq_exit+0x6d/0x70 [<c1005a50>] do_IRQ+0x50/0xc0 [<c1048951>] ? sys_gettimeofday+0x31/0x70 [<c1003d30>] common_interrupt+0x30/0x38 ---[ end trace 228776f33fb36589 ]--- ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800814b5 ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now net_ratelimit: 683 callbacks suppressed ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now No probe response from AP 00:04:e2:fc:b8:aa after 500ms, disconnecting. cfg80211: Calling CRDA to update world regulatory domain cfg80211: World regulatory domain updated: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) net_ratelimit: 258 callbacks suppressed ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) net_ratelimit: 52 callbacks suppressed ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k phy1: failed to wakeup the MAC Chip ath5k phy1: can't reset hardware (-5) ath5k 0000:00:0b.0: PCI INT A disabled (unloaded module to prevent crash of whole system...) And another question (i hope it can lead somewhere): After each fail i see the CastleNet NIC instead of Atheros: 00:0b.0 Ethernet controller: CastleNet Technology Inc. Device 0013 (rev 01) Why it's always the same ID? If the memory was overwritten by some random data it would be different each time it fails. This looks like it's almost intentionaly overwritten somewhere in code... Or maybe there is some other reason to act as CastleNet NIC... what do you think? So the PCI ID changes 168c:0013 -> 1688:0013...that is a single bit difference. You've got hardware failing to reset, PCI IDs changing (by a single bit), and you are getting MCE errors...is there some reason not to suspect that you have a dying laptop? (In reply to comment #41) > And why the RETRY_TIMEOUT register (0x41) does not survive the > suspend/resume? > Why it does not survive save/restore of PCI state? How should other registers > survive? Maybe we should store all known atheros registers manualy, because > they are lost too... Only the first 0x40 registers are saved by the core; 0x41+ are device specific. We do reprogram the card after resume so no need to worry about the rest. > is there some reason not to suspect that you have a dying laptop? Beside that i can't afford new one right now? :-D and hacker should never throw away any piece of hardware which is working partialy at least :-) I've been trying memtest and some CPU tests without error (well problem can be somewher else). Anyway i still think that my HW is OK. - there are also other users with similar issues with this driver and hardware. - there are no othere problems and it just WORKS when i am not using suspend - do we have some positive feedback like "hello i am lumberjack and i feel ok. i have my lappy suspended for all night and my AR5004X wifi is working flawlessly all day with your driver for weeks without reboot." ?? > We do reprogram the card after resume so no need to worry about the rest. Can we try to disable this? or it will not work without it? I really think that manualy saving and restoring all bytes can help (at least as workaround). It seems to me that the card is reprogrammed incorrectly after wakeup or some bytes are not touched at all. i can try to write it myself, but i even don't know how to allocate memory in kernelspace... i can probably use some static array to save the configspace which will work for single card only :-) but i am not kernel hacker and i will probably screw it all up. btw how can i get the total number of config bytes? (i don't want to overflow somewhere and kill the card definetely). Can you point at "other uses with similar issues with this driver and hardware"? As for happy lumberjacks or whatever, it can be difficult to collect "everything is ok" reports -- people generally don't bother with that. > Can you point at "other uses with similar issues with this driver and hardware"? *** Bug 14561 has been marked as a duplicate of this bug. *** And i've seen even few more similar wakeup problems here on bugzilla... Well nobody complains if he's happy. This card seems to be popular, but it's more commonly used in routers/APs (which are actually NEVER suspended) than in laptops... Well, I can at least say that suspend works fine here on my hardware (well, not in 2.6.36-rcX yet, but that's not ath5k's fault). The part that reprograms the card after suspend is in the core - take a look at net/mac80211/pm.c and ieee80211_reconfig in util.c -- it really does the same thing as when you do ifconfig up (which includes a reset that reloads all the card's initvals). We just remember enough in the driver to power up the card again. Anyway some of these issues seem unrelated to suspend (seems like there are multiple problems). Can you try enabling slab/slub debug on your kernel and see if anything triggers? Also, post output of /proc/interrupts? The IMR/ISR printk seems to indicate interrupts firing when they are supposed to be disabled. Could just be sharing or something. ok. 1.) i'll try if something will break after several ifconfig down/ups. 3.) i'll post /proc/interrupts from both working and failing state of the card 2.) can we write to the space, where vendor/product IDs are stored? if it's not physicaly impossible, maybe we are destroying it accidentaly. 4.) is there some way that i can use to avoid IMR/ISR "just sharing or something" issue? (Note that the list in comment 51 is number out of order...) Thomas, can you provide the information from 1 & 3? Bob, can you anser 2 & 4? 2.6.35.7 (maybe 2.6.35.6 too) seems to be even worse... i think now it fails even without suspending (while older kernels are still working - at least before suspending). I don't have too much time to investigate it now, but i really think that something went worse now... Johny: sorry, i will look at it ASAP. Thomas & Bob, ping? 2) Personally haven't tried it, but it should be difficult to accidentally write into the configuration space. Take a look at <a href="http://www.sfr-fresh.com/linux/misc/madwifi-0.9.4.tar.gz:a/madwifi-0.9.4/tools/ath_info.c">ath_info.c</a>) -- it has to set a specific bit in a command register, as well as loading data and offset registers in order to write the pci id. And we don't (intentionally, at least) ever use the write bit within ath5k. 4) I'm afraid I don't know the apic that well; sometimes this is achieved by moving things around to different slots or changing bios settings. I really think there is something wrong with your PCI controller, not only you get MCE, changed pci ids but also interrupts from nowhere... ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800824b5 ath5k phy0: failed to warm reset the MAC Chip ath5k phy0: can't reset hardware (-5) ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800824b5 Notice that IMR is active (has bits set), that means when ISR & IMR is true we have interrupt generation but here ISR is zero so card is not generating any interrupts, that's why ath5k_hw_get_isr prints this message, because this shouldn't happen ! Your PCI controller messes things up, sends interrupts that trigger our interrupt handler and get_isr is called to find out that hw never sent an interrupt :P BTW when you read status with lspci you don't use ath5k (if you did you wouldn't be able to read them without a driver being loaded), it reads PCI registers directly and pci ids etc are mapped on fixed eeprom locations, it's hw that does the EEPROM read in this case through EEPROM controler. Also as you saw ath5k didn't corrupt your EEPROM because if it did pci ids etc wouldn't magically return to their initial values after power-off, the only corrupted EEPROMs I've seen on CM6/9 is due to extreme power failure on embedded wi-fi routers or lightning strikes (and that's because they are very popular for such setups), and it's the only case I've ever seen ! Actually I have a CM6 with corrupted EEPROM and it can still work (i was thinking of a "default" set of settings in case of corrupted EEPROMs but Atheros ppl said it's too complex), at least it doesn't give any failed resets, POST tests etc. A buggy PCI controller can explain everything, if it messes up data transfers (as it did with pci ids) then we read crap when we try to reset/wakeup the card (btw you can use ath_info to check register values when your card works fine and when "it doesn't" to check this out) and also messes up when we try to do POST test... kernel: ath5k phy0: POST Failed !!! and BTW this can't be ath5k's fault because ath5k has just loaded (we run POST test on attach) ! You mentioned that: instead of "Memory at e2010000" there was "Memory at e2000000" or something else... ath5k can't change it's pci base address ! Finaly here is a report of another wireless driver that fails on resume on the same SiS chipset... https://bugs.launchpad.net/ubuntu/+source/acpi/+bug/67792 So please give me a reason why this is an ath5k bug... Well so you say it's definetely hardware bug OK? Or it's software problem with PCI controller driver? I am not saying that it's impossible to have hardware error in 6 years old Acer laptop :-D I just think that ath5k should not freeze whole kernel even in case of buggy hardware. Why PCI ids do not change if i don't load the ath5k? Why kernel does not crash when i unload ath5k? Let me explain: if module knows there is strange hardware error and it's going to crash if module will not be unloaded, why ath5k does not disable/suspend or completely unload itself to prevent crash which can cause loosing data and similar problems? PCI ids don't change if ath5k is not loaded because card isn't working, you can't have suspend failure if there is nothing to suspend or corruption when card is inactive. Ath5k is responsible for ar5k cards, it's not responsible for the whole system, it can't know eg. if your host PCI controller is failing because it doesn't know anything about it, it just tries to talk to the card and assumes that the PCI subsystem is doing its job. It can detect failures on card operation but a reset fixes most of them, we haven't seen so far a card failure so hard that we need to unload the module or else system hangs. From what I've seen so far there is a problem with the PCI subsystem in your case. It might be host's pci controller/driver or card's pci controller. I can't guarantee about hw failure on your card's pci controller, although I've never seen such failure -or report- and I have a CM9 right here in front of me that works with ath5k just fine (actually I've been using that card for development for 4 years and was also working on my outdoor wireless router for a while, I've done everything on this card :P, unfortunately suspend/resume doesn't work on my old laptop with or without the card so I can't test it). However I'm sure we handle pci stuff on our part correctly on sw and you can see that for yourself since for your Dlink DWL-G650 you run exactly the same code as with CM9, only phy changes from RF5112 to RF2112 which is actually the same phy without 5GHz support, there is no difference on mac/pci code (pcmcia cards are also pci cards to us, bridge control is handled by another driver, eg yenta-socket) ! Also what do you mean when you say "ath5k crashes" ? Except the MCE you gave us (that indicates a bus error btw and explicitly says "This is not a software problemm !") I didn't see any logs that indicate ath5k crashes. When you load/unload a driver for a PCI card the PCI subsystem also gets triggered so how can we be sure without logs of the crash which driver fails ? The only logs that you gave us indicate a problem on interrupt processing with interrupts that are not triggered by ar5k card (clearly indicates a problem with PCI) and slowpath warning on rx_tasklet (also related to interrupt processing) and they are just warnings not errors. We still don't know what's going on when your system crashes, my guess is you get an interrupt storm and your system hangs, check this out... ath5k_hw_get_isr: ISR: 0x00000000 IMR: 0x800814b5 ath5k phy1: too many interrupts, giving up for now ath5k phy1: too many interrupts, giving up for now net_ratelimit: 683 callbacks suppressed This is not a bug on ath5k, either you have hw failure on your card's/host PCI controller or your host PCI controller is not properly configured for suspend/resume (and that's a common thing, suspend/resume is not so "standard" procedure, it has to do with BIOS support, ACPI stuff etc + not all vendors follow the standards). In both cases there is nothing we can do. Just for the record try this... a) Disable NetworkManager b) Unload ath5k c) Suspend d) Resume e) Reload ath5k f) Enable NetworkManager Also please load ath5k with DEBUG_RESET (0x00000001) and see if you get any fatal interrupts (that would be very interesting)... Any news on this one ? with new kernel it does not work at all... (no crash, but no wifi also) $ uname -a Linux insomnia 2.6.37-ARCH #1 SMP PREEMPT Sat Jan 29 19:40:04 UTC 2011 i686 Mobile AMD Sempron(tm) Processor 3000+ AuthenticAMD GNU/Linux $ dmesg cfg80211: Calling CRDA to update world regulatory domain ath5k 0000:00:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 ath5k 0000:00:0b.0: registered as 'phy0' cfg80211: World regulatory domain updated: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) cfg80211: Calling CRDA for country: CZ cfg80211: Regulatory domain changed to country: CZ (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) (2400000 KHz - 2483500 KHz @ 40000 KHz), (N/A, 2000 mBm) (5150000 KHz - 5250000 KHz @ 40000 KHz), (N/A, 2301 mBm) (5250000 KHz - 5350000 KHz @ 40000 KHz), (N/A, 2000 mBm) (5470000 KHz - 5725000 KHz @ 40000 KHz), (N/A, 2698 mBm) ath5k phy0: Invalid EEPROM checksum: 0x082a eep_max: 0x0340 (default size) ath5k phy0: unable to init EEPROM ath5k 0000:00:0b.0: PCI INT A disabled ath5k: probe of 0000:00:0b.0 failed with error -5 $ lspci -vvv 00:0b.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01) Subsystem: Wistron NeWeb Corp. CM9 Wireless a/b/g MiniPCI Adapter Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR+ INTx- Latency: 168 (2500ns min, 7000ns max), Cache Line Size: 80 bytes Interrupt: pin A routed to IRQ 17 Region 0: Memory at e2000000 (32-bit, non-prefetchable) [size=64K] Region 1: Memory at <unassigned> (64-bit, non-prefetchable) Region 3: Memory at <unassigned> (64-bit, non-prefetchable) Region 5: Memory at <invalid-64bit-slot> (64-bit, non-prefetchable) Capabilities: [44] Slot ID: 0 slots, First-, chassis 00 Kernel modules: ath5k and there's no device in ifconfig of course... Have you tried what i mentioned on the previous post ? Closing due to lack of response... With each kernel version everything was getting even worse, card stoped to work completely (and i thought i've bricked my miniPCI controler) and after few more kernel versions it ended up by completely freezing my system until key was pressed or mouse was moved. this leaded me to another idea. NOHZ! (tickless kernel is default in ArchLinux) i've added nohz=off to kernel options line in GRUB and card magically started to work. It also survived first suspension to RAM, so we'll se if the tickless mode was only problem. |