Bug 13987

Summary: Received NMI interrupt at resume
Product: Networking Reporter: Christian Casteyde (casteyde.christian)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: linville, rjw
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.31-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 13615    
Attachments: dmesg output for 2.6.32-rc1 after boot
final git bisect log

Description Christian Casteyde 2009-08-15 07:55:04 UTC
Kernel 2.6.31-rc2 through -rc6, 64bits mode
Athlon X64 3GHz single core
Bluewhite64 (64 bit native Slackware distro)
Running on Acer Aspire 1511Lmi laptop

I get the following at each resume from suspend to ram (echo -n mem > /sys/power/state):

Uhhuh. NMI received for unknown reason b0 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue

I already reported this warning in
http://bugzilla.kernel.org/show_bug.cgi?id=13781#c9

but this bug entry was for another resume problem and was since closed, so I'm opening this new one.

This error was not present with 2.6.30 (also compiled with debug/kernel hacking options, and this log was already present in kernel/traps.c in this version, so I think this is a regression.

Of course, I have no hardware problem I'm aware of. Moreover, except this warning, everything is working normally after the resume.

Portion of the dmesg log below:

Intel ICH 0000:00:06.0: PCI INT A -> Link[LACI] -> GSI 22 (level, low) -> IRQ 22
Intel ICH 0000:00:06.0: setting latency timer to 64
Intel ICH Modem 0000:00:06.1: PCI INT B -> Link[LMCI] -> GSI 21 (level, low) -> IRQ 21
Intel ICH Modem 0000:00:06.1: setting latency timer to 64
AMD_IDE 0000:00:08.0: setting latency timer to 64
amd74xx 0000:00:08.0: BIOS didn't set cable bits correctly. Enabling workaround.
pci 0000:00:0a.0: setting latency timer to 64
b43-pci-bridge 0000:02:08.0: PCI INT A -> Link[LNK4] -> GSI 19 (level, low) -> IRQ 19
firewire_core: skipped bus generations, destroying all nodes
pci 0000:01:00.0: PME# disabled
parport_pc 00:0b: activated
hda: host max PIO5 wanted PIO255(auto-tune) selected PIO4
hda: UDMA/100 mode selected
hdc: host max PIO5 wanted PIO255(auto-tune) selected PIO4
hdc: UDMA/33 mode selected
Uhhuh. NMI received for unknown reason b0 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue
firewire_core: rediscovered device fw0
usb 2-3: reset full speed USB device using ohci_hcd and address 2
Restarting tasks ... done.
powernow-k8: error - out of sync, fix 0x0 0xa, vid 0x12 0x2
Comment 1 Rafael J. Wysocki 2009-08-15 14:06:14 UTC
Does it happen with 2.6.31-rc1?
Comment 2 Christian Casteyde 2009-08-16 13:06:04 UTC
Yes.
I rechecked it: it doesn't happen with 2.6.30.4 + debug options, only with 2.6.31-rc*. However, there are more debug options in 2.6.31 (for example, memcheck), which slows down the computer a lot and therefore can be the cause for the problem in case of a race condition at resume.
Comment 3 Christian Casteyde 2009-08-23 09:18:04 UTC
Update: still valid with -rc7
The kmemcheck option is not the cause of the problem, it happens with "classical" debug options.
Comment 4 Rafael J. Wysocki 2009-08-26 13:52:28 UTC
On Wednesday 26 August 2009, Christian Casteyde wrote:
> Hello,
> 
> Yes, this bug is sill valid.
> 
> Le mardi 25 août 2009, vous avez écrit :
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.30.  Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13987
> > Subject             : Received NMI interrupt at resume
> > Submitter   : Christian Casteyde <casteyde.christian@free.fr>
> > Date                : 2009-08-15 07:55 (11 days old)
Comment 5 Christian Casteyde 2009-09-11 13:39:52 UTC
Update: The final release 2.6.31 still issues the warnings (even without debugging options). The warnings are sometimes doubled:

Message from syslogd@athor at Fri Sep 11 15:32:19 2009 ...
athor kernel: Uhhuh. NMI received for unknown reason b1 on CPU 0.

Message from syslogd@athor at Fri Sep 11 15:32:19 2009 ...
athor kernel: You have some hardware problem, likely on the PCI bus.

Message from syslogd@athor at Fri Sep 11 15:32:19 2009 ...
athor kernel: Dazed and confused, but trying to continue

Message from syslogd@athor at Fri Sep 11 15:32:57 2009 ...
athor kernel: Uhhuh. NMI received for unknown reason a0 on CPU 0.

Message from syslogd@athor at Fri Sep 11 15:32:57 2009 ...
athor kernel: You have some hardware problem, likely on the PCI bus.

Message from syslogd@athor at Fri Sep 11 15:32:57 2009 ...
athor kernel: Dazed and confused, but trying to continue

Last note: these message always happens just after IDE resume: in dmesg log, the last info shown are always after that:

hda: host max PIO5 wanted PIO255(auto-tune) selected PIO4
hda: UDMA/100 mode selected
hdc: host max PIO5 wanted PIO255(auto-tune) selected PIO4
hdc: UDMA/33 mode selected
Comment 6 Christian Casteyde 2009-09-30 18:19:04 UTC
Update for 2.6.32-rc1

This kernel cannot suspend anymore the laptop, so I cannot give more info for now. However, I got the warning message once at boot (not at resume). This may be the same problem, or not, but I append the dmesg output if it can be usefull.

...
VFS: Mounted root (ext4 filesystem) readonly on device 3:2.
Freeing unused kernel memory: 440k freed
kjournald2 starting: pid 1391, dev hda2:8, commit interval 5 seconds
Write protecting the kernel read-only data: 7580k
udev: renamed network interface wlan0 to eth1
Adding 506008k swap on /dev/hda1.  Priority:-1 extents:1 across:506008k
EXT4-fs (hda2): internal journal on hda2:8
kjournald starting.  Commit interval 5 seconds
EXT3 FS on hda3, internal journal
EXT3-fs: mounted filesystem with writeback data mode.
b43 ssb0:0: firmware: requesting b43/ucode5.fw
b43 ssb0:0: firmware: requesting b43/pcm5.fw
b43 ssb0:0: firmware: requesting b43/b0g0initvals5.fw
b43 ssb0:0: firmware: requesting b43/b0g0bsinitvals5.fw
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
ADDRCONF(NETDEV_UP): eth1: link is not ready
Uhhuh. NMI received for unknown reason a1 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
...

(after that, there is a kmemcheck warning on b43 driver).
Maybe the unexpected NMI is related to the b43 driver when it is activated. I append the full dmesg output for more info.
Comment 7 Christian Casteyde 2009-09-30 18:20:02 UTC
Created attachment 23211 [details]
dmesg output for 2.6.32-rc1 after boot
Comment 8 Christian Casteyde 2009-10-11 22:30:27 UTC
Trying to bisect it, since I can reproduce at each suspend.
Still 260 revisions to check for now, will give results in 1 or 2 days.
Comment 9 Christian Casteyde 2009-10-12 18:29:01 UTC
I've just finished the bisection. Here are the results:

root@athor:/usr/src/linux-git# git bisect bad
f41f3f373dd72344c65d801d6381fe83ef3a2c54 is first bad commit
commit f41f3f373dd72344c65d801d6381fe83ef3a2c54
Author: Johannes Berg <johannes@sipsolutions.net>
Date:   Sun Jun 7 12:30:34 2009 -0500

    b43/legacy: port to cfg80211 rfkill

    This ports the b43/legacy rfkill code to the new API offered
    by cfg80211 and thus removes a lot of useless stuff.

    Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
    Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
    Cc: Michael Buesch <mb@bu3sch.de>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

:040000 040000 b5da3bc162b9f90b16511e5742f3e55bed3d77fd 1aff203801c849e6868f1ddd1d0a8bdb7c2822c8 M      drivers
root@athor:/usr/src/linux-git#

So this bug is due to the b43 port to rfkill.

Don't know if the warnings I get sometimes at boot are due to the same cause, but I effectively do net stop / net start sequence in my suspend script, and the warning indeed seems to occur while restoring the wireless link (KDE4 bells each time the warning occurs to notify the user (me), so I know it is there even before the screen is restored).

For completeness, my suspend script is the following one:
       # Stops network:
        /etc/rc.d/rc.inet1 eth1_stop

        # Flush all data to disk (just in case)
        LD_PRELOAD="" sync

        # Save the system time
        hwclock --systohc --utc

        # Suspend
        echo -n mem > /sys/power/state

        # 2.6.25 does not send lid event at resume:
        rm -rf /var/state/hibernating

        # Restore the system time
        hwclock --hctosys --utc

        # Reset the graphic card
        vbetool post

        # Reset the CD/DVD drive
        hdparm -w /dev/hdc

        # Reset audio
        alsactl restore

        # Starts network:
        /etc/rc.d/rc.inet1 eth1_start

        # Switch back to X
        chvt 7
Comment 10 Christian Casteyde 2009-10-12 18:30:35 UTC
Created attachment 23366 [details]
final git bisect log

Just in case, here is the git bisect log.
Comment 11 Rafael J. Wysocki 2009-10-12 20:58:11 UTC
First-Bad-Commit : f41f3f373dd72344c65d801d6381fe83ef3a2c54
Comment 12 John W. Linville 2009-10-13 15:26:21 UTC
Not obvious to me how that triggers an NMI.  Maybe something to do with the SSB bus operations?
Comment 13 Rafael J. Wysocki 2009-11-17 22:14:21 UTC
On Tuesday 17 November 2009, Christian Casteyde wrote:
> Hello,
> 
> yes, this bug is still present in -rc7.
> 
> CC
> Le mardi 17 novembre 2009, vous avez écrit :
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31.  Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13987
> > Subject             : Received NMI interrupt at resume
> > Submitter   : Christian Casteyde <casteyde.christian@free.fr>
> > Date                : 2009-08-15 07:55 (94 days old)
Comment 14 Johannes Berg 2009-11-17 22:18:06 UTC
So ... does reverting that patch actually fix it?
Comment 15 Christian Casteyde 2009-11-17 23:43:34 UTC
Before this patch, the problem wasn't there indeed. But I don't know if that patch can be reverted on -rc7, neither where to get the patch alone.

I could check on -rc7 if you tell me how to revert it.
Comment 16 Johannes Berg 2009-11-18 11:12:37 UTC
git show <commit ID> | patch -p1 -R

you can ignore the two failures in b43/led.c and b43/main.c
Comment 17 Johannes Berg 2009-11-18 11:14:01 UTC
Ah, no, you can't ignore them ... but since the LED isn't really important you can remove the b43_is_hw_radio_enabled() call from led.c.
Comment 18 Christian Casteyde 2009-11-18 18:53:18 UTC
Yes, reverting this patch as indicated fix this problem.
However, I get those warnings now:

hdc: ATAPI reset complete                                                         
------------[ cut here ]------------                                              
WARNING: at drivers/net/wireless/b43/rfkill.c:79 b43_rfkill_soft_set+0xa3/0xb0()  
Hardware name: Aspire 1510                                                        
Modules linked in:                                                                
Pid: 4, comm: events/0 Tainted: G        W  2.6.32-rc7 #4                         
Call Trace:                                                                       
 [<ffffffff81324013>] ? b43_rfkill_soft_set+0xa3/0xb0                             
 [<ffffffff81324013>] ? b43_rfkill_soft_set+0xa3/0xb0                             
 [<ffffffff810378c9>] ? warn_slowpath_common+0x79/0xd0                            
 [<ffffffff8154b8b0>] ? __mutex_lock_slowpath+0x90/0x140                          
 [<ffffffff81324013>] ? b43_rfkill_soft_set+0xa3/0xb0                             
 [<ffffffff81538a40>] ? rfkill_sync_work+0x0/0x60                                 
 [<ffffffff81538713>] ? rfkill_set_block+0x93/0x140                               
 [<ffffffff81538a7a>] ? rfkill_sync_work+0x3a/0x60                                
 [<ffffffff813350b0>] ? fw_device_update+0x0/0x30                                 
 [<ffffffff8104a3f6>] ? worker_thread+0x176/0x280                                 
 [<ffffffff8104e170>] ? autoremove_wake_function+0x0/0x30                         
 [<ffffffff8104a280>] ? worker_thread+0x0/0x280                                   
 [<ffffffff8104de7e>] ? kthread+0x8e/0xa0                                         
 [<ffffffff8100c00a>] ? child_rip+0xa/0x20                                        
 [<ffffffff8104ddf0>] ? kthread+0x0/0xa0                                          
 [<ffffffff8100c000>] ? child_rip+0x0/0x20                                        
---[ end trace b711db4f9b04b879 ]---                                              
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)                 
ADDRCONF(NETDEV_UP): eth1: link is not ready                                      
------------[ cut here ]------------                                              
WARNING: at drivers/net/wireless/b43/rfkill.c:79 b43_rfkill_soft_set+0xa3/0xb0()  
Hardware name: Aspire 1510                                                        
Modules linked in:                                                                
Pid: 4, comm: events/0 Tainted: G        W  2.6.32-rc7 #4                         
Call Trace:                                                                       
 [<ffffffff81324013>] ? b43_rfkill_soft_set+0xa3/0xb0                             
 [<ffffffff81324013>] ? b43_rfkill_soft_set+0xa3/0xb0                             
 [<ffffffff810378c9>] ? warn_slowpath_common+0x79/0xd0                            
 [<ffffffff8154b8b0>] ? __mutex_lock_slowpath+0x90/0x140                          
 [<ffffffff81324013>] ? b43_rfkill_soft_set+0xa3/0xb0
 [<ffffffff81538a40>] ? rfkill_sync_work+0x0/0x60
 [<ffffffff81538713>] ? rfkill_set_block+0x93/0x140
 [<ffffffff81538a7a>] ? rfkill_sync_work+0x3a/0x60
 [<ffffffff81507f50>] ? wdev_cleanup_work+0x0/0xf0
 [<ffffffff8104a3f6>] ? worker_thread+0x176/0x280
 [<ffffffff8104e170>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff8104a280>] ? worker_thread+0x0/0x280
 [<ffffffff8104de7e>] ? kthread+0x8e/0xa0
 [<ffffffff8100c00a>] ? child_rip+0xa/0x20
 [<ffffffff8104ddf0>] ? kthread+0x0/0xa0
 [<ffffffff8100c000>] ? child_rip+0x0/0x20
---[ end trace b711db4f9b04b87a ]---
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
ADDRCONF(NETDEV_UP): eth1: link is not ready
eth1: deauthenticating from 00:07:cb:55:12:b0 by local choice (reason=3)
eth1: direct probe to AP 00:07:cb:55:12:b0 (try 1)
eth1: direct probe responded
eth1: authenticate with AP 00:07:cb:55:12:b0 (try 1)
eth1: authenticated
eth1: associate with AP 00:07:cb:55:12:b0 (try 1)
eth1: RX AssocResp from 00:07:cb:55:12:b0 (capab=0x411 status=0 aid=1)
eth1: associated
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

each time the network starts.

Maybe this is due to the hunk failed in reversing the patch, however, I don't get the NMI errors with this kernel (vanilla 32-rc7 with the commit reverted and line 264 modified as follow:

if (dev->phy.radio_on /* && b43_is_hw_radio_enabled(dev)*/) {

to remove the call to b43_is_hw_radio_enabled as you asked).

I can test partial reverts if you want (with a patch): this is easy only have to reboot and suspend/resume 2 or 3 times.
Comment 19 Christian Casteyde 2009-11-20 21:48:57 UTC
Update: still present in 2.6.32-rc8
Comment 20 Johannes Berg 2009-11-22 10:57:37 UTC
Your stack traces are useless, please enable CONFIG_FRAMEPOINTER.

Either way though, I think the warning is an indication that there's something wrong in both cases, but it manages to survive with a warning instead of dying or something.

However, since I don't even know what that NMI warning means, there's nothing I can do to fix the problem.
Comment 21 Christian Casteyde 2009-11-22 16:21:55 UTC
Sorry for omitting the frame pointer.
I also think I wasn't clear enough on the kernel used while testing this patch reverted.
I'm compelled to also revert the commit from http://bugzilla.kernel.org/show_bug.cgi?id=14538 since I'm unable to restart the network otherwise. And btw when the network don't restart, I don't have the NMI error.

So here is the results with -rc8 with the suspect commit reverted, with the led call commented out, and the commit of faulty commit of 14538 also reverted:
- resuming works ;
- the warning call stack is the following (I get it immediatly at boot) :
------------[ cut here ]------------                                              
WARNING: at drivers/net/wireless/b43/rfkill.c:78 b43_rfkill_soft_set+0xa6/0xb0()  
Hardware name: Aspire 1510                                                        
Modules linked in:                                                                
Pid: 4, comm: events/0 Not tainted 2.6.32-rc8 #4                                  
Call Trace:                                                                       
 [<ffffffff8132df86>] ? b43_rfkill_soft_set+0xa6/0xb0                             
 [<ffffffff81038348>] warn_slowpath_common+0x78/0xd0                              
 [<ffffffff810383af>] warn_slowpath_null+0xf/0x20                                 
 [<ffffffff8132df86>] b43_rfkill_soft_set+0xa6/0xb0                               
 [<ffffffff81549170>] ? rfkill_sync_work+0x0/0x60                                 
 [<ffffffff81548e35>] rfkill_set_block+0x95/0x140                                 
 [<ffffffff815491ad>] rfkill_sync_work+0x3d/0x60                                  
 [<ffffffff8104b254>] worker_thread+0x174/0x290                                   
 [<ffffffff8104f150>] ? autoremove_wake_function+0x0/0x40                         
 [<ffffffff8104b0e0>] ? worker_thread+0x0/0x290                                   
 [<ffffffff8104b0e0>] ? worker_thread+0x0/0x290                                   
 [<ffffffff8104ee4e>] kthread+0x8e/0xa0                                           
 [<ffffffff8100c08a>] child_rip+0xa/0x20                                          
 [<ffffffff8104edc0>] ? kthread+0x0/0xa0                                          
 [<ffffffff8100c080>] ? child_rip+0x0/0x20                                        
---[ end trace 741371b95cbb4c00 ]---                                              
b43 ssb0:0: firmware: requesting b43/ucode5.fw                                    
b43 ssb0:0: firmware: requesting b43/pcm5.fw                                      
b43 ssb0:0: firmware: requesting b43/b0g0initvals5.fw                             
b43 ssb0:0: firmware: requesting b43/b0g0bsinitvals5.fw                           
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)                 
ADDRCONF(NETDEV_UP): eth1: link is not ready                                      
------------[ cut here ]------------                                              
WARNING: at drivers/net/wireless/b43/rfkill.c:78 b43_rfkill_soft_set+0xa6/0xb0()  
Hardware name: Aspire 1510                                                        
Modules linked in:                                                                
Pid: 4, comm: events/0 Tainted: G        W  2.6.32-rc8 #4                         
Call Trace:                                                                       
 [<ffffffff8132df86>] ? b43_rfkill_soft_set+0xa6/0xb0                             
 [<ffffffff81038348>] warn_slowpath_common+0x78/0xd0                              
 [<ffffffff810383af>] warn_slowpath_null+0xf/0x20                                 
 [<ffffffff8132df86>] b43_rfkill_soft_set+0xa6/0xb0                               
 [<ffffffff81549170>] ? rfkill_sync_work+0x0/0x60
 [<ffffffff81548e35>] rfkill_set_block+0x95/0x140
 [<ffffffff815491ad>] rfkill_sync_work+0x3d/0x60
 [<ffffffff8104b254>] worker_thread+0x174/0x290
 [<ffffffff8104f150>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8104b0e0>] ? worker_thread+0x0/0x290
 [<ffffffff8104b0e0>] ? worker_thread+0x0/0x290
 [<ffffffff8104ee4e>] kthread+0x8e/0xa0
 [<ffffffff8100c08a>] child_rip+0xa/0x20
 [<ffffffff8104edc0>] ? kthread+0x0/0xa0
 [<ffffffff8100c080>] ? child_rip+0x0/0x20
---[ end trace 741371b95cbb4c01 ]---
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
ADDRCONF(NETDEV_UP): eth1: link is not ready
eth1: deauthenticating from 00:07:cb:55:12:b0 by local choice (reason=3)
eth1: direct probe to AP 00:07:cb:55:12:b0 (try 1)
eth1: direct probe responded
eth1: authenticate with AP 00:07:cb:55:12:b0 (try 1)
eth1: authenticated
eth1: associate with AP 00:07:cb:55:12:b0 (try 1)
eth1: RX AssocResp from 00:07:cb:55:12:b0 (capab=0x411 status=0 aid=1)
eth1: associated
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

I didn't applied the proposed patch from bug #14538 while testing this, to keep problems separated.
Comment 22 Christian Casteyde 2009-12-03 21:03:47 UTC
Update: 2.6.32 still cannot associate anymore, as reported in bug #14538 which hopefully has a a fix that also fixes the current bug.
Comment 23 Christian Casteyde 2009-12-29 17:00:33 UTC
Fixed by mainline commit c2ff581acab16c6af56d9e8c1a579bf041ec00b1.