Bug 15824 - iwlagn - BUG: unable to handle kernel NULL pointer dereference at (null)
Summary: iwlagn - BUG: unable to handle kernel NULL pointer dereference at (null)
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Reinette Chatre
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-21 09:19 UTC by rocko
Modified: 2010-06-03 21:39 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.34-rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel log showing bug info (16.02 KB, text/plain)
2010-04-21 09:19 UTC, rocko
Details
Do not access pointer if null (564 bytes, patch)
2010-04-21 23:00 UTC, Reinette Chatre
Details | Diff
kernel log during crash (403.28 KB, application/octet-stream)
2010-04-23 04:14 UTC, rocko
Details
log showing iwlagn oops (5.93 KB, text/plain)
2010-04-23 04:19 UTC, rocko
Details
Do not reset radio on plcp errors (518 bytes, patch)
2010-05-04 20:37 UTC, Reinette Chatre
Details | Diff
kernel log before oops (308.19 KB, application/octet-stream)
2010-05-04 23:47 UTC, rocko
Details
kernel log with iwlagn debugging before oops (62.48 KB, application/octet-stream)
2010-05-05 06:32 UTC, rocko
Details
No RF reset in max time a scan could take to finish. (12.00 KB, patch)
2010-05-05 21:23 UTC, Reinette Chatre
Details | Diff
No RF reset in max time a scan could take to finish (v2) (1.05 KB, patch)
2010-05-05 23:39 UTC, Reinette Chatre
Details | Diff
Fix internal scan race (3.44 KB, patch)
2010-05-06 18:35 UTC, Reinette Chatre
Details | Diff
kernel with patch - log part 1 (690.88 KB, application/x-bzip2)
2010-05-10 10:26 UTC, rocko
Details

Description rocko 2010-04-21 09:19:26 UTC
Created attachment 26076 [details]
kernel log showing bug info

This happened while connected to an 802.11N network (the PC froze completely) and looks to me like it's a problem in iwl. I've included the kernel log which shows the oops at the end: there are several warnings preceding it.
Comment 1 Reinette Chatre 2010-04-21 22:28:00 UTC
From what I can tell the NULL pointer is from the channel information in the scan request. Does this happen regularly? How do you trigger it? Is it possible to do two things to help get some more information:

- please add a dump_stack() at the beginning of ieee80211_scan_completed
- please load driver with "modprobe iwlagn debug=0x43fff"
Comment 2 Reinette Chatre 2010-04-21 23:00:10 UTC
Created attachment 26085 [details]
Do not access pointer if null

In addition to the previous requested information, could you please also try out this patch? If my analysis is correct this will prevent the NULL pointer access ... but why it is a NULL pointer in the first place is still unclear.
Comment 3 rocko 2010-04-21 23:47:11 UTC
Early feedback: this is the first time I've seen this particular bug. It happened an hour or so after installing and booting into 2.6.34-rc5, and about 30 mins after connecting to the 11n network. After that I rebooted into 2.6.33 (to avoid losing more work!) so I can't say if it happens frequently.

Note that I have had trouble in the past with the 11n network and have sometimes had to disable 11n (ie fallback to 11g) using the modprobe conf option to get it to connect reliably, although I don't think it has locked up the PC before.

I have now added in the patch, dump_stack(), and debug option and I should get a chance to report back tomorrow when I get the opportunity to access the 11n network again (I haven't seen this problem with my home 11g network).
Comment 4 rocko 2010-04-22 02:08:01 UTC
Actually, I have to admit that I didn't read what I was doing closely enough the first time. Where exactly should I put dump_stack? ieee80211_scan_completed is defined in iwlcore.mod.c in a struct modversion_info structure as what looks like an offset, and I don't understand enough about the code to figure out where the function is actually located.
Comment 5 Reinette Chatre 2010-04-22 15:55:26 UTC
(In reply to comment #4)
> Actually, I have to admit that I didn't read what I was doing closely enough
> the first time. Where exactly should I put dump_stack?
> ieee80211_scan_completed
> is defined in iwlcore.mod.c in a struct modversion_info structure as what
> looks
> like an offset, and I don't understand enough about the code to figure out
> where the function is actually located.

Thank you for helping to debug this. 

ieee80211_scan_completed is a mac80211 function that can be found in net/mac80211/scan.c
Comment 6 rocko 2010-04-23 04:14:39 UTC
Created attachment 26107 [details]
kernel log during crash

The PC froze again while connected to the 11n network: unfortunately it looks like it wasn't captured in the log this time (there's some truncation). I've attached the log anyway in case it helps. You can see messages stopped at 10:38 and I rebooted around 10:40 (search for imklog).
Comment 7 rocko 2010-04-23 04:19:33 UTC
Created attachment 26108 [details]
log showing iwlagn oops

I rebooted into 2.6.33.2 (not 2.6.34-rc5) and had another oops while connected to the 11n network. iwlagn debugging was on. Might it be relevant to this bug as well?
Comment 8 rocko 2010-05-03 08:52:03 UTC
Were there any changes in 2.6.34-rc6 that might have addressed this? I ran for 6 hours or so on the 11n network today without seeing this issue.
Comment 9 John W. Linville 2010-05-03 13:23:35 UTC
Dunno if these address your specific issue or not...

commit f2fa1b015e9c199e45c836c769d94db595150731
Author: Shanyu Zhao <shanyu.zhao@intel.com>
Date:   Wed Apr 7 18:37:52 2010 -0700

    iwlwifi: correct 6000 EEPROM regulatory address
    
    For 6000 series, the 2.4G HT40 band regulatory settings address in EEPROM
    was off by 2.
    
    Before the fix, you'll see this in dmesg:
    [79535.788877] ieee80211 phy8: U iwl_mod_ht40_chan_info HT40 Ch. 7 [2.4GHz]
    WIDE (0x61 0dBm): Ad-Hoc not supported
    [79535.788880] ieee80211 phy8: U iwl_mod_ht40_chan_info HT40 Ch. 11 [2.4GHz]
    WIDE (0x61 0dBm): Ad-Hoc not supported
    
    And after the fix:
    [91132.688706] ieee80211 phy14: U iwl_mod_ht40_chan_info HT40 Ch. 7 [2.4GHz]
    IBSS ACTIVE WIDE (0x6f 0dBm): Ad-Hoc supported
    [91132.688709] ieee80211 phy14: U iwl_mod_ht40_chan_info HT40 Ch. 11 [2.4GHz
    IBSS ACTIVE WIDE (0x6f 0dBm): Ad-Hoc supported
    
    Signed-off-by: Shanyu Zhao <shanyu.zhao@intel.com>
    Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

commit 88be026490ed89c2ffead81a52531fbac5507e01
Author: Johannes Berg <johannes.berg@intel.com>
Date:   Wed Apr 7 00:21:36 2010 -0700

    iwlwifi: fix scan races
    
    When an internal scan is started, nothing protects the
    is_internal_short_scan variable which can cause crashes,
    cf. https://bugzilla.kernel.org/show_bug.cgi?id=15667.
    Fix this by making the short scan request use the mutex
    for locking, which requires making the request go to a
    work struct so that it can sleep.
    
    Reported-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Johannes Berg <johannes.berg@intel.com>
    Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Comment 10 Reinette Chatre 2010-05-03 21:45:27 UTC
Johannes's "fix scan races" is definitely related to this problem. This problem looks very similar to https://bugzilla.kernel.org/show_bug.cgi?id=15667, which is fixed by that patch.
Comment 11 rocko 2010-05-04 03:46:39 UTC
-rc6 is certainly performing much better, but I had the kernel "freeze" again today approx one hour and fifty minutes after connecting to the 11n network.

Unfortunately there is no log information that appears relevant, just a bunch of "wpa_supplicant[3666]: Failed to initiate AP scan." messages (the last one approx twenty minutes before the freeze).

My guess though is that it is still related to the iwlwifi driver/11n network, because it hasn't ever frozen in either -rc5 or -rc6 while connected to 11g (or no) networks in the last week or so.
Comment 12 Reinette Chatre 2010-05-04 16:21:43 UTC
(In reply to comment #7)
> Created an attachment (id=26108) [details]
> log showing iwlagn oops
> 
> I rebooted into 2.6.33.2 (not 2.6.34-rc5) and had another oops while
> connected
> to the 11n network. iwlagn debugging was on. Might it be relevant to this bug
> as well?

Looks like you had debugging on at the time you encountered the oops ... any chance you still have the logs so that we can see what all happened before the oops?
Comment 13 Reinette Chatre 2010-05-04 20:37:05 UTC
Created attachment 26220 [details]
Do not reset radio on plcp errors

(In reply to comment #6)
> Created an attachment (id=26107) [details]
> kernel log during crash
> 
> The PC froze again while connected to the 11n network: unfortunately it looks
> like it wasn't captured in the log this time (there's some truncation). I've
> attached the log anyway in case it helps. You can see messages stopped at
> 10:38
> and I rebooted around 10:40 (search for imklog).

One interesting thing I see in this log is that often when there are a lot of plcp errors the radio is reset with a scan on a single channel, but the scan never completes and needs to be aborted. I am not sure if the 4965 plcp thresholds are correct though so we can try running without them. Could you please try this patch?
Comment 14 rocko 2010-05-04 23:47:45 UTC
Created attachment 26222 [details]
kernel log before oops

Here's a log from about 10:25 to when the oops occurred at 11:11. There's more if you would like. I see there was a microcode error at 10:33:51.

The kernel froze a second time yesterday, by the way, but still no produced no log information about it.

I have applied the patch now and hopefully will get another chance today to access the 11n network.
Comment 15 rocko 2010-05-05 06:32:22 UTC
Created attachment 26223 [details]
kernel log with iwlagn debugging before oops

2.6.34-rc6 froze again on the 11n network (with the latest patch included). I've attached about 8 minutes of log (freeze occurred near the end at approx 14:15). Again, it doesn't show the oops but iwlagn debugging was on.
Comment 16 rocko 2010-05-05 08:27:07 UTC
I've also noticed that when the kernel hangs, the wireless light is flashing. Sometimes the caps log and numlock lights are flashing, sometimes they are off.
Comment 17 Reinette Chatre 2010-05-05 21:23:12 UTC
Created attachment 26246 [details]
No RF reset in max time a scan could take to finish.

ok, now I got it.

For some reason your device sometimes fails scan requests. That is, a scan command is sent to it successfully, but the device never responds with actual scan data. This causes a watchdog time to abort the scan after 7 seconds.

Now, we are using a scan request of one channel to reset the radio if too many plcp errors are detected, which also happens to your device. 

The race is as follows:
Too many plcp errors cause a rf reset, which trigger a scan.
This scan request is sent to device successfully.

seven seconds go by ...
No response from device so scan is aborted.
At the same time, the issue causing the scan request in the first place (too many plcp errors) is still present and a new scan request is made and immediately queued. Before it is queued the internal_short_scan variable is set to true so that it will be dealt with as a rf reset and then it actually queues the scanning work
At this point the first scan's "completed" routine sets the internal_short_scan variable to false. Now the actual scanning of the second request is run and since internal_short_scan is false at this point it is treated as a regular scan, for which there is no data.

In 2.6.35 this should not be an issue since the scanning was made synchronous. This is a significant patch though and I doubt John would like to take it at this point for 2.6.34.

I attached a workaround for this issue ... do not do a rf sooner than 8 seconds after the previous (the scan abort is sent after 7 seconds).

Could you please try this patch out? Also, as a sanity check, could you please try out compat-wireless to ensure that this issue will not be in 2.6.35 (please test _without_ this patch).
Comment 18 rocko 2010-05-05 23:29:57 UTC
Would that cause the kernel to freeze?

I can't open the patch (it says it's a vim swap file?), by the way. Could you re-post it, please?
Comment 19 Reinette Chatre 2010-05-05 23:39:25 UTC
Created attachment 26250 [details]
No RF reset in max time a scan could take to finish (v2)

(In reply to comment #18)
> Would that cause the kernel to freeze?

yes ... since as a consequence of race condition it will access memory it shouldn't
 
> I can't open the patch (it says it's a vim swap file?), by the way. Could you
> re-post it, please?

I'm very sorry ... here it is.
Comment 20 Reinette Chatre 2010-05-06 16:05:08 UTC
If you have not done it already ... please hold off on the compat-wireless testing. I think it is possible to encounter this race in our latest also (although bit harder) so a proper fix is needed instead of the proposed workaround.
Comment 21 Reinette Chatre 2010-05-06 18:35:59 UTC
Created attachment 26265 [details]
Fix internal scan race

Please try this patch as a real fix to this issue. It was created against 2.6.34-rc6.
Comment 22 rocko 2010-05-06 23:42:56 UTC
OK, I have reversed the previous patch and applied the new one. I should have a chance to access the AP that causes the fault early next week.
Comment 23 rocko 2010-05-10 10:26:31 UTC
Created attachment 26313 [details]
kernel with patch - log part 1

rc6 with the patch ran fine today while I was connected to the 11n network (for a bit under 4 hours). I'm attaching the log in case it shows anything useful.
Comment 24 rocko 2010-05-10 10:30:03 UTC
The log attached above is for 12:29 until I suspended at 15:18. I've got another hour of logs from 11:35 until 12:29 if you need it.
Comment 25 Reinette Chatre 2010-05-10 18:28:34 UTC
Thank you for the extra log. In it I notice that you ran into a few cases where previously you might have encountered the race condition, "May 10 12:29:16", "May 10 12:30:29", and "May 10 12:32:54" are some examples. The patch is thus doing it's job. I'll submit it upstream. Thank you very much for the testing.
Comment 26 Reinette Chatre 2010-06-03 21:39:53 UTC
Patch can now be found in 2.6.35-rc1. The patch was also sent to stable and will hopefully be in next 2.6.34 stable release

Note You need to log in before you can comment on or make changes to this bug.