Created attachment 26076 [details] kernel log showing bug info This happened while connected to an 802.11N network (the PC froze completely) and looks to me like it's a problem in iwl. I've included the kernel log which shows the oops at the end: there are several warnings preceding it.
From what I can tell the NULL pointer is from the channel information in the scan request. Does this happen regularly? How do you trigger it? Is it possible to do two things to help get some more information: - please add a dump_stack() at the beginning of ieee80211_scan_completed - please load driver with "modprobe iwlagn debug=0x43fff"
Created attachment 26085 [details] Do not access pointer if null In addition to the previous requested information, could you please also try out this patch? If my analysis is correct this will prevent the NULL pointer access ... but why it is a NULL pointer in the first place is still unclear.
Early feedback: this is the first time I've seen this particular bug. It happened an hour or so after installing and booting into 2.6.34-rc5, and about 30 mins after connecting to the 11n network. After that I rebooted into 2.6.33 (to avoid losing more work!) so I can't say if it happens frequently. Note that I have had trouble in the past with the 11n network and have sometimes had to disable 11n (ie fallback to 11g) using the modprobe conf option to get it to connect reliably, although I don't think it has locked up the PC before. I have now added in the patch, dump_stack(), and debug option and I should get a chance to report back tomorrow when I get the opportunity to access the 11n network again (I haven't seen this problem with my home 11g network).
Actually, I have to admit that I didn't read what I was doing closely enough the first time. Where exactly should I put dump_stack? ieee80211_scan_completed is defined in iwlcore.mod.c in a struct modversion_info structure as what looks like an offset, and I don't understand enough about the code to figure out where the function is actually located.
(In reply to comment #4) > Actually, I have to admit that I didn't read what I was doing closely enough > the first time. Where exactly should I put dump_stack? > ieee80211_scan_completed > is defined in iwlcore.mod.c in a struct modversion_info structure as what > looks > like an offset, and I don't understand enough about the code to figure out > where the function is actually located. Thank you for helping to debug this. ieee80211_scan_completed is a mac80211 function that can be found in net/mac80211/scan.c
Created attachment 26107 [details] kernel log during crash The PC froze again while connected to the 11n network: unfortunately it looks like it wasn't captured in the log this time (there's some truncation). I've attached the log anyway in case it helps. You can see messages stopped at 10:38 and I rebooted around 10:40 (search for imklog).
Created attachment 26108 [details] log showing iwlagn oops I rebooted into 2.6.33.2 (not 2.6.34-rc5) and had another oops while connected to the 11n network. iwlagn debugging was on. Might it be relevant to this bug as well?
Were there any changes in 2.6.34-rc6 that might have addressed this? I ran for 6 hours or so on the 11n network today without seeing this issue.
Dunno if these address your specific issue or not... commit f2fa1b015e9c199e45c836c769d94db595150731 Author: Shanyu Zhao <shanyu.zhao@intel.com> Date: Wed Apr 7 18:37:52 2010 -0700 iwlwifi: correct 6000 EEPROM regulatory address For 6000 series, the 2.4G HT40 band regulatory settings address in EEPROM was off by 2. Before the fix, you'll see this in dmesg: [79535.788877] ieee80211 phy8: U iwl_mod_ht40_chan_info HT40 Ch. 7 [2.4GHz] WIDE (0x61 0dBm): Ad-Hoc not supported [79535.788880] ieee80211 phy8: U iwl_mod_ht40_chan_info HT40 Ch. 11 [2.4GHz] WIDE (0x61 0dBm): Ad-Hoc not supported And after the fix: [91132.688706] ieee80211 phy14: U iwl_mod_ht40_chan_info HT40 Ch. 7 [2.4GHz] IBSS ACTIVE WIDE (0x6f 0dBm): Ad-Hoc supported [91132.688709] ieee80211 phy14: U iwl_mod_ht40_chan_info HT40 Ch. 11 [2.4GHz IBSS ACTIVE WIDE (0x6f 0dBm): Ad-Hoc supported Signed-off-by: Shanyu Zhao <shanyu.zhao@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> commit 88be026490ed89c2ffead81a52531fbac5507e01 Author: Johannes Berg <johannes.berg@intel.com> Date: Wed Apr 7 00:21:36 2010 -0700 iwlwifi: fix scan races When an internal scan is started, nothing protects the is_internal_short_scan variable which can cause crashes, cf. https://bugzilla.kernel.org/show_bug.cgi?id=15667. Fix this by making the short scan request use the mutex for locking, which requires making the request go to a work struct so that it can sleep. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Johannes's "fix scan races" is definitely related to this problem. This problem looks very similar to https://bugzilla.kernel.org/show_bug.cgi?id=15667, which is fixed by that patch.
-rc6 is certainly performing much better, but I had the kernel "freeze" again today approx one hour and fifty minutes after connecting to the 11n network. Unfortunately there is no log information that appears relevant, just a bunch of "wpa_supplicant[3666]: Failed to initiate AP scan." messages (the last one approx twenty minutes before the freeze). My guess though is that it is still related to the iwlwifi driver/11n network, because it hasn't ever frozen in either -rc5 or -rc6 while connected to 11g (or no) networks in the last week or so.
(In reply to comment #7) > Created an attachment (id=26108) [details] > log showing iwlagn oops > > I rebooted into 2.6.33.2 (not 2.6.34-rc5) and had another oops while > connected > to the 11n network. iwlagn debugging was on. Might it be relevant to this bug > as well? Looks like you had debugging on at the time you encountered the oops ... any chance you still have the logs so that we can see what all happened before the oops?
Created attachment 26220 [details] Do not reset radio on plcp errors (In reply to comment #6) > Created an attachment (id=26107) [details] > kernel log during crash > > The PC froze again while connected to the 11n network: unfortunately it looks > like it wasn't captured in the log this time (there's some truncation). I've > attached the log anyway in case it helps. You can see messages stopped at > 10:38 > and I rebooted around 10:40 (search for imklog). One interesting thing I see in this log is that often when there are a lot of plcp errors the radio is reset with a scan on a single channel, but the scan never completes and needs to be aborted. I am not sure if the 4965 plcp thresholds are correct though so we can try running without them. Could you please try this patch?
Created attachment 26222 [details] kernel log before oops Here's a log from about 10:25 to when the oops occurred at 11:11. There's more if you would like. I see there was a microcode error at 10:33:51. The kernel froze a second time yesterday, by the way, but still no produced no log information about it. I have applied the patch now and hopefully will get another chance today to access the 11n network.
Created attachment 26223 [details] kernel log with iwlagn debugging before oops 2.6.34-rc6 froze again on the 11n network (with the latest patch included). I've attached about 8 minutes of log (freeze occurred near the end at approx 14:15). Again, it doesn't show the oops but iwlagn debugging was on.
I've also noticed that when the kernel hangs, the wireless light is flashing. Sometimes the caps log and numlock lights are flashing, sometimes they are off.
Created attachment 26246 [details] No RF reset in max time a scan could take to finish. ok, now I got it. For some reason your device sometimes fails scan requests. That is, a scan command is sent to it successfully, but the device never responds with actual scan data. This causes a watchdog time to abort the scan after 7 seconds. Now, we are using a scan request of one channel to reset the radio if too many plcp errors are detected, which also happens to your device. The race is as follows: Too many plcp errors cause a rf reset, which trigger a scan. This scan request is sent to device successfully. seven seconds go by ... No response from device so scan is aborted. At the same time, the issue causing the scan request in the first place (too many plcp errors) is still present and a new scan request is made and immediately queued. Before it is queued the internal_short_scan variable is set to true so that it will be dealt with as a rf reset and then it actually queues the scanning work At this point the first scan's "completed" routine sets the internal_short_scan variable to false. Now the actual scanning of the second request is run and since internal_short_scan is false at this point it is treated as a regular scan, for which there is no data. In 2.6.35 this should not be an issue since the scanning was made synchronous. This is a significant patch though and I doubt John would like to take it at this point for 2.6.34. I attached a workaround for this issue ... do not do a rf sooner than 8 seconds after the previous (the scan abort is sent after 7 seconds). Could you please try this patch out? Also, as a sanity check, could you please try out compat-wireless to ensure that this issue will not be in 2.6.35 (please test _without_ this patch).
Would that cause the kernel to freeze? I can't open the patch (it says it's a vim swap file?), by the way. Could you re-post it, please?
Created attachment 26250 [details] No RF reset in max time a scan could take to finish (v2) (In reply to comment #18) > Would that cause the kernel to freeze? yes ... since as a consequence of race condition it will access memory it shouldn't > I can't open the patch (it says it's a vim swap file?), by the way. Could you > re-post it, please? I'm very sorry ... here it is.
If you have not done it already ... please hold off on the compat-wireless testing. I think it is possible to encounter this race in our latest also (although bit harder) so a proper fix is needed instead of the proposed workaround.
Created attachment 26265 [details] Fix internal scan race Please try this patch as a real fix to this issue. It was created against 2.6.34-rc6.
OK, I have reversed the previous patch and applied the new one. I should have a chance to access the AP that causes the fault early next week.
Created attachment 26313 [details] kernel with patch - log part 1 rc6 with the patch ran fine today while I was connected to the 11n network (for a bit under 4 hours). I'm attaching the log in case it shows anything useful.
The log attached above is for 12:29 until I suspended at 15:18. I've got another hour of logs from 11:35 until 12:29 if you need it.
Thank you for the extra log. In it I notice that you ran into a few cases where previously you might have encountered the race condition, "May 10 12:29:16", "May 10 12:30:29", and "May 10 12:32:54" are some examples. The patch is thus doing it's job. I'll submit it upstream. Thank you very much for the testing.
Patch can now be found in 2.6.35-rc1. The patch was also sent to stable and will hopefully be in next 2.6.34 stable release