Bug 15667

Summary: general protection fault from iwlagn
Product: Drivers Reporter: Peter Zijlstra (a.p.zijlstra)
Component: network-wirelessAssignee: Johannes Berg (johannes)
Status: CLOSED CODE_FIX    
Severity: normal CC: johannes, linville, reinette.chatre
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34-rc3-tip+ Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output
dmesg output with printk patch
dmesg with printk output for real now
patch to fix internal scan race

Description Peter Zijlstra 2010-04-01 14:09:30 UTC
iwlagn makes my t500 misserable ever since early .34

Johill made me run with debug=0x802, output in attachment.
the GPF splat also here:

[19442.770407] general protection fault: 0000 [#1] SMP 
[19442.770415] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full
[19442.770422] CPU 0 
[19442.770426] Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc cryptd aes_x86_64 aes_generic rfcomm binfmt_misc sco bridge stp ppdev bnep l2cap snd_hda_codec_
conexant joydev fbcon tileblit font bitblit softcursor kvm_intel autofs4 kvm arc4 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi pcmcia snd_rawmidi snd_seq_midi_event iwlagn snd_seq thinkpad_acpi iwlcore snd_timer i915 snd_seq_device drm_kms_helper mac80211 drm yenta_socket sdhci_pci sdhci psmouse btusb rsrc_nonstatic i2c_algo_bit bluetooth nvram snd serio_raw led_class pcmcia_core video cfg80211 soundcore snd_page_alloc output lp intel_agp parport xfs exportfs ohci1394 ieee1394 ahci e1000e
[19442.770555] 
[19442.770563] Pid: 863, comm: iwlagn Not tainted 2.6.34-rc3-tip+ #5 2242CTO/2242CTO
[19442.770569] RIP: 0010:[<ffffffff812a4d20>]  [<ffffffff812a4d20>] memcpy+0x10/0xb0
[19442.770584] RSP: 0018:ffff880135215cf8  EFLAGS: 00010202
[19442.770590] RAX: ffff880130628316 RBX: ffff8801354c97a0 RCX: 0000000000000001
[19442.770596] RDX: 0000000000000001 RSI: 000c0199000020de RDI: ffff880130628316
[19442.770602] RBP: ffff880135215d10 R08: 00000000000000ea R09: 00000000ffffffff
[19442.770608] R10: 000000000a6a2100 R11: 00000000000000c8 R12: ffff8801354c1360
[19442.770614] R13: ffff8801354c1db8 R14: 0000000000000070 R15: ffff880130628000
[19442.770621] FS:  0000000000000000(0000) GS:ffff880001e00000(0000) knlGS:0000000000000000
[19442.770628] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[19442.770634] CR2: 00007ffc65782a18 CR3: 0000000001a29000 CR4: 00000000000006f0
[19442.770640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19442.770646] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[19442.770653] Process iwlagn (pid: 863, threadinfo ffff880135214000, task ffff880134008000)
[19442.770658] Stack:
[19442.770662]  ffffffffa0370e21 ffffffffa037eec0 0000000000000001 ffff880135215e30
[19442.770671] <0> ffffffffa0371acf ffffe8ff0000000a ffff880135215fd8 ffff880135215d70
[19442.770682] <0> ffffffffa035c32b ffff880135215d50 ffffffff02800000 ffff880100000055
[19442.770693] Call Trace:
[19442.770727]  [<ffffffffa0370e21>] ? iwl_fill_probe_req+0x81/0xb0 [iwlcore]
[19442.770755]  [<ffffffffa0371acf>] iwl_bg_request_scan+0x63f/0x1390 [iwlcore]
[19442.770783]  [<ffffffffa035c32b>] ? iwl_send_cmd_async+0x5b/0xe0 [iwlcore]
[19442.770809]  [<ffffffffa035c3ea>] ? iwl_send_cmd_pdu_async+0x3a/0x40 [iwlcore]
[19442.770836]  [<ffffffffa035c3f0>] ? iwl_generic_cmd_callback+0x0/0x150 [iwlcore]
[19442.770858]  [<ffffffffa03f95b8>] ? iwl5000_send_tx_power+0x58/0x60 [iwlagn]
[19442.770884]  [<ffffffffa0351096>] ? iwl_set_tx_power+0x96/0x160 [iwlcore]
[19442.770910]  [<ffffffffa0371490>] ? iwl_bg_request_scan+0x0/0x1390 [iwlcore]
[19442.770921]  [<ffffffff810770dd>] worker_thread+0x17d/0x2b0
[19442.770932]  [<ffffffff8107be70>] ? autoremove_wake_function+0x0/0x40
[19442.770941]  [<ffffffff81076f60>] ? worker_thread+0x0/0x2b0
[19442.770948]  [<ffffffff8107b936>] kthread+0x96/0xa0
[19442.770957]  [<ffffffff8100ae24>] kernel_thread_helper+0x4/0x10
[19442.770965]  [<ffffffff8107b8a0>] ? kthread+0x0/0xa0
[19442.770972]  [<ffffffff8100ae20>] ? kernel_thread_helper+0x0/0x10
[19442.770977] Code: c0 49 89 70 58 41 c6 40 4c 04 83 e0 fc 83 c0 08 41 88 40 4d c9 c3 90 90 90 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 <f3> a4 c3 8b 1e 4c 8b 46 08 4c 89 1f 4c 89 47 08 4c 8b 4e 10 4c 
[19442.771060] RIP  [<ffffffff812a4d20>] memcpy+0x10/0xb0
[19442.771068]  RSP <ffff880135215cf8>
[19442.771084] ---[ end trace d26f7b1d5bd2cfb3 ]---
Comment 1 Peter Zijlstra 2010-04-01 14:14:07 UTC
Created attachment 25800 [details]
dmesg output

dmesg output
Comment 2 Johannes Berg 2010-04-01 14:16:03 UTC
Please add http://paste.pocoo.org/raw/196316/


--- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-scan.c	2010-04-01 16:04:48.000000000 +0200
+++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-scan.c	2010-04-01 16:06:15.000000000 +0200
@@ -264,6 +264,7 @@ static void iwl_rx_scan_complete_notif(s
 
 reschedule:
 	priv->scan_pass_start = jiffies;
+	printk(KERN_DEBUG "queueing request_scan in complete, bands=%x\n", priv->scan_bands);
 	queue_work(priv->workqueue, &priv->request_scan);
 }
 
@@ -475,6 +476,7 @@ static int iwl_scan_initiate(struct iwl_
 	priv->scan_start = jiffies;
 	priv->scan_pass_start = priv->scan_start;
 
+	printk(KERN_DEBUG "queueing request_scan in initiate, bands=%x\n", priv->scan_bands);
 	queue_work(priv->workqueue, &priv->request_scan);
 
 	return 0;
@@ -575,6 +577,7 @@ int iwl_internal_short_hw_scan(struct iw
 	IWL_DEBUG_SCAN(priv, "Start internal short scan...\n");
 	set_bit(STATUS_SCANNING, &priv->status);
 	priv->is_internal_short_scan = true;
+	printk(KERN_DEBUG "queueing request_scan in internal, bands=%x\n", priv->scan_bands);
 	queue_work(priv->workqueue, &priv->request_scan);
 
 out:
Comment 3 Peter Zijlstra 2010-04-02 15:28:57 UTC
Created attachment 25820 [details]
dmesg output with printk patch

Another splat, with debug=0x802 and the printk patch applied. Please let me know if you need more information.
Comment 4 Johannes Berg 2010-04-03 15:09:33 UTC
Either bugzilla or I am confused, or you attached the same file again?
Comment 5 Peter Zijlstra 2010-04-03 15:28:21 UTC
Created attachment 25836 [details]
dmesg with printk output for real now
Comment 6 Johannes Berg 2010-04-03 16:38:51 UTC
Well, ok, clear now what's going on...

is_internal_short_scan starts out as true

ieee80211 phy0: U iwl_bg_scan_completed SCAN complete scan
ieee80211 phy0: I iwl_internal_short_hw_scan Start internal short scan...
queueing request_scan in internal, bands=1

still set to true

ieee80211 phy0: U iwl_bg_scan_completed internal short scan completed

this sets it to false, but the next scan is already in progress!!

Now iwl_bg_request_scan will check the variable and find it false, even though it should be true.

is_internal_short_scan needs to be protected by a lock, probably priv->mutex.
Comment 7 Johannes Berg 2010-04-03 16:43:35 UTC
No, it's not that simple ... all this is really racy.
Comment 8 Johannes Berg 2010-04-06 11:23:32 UTC
Created attachment 25881 [details]
patch to fix internal scan race

Let me know what happens.
Comment 9 Peter Zijlstra 2010-04-08 19:43:02 UTC
Yep, that seems to have cured it, machine has been happy for days now, whereas it used to die several times a day before this.