Bug 13584 - iwlagn breaks after a few hours and causes other processes to block
iwlagn breaks after a few hours and causes other processes to block
Status: CLOSED CODE_FIX
Product: Networking
Classification: Unclassified
Component: Wireless
All Linux
: P1 normal
Assigned To: David Woodhouse
:
: 12222 (view as bug list)
Depends on:
Blocks: 13070
  Show dependency treegraph
 
Reported: 2009-06-19 21:45 UTC by Michael Olbrich
Modified: 2009-07-06 23:52 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.30
Tree: Mainline
Regression: Yes


Attachments
dmesg section with the bug (4.40 KB, text/plain)
2009-06-19 21:45 UTC, Michael Olbrich
Details
Testing patch (1.66 KB, patch)
2009-06-22 22:55 UTC, David Woodhouse
Details | Diff
kernel config (61.70 KB, text/plain)
2009-06-25 21:52 UTC, Michael Olbrich
Details
kernel config (61.64 KB, text/plain)
2009-06-25 22:09 UTC, Michael Olbrich
Details
potential fix (678 bytes, patch)
2009-07-01 13:48 UTC, David Woodhouse
Details | Diff

Description Michael Olbrich 2009-06-19 21:45:28 UTC
Created attachment 22014 [details]
dmesg section with the bug

Hardware: Lenovo Thinkpad x200s with a Intel Corporation Wireless WiFi Link 5300
kernel: 2.6.30 (no problems with 2.6.29)

Problem: After several hours of working with wireless lan I get the message attached. wlan stops working. "ifconfig down" or "modprobe -r" hangs and cannot be killed. Reboot is not possible. And some other processes start to hang too. Others continue to work just fine. I could not detect a pattern here.

This happens with both iwlwifi-5000-1.ucode and iwlwifi-5000-2.ucode

The problems takes too long for a random git-bisect but I could test specific versions or a patch.
Comment 1 David Woodhouse 2009-06-22 22:55:45 UTC
Created attachment 22056 [details]
Testing patch

I've seen this elsewhere but not been able to reproduce it on a server board; I'll try again on an x200s. Please could you attach your .config?

Please could you also enable CONFIG_DMA_API_DEBUG and try the attached patch? It should hopefully shed some light on the issue if you can reproduce this before I manage to...
Comment 2 David Woodhouse 2009-06-23 11:33:13 UTC
Can you also try booting with 'intel_iommu=strict' and see if that makes the problem go away?
Comment 3 Michael Olbrich 2009-06-25 21:52:58 UTC
Created attachment 22096 [details]
kernel config

config of the kernel with the bug
Comment 4 Michael Olbrich 2009-06-25 22:03:16 UTC
I have applied the patch and enabled CONFIG_DMA_API_DEBUG
However for some reason the bug has not reappeared yet.
I am trying to find out what else I changed....
This my take some time.
Comment 5 Michael Olbrich 2009-06-25 22:07:47 UTC
Comment on attachment 22096 [details]
kernel config

that was the wrong config file
Comment 6 Michael Olbrich 2009-06-25 22:09:11 UTC
Created attachment 22097 [details]
kernel config

config of the kernel with the bug
Comment 7 Reinette Chatre 2009-06-29 22:49:11 UTC
Why was the recent assignment change made? This bug was originally assigned to networking_wireless@kernel-bugs.osdl.org but was changed when determined that it is a problem related to iommu. Why assign it back now? I see nothing in the comments to indicate this is needed.
Comment 8 Rafael J. Wysocki 2009-06-29 22:54:21 UTC
Sorry, my bad.  Restored.
Comment 9 Reinette Chatre 2009-06-29 23:02:17 UTC
(In reply to comment #8)
> Sorry, my bad.  Restored.

No problem ... but this is not the correct assignment - David Woodhouse (dwmw2@infradead.org ) is currently looking at this issue. This bug was assigned to him before you changed it.
Comment 10 Rafael J. Wysocki 2009-06-29 23:08:44 UTC
That's fixable too.
Comment 11 David Woodhouse 2009-06-30 00:50:09 UTC
Still haven't managed to reproduce it with the debugging patch? Perhaps the extra printks are changing the timing. Can you instead try with the code in
  git://git.infradead.org/iommu-2.6.git
Comment 12 Michael Olbrich 2009-06-30 07:50:02 UTC
iommu-2.6.git without CONFIG_DMA_API_DEBUG and without 'intel_iommu=strict', right?
Comment 13 David Woodhouse 2009-06-30 10:24:34 UTC
Preferably with CONFIG_DMA_API_DEBUG, if you can get it to trigger that way.
And I don't mind whether you have 'intel_iommu=strict' -- whichever makes it easier to trigger, so probably without.

And is there any particular reason you're using a 32-bit kernel? I hadn't noticed that before -- I just ran 'make oldconfig' and the stupid x86 makefiles turned it into a 64-bit kernel for me even though CONFIG_X86_32 was set.
Comment 14 Michael Olbrich 2009-06-30 12:23:53 UTC
I just got a kernel panic with iommu-2.6.git without CONFIG_DMA_API_DEBUG and without 'intel_iommu=strict'

No serial console so i have no backtrace for you. From what I can see on my screen it looks very similar to the original bug with a few extra lines.
It doesn't stop at __intel_map_single:

__domain_mapping
warn_slow_path_common
__domain_mapping
warn_slow_path_null
__domain_mapping
__intel_map_single
__alloc_skb
iwl_rx_allocate
...

I'll try to catch this with a serial console.
Comment 15 David Woodhouse 2009-06-30 12:41:31 UTC
In iommu-2.6.git I've changed it to a warning, not a BUG() -- so it ought to continue after it complains.

However, this is a symptom of something _else_ which is wrong, so that's probably why it subsequently crashes. Maybe if you turn that WARN_ON(1) back into a BUG() (about line 1684 of intel-iommu.c) it'll behave as before and make it to syslog without panicking?

I can't reproduce with a 32-bit kernel on another machine; I'll reinstall the x200s in 32-bit mode...

Using a digital camera and a framebuffer with an 8x8 font can sometimes be a useful substitute for a serial console.

What were you doing when you triggered it?
Comment 16 David Woodhouse 2009-06-30 17:05:02 UTC
I think I've managed to reproduce it. If I run 'yum update' from a NFS-mounted repository while flood-pinging it, both over wireless, I get:

Message from syslogd@x200s-32 at Jun 30 16:33:36 ...
 kernel:[  221.468266] Uhhuh. NMI received for unknown reason b0 on CPU 0.

Message from syslogd@x200s-32 at Jun 30 16:33:36 ...
 kernel:[  221.468266] You have some hardware problem, likely on the PCI bus.

Message from syslogd@x200s-32 at Jun 30 16:33:36 ...
 kernel:[  221.468266] Dazed and confused, but trying to continue

I switched it back from WARN_ON() to BUG(), and it looks more like what you were originally reporting:

<2>[  326.588207] ERROR: DMA PTE for vPFN 0xcfff8 already set (to 34adc002)
<0>[  326.588304] ------------[ cut here ]------------
<2>[  326.588350] kernel BUG at drivers/pci/intel-iommu.c:1691!
<0>[  326.588400] invalid opcode: 0000 [#1] SMP 
<0>[  326.588450] last sysfs file: /sys/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/ieee80211/phy0/rfkill0/state
<4>[  326.588533] Modules linked in: nfs lockd nfs_acl auth_rpcgss sco bridge stp llc bnep l2cap bluetooth sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 acpi_cpufreq binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput iwlagn snd_hda_codec_conexant snd_hda_intel iwlcore snd_hda_codec mac80211 snd_pcm cfg80211 snd_timer snd soundcore rfkill snd_page_alloc psmouse e1000e sg led_class pcspkr battery processor i2c_i801 nvram thermal ac evdev button ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ehci_hcd usbcore nls_base [last unloaded: scsi_wait_scan]
<4>[  326.589008] 
<4>[  326.589008] Pid: 2244, comm: iwlagn Not tainted (2.6.31-rc1 #17) 7465CTO
<4>[  326.589008] EIP: 0060:[<c1147a21>] EFLAGS: 00010202 CPU: 0
<4>[  326.589008] EIP is at __domain_mapping+0x187/0x1ab
<4>[  326.589008] EAX: 00000004 EBX: 000cfff8 ECX: ffffffb4 EDX: 0182e000
<4>[  326.589008] ESI: 00000000 EDI: 00000004 EBP: f61f0ecc ESP: f61f0e98
<4>[  326.589008]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
<0>[  326.589008] Process iwlagn (pid: 2244, ti=f61f0000 task=f62a8bf0 task.ti=f61f0000)
<0>[  326.589008] Stack:
<4>[  326.589008]  c134fb94 000cfff8 34adc002 00000000 f5c05a80 00000002 f5c05a80 00000000
<4>[  326.589008] <0> f4b20fc0 000cfff8 f7132800 34808000 f527b780 f61f0f08 c1149300 00034808
<4>[  326.589008] <0> 00000003 00000002 f61f0f04 c108caa4 c11f6972 00000000 f71e50c0 f5c05a80
<0>[  326.589008] Call Trace:
<4>[  326.589008]  [<c1149300>] ? __intel_map_single+0xe6/0x174
<4>[  326.589008]  [<c108caa4>] ? __kmalloc_track_caller+0xc8/0xca
<4>[  326.589008]  [<c11f6972>] ? __alloc_skb+0x29/0x109
<4>[  326.589008]  [<c11493bf>] ? intel_map_page+0x31/0x37
<4>[  326.589008]  [<f931def0>] ? iwl_rx_allocate+0xdf/0x192 [iwlcore]
<4>[  326.589008]  [<f931ef9a>] ? iwl_rx_replenish+0x12/0x38 [iwlcore]
<4>[  326.589008]  [<f93b8b95>] ? iwl_bg_rx_replenish+0x28/0x33 [iwlagn]
<4>[  326.589008]  [<f93b8b6d>] ? iwl_bg_rx_replenish+0x0/0x33 [iwlagn]
<4>[  326.589008]  [<c103c771>] ? worker_thread+0xd8/0x196
<4>[  326.589008]  [<c103fc73>] ? autoremove_wake_function+0x0/0x36
<4>[  326.589008]  [<c103c699>] ? worker_thread+0x0/0x196
<4>[  326.589008]  [<c103fa13>] ? kthread+0x6c/0x6e
<4>[  326.589008]  [<c103f9a7>] ? kthread+0x0/0x6e
<4>[  326.589008]  [<c100389f>] ? kernel_thread_helper+0x7/0x18
<0>[  326.589008] Code: 24 1c 89 4c 24 0c 8b 5c 24 24 89 5c 24 04 c7 04 24 94 fb 34 c1 e8 8c 11 12 00 a1 a4 42 39 c1 85 c0 74 08 83 e8 01 a3 a4 42 39 c1 <0f> 0b eb fe b8 f4 ff ff ff eb 8d 8b 7c 24 24 8b 5d 0c 8d 44 1f 
<0>[  326.589008] EIP: [<c1147a21>] __domain_mapping+0x187/0x1ab SS:ESP 0068:f61f0e98
<4>[  326.591597] ---[ end trace abcadcd058632ed7 ]---


It looks suspicious that only 32 bits of the PTE are set, not the full 64. We should probably try to make the code set PTEs atomically. The machine check may be a result of a half-set PMD, maybe...
Comment 17 David Woodhouse 2009-06-30 20:49:35 UTC
I'm using cmpxchg64() now, and still see the 'pte already set' problem.

It's always at virtual PFN values of just under 0xd0000. I wonder if the iova code is broken somehow and dishing out duplicates...
Comment 18 David Woodhouse 2009-07-01 06:18:58 UTC
I'm starting to suspect a DMA leak in the iwlwifi driver -- we end up with megabytes and megabytes of memory mapped for DMA. But there's definitely an iommu bug which is triggered by this, which I'm still trying to work out.

I also saw http://david.woodhou.se/cimg2750.jpg which is pure iwlwifi brokenness.
Comment 19 David Woodhouse 2009-07-01 06:57:03 UTC
OK, I think I caught the iova code at it. It works down from PFN 0xfffff, but seems to jump from 0xe0000 to 0xcffff, then it gives out 0xcfff6 and 0xcfff7 twice:

<4>[  236.344652] New low water e02b4 (domain f61ded10)
<4>[  236.636322] New low water e01b0 (domain f61ded10)
<4>[  248.724516] New low water e00ac (domain f61ded10)
<4>[  260.732357] Domain f61ded10 give iova cfffc-cffff
<4>[  260.733788] New low water cfffc (domain f61ded10)
<4>[  260.735267] iov_pfn cfffc, pte f4ba0fe0
<4>[  260.736690] iov_pfn cfffd, pte f4ba0fe8
<4>[  260.738096] iov_pfn cfffe, pte f4ba0ff0
<4>[  261.730755] Domain f61ded10 give iova cfff8-cfffb
<4>[  261.732273] iov_pfn cfff8, pte f4ba0fc0
<4>[  261.733784] iov_pfn cfff9, pte f4ba0fc8
<4>[  261.735260] iov_pfn cfffa, pte f4ba0fd0
<4>[  261.736979] Domain f61ded10 give iova cfff7-cfff7
<4>[  261.737712] iov_pfn cfff7, pte f4ba0fb8
<4>[  261.737712] Domain f61ded10 give iova cfff6-cfff6
<4>[  261.737712] iov_pfn cfff6, pte f4ba0fb0
<4>[  262.729558] Domain f61ded10 give iova cfff4-cfff7
<4>[  262.731290] Domain f61ded10 give iova cfff3-cfff3
<4>[  262.732123] iov_pfn cfff3, pte f4ba0f98
<4>[  262.732123] Domain f61ded10 give iova cfff2-cfff2
<4>[  262.732123] iov_pfn cfff2, pte f4ba0f90
<4>[  262.737511] Domain f61ded10 give iova cfff1-cfff1
<4>[  262.738488] iov_pfn cfff1, pte f4ba0f88
<4>[  262.738488] Domain f61ded10 give iova cfff0-cfff0
<4>[  262.738488] iov_pfn cfff0, pte f4ba0f80
<4>[  262.744048] Domain f61ded10 give iova cffef-cffef
<4>[  262.745023] iov_pfn cffef, pte f4ba0f78
<4>[  262.745023] Domain f61ded10 give iova cffee-cffee
<4>[  262.745023] iov_pfn cffee, pte f4ba0f70
<4>[  262.750842] Domain f61ded10 give iova cffed-cffed
<4>[  262.751820] iov_pfn cffed, pte f4ba0f68
<4>[  262.751820] Domain f61ded10 give iova cffec-cffec
<4>[  262.751820] iov_pfn cffec, pte f4ba0f60
<4>[  262.757893] Domain f61ded10 give iova cffeb-cffeb
<4>[  262.758871] iov_pfn cffeb, pte f4ba0f58
<4>[  262.758871] Domain f61ded10 give iova cffea-cffea
<4>[  262.758871] iov_pfn cffea, pte f4ba0f50
<4>[  262.765232] Domain f61ded10 give iova cffe9-cffe9
<4>[  262.766209] iov_pfn cffe9, pte f4ba0f48
<4>[  262.766209] Domain f61ded10 give iova cffe8-cffe8
<4>[  262.766209] iov_pfn cffe8, pte f4ba0f40
<4>[  262.772878] Domain f61ded10 give iova cffe7-cffe7
<4>[  262.773857] iov_pfn cffe7, pte f4ba0f38
<4>[  262.773857] Domain f61ded10 give iova cffe6-cffe6
<4>[  262.773857] iov_pfn cffe6, pte f4ba0f30
<4>[  262.780786] Domain f61ded10 give iova cffe5-cffe5
<4>[  262.781765] iov_pfn cffe5, pte f4ba0f28
<4>[  262.781765] Domain f61ded10 give iova cffe4-cffe4
<4>[  262.781765] iov_pfn cffe4, pte f4ba0f20
<4>[  262.788987] Domain f61ded10 give iova cffe3-cffe3
<4>[  262.789966] iov_pfn cffe3, pte f4ba0f18
<4>[  262.789966] Domain f61ded10 give iova cffe2-cffe2
<4>[  262.789966] iov_pfn cffe2, pte f4ba0f10
<4>[  262.797461] Domain f61ded10 give iova cffe1-cffe1
<4>[  262.798440] iov_pfn cffe1, pte f4ba0f08
<4>[  262.798440] Domain f61ded10 give iova cffe0-cffe0
<4>[  262.804065] Domain f61ded10 give iova cffdf-cffdf
<4>[  262.805045] Domain f61ded10 give iova cffde-cffde
<4>[  262.808486] Domain f61ded10 give iova cffdd-cffdd
<4>[  262.809466] Domain f61ded10 give iova cffdc-cffdc
<4>[  262.812902] Domain f61ded10 give iova cffdb-cffdb
<4>[  262.813882] Domain f61ded10 give iova cffda-cffda
<4>[  262.817318] Domain f61ded10 give iova cffd9-cffd9
<4>[  262.818297] Domain f61ded10 give iova cffd8-cffd8
<4>[  262.821861] iov_pfn cfff4, pte f4ba0fa0
<4>[  262.824094] iov_pfn cfff5, pte f4ba0fa8
<4>[  262.826305] iov_pfn cfff6, pte f4ba0fb0
<2>[  262.828516] ERROR: DMA PTE for vPFN 0xcfff6 already set (to 34c24001 not 234b0e002)
<0>[  262.830829] ------------[ cut here ]------------
<2>[  262.831794] kernel BUG at drivers/pci/intel-iommu.c:1722!
<0>[  262.831794] invalid opcode: 0000 [#1] SMP 
<4>[  262.831794] Pid: 1879, comm: iwlagn Not tainted (2.6.31-rc1 #33) 7465CTO
<4>[  262.831794] EIP: 0060:[<c11478b3>] EFLAGS: 00010202 CPU: 0
<4>[  262.831794] EIP is at __domain_mapping+0x1c0/0x202
<4>[  262.831794] EAX: 00000004 EBX: 000cfff6 ECX: fffff55a EDX: 00000046
<4>[  262.831794] ESI: f4ba0fb0 EDI: 00000000 EBP: f63abecc ESP: f63abe98
<4>[  262.831794]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
<0>[  262.831794] Process iwlagn (pid: 1879, ti=f63aa000 task=f60f17e0 task.ti=f63aa000)
<4>[  262.831794] 
<0>[  262.831794] EIP: [<c11478b3>] __domain_mapping+0x1c0/0x202 SS:ESP 0068:f63abe98
<4>[  262.860075] ---[ end trace 10284dc5c7c5facf ]---
<4>[  264.736081] Domain f61ded10 free iova cfffc-cffff
<4>[  265.741066] Domain f61ded10 free iova cfff8-cfffb
Comment 20 David Woodhouse 2009-07-01 13:48:46 UTC
Created attachment 22162 [details]
potential fix

This patch should fix the problem in the IOMMU code.

Still chasing the apparent leak in iwlwifi -- that might be because it's passing the wrong DMA address when it unmaps the buffers.
Comment 21 David Woodhouse 2009-07-01 20:10:44 UTC
The 'leak' is fixed by http://lkml.org/lkml/2009/7/1/389 btw.
Comment 22 David Woodhouse 2009-07-04 23:48:16 UTC
*** Bug 12222 has been marked as a duplicate of this bug. ***
Comment 23 Rafael J. Wysocki 2009-07-06 23:52:35 UTC
Fixed by commit a15a519ed6e5e644f5a33c213c00b0c1d3cfe683 .

Note You need to log in before you can comment on or make changes to this bug.