Bug 111781

Summary: Macbook Apple firmware boot leaves wifi DMA on resulting in chaos
Product: Platform Specific/Hardware Reporter: Chris Bainbridge (chris.bainbridge)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bugzilla, lukas
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.4 Subsystem:
Regression: No Bisected commit-id:

Description Chris Bainbridge 2016-02-02 19:15:32 UTC
MacBookPro 10,2/Mac 
BIOS MBP102.88Z.0106.B0A.1509130955 09/13/2015
03:00.0 Network controller: Broadcom Corporation BCM4331 802.11a/b/g/n (rev 02)


Using latest Apple OS X and firmware results in memory corruption, symptoms are many errors, including:

 - 2 file system corruptions leaving the root fs unbootable in 1 month
 - processes randomly segfault
 - memory errors, especially at shutdown or reboot, resulting in hangs (postgresql "PANIC:  stuck spinlock" etc.)
 - BUG: Bad page map in process sshd (and others)
 - BUG: Bad rss-counter state
 - BUG: unable to handle kernel NULL pointer dereference
 - swap_free: Bad swap file entry
 - Oops: 0000 [#1] SMP DEBUG_PAGEALLOC


After enabling IOMMU (iommu=force intel_iommu=on) it reports DMAR errors on most boots:

[   41.971617] DMAR: DRHD: handling fault status reg 3
[   41.971629] DMAR: DMAR:[DMA Write] Request device [03:00.0] fault addr 85229000 
DMAR:[fault reason 01] Present bit in root entry is clear


One boot log shows this error occurring 0.7 seconds in to the boot so there is a strong possibility of memory corruption occurring before the wifi driver can turn off the DMA.

Apparently this bug was found before in 2012 - https://mjg59.dreamwidth.org/11235.html - comments suggest that Apple fixed it since then, so this could be a regression. On the other hand, maybe nobody noticed it since the bugs are so random and Intel IOMMU isn't enabled by default.
Comment 1 Chris Bainbridge 2016-02-02 19:26:43 UTC
Obviously this is a bug in the firmware so the options are limited, though enabling IOMMU by default would help. Maybe there should be a policy to taint the kernel when a known bad firmware version is detected.
Comment 2 Chris Murphy 2016-02-04 20:24:23 UTC
New problem though if this isn't true anymore "Yes, this seems to stop once the driver is loaded."
Comment 3 Chris Bainbridge 2016-02-04 20:29:40 UTC
It stops when the driver is loaded but that can take several seconds and until then memory can be corrupted. There is also the issue of people who blacklist or don't install the driver for whatever reason.
Comment 5 Lukas Wunner 2016-08-10 14:04:13 UTC
Fixed in stable kernels 4.6.6, 4.4.17, 4.1.30, 3.18.39
Comment 6 Lukas Wunner 2016-11-14 10:04:43 UTC
Fixed in upcoming stable kernels 3.16.39, 3.2.84