Bug 14003

Summary: Infinite loop on bootup while handling DMAR
Product: Drivers Reporter: Bernhard Rosenkraenzer (bero)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: CLOSED DOCUMENTED    
Severity: normal CC: dwmw2, linux-bugs, Matt_Domsch, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13615    

Description Bernhard Rosenkraenzer 2009-08-18 14:54:20 UTC
Booting 64bit 2.6.31-rc6 on a hp xw4600 results in an infinite loop of

DMAR:[DMA READ] Request device [ff:1f.7] fault addr fffffffffffff000
DMAR:[fault reason 255] Unknown

even with iommu=soft (2.6.30 could boot with iommu=soft, but would result in the same error with hardware iommu enabled).

2.6.30 reports this about the DMAR ACPI tables:

ACPI: DMAR 00000000defc247f 00158 (v01 COMPAQ BEARLX38 00000001      00000000)
DMAR:Host address width 36
DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
DMAR:RMRR base: 0x00000000defd0000 end: 0x00000000defd0fff
DMAR:RMRR base: 0x00000000defd1000 end: 0x00000000defd1fff
DMAR:RMRR base: 0x00000000defd2000 end: 0x00000000defd2fff
DMAR:RMRR base: 0x00000000defd3000 end: 0x00000000defd3fff
DMAR:RMRR base: 0x00000000defd4000 end: 0x00000000defd4fff
DMAR:RMRR base: 0x00000000defd5000 end: 0x00000000defd5fff
DMAR:RMRR base: 0x00000000defd6000 end: 0x00000000defd6fff
DMAR:RMRR base: 0x00000000defd7000 end: 0x00000000defd7fff
Not all IO-APIC's listed under remapping hardware

Device ff:1f.7 doesn't actually exist, lspci -n says

00:00.0 0600: 8086:29e0
00:01.0 0604: 8086:29e1
00:1a.0 0c03: 8086:2937 (rev 02)
00:1a.1 0c03: 8086:2938 (rev 02)
00:1a.2 0c03: 8086:2939 (rev 02)
00:1a.7 0c03: 8086:293c (rev 02)
00:1b.0 0403: 8086:293e (rev 02)
00:1c.0 0604: 8086:2940 (rev 02)
00:1c.4 0604: 8086:2948 (rev 02)
00:1c.5 0604: 8086:294a (rev 02)
00:1d.0 0c03: 8086:2934 (rev 02)
00:1d.1 0c03: 8086:2935 (rev 02)
00:1d.2 0c03: 8086:2936 (rev 02)
00:1d.7 0c03: 8086:293a (rev 02)
00:1e.0 0604: 8086:244e (rev 92)
00:1f.0 0601: 8086:2916 (rev 02)
00:1f.2 0106: 8086:2922 (rev 02)
01:00.0 0300: 1002:7187
01:00.1 0380: 1002:71a7
10:0b.0 0c00: 11c1:5811 (rev 61)
3f:00.0 0200: 14e4:167b (rev 02)
Comment 1 Andrew Morton 2009-08-19 21:26:42 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 18 Aug 2009 14:54:22 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14003
> 
>            Summary: Infinite loop on bootup while handling DMAR

That's a box-killing post-2.6.30 regression.

>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 2.6.31-rc6
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: PCI
>         AssignedTo: drivers_pci@kernel-bugs.osdl.org
>         ReportedBy: bero@arklinux.org
>         Regression: Yes
> 
> 
> Booting 64bit 2.6.31-rc6 on a hp xw4600 results in an infinite loop of
> 
> DMAR:[DMA READ] Request device [ff:1f.7] fault addr fffffffffffff000
> DMAR:[fault reason 255] Unknown
> 
> even with iommu=soft (2.6.30 could boot with iommu=soft, but would result in
> the same error with hardware iommu enabled).
> 
> 2.6.30 reports this about the DMAR ACPI tables:
> 
> ACPI: DMAR 00000000defc247f 00158 (v01 COMPAQ BEARLX38 00000001     
> 00000000)
> DMAR:Host address width 36
> DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
> DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
> DMAR:RMRR base: 0x00000000defd0000 end: 0x00000000defd0fff
> DMAR:RMRR base: 0x00000000defd1000 end: 0x00000000defd1fff
> DMAR:RMRR base: 0x00000000defd2000 end: 0x00000000defd2fff
> DMAR:RMRR base: 0x00000000defd3000 end: 0x00000000defd3fff
> DMAR:RMRR base: 0x00000000defd4000 end: 0x00000000defd4fff
> DMAR:RMRR base: 0x00000000defd5000 end: 0x00000000defd5fff
> DMAR:RMRR base: 0x00000000defd6000 end: 0x00000000defd6fff
> DMAR:RMRR base: 0x00000000defd7000 end: 0x00000000defd7fff
> Not all IO-APIC's listed under remapping hardware
> 
> Device ff:1f.7 doesn't actually exist, lspci -n says
> 
> 00:00.0 0600: 8086:29e0
> 00:01.0 0604: 8086:29e1
> 00:1a.0 0c03: 8086:2937 (rev 02)
> 00:1a.1 0c03: 8086:2938 (rev 02)
> 00:1a.2 0c03: 8086:2939 (rev 02)
> 00:1a.7 0c03: 8086:293c (rev 02)
> 00:1b.0 0403: 8086:293e (rev 02)
> 00:1c.0 0604: 8086:2940 (rev 02)
> 00:1c.4 0604: 8086:2948 (rev 02)
> 00:1c.5 0604: 8086:294a (rev 02)
> 00:1d.0 0c03: 8086:2934 (rev 02)
> 00:1d.1 0c03: 8086:2935 (rev 02)
> 00:1d.2 0c03: 8086:2936 (rev 02)
> 00:1d.7 0c03: 8086:293a (rev 02)
> 00:1e.0 0604: 8086:244e (rev 92)
> 00:1f.0 0601: 8086:2916 (rev 02)
> 00:1f.2 0106: 8086:2922 (rev 02)
> 01:00.0 0300: 1002:7187
> 01:00.1 0380: 1002:71a7
> 10:0b.0 0c00: 11c1:5811 (rev 61)
> 3f:00.0 0200: 14e4:167b (rev 02)
>
Comment 2 David Woodhouse 2009-08-19 22:01:57 UTC
Looking at the log from 2.6.30, it seems that the iommu wasn't being used. I
don't think this is a regression -- it's just another BIOS written by
incompetents and shipped without any QA.

There should be a workaround for this in
http://git.infradead.org/iommu-2.6.git/commit/0815565a
but it can't _work_ -- it'll just disable the IOMMU after bitching a bit.
Comment 3 Suresh B Siddha 2009-08-20 01:16:25 UTC
On Wed, 2009-08-19 at 14:26 -0700, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 18 Aug 2009 14:54:22 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=14003
> > 
> >            Summary: Infinite loop on bootup while handling DMAR
> 
> That's a box-killing post-2.6.30 regression.
> 
> > 
> > Booting 64bit 2.6.31-rc6 on a hp xw4600 results in an infinite loop of
> > 
> > DMAR:[DMA READ] Request device [ff:1f.7] fault addr fffffffffffff000
> > DMAR:[fault reason 255] Unknown

I think David has a fix queued up for this already. Please check
http://git.infradead.org/iommu-2.6.git/commit/0815565a

thanks,
suresh
Comment 4 David Woodhouse 2009-08-20 07:53:09 UTC
On Wed, 2009-08-19 at 14:26 -0700, Andrew Morton wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=14003
> >            Summary: Infinite loop on bootup while handling DMAR
> 
> That's a box-killing post-2.6.30 regression.

It's a BIOS bug -- the user's BIOS is written by idiots who obviously
shipped it without any QA whatsoever.

As far as I can tell, 2.6.30 aborted early because of a _different_ BIOS
bug, but now we cope with that particular bug and we fall over the next
bug. Or just come across them in a different order.

The IOMMU on this board can _never_ have worked. Just disable it.

Or use a board with open source firmware available, and this kind of
crap won't happen. (At least if it does, you'll be able to fix it).
Comment 5 Bernhard Rosenkraenzer 2009-08-20 08:16:23 UTC
Confirmed that the patch fixes it (as far as "fixing" it by turning off the iommu goes).
BIOS developers should be forced to use real operating systems...
Comment 6 Anonymous Emailer 2009-08-20 08:41:28 UTC
Reply-To: paravoid@debian.org

David Woodhouse wrote:
> On Wed, 2009-08-19 at 14:26 -0700, Andrew Morton wrote:
>>> http://bugzilla.kernel.org/show_bug.cgi?id=14003
>>>            Summary: Infinite loop on bootup while handling DMAR
>> That's a box-killing post-2.6.30 regression.
> 
> It's a BIOS bug -- the user's BIOS is written by idiots who obviously
> shipped it without any QA whatsoever.
Matt may be wondering why he was addressed, since the bugzilla entry
mentions only a bug in HP's BIOS.

I'm experiencing the same bug on a newly bought Dell Optiplex 760 with
BIOS version A03, as explained in my mail in lkml,
<4A89CB52.4030008@debian.org>, subject "[regression, bisected] fails to
boot on Dell Optiplex 760 with VT-d enabled".

Thanks,
Faidon
Comment 7 David Woodhouse 2009-08-20 08:46:00 UTC
No, using a real operating system isn't relevant. This would break VT-d under Windows too -- I don't think there's anything the OS can do other than just ignore the broken DMAR tables and run without the IOMMU.

So the offending BIOS teams obviously didn't bother to test this AT ALL.
Comment 8 Rafael J. Wysocki 2009-08-20 14:35:19 UTC
Closing, since this is not a kernel bug.