Bug 108251

Summary: tpm_crb / DMAR: DRHD: handling fault status reg 3
Product: Drivers Reporter: Pierre Chifflier (chifflier)
Component: OtherAssignee: drivers_other
Status: NEW ---    
Severity: normal CC: bigon, bordjukov, dwmw2, jarkko.sakkinen, klondike+kernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.2.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
DMAR table
DMAR.dsl
Patch to alias function 0 to function 7 on the HECI

Description Pierre Chifflier 2015-11-21 16:32:43 UTC
Hi,

On a 4.2.6 kernel with IOMMU enabled (iommu=on,igfx_off), the TPM 2 device is not detected by the tpm_crb module. It works if IOMMU is disabled.

dmesg when running 'modprobe -v tpm_crb':
[47849.248992] DMAR: DRHD: handling fault status reg 3
[47849.249001] DMAR: DMAR:[DMA Read] Request device [00:16.7] fault addr ccdff000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[47849.252147] DMAR: DRHD: handling fault status reg 3
[47849.252157] DMAR: DMAR:[DMA Write] Request device [00:16.7] fault addr ccdff000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[47851.251366] tpm_crb MSFT0101:00: Operation Timed out
[47851.251382] tpm_crb: probe of MSFT0101:00 failed with error -62
[47851.252014] DMAR: DRHD: handling fault status reg 3
[47851.252018] DMAR: DMAR:[DMA Read] Request device [00:16.7] fault addr ccdff000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[47851.252021] DMAR: DRHD: handling fault status reg 3
[47851.252024] DMAR: DMAR:[DMA Read] Request device [00:16.7] fault addr ccdff000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[47851.254488] DMAR: DRHD: handling fault status reg 3
[47851.254494] DMAR: DMAR:[DMA Write] Request device [00:16.7] fault addr ccdff000 
               DMAR:[fault reason 02] Present bit in context entry is clear


Hardware is a Lenovo x250, with TPM2 option selected in the Security Chip in UEFI menu.

# lspci
00:00.0 Host bridge: Intel Corporation Broadwell-U Host Bridge -OPI (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Broadwell-U Integrated Graphics (rev 09)
00:03.0 Audio device: Intel Corporation Broadwell-U Audio Controller (rev 09)
00:14.0 USB controller: Intel Corporation Wildcat Point-LP USB xHCI Controller (rev 03)
00:16.0 Communication controller: Intel Corporation Wildcat Point-LP MEI Controller #1 (rev 03)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (3) I218-LM (rev 03)
00:1b.0 Audio device: Intel Corporation Wildcat Point-LP High Definition Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #6 (rev e3)
00:1c.1 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #3 (rev e3)
00:1d.0 USB controller: Intel Corporation Wildcat Point-LP USB EHCI Controller (rev 03)
00:1f.0 ISA bridge: Intel Corporation Wildcat Point-LP LPC Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03)
00:1f.3 SMBus: Intel Corporation Wildcat Point-LP SMBus Controller (rev 03)
00:1f.6 Signal processing controller: Intel Corporation Wildcat Point-LP Thermal Management Controller (rev 03)
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5227 PCI Express Card Reader (rev 01)
03:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59)

# for x in /sys/bus/acpi/devices/MSFT0101\:00/*; do echo $x; cat $x; echo ------; done
/sys/bus/acpi/devices/MSFT0101:00/description
TPM 2.0 Device
------
/sys/bus/acpi/devices/MSFT0101:00/hid
MSFT0101
------
/sys/bus/acpi/devices/MSFT0101:00/modalias
acpi:MSFT0101:
------
/sys/bus/acpi/devices/MSFT0101:00/path
\_SB_.TPM_
------
/sys/bus/acpi/devices/MSFT0101:00/physical_node
cat: /sys/bus/acpi/devices/MSFT0101:00/physical_node: Is a directory
------
/sys/bus/acpi/devices/MSFT0101:00/power
cat: /sys/bus/acpi/devices/MSFT0101:00/power: Is a directory
------
/sys/bus/acpi/devices/MSFT0101:00/status
15
------
/sys/bus/acpi/devices/MSFT0101:00/subsystem
cat: /sys/bus/acpi/devices/MSFT0101:00/subsystem: Is a directory
------
/sys/bus/acpi/devices/MSFT0101:00/uevent
MODALIAS=acpi:MSFT0101:
------
Comment 1 Pierre Chifflier 2016-02-09 08:31:42 UTC
Quick update, bug is still present in 4.3.5
Also, adding TPM 2 driver author to CC.
Comment 2 jarkko.sakkinen 2016-02-09 09:24:32 UTC
Thanks for reporting this. We have some luck since I happen to have the very same laptop.
Comment 3 David Woodhouse 2016-02-10 08:05:08 UTC
Please could I see the full dmesg? I'd like to see what the ACPI tables tell the IOMMU code about the mapping of ACPI devices to PCI dev/fn. And also I'd like to see what is at address 0xccdff000 — is that reserved memory?
Comment 4 Pierre Chifflier 2016-02-10 10:32:22 UTC
Created attachment 203281 [details]
dmesg

Dmesg, kernel 4.3.5

Boot is in UEFI mode only (CSM disabled)
Comment 5 David Woodhouse 2016-02-10 10:59:08 UTC
Thanks. So...

[    0.000000] BIOS-e820: [mem 0x00000000ccdfe000-0x00000000ccdfefff] usable
[    0.000000] BIOS-e820: [mem 0x00000000f80f8000-0x00000000f80f8fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved

A chunk of usable memory ends at 0xccdfefff, and our faulting address 0xccdff000 is right after that. I think this is the TPM 'control area'.

(Btw, isn't that a BIOS bug, and this memory should be explicitly marked as reserved, rather than just left out of the E820 map? Otherwise, we might attempt to map PCI BARs over it? But I suspect that's harmless in this case.)

It looks like the DMA from the TPM appears as if it comes from (non-existent) PCI device 00:16.7. That's normal — the firmware is supposed to provide that information in ANDD records in the DMAR table, with a mapping for each DMA-capable ACPI device to the PCI bus/dev/fn that it appears as. Although your BIOS doesn't seem to be providing that.

I also suspect that the BIOS is supposed to ask the OS to set up a 1:1 mapping for the control area with an RMRR record in the DMAR table. It doesn't look like it's doing so. Can you attach the contents of /sys/firmware/acpi/tables/DMAR please?

It's possible that the BIOS *is* doing the right thing, but that we don't obey because there isn't actually a PCI device at 00:16.7. I'd just like to check.
Comment 6 Pierre Chifflier 2016-02-10 12:37:43 UTC
Created attachment 203291 [details]
DMAR table

/sys/firmware/acpi/tables/DMAR

Device 16 is the MEI, which provides the vTPM

I don't know how to parse the DMAR table, so I provided the binary.
Comment 7 jarkko.sakkinen 2016-02-10 13:04:13 UTC
Communication with PTT is through ACPI function that transfers control to SMM and communicates with ME.

From /sys/firmware/acpi/tables/TPM2 it is easy to check the address of the control area.
Comment 8 jarkko.sakkinen 2016-02-10 13:27:37 UTC
On my laptop (x250), if I set intel_iommu=on, the driver times out when it first tries to communicate with the TPM and waits for the response. I do not see the above erros.

BTW, have you ever updated your BIOS for that laptop?
Comment 9 jarkko.sakkinen 2016-02-10 13:49:03 UTC
> Created attachment 203291 [details]
> DMAR table
> 
> /sys/firmware/acpi/tables/DMAR
> 
> Device 16 is the MEI, which provides the vTPM
> 
> I don't know how to parse the DMAR table, so I provided the binary.

There is a tool called iasl that you can use to disassemble ACPI tables.

I don't find entry for that device from DMAR.
Comment 10 jarkko.sakkinen 2016-02-10 13:49:57 UTC
Created attachment 203301 [details]
DMAR.dsl
Comment 11 David Woodhouse 2016-02-10 14:39:44 UTC
So the BIOS isn't *asking* us to permit DMA from 00:16.7 to 0xccdff000. And thus we don't configure the IOMMU to permit it, ando stuff thus doesn't work

Having diagnosed a BIOS bug, I probably now have about three years until it gets fixed (and then only if you buy new hardware). During which time I can fix the Linux bug that if the BIOS *had* asked for such a mapping, we wouldn't have honoured it anyway because there is no *actual* PCI device at 00:16.7 :)

Jarkko, no idea why you don't see the same faults; perhaps some deep chipset/firmware magic hides them from you. But the symptoms for the TPM look the same; that DMA to the control area is blocked.
Comment 12 Pierre Chifflier 2016-02-10 16:21:16 UTC
(In reply to jarkko.sakkinen from comment #8)
> On my laptop (x250), if I set intel_iommu=on, the driver times out when it
> first tries to communicate with the TPM and waits for the response. I do not
> see the above erros.
> 
> BTW, have you ever updated your BIOS for that laptop?

Yes, current version is 1.17
Comment 13 jarkko.sakkinen 2016-08-22 13:53:03 UTC
(In reply to David Woodhouse from comment #11)
> So the BIOS isn't *asking* us to permit DMA from 00:16.7 to 0xccdff000. And
> thus we don't configure the IOMMU to permit it, ando stuff thus doesn't work
> 
> Having diagnosed a BIOS bug, I probably now have about three years until it
> gets fixed (and then only if you buy new hardware). During which time I can
> fix the Linux bug that if the BIOS *had* asked for such a mapping, we
> wouldn't have honoured it anyway because there is no *actual* PCI device at
> 00:16.7 :)
> 
> Jarkko, no idea why you don't see the same faults; perhaps some deep
> chipset/firmware magic hides them from you. But the symptoms for the TPM
> look the same; that DMA to the control area is blocked.

I know it now. Starting from Skylake DMA is not used. In Skylake the
communication is MMIO based. That's why I didn't have such a problem.

Should this be worked around somehow? Just thinking what I should do
with this bug.
Comment 14 Laurent Bigonville 2021-11-15 15:41:53 UTC
Hello,

I'm still experiencing a similar issue with my lenovo t550 even with the latest firmware.

Can something be done in the kernel for this? Could that be reported to lenovo somehow?

I'm sure that will cause troubles to people who will try to dualboot with windows 11 (as it requires TPM 2.0)
Comment 15 klondike 2022-10-23 20:37:05 UTC
(In reply to David Woodhouse from comment #11)
> Having diagnosed a BIOS bug, I probably now have about three years until it
> gets fixed (and then only if you buy new hardware). During which time I can
> fix the Linux bug that if the BIOS *had* asked for such a mapping, we
> wouldn't have honoured it anyway because there is no *actual* PCI device at
> 00:16.7 :)

The firmware bug can be addressed by overriding the DMAR ACPI table using an initrd (as I have already done). This still does not address the issue that Linux is not honouring the BIOS request, so... How can this second issue be solved?
Comment 16 klondike 2022-11-20 23:25:21 UTC
I patched the tables using a ACPI device definition and using a patch that added support for MTRRs using them.

Baolu instead suggested using a PCI quirk to allow the MTRR to apply to the (hidden) function used bu the HECI. The resulting patch is significantly smaller both on the kernel and ACPI tables.

If anybody here wants to test the patch I can provide them with the patched DMAR table for a X240 thinkpad. But it might be better if you patch the table yourself (I'll try to write down a post on how to do so).
Comment 17 klondike 2022-11-21 14:08:30 UTC
Created attachment 303249 [details]
Patch to alias function 0 to function 7 on the HECI