Bug 203485

Summary: PCI can't map correct memory resource if the BAR is 64-bit and then leads to system hang during booting up
Product: Drivers Reporter: AceLan Kao (acelan)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED INVALID    
Severity: normal CC: andy.shevchenko, bjorn, kai.heng.feng, leho, mika.westerberg, zaid2knet, ztuowen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v5.1-rc7 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg.log
Debug message and modifications
MTRR
intel-lpss change
mfd: intel-lpss: add quirk for Dell XPS 13 7390 2-in-1

Description AceLan Kao 2019-05-03 08:31:25 UTC
Created attachment 282597 [details]
dmesg.log

This issue has been found on loading intel-lpss-pci driver, the pdev->resource passing to its probe function is not correct. And it leads to system hangs while loading the driver.

This is the memory addresses driver got, the start and end are more like the region declared in BAR

[    0.718591] pci 0000:00:15.0: BAR 0: assigned [mem 0x4010000000-0x4010000fff 64bit]
[    0.718613] pci 0000:00:15.1: BAR 0: assigned [mem 0x4010001000-0x4010001fff 64bit]
[    0.718626] pci 0000:00:1d.0: BAR 13: assigned [io  0x4000-0x6fff]
[    0.718634] pci 0000:00:1e.0: BAR 0: assigned [mem 0x4010002000-0x4010002fff 64bit]

[  154.417550] [/home/acelan/linux/drivers/mfd/intel-lpss-pci.c:31] intel_lpss_pci_probe(): 0000:00:15.0 - mem = 0x0000000047753398, mem->start = 0x0000000010000000, mem->end = 0x0000000010000FFF 
[  154.417571] [/home/acelan/linux/drivers/mfd/intel-lpss-pci.c:31] intel_lpss_pci_probe(): 0000:00:15.1 - mem = 0x0000000047754398, mem->start = 0x0000000010001000, mem->end = 0x0000000010001FFF 
[  154.417587] [/home/acelan/linux/drivers/mfd/intel-lpss-pci.c:31] intel_lpss_pci_probe(): 0000:00:1e.0 - mem = 0x000000004775A398, mem->start = 0x0000000010002000, mem->end = 0x0000000010002FFF 

Below is the correct one I copy from 32-Bit BAR
[    3.511100] [/home/acelan/bionic/drivers/mfd/intel-lpss-pci.c:47] intel_lpss_pci_probe(): 0000:00:15.0 - mem = 0x00000000AB282380, mem->start = 0x00000000ED137000, mem->end = 0x00000000ED137FFF
Comment 1 Bjorn Helgaas 2019-05-03 12:36:50 UTC
Thanks for the report!

The addresses from your log ("mem->start = 0x0000000010000000") look like the upper 32 bits are truncated.  I don't see that printk in the upstream source so I assume it's debugging you added.

What is the 32-bit BAR you mention, and where did you get its value?  From the dmesg log, 0000:00:15.0 only has one BAR, and it is a 64-bit BAR:

  pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]
  pci 0000:00:15.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit]
  pci 0000:00:15.0: BAR 0: assigned [mem 0x4010000000-0x4010000fff 64bit]

It looks like BIOS didn't assign that BAR, but Linux assigned space that appears to be valid (it's inside a host bridge window).

Please attach "sudo lspci -vvxxx" output and attach or point me to the differences between an upstream kernel and yours.
Comment 2 Mika Westerberg 2019-05-03 12:56:40 UTC
We have seen similar issue with some systems and it looks like it is the case here. The MTRR registers have this line (from dmesg):

[    0.001681]   6 base 4000000000 mask 6000000000 write-combining

Making the area for LPSS devices write-combining (assuming I'm reading it right). It was solved by a BIOS fix (since BIOS programs these AFAIK).
Comment 3 Bjorn Helgaas 2019-05-03 14:06:46 UTC
Is there anything we could do in Linux to recognize or work around this so we wouldn't require a BIOS update?

From [1], "If a device cannot tolerate memory write combining, it has been designed incorrectly," which suggests that the LPSS device might be broken in this regard.

Maybe we could have some sort of quirk that forces putting it in non-prefetchable, non-write-combining space?

[1] Conventional PCI Spec, r3.0, sec 3.2.6
Comment 4 Mika Westerberg 2019-05-03 14:45:10 UTC
I think MTRR registers are about CPU caches not how the PCI bus itself handles transactions. If the CPU cache write-combines writes to the LPSS device registers, it may certainly cause problems.

IIRC this issue does not happen in Windows because it allocates MMIO space from the host bridge window top to bottom whereas Linux does the opposite. One option is to do the same.
Comment 5 Bjorn Helgaas 2019-05-03 15:38:31 UTC
Right, MTRR is indeed about CPU caches, but from the device's point of view, all it sees is a write transaction.  It can't tell whether that transaction includes writes combined by a CPU cache or anything else in the path to the device.

Write-combining is critical for good performance of frame buffers on PCI, so I don't think the fact that MTRR configures the range to be WC is the problem. 
 The only things a BAR can tell us about its requirements are its width and prefetchability [1].  There's no way for it to say "I can't tolerate write-combining," which fits with the assertion that all PCI devices must tolerate write combining.

Changing the allocation order might accidentally avoid the problem, but it wouldn't be an actual *fix*.  As an aside (since I don't think it would be a robust fix), we did try that several years ago, but reverted it because of some regressions:

b126b4703afa ("PCI: allocate bus resources from the top down")
6db45b76eaa0 ("Revert "PCI: allocate bus resources from the top down"")
https://bugzilla.kernel.org/show_bug.cgi?id=22062
https://lore.kernel.org/lkml/20101102170335.GC24927@aftab/T/#u

[1] PCIe r4.0, sec 7.5.1.2.1
Comment 6 Mika Westerberg 2019-05-03 16:13:30 UTC
Looking at the Intel software developer's manual the MTRR write-combining attribute allows things such speculative reads, writes may be delayed and re-ordered. So yes, it is good for frame buffers but for devices such as LPSS where you depend on the order of reads and writes it is not.

The driver calls devm_ioremap() for the BAR and AFAICT it resolves to ioremap_nocache() so it expects to get uncached memory. We could try to change it to call devm_ioremap_nocache() instead but I think it still ends up doing the same thing.
Comment 7 Mika Westerberg 2019-05-03 16:24:43 UTC
AceLan, can you try to change the following line:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mfd/intel-lpss.c#n397

to call devm_ioremap_nocache() instead of devm_ioremap() and see if it makes any difference?
Comment 8 Mika Westerberg 2019-05-03 16:50:40 UTC
Forgot to say the BIOS fix I referred was to change the MTRR range to be uncachable instead of write-combining.
Comment 9 Bjorn Helgaas 2019-05-06 14:02:51 UTC
I'm sure changing the MTRR range for the LPSS BAR from write-combining to uncacheable would avoid the problem.  I'm not sure why the BIOS would program that MTRR range at all: I don't think the BIOS itself uses that range starting at 0x4000000000, and this seems to be a recent CPU that supports PAT, so the OS should be able to control the memory type of a per-page basis.

My question is whether it is legal for LPSS to rely on the stronger ordering of UC.  The PCI spec suggests to me that it is not, but I'd like to hear from a PCI spec expert.

If LPSS is designed incorrectly, a BIOS change will avoid the problem on this machine, but of course we might see the problem again on another machine.  It would be nice to work around or at least notice that so we don't have to debug it again.
Comment 10 Bjorn Helgaas 2019-05-06 17:39:20 UTC
I guess a more likely explanation is that the LPSS hardware works correctly even with write-combining, but the *driver* is missing a barrier or synchronization.
Comment 11 Andy Shevchenko 2019-05-06 21:16:59 UTC
I do not understand how CPU caches are related to PCI device programming sequence? Devices, by their nature, may require specific sequence of events (bits set or cleared or flipped or whatever), if CPU cache is enabled for the register file, the driver will not see device responses in time. It sounds to me as fundamental thing about hardware. I can't imagine anything, except framebuffer (and perhaps some similar cases), where cache has no side effects on the hardware in question. Bjorn, can you elaborate how some generic hardware, which requires some registers to be set prior to others, is supposed to handle CPU cache side effects?
Comment 12 Andy Shevchenko 2019-05-06 21:21:55 UTC
(In reply to AceLan Kao from comment #0)
> Created attachment 282597 [details]
> dmesg.log
> 
> This issue has been found on loading intel-lpss-pci driver, the
> pdev->resource passing to its probe function is not correct. And it leads to
> system hangs while loading the driver.

It's a BIOS sighting where MTRR is set to wrong value for the region in question.
Comment 13 Bjorn Helgaas 2019-05-06 22:04:16 UTC
WC memory is actually not cacheable (I'm looking at [1], probably the same as what Mika referenced in comment #6), but I think you are correct that the MTRR setting is a BIOS bug.

I mentioned the statement in the PCI spec that requires devices to tolerate memory write combining [2], but after reading it again, I think that refers to a more restrictive version of combining than the CPU WC attribute.  PCI write combining is permitted only when "the target sees the data in the same order as the original master generated it."  Writes to DWORDs 4, 3, 1 could not be combined because the target would see the writes in the order 1, 3, 4.  The CPU WC memory type *would* allow those to be combined because it doesn't require that the order of writes be preserved.

The PCI specs don't provide any way for hardware to communicate whether it can tolerate the CPU WC, so I think the implication is that PCI MMIO must be mapped UC unless the driver, which presumably *does* know what its device can tolerate, requests something else.

I wonder if there's a way for Linux to detect this sort of invalid MTRR setting and complain about it?

[1] Intel 64 and IA-32 Architectures Software Developer's Manual, May 2011, Volume 3, Section 11.3
[2] Conventional PCI Spec, r3.0, sec 3.2.6
Comment 14 AceLan Kao 2019-05-07 07:43:44 UTC
Mika, devm_ioremap_nocache() doesn't work, it hangs immediately when loading intel-lpss-pci driver.

Bjorn, right, you remind me, I've added mb(). It hangs on 4.15 kernel without mb() while calling lo_hi_writeq().
Do you still need me to provide more info?
Comment 15 AceLan Kao 2019-05-07 07:44:19 UTC
Created attachment 282665 [details]
Debug message and modifications
Comment 16 Bjorn Helgaas 2019-05-07 13:23:56 UTC
AceLan, I don't think so, thanks.

There's a graphics device with a framebuffer at 0x4000000000, which is probably why the BIOS set that area to be WC:

  pci 0000:00:02.0: [8086:3ea0] type 00 class 0x030000
  pci 0000:00:02.0: reg 0x18: [mem 0x4000000000-0x400fffffff 64bit pref]
  efifb: framebuffer at 0x4000000000, using 32448k, total 32448k

Linux put these devices (all appear to be related to LPSS) in that same area:

  pci 0000:00:15.0: BAR 0: assigned [mem 0x4010000000-0x4010000fff 64bit]
  pci 0000:00:15.1: BAR 0: assigned [mem 0x4010001000-0x4010001fff 64bit]
  pci 0000:00:1e.0: BAR 0: assigned [mem 0x4010002000-0x4010002fff 64bit]

and I suspect they probably all have similar problems.
Comment 17 AceLan Kao 2019-05-08 02:36:48 UTC
Thanks, I understand the issue clearly now, and we'll ask ODM BIOS to do the necessary change.
Do you think we have any chance to handle this issue better in kernel?
BTW, We encountered the issue on multiple platforms, and using "pci=nobar" sometimes can workaround it.
Comment 18 Bjorn Helgaas 2019-05-08 13:56:37 UTC
"pci=nobar" (an x86-specific option) might avoid a hang, but I bet it
also keeps the related driver, e.g., lpss, from working at all.  The driver probe should fail because we zeroed out the device resources in pcibios_fixup_device_resources().  This actually looks quite dangerous because the device may still be enabled and its BARs still contain something, but we no longer know where it is, so we may assign that space to another device and cause a conflict.

I think it's quite possible that we can handle this better in the kernel.  I assume Windows works fine on these systems, so in principle, at least, we could make Linux work fine as well.

We would have to figure out whether it's safe for us to change or override those MTRR settings.  I don't know what the contract between the BIOS and the kernel is with regard to MTRRs.

If you want to pursue this, I would start by opening a new bugzilla so it's not cluttered with all the stuff in this one, collecting a list of affected systems, and attaching dmesg output and Windows AIDA64 output for a couple.  The AIDA64 output should show us what Windows did with the MTRRs.
Comment 19 Kai-Heng Feng 2019-05-13 05:15:13 UTC
Created attachment 282733 [details]
MTRR

I can't find MTRR in AIDA64.

Here's MTRR collected in WinDbg.
Comment 20 AceLan Kao 2019-06-11 02:05:22 UTC
This issue has been fixed by BIOS by changing the write-combining region to uncachable.
Comment 21 Bjorn Helgaas 2019-06-11 02:37:56 UTC
You said that you encountered this issue on several platforms (comment #17).  Are there BIOS updates for all of them?

Are any of these platforms in the field already?  I think requiring the user to update the BIOS because of this problem would be a poor solution.
Comment 22 AceLan Kao 2019-06-11 03:08:02 UTC
We asked BIOS to disable using MMIO above 4G when we encountered this issue. So, I think there is no machine with MMIO above 4G option is enabled in the field.

When this issue happens, the machine even can't boot up, so I believe there is no platform in the field has this issue.
Comment 23 Tuowen Zhao 2019-09-04 16:01:32 UTC
Created attachment 284843 [details]
intel-lpss change

Dell XPS 2-in-1 7390 suffers from this issue and can't boot up. Changing to devm_ioremap_nocache still causes boot issue similarly reported by AceLan. Using the strongly uncachable version, ioremap_uc, works, however.
Comment 24 Andy Shevchenko 2019-09-04 17:19:03 UTC
(In reply to Tuowen Zhao from comment #23)
> Created attachment 284843 [details]
> intel-lpss change
> 
> Dell XPS 2-in-1 7390 suffers from this issue and can't boot up. Changing to
> devm_ioremap_nocache still causes boot issue similarly reported by AceLan.
> Using the strongly uncachable version, ioremap_uc, works, however.

Contact with Dell and ask them to give you a fixed firmware.
I guess behind the scenes it comes to Intel or so. Here we had that bug and it was successfully fixed in the firmware (BIOS).
Comment 25 AceLan Kao 2019-09-27 02:28:28 UTC
Created attachment 285207 [details]
mfd: intel-lpss: add quirk for Dell XPS 13 7390 2-in-1

Confirmed Tuowen Zhao's patch on comment #23 works.
There are already some platforms with this write-combining MTRR on the market, we can't only rely on vendor to provide BIOS updates, user may struggle on booting up the machines, especially we have a solution to workaround it in the kernel.
How about we add a quirk to list all those machines, and apply devm_ioremap_uc() on those machines only.
Comment 26 Andy Shevchenko 2019-09-27 06:20:30 UTC
(In reply to AceLan Kao from comment #25)
> Created attachment 285207 [details]
> mfd: intel-lpss: add quirk for Dell XPS 13 7390 2-in-1
> 
> Confirmed Tuowen Zhao's patch on comment #23 works.
> There are already some platforms with this write-combining MTRR on the
> market, we can't only rely on vendor to provide BIOS updates, user may
> struggle on booting up the machines, especially we have a solution to
> workaround it in the kernel.

> How about we add a quirk to list all those machines, and apply
> devm_ioremap_uc() on those machines only.

Rather warning that there is a firmware bug, but this can be easily checked by reading mttr settings run-time, correct?

(In reply to Tuowen Zhao from comment #23)
> Created attachment 284843 [details]
> intel-lpss change
> 
> Dell XPS 2-in-1 7390 suffers from this issue and can't boot up. Changing to
> devm_ioremap_nocache still causes boot issue similarly reported by AceLan.
> Using the strongly uncachable version, ioremap_uc, works, however.

Please, submit the patch thru mailing list.
https://www.kernel.org/doc/html/latest/process/submitting-patches.html