Bug 206175 - Fedora >= 5.4 kernels instantly freeze on boot without producing any display output
Summary: Fedora >= 5.4 kernels instantly freeze on boot without producing any display ...
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL: https://bugzilla.redhat.com/show_bug....
Keywords:
Depends on:
Blocks:
 
Reported: 2020-01-12 13:51 UTC by Artem S. Tashkinov
Modified: 2020-03-11 16:17 UTC (History)
8 users (show)

See Also:
Kernel Version: >= 5.4.0-rc1
Tree: Fedora
Regression: Yes


Attachments
grub `set debug=all` + kernel debug options (581.55 KB, image/jpeg)
2020-01-12 23:50 UTC, Artem S. Tashkinov
Details
[kernel 5.3] dmesg | grep -i dma (2.08 KB, text/plain)
2020-03-07 02:39 UTC, Artem S. Tashkinov
Details
Fedora 31 kernel config (210.99 KB, text/plain)
2020-03-07 02:40 UTC, Artem S. Tashkinov
Details
attachment-16112-0.html (2.08 KB, text/html)
2020-03-10 16:56 UTC, Linus Torvalds
Details
attachment-24927-0.html (1.19 KB, text/html)
2020-03-10 17:45 UTC, Linus Torvalds
Details
clean up platform device dma_mask handling (1.88 KB, patch)
2020-03-10 20:15 UTC, Linus Torvalds
Details | Diff
kernel panic on boot, kernel-5.5.8-200.fc31 (638.54 KB, image/jpeg)
2020-03-11 08:29 UTC, Artem S. Tashkinov
Details
fix dma_mask handling in platform_device_register_full.patch (1.71 KB, text/plain)
2020-03-11 16:16 UTC, Artem S. Tashkinov
Details

Description Artem S. Tashkinov 2020-01-12 13:51:40 UTC
I'm unable to boot any Fedora x86-64 kernels >=5.4.8 (can't try 5.4.0 because it's not available in Fedora repos)and probably 5.4.0 as well.

Right after selecting the kernel in GRUB2 (Secure UEFI: on) I see a single message on the screen:

EFI stub: UEFI secure boot is enabled

At this point the system has frozen hard: it doesn't respond to Ctrl + Alt + Del, it doesn't respond to CAPS lock, nothing. Only 4 seconds power button press shuts it down.

All the pertinent info is here: https://bugzilla.redhat.com/show_bug.cgi?id=1790115

TLDR of the existing bug report:

1) grub2 is unlikely to be at fault as ReFind exhibits the same issue
2) rawhide 5.5 kernel does not boot
3) kernels 5.4.8 and 5.4.10 do not boot
4) 5.3 kernels (and earlier) boot
5) secure boot doesn't change anything (I've tried to disable it unsuccessfully)
6) nomodeset i915.modeset=0 options don't help
7) there are no kernel messages on boot - i wonder if I can force some sort of text mode. It's unlikely there's a serial port on this laptop.

My hardware:

lspci:

00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL810xE PCI Express Fast Ethernet controller (rev 0a)
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 08)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-LP LPC Controller (rev 21)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)
00:1f.7 Non-Essential Instrumentation [1300]: Intel Corporation Device 9d26 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21)
00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 08)
00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01)
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
00:02.0 VGA compatible controller: Intel Corporation Skylake GT2 [HD Graphics 520] (rev 07)

CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz

product: HP Pavilion x360 Convertible (P1F10UA#ABA)
BIOS: F.28 (the latest released one)
Comment 1 Artem S. Tashkinov 2020-01-12 23:49:51 UTC
Linus,

I need help, please!

Even `efi=debug earlyprintk=efi,keep` produces no output at all. I've no idea what to do except bisecting but I don't want to go through this. Besides, it's a distro kernel and I will be debugging my own kernel.
Comment 2 Artem S. Tashkinov 2020-01-12 23:50:27 UTC
Created attachment 286775 [details]
grub `set debug=all` + kernel debug options
Comment 3 Artem S. Tashkinov 2020-01-12 23:53:28 UTC
The last four lines of GRUB2 debug log are:

script/lexer.c:321: token 0 text []
loader/efi/linux.c:82: kernel_addr: 0x1000000 handover_offset: 0x190 params: 0x7af0c000
loader/efi/linux.c:85: handover_func() = 0x1000390
_ (not blinking)

At this point the system is dead.
Comment 4 Linus Torvalds 2020-01-13 00:48:09 UTC
There really is nothing at all to go by.

I see no better idea than trying to bisect things. Annoying, but that's exactly what bisection is great at - bugs where you can't even begin to guess what the problem is.

If you can't see it with a normal self-built kernel, try to build with the Fedora config file.
Comment 5 Linus Torvalds 2020-01-13 00:54:13 UTC
The fact that it prints nothing at all makes me suspect it's some random but very core thing. But it's presumably very specific to your machine - some particular e820 memory layout issue or something. I'm not seeing anything particularly odd in your devices.

One thing to do is to perhaps remove the "rhgb quiet" that Fedora puts at the end of the boot line. You can do that just from the grub shell line.
Comment 6 Artem S. Tashkinov 2020-01-13 00:58:00 UTC
(In reply to Linus Torvalds from comment #5)
> The fact that it prints nothing at all makes me suspect it's some random but
> very core thing. But it's presumably very specific to your machine - some
> particular e820 memory layout issue or something. I'm not seeing anything
> particularly odd in your devices.
> 
> One thing to do is to perhaps remove the "rhgb quiet" that Fedora puts at
> the end of the boot line. You can do that just from the grub shell line.

I've tried booting with no arguments at all (not even initrd) - and with just debugging flags turned on (efi=debug earlyprintk=efi,keep), i.e.

vmlinuz-5.4.8-300.fc31.x86_64 efi=debug earlyprintk=efi,keep

- the result is always the same: an instant freeze and no messages on the screen.

I will try with various Fedora kernels first starting with 5.4-rc1. I really don't want to bisect yet.
Comment 7 Matt Yates 2020-01-14 06:31:41 UTC
I have the same problem on an Acer Aspire 5 laptop (model A515-43-R19L).  The laptop isn't completely as originally sold, because I installed a second 4GB ram stick and a second hard drive.  

Debian testing was working fine until the kernel was upgraded from the 5.3 to the 5.4 version series.  Tonight, I installed Fedora 31 that comes with a 5.3 kernel.  Fedora was working until I ran updates and the kernel was upgraded to 5.4.8.

In every case, the 5.4 kernel results in a hard freeze early in the boot process.  The 5.3 kernel works fine, and downgrading back to that kernel works.

In Debian, I tried compiling my own 5.4.11 kernel using Debian's 5.3 kernel config as a starting point.  I ran "make oldconfig" and selected all the defaults.  The self-compiled kernel had the same issue.

My lspci output:

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0]
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0]
00:01.6 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00:01.7 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus A
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus B
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 7
02:00.0 Non-Volatile memory controller: SK hynix Device 1327
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
04:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c4)
05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Raven/Raven2/Fenghuang HDMI/DP Audio Controller
05:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
05:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven2 USB 3.1
05:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor
05:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) HD Audio Controller
06:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61)
Comment 8 Artem S. Tashkinov 2020-01-14 09:48:31 UTC
(In reply to Matt Yates from comment #7)
> I have the same problem on an Acer Aspire 5 laptop (model A515-43-R19L). 
> The laptop isn't completely as originally sold, because I installed a second
> 4GB ram stick and a second hard drive.  
> 
> Debian testing was working fine until the kernel was upgraded from the 5.3
> to the 5.4 version series.  Tonight, I installed Fedora 31 that comes with a
> 5.3 kernel.  Fedora was working until I ran updates and the kernel was
> upgraded to 5.4.8.
> 
> In every case, the 5.4 kernel results in a hard freeze early in the boot
> process.  The 5.3 kernel works fine, and downgrading back to that kernel
> works.
> 
> In Debian, I tried compiling my own 5.4.11 kernel using Debian's 5.3 kernel
> config as a starting point.  I ran "make oldconfig" and selected all the
> defaults.  The self-compiled kernel had the same issue.
> 
> My lspci output:
> 

That's weird and puzzling - our laptops are completely different aside from sharing the same architecture.

What's your BIOS vendor (can be checked using sudo lshw)? Mine is American Megatrends Inc.

Also, could you check your BIOS for a TPM module? Could you try disabling it and see if it helps?
Comment 9 Matt Yates 2020-01-14 21:17:14 UTC
My BIOS vendor is "Insyde Corp.".  There is a TPM module.  When I disabled it, it caused my EFI boot entry to disappear, so I couldn't test it.

However, I think we may have two separate problems.  I switched back from Fedora to Debian Testing, and the Debian installer upgraded the kernel from 5.3 to 5.4 series prior to the first boot.  The 5.4 kernel booted up on first boot.  I could see boot messages scrolling, but the screen went to a black while trying to load lightdm because I did not have the "firmware-amd-graphics" package installed required for graphics.  After installing the amd graphics package, the 5.4 kernel freezes as before (right at the start of the boot process).  The 5.3 kernel boots as normal, and graphics work.

The "firmware-amd-graphics" package (version 20190717-2) was the only thing I changed, so I guess the problem must be some sort of conflict with the amd graphics firmware and the 5.4 kernel.
Comment 10 Alex Deucher 2020-01-14 22:25:19 UTC
I have a similar picasso based raven laptop and it's working fine with F31 and the latest 5.4.8 kernel.  I think the firmware is a red herring since the driver won't load without it.  If you don't have it installed the driver won't be loaded.  Sounds more like some regression between 5.3 and 5.4 for your laptop.  It probably makes sense to open a new bug for the amdgpu issue to avoid mixing issues in this report.
Comment 11 Artem S. Tashkinov 2020-01-15 12:57:05 UTC
(In reply to Matt Yates from comment #9)
> 
> The "firmware-amd-graphics" package (version 20190717-2) was the only thing
> I changed, so I guess the problem must be some sort of conflict with the amd
> graphics firmware and the 5.4 kernel.

This is definitely a whole different issue not related to this one. You could open a separate bug report.
Comment 12 Artem S. Tashkinov 2020-01-18 14:45:21 UTC
Rafael J. Wysocki,

I see some pretty heavy changes merged with kernel after 5.4 release: https://github.com/torvalds/linux/commit/782b59711e1561ee0da06bc478ca5e8249aa8d09

Could they cause this issue?

Are there any extra boot flags that I could try to understand what's going on in my case?
Comment 13 Artem S. Tashkinov 2020-01-23 13:21:47 UTC
kernel-5.4.0-0.rc0.git1.1.fc32 boots
kernel-5.4.0-0.rc0.git7.1.fc32 fails to boot (hard freeze)

And all the kernels in between give 

Forbidden

You don't have permission to access this resource.

Which means some changes between git1 and git7 rendered my system unbootable.

Looks like I have to bisect :-(

__________________________________________________________

Again, Rafael, could you respond please?
Comment 15 Artem S. Tashkinov 2020-03-06 23:36:40 UTC
From https://bugzilla.redhat.com/show_bug.cgi?id=1790115#c44

The bad commit is between these two snapshots:

* Thu Sep 19 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git3.1
- Linux v5.3-7639-gb41dae061bbd

* Wed Sep 18 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git2.1
- Linux v5.3-3839-g35f7a9526615

Maybe kernel developers have an idea what it could be?
Comment 16 Artem S. Tashkinov 2020-03-06 23:39:00 UTC
Again, the kernel freezes VERY early during boot - as nothing is printed on the screen even with all possible debug flags. Someone surely knows what it is and I don't want to spam LKML with that.
Comment 17 Linus Torvalds 2020-03-06 23:52:33 UTC
(In reply to Artem S. Tashkinov from comment #15)
> 
> The bad commit is between these two snapshots:
> 
> * Thu Sep 19 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git3.1
> - Linux v5.3-7639-gb41dae061bbd
> 
> * Wed Sep 18 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git2.1
> - Linux v5.3-3839-g35f7a9526615

That's still 3800 commits...

Looking at my merge log in that range, we have

    Pull xfs updates from Darrick Wong
    Pull swap access updates from Darrick Wong
    Pull overlayfs fixes from Miklos Szeredi
    Pull btrfs updates from David Sterba
    Pull AFS updates from David Howells
    Pull fs-verity support from Eric Biggers
    Pull fscrypt updates from Eric Biggers
    Pull file locking updates from Jeff Layton
    Pull vfs mount API infrastructure updates from Al Viro
    Pull d_path fix from Al Viro
    Pull vfs namei updates from Al Viro
    Pull networking updates from David Miller
    Pull crypto updates from Herbert Xu
    Pull char/misc driver updates from Greg KH
    Pull staging and IIO driver updates from Greg KH
    Pull tty/serial driver updates from Greg KH
    Pull USB updates from Greg KH
    Pull driver core updates from Greg Kroah-Hartman
    Pull KVM updates from Paolo Bonzini
    Pull KVM fix from Paolo Bonzini

and none of them look at all obvious.

There's no low-level x86 boot code, for example. And the filesystem stuff shouldn't trigger early in boot.

Sure, there's some low-level driver updates that might cause it. tty in particular. But I don't see why it would impact you and nobody else.

I _really_ think you should bisect. Trust me, it's faster than whatever else you're doing.
Comment 18 Artem S. Tashkinov 2020-03-07 00:57:10 UTC
(In reply to Linus Torvalds from comment #17)
> I _really_ think you should bisect. Trust me, it's faster than whatever else
> you're doing.

Thank you Linus! I'm thinking about it.

The issue is, with my custom very simplistic .config the kernel boots just fine. It's the options which Fedora enables causes it not to boot. And their config includes pretty much everything on Earth.

To add insult to injury Fedora's kernel is built with some patches on top of vanilla and those patches could interact with the kernel code in strange ways and then it's not so obvious what should I bisect exactly. Should I apply Fedora's patches after each bisect step before actually building the kernel?
Comment 19 Artem S. Tashkinov 2020-03-07 02:10:43 UTC
Damn. 14 reboots.

# git bisect bad
cdfee5623290bc893f595636b44fa28e8207c5b3 is the first bad commit
commit cdfee5623290bc893f595636b44fa28e8207c5b3
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Aug 16 08:24:35 2019 +0200

    driver core: initialize a default DMA mask for platform device
    
    We still treat devices without a DMA mask as defaulting to 32-bits for
    both mask, but a few releases ago we've started warning about such
    cases, as they require special cases to work around this sloppyness.
    Add a dma_mask field to struct platform_device so that we can initialize
    the dma_mask pointer in struct device and initialize both masks to
    32-bits by default, replacing similar functionality in m68k and
    powerpc.  The arch_setup_pdev_archdata hooks is now unused and removed.
    
    Note that the code looks a little odd with the various conditionals
    because we have to support platform_device structures that are
    statically allocated.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Link: https://lore.kernel.org/r/20190816062435.881-7-hch@lst.de
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 arch/m68k/kernel/dma.c               |  9 ---------
 arch/powerpc/kernel/setup-common.c   |  6 ------
 arch/sh/boards/mach-ap325rxa/setup.c |  1 -
 arch/sh/boards/mach-ecovec24/setup.c |  2 --
 arch/sh/boards/mach-kfr2r09/setup.c  |  1 -
 arch/sh/boards/mach-migor/setup.c    |  1 -
 arch/sh/boards/mach-se/7724/setup.c  |  2 --
 drivers/base/platform.c              | 37 ++++++++++++++++--------------------
 include/linux/platform_device.h      |  2 +-
 9 files changed, 17 insertions(+), 44 deletions(-)

# git bisect log
git bisect start
# good: [35f7a95266153b1cf0caca3aa9661cb721864527] Merge tag 'devprop-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 35f7a95266153b1cf0caca3aa9661cb721864527
# bad: [b41dae061bbd722b9d7fa828f35d22035b218e18] Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad b41dae061bbd722b9d7fa828f35d22035b218e18
# good: [2f2fa16e23816bded4b97117faf6e97a95ba9056] Merge branch 'devlink-unknown'
git bisect good 2f2fa16e23816bded4b97117faf6e97a95ba9056
# bad: [e6874fc29410fabfdbc8c12b467f41a16cbcfd2b] Merge tag 'staging-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad e6874fc29410fabfdbc8c12b467f41a16cbcfd2b
# good: [56a583d264b97e34d3c6009f50f113450e63a1de] Staging: exfat: Avoid use of strcpy
git bisect good 56a583d264b97e34d3c6009f50f113450e63a1de
# bad: [fb9617edf6c0e1b86a6595cd92dd3f84595221d9] Merge tag 'usb-ci-v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-next
git bisect bad fb9617edf6c0e1b86a6595cd92dd3f84595221d9
# bad: [ecd55e367f3d706788632e176ec6b94e1a72a07c] usb: chipidea: msm: Use device-managed registration API
git bisect bad ecd55e367f3d706788632e176ec6b94e1a72a07c
# good: [3e2cb866b2b1497a2246c8d222cd672694ac9f15] USB: phy: mv-usb: convert platform driver to use dev_groups
git bisect good 3e2cb866b2b1497a2246c8d222cd672694ac9f15
# good: [eceddc4071e39807f080913656252e4526227647] usb: typec: fusb302: Remove unused properties
git bisect good eceddc4071e39807f080913656252e4526227647
# good: [a599e48662b4b505bef45d4831061f9d50703e17] usb: usb-skeleton: make comment block in line with coding style
git bisect good a599e48662b4b505bef45d4831061f9d50703e17
# good: [bd5defaee872da9b81e3c72045eb6794445cd2e6] dma-mapping: remove is_device_dma_capable
git bisect good bd5defaee872da9b81e3c72045eb6794445cd2e6
# bad: [58fb8beda201f00b139e8993cbda937cd9a4e603] dt-binding: usb: ci-hdrc-usb2: add imx7ulp compatible
git bisect bad 58fb8beda201f00b139e8993cbda937cd9a4e603
# bad: [cdfee5623290bc893f595636b44fa28e8207c5b3] driver core: initialize a default DMA mask for platform device
git bisect bad cdfee5623290bc893f595636b44fa28e8207c5b3
# first bad commit: [cdfee5623290bc893f595636b44fa28e8207c5b3] driver core: initialize a default DMA mask for platform device
Comment 20 Artem S. Tashkinov 2020-03-07 02:31:07 UTC
To further confirm it:

5.4 vanilla with Fedora config FAILs to boot.

5.4 vanilla with Fedora config *without* 0001-driver-core-initialize-a-default-DMA-mask-for-platform.patch (commit cdfee5623290bc893f595636b44fa28e8207c5b3) boots successfully.

Now I'm totally confused as this patch looks relatively innocuous, so I'm leaving further judgement to people who can understand what's going on. This patch might affect multiple other built-in kernel drivers which causes a boot failure.

This is what gets recompiled after reverting this patch - looks like a lot of stuff:

  UPD     include/config/kernel.release
  DESCEND  objtool
  UPD     include/generated/utsrelease.h
  CALL    scripts/atomic/check-atomics.sh
  CALL    scripts/checksyscalls.sh
  CHK     include/generated/compile.h
  CC      init/version.o
  AR      init/built-in.a
  CC      kernel/sys.o
  CC      drivers/ata/libata-core.o
  CC      arch/x86/kernel/rtc.o
  CC      drivers/acpi/glue.o
  CC      drivers/base/core.o
  CC      drivers/base/platform.o
  CC      arch/x86/kernel/cpu/microcode/core.o
  CC      drivers/char/ipmi/ipmi_dmi.o
  CC      drivers/char/ipmi/ipmi_plat_data.o
  CC      drivers/clk/clk-fixed-factor.o
  CC      drivers/acpi/dock.o
  CC      drivers/char/tpm/tpm-chip.o
  CC      drivers/char/tpm/tpm-dev-common.o
  AR      drivers/char/ipmi/built-in.a
  CC      drivers/clk/clk-fixed-rate.o
  CC      lib/genalloc.o
  AR      arch/x86/kernel/cpu/microcode/built-in.a
  AR      arch/x86/kernel/cpu/built-in.a
  CC      arch/x86/kernel/early-quirks.o
  CC      drivers/acpi/acpi_lpss.o
  CC      drivers/acpi/acpi_apd.o
  CC      drivers/char/tpm/tpm-dev.o
  CC      drivers/char/tpm/tpm-interface.o
  CC      drivers/char/tpm/tpm1-cmd.o
  CC      drivers/char/tpm/tpm2-cmd.o
  CC      drivers/clk/clk-gpio.o
  CC      drivers/char/tpm/tpmrm-dev.o
  CC      drivers/char/tpm/tpm2-space.o
  CC      drivers/base/firmware_loader/main.o
  CC      drivers/acpi/acpi_platform.o
  CC      drivers/char/tpm/tpm-sysfs.o
  CC      arch/x86/kernel/early_printk.o
  CC      drivers/clk/x86/clk-pmc-atom.o
  CC      drivers/char/tpm/eventlog/common.o
  CC      drivers/char/tpm/eventlog/tpm1.o
  AR      lib/built-in.a
  CC      drivers/char/tpm/eventlog/tpm2.o
  CC      drivers/char/tpm/tpm_ppi.o
  CC      drivers/base/power/domain.o
  CC      kernel/dma/mapping.o
  CC      drivers/char/tpm/eventlog/acpi.o
  CC      drivers/clk/x86/clk-st.o
  CC      drivers/clk/x86/clk-lpt.o
  AR      drivers/base/firmware_loader/built-in.a
  CC      drivers/acpi/acpi_watchdog.o
  CC      drivers/char/tpm/eventlog/efi.o
  CC      drivers/char/tpm/tpm_tis_core.o
  CC      drivers/char/tpm/tpm_tis.o
  CC      drivers/char/tpm/tpm_crb.o
  CC      arch/x86/kernel/pmem.o
  CC      arch/x86/kernel/pcspeaker.o
  AR      kernel/dma/built-in.a
  CC      kernel/power/qos.o
  AR      drivers/clk/x86/built-in.a
  CC      drivers/acpi/apei/hest.o
  AR      drivers/clk/built-in.a
  CC      drivers/devfreq/devfreq.o
  AR      drivers/ata/built-in.a
  CC      drivers/acpi/apei/ghes.o
  CC      drivers/acpi/ac.o
  CC      drivers/dma/dmaengine.o
  CC      arch/x86/kernel/sysfb.o
  CC      kernel/time/alarmtimer.o
  CC      drivers/acpi/fan.o
  AR      drivers/base/power/built-in.a
  AR      drivers/base/built-in.a
  CC      kernel/trace/trace.o
  CC      drivers/edac/edac_mc.o
  CC      drivers/edac/edac_device.o
  CC      drivers/edac/edac_mc_sysfs.o
  AR      drivers/char/tpm/built-in.a
  CC      drivers/edac/edac_module.o
  AR      arch/x86/kernel/built-in.a
  AR      drivers/char/built-in.a
  AR      arch/x86/built-in.a
  CC      drivers/gpio/gpio-crystalcove.o
  CC      drivers/gpu/drm/drm_mipi_dsi.o
  CC      drivers/firmware/efi/efi.o
  CC      drivers/acpi/pmic/intel_pmic_crc.o
  AR      kernel/power/built-in.a
  CC      drivers/gpio/gpio-tps68470.o
  CC      drivers/gpio/gpio-wcove.o
  AR      drivers/dma/built-in.a
  AR      drivers/acpi/apei/built-in.a
  AR      drivers/devfreq/built-in.a
  CC      drivers/acpi/pmic/intel_pmic_xpower.o
  CC      kernel/module.o
  CC      drivers/edac/edac_device_sysfs.o
  CC      drivers/edac/wq.o
  CC      drivers/edac/edac_pci.o
  AR      kernel/time/built-in.a
  CC      drivers/edac/edac_pci_sysfs.o
  CC      drivers/edac/ghes_edac.o
  CC      drivers/i2c/i2c-core-base.o
  CC      drivers/acpi/pmic/intel_pmic_bxtwc.o
  CC      drivers/acpi/pmic/intel_pmic_chtwc.o
  CC      drivers/acpi/pmic/intel_pmic_chtdc_ti.o
  AR      drivers/gpio/built-in.a
  CC      drivers/i2c/busses/i2c-designware-platdrv.o
  AR      drivers/firmware/efi/built-in.a
  AR      drivers/firmware/built-in.a
  AR      drivers/acpi/built-in.a
  CC      drivers/input/serio/i8042.o
  AR      drivers/gpu/drm/built-in.a
  AR      drivers/gpu/built-in.a
  CC      drivers/iommu/amd_iommu.o
  CC      drivers/mailbox/pcc.o
  CC      drivers/input/mouse/elantech.o
  CC      drivers/media/cec/cec-notifier.o
  AR      drivers/edac/built-in.a
  CC      drivers/mfd/tps68470.o
  CC      drivers/mfd/mfd-core.o
  CC      drivers/mfd/axp20x.o
  AR      drivers/i2c/busses/built-in.a
  CC      drivers/mfd/intel_soc_pmic_crc.o
  CC      drivers/mfd/intel_soc_pmic_core.o
  AR      drivers/mailbox/built-in.a
  CC      drivers/mfd/intel_soc_pmic_bxtwc.o
  CC      drivers/mfd/intel_soc_pmic_chtwc.o
  AR      drivers/media/cec/built-in.a
  AR      drivers/media/built-in.a
  AR      drivers/input/serio/built-in.a
  CC      drivers/net/phy/mdio_bus.o
  CC      drivers/pci/probe.o
  AR      drivers/i2c/built-in.a
  AR      drivers/input/mouse/built-in.a
  AR      drivers/input/built-in.a
  AR      drivers/mfd/built-in.a
  CC      drivers/pci/pci-driver.o
  CC      drivers/pinctrl/intel/pinctrl-baytrail.o
  CC      drivers/pinctrl/intel/pinctrl-cherryview.o
  CC      drivers/platform/x86/intel_pmc_ipc.o
  CC      drivers/platform/x86/intel_pmc_core.o
  CC      drivers/platform/x86/intel_pmc_core_pltdrv.o
  CC      drivers/platform/x86/pmc_atom.o
  CC      drivers/net/phy/fixed_phy.o
  CC      drivers/pwm/pwm-crc.o
  CC      drivers/regulator/dummy.o
  CC      drivers/regulator/fixed-helper.o
  AR      kernel/trace/built-in.a
  AR      drivers/pwm/built-in.a
  AR      kernel/built-in.a
  CC      drivers/rtc/rtc-cmos.o
  AR      drivers/iommu/built-in.a
  CC      drivers/spi/spi.o
  AR      drivers/pinctrl/intel/built-in.a
  AR      drivers/pinctrl/built-in.a
  AR      drivers/platform/x86/built-in.a
  AR      drivers/platform/built-in.a
  CC      drivers/usb/common/common.o
  AR      drivers/net/phy/built-in.a
  AR      drivers/net/built-in.a
  CC      drivers/tty/serdev/core.o
  AR      drivers/regulator/built-in.a
  CC      drivers/usb/core/hcd.o
  CC      drivers/video/fbdev/vesafb.o
  CC      drivers/video/fbdev/efifb.o
  CC      drivers/usb/host/xhci-ext-caps.o
  CC      drivers/tty/serial/8250/8250_core.o
  CC      drivers/tty/serial/8250/8250_pnp.o
  CC      drivers/scsi/hosts.o
  CC      drivers/tty/serial/8250/8250_port.o
  CC      drivers/tty/serial/8250/8250_dma.o
  AR      drivers/usb/common/built-in.a
  CC      drivers/tty/serial/8250/8250_dwlib.o
  AR      drivers/rtc/built-in.a
  AR      drivers/pci/built-in.a
  CC      drivers/tty/serial/8250/8250_pci.o
  CC      drivers/tty/serial/8250/8250_early.o
  CC      drivers/tty/serial/8250/8250_dw.o
  CC      drivers/tty/serial/8250/8250_mid.o
  AR      drivers/tty/serdev/built-in.a
  AR      drivers/usb/host/built-in.a
  AR      drivers/scsi/built-in.a
  AR      drivers/video/fbdev/built-in.a
  AR      drivers/video/built-in.a
  AR      drivers/usb/core/built-in.a
  AR      drivers/usb/built-in.a
  AR      drivers/tty/serial/8250/built-in.a
  AR      drivers/tty/serial/built-in.a
  AR      drivers/tty/built-in.a
  AR      drivers/spi/built-in.a
  AR      drivers/built-in.a
  GEN     .version
  CHK     include/generated/compile.h
  UPD     include/generated/compile.h
  CC      init/version.o
  AR      init/built-in.a
  LD      vmlinux.o
  MODPOST vmlinux.o
  MODINFO modules.builtin.modinfo
  LD      .tmp_vmlinux1
  KSYM    .tmp_kallsyms1.o
  LD      .tmp_vmlinux2
  KSYM    .tmp_kallsyms2.o
  LD      vmlinux
  SORTEX  vmlinux
  SYSMAP  System.map
  TEST    posttest
Comment 21 Artem S. Tashkinov 2020-03-07 02:39:31 UTC
Created attachment 287811 [details]
[kernel 5.3] dmesg | grep -i dma
Comment 22 Artem S. Tashkinov 2020-03-07 02:40:16 UTC
Created attachment 287813 [details]
Fedora 31 kernel config
Comment 23 Artem S. Tashkinov 2020-03-07 02:41:13 UTC
As of kernel 5.6-rc4-git the bug is still present.
Comment 24 Linus Torvalds 2020-03-07 02:47:46 UTC
Just bringing in more people.

Christoph, I'm not seeing where we initialize the new 'dma_mask' in all cases.

So it looks like what happens is:

 - platform_device_alloc() use kzalloc() to allocate the platform object.

 - it then calls setup_pdev_dma_masks, which has

        if (!pdev->dma_mask)
                pdev->dma_mask = DMA_BIT_MASK(32);

   so now it's set to a 32-bit mask.

 - nothing else ever seems to set it to anything else.

So the code looks a bit odd. What's going on? The logic of this all
entirely escapes me.

That said, I don't see why it would break anything either.

                    Linus

On Fri, Mar 6, 2020 at 8:31 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=206175
>
> --- Comment #20 from Artem S. Tashkinov (aros@gmx.com) ---
> To further confirm it:
>
> 5.4 vanilla with Fedora config FAILs to boot.
>
> 5.4 vanilla with Fedora config *without*
> 0001-driver-core-initialize-a-default-DMA-mask-for-platform.patch (commit
> cdfee5623290bc893f595636b44fa28e8207c5b3) boots successfully.
>
> Now I'm totally confused as this patch looks relatively innocuous, so I'm
> leaving further judgement to people who can understand what's going on. This
> patch might affect multiple other built-in kernel drivers which causes a boot
> failure.
>
> This is what gets recompiled after reverting this patch - looks like a lot of
> stuff:
>
>   UPD     include/config/kernel.release
>   DESCEND  objtool
>   UPD     include/generated/utsrelease.h
>   CALL    scripts/atomic/check-atomics.sh
>   CALL    scripts/checksyscalls.sh
>   CHK     include/generated/compile.h
>   CC      init/version.o
>   AR      init/built-in.a
>   CC      kernel/sys.o
>   CC      drivers/ata/libata-core.o
>   CC      arch/x86/kernel/rtc.o
>   CC      drivers/acpi/glue.o
>   CC      drivers/base/core.o
>   CC      drivers/base/platform.o
>   CC      arch/x86/kernel/cpu/microcode/core.o
>   CC      drivers/char/ipmi/ipmi_dmi.o
>   CC      drivers/char/ipmi/ipmi_plat_data.o
>   CC      drivers/clk/clk-fixed-factor.o
>   CC      drivers/acpi/dock.o
>   CC      drivers/char/tpm/tpm-chip.o
>   CC      drivers/char/tpm/tpm-dev-common.o
>   AR      drivers/char/ipmi/built-in.a
>   CC      drivers/clk/clk-fixed-rate.o
>   CC      lib/genalloc.o
>   AR      arch/x86/kernel/cpu/microcode/built-in.a
>   AR      arch/x86/kernel/cpu/built-in.a
>   CC      arch/x86/kernel/early-quirks.o
>   CC      drivers/acpi/acpi_lpss.o
>   CC      drivers/acpi/acpi_apd.o
>   CC      drivers/char/tpm/tpm-dev.o
>   CC      drivers/char/tpm/tpm-interface.o
>   CC      drivers/char/tpm/tpm1-cmd.o
>   CC      drivers/char/tpm/tpm2-cmd.o
>   CC      drivers/clk/clk-gpio.o
>   CC      drivers/char/tpm/tpmrm-dev.o
>   CC      drivers/char/tpm/tpm2-space.o
>   CC      drivers/base/firmware_loader/main.o
>   CC      drivers/acpi/acpi_platform.o
>   CC      drivers/char/tpm/tpm-sysfs.o
>   CC      arch/x86/kernel/early_printk.o
>   CC      drivers/clk/x86/clk-pmc-atom.o
>   CC      drivers/char/tpm/eventlog/common.o
>   CC      drivers/char/tpm/eventlog/tpm1.o
>   AR      lib/built-in.a
>   CC      drivers/char/tpm/eventlog/tpm2.o
>   CC      drivers/char/tpm/tpm_ppi.o
>   CC      drivers/base/power/domain.o
>   CC      kernel/dma/mapping.o
>   CC      drivers/char/tpm/eventlog/acpi.o
>   CC      drivers/clk/x86/clk-st.o
>   CC      drivers/clk/x86/clk-lpt.o
>   AR      drivers/base/firmware_loader/built-in.a
>   CC      drivers/acpi/acpi_watchdog.o
>   CC      drivers/char/tpm/eventlog/efi.o
>   CC      drivers/char/tpm/tpm_tis_core.o
>   CC      drivers/char/tpm/tpm_tis.o
>   CC      drivers/char/tpm/tpm_crb.o
>   CC      arch/x86/kernel/pmem.o
>   CC      arch/x86/kernel/pcspeaker.o
>   AR      kernel/dma/built-in.a
>   CC      kernel/power/qos.o
>   AR      drivers/clk/x86/built-in.a
>   CC      drivers/acpi/apei/hest.o
>   AR      drivers/clk/built-in.a
>   CC      drivers/devfreq/devfreq.o
>   AR      drivers/ata/built-in.a
>   CC      drivers/acpi/apei/ghes.o
>   CC      drivers/acpi/ac.o
>   CC      drivers/dma/dmaengine.o
>   CC      arch/x86/kernel/sysfb.o
>   CC      kernel/time/alarmtimer.o
>   CC      drivers/acpi/fan.o
>   AR      drivers/base/power/built-in.a
>   AR      drivers/base/built-in.a
>   CC      kernel/trace/trace.o
>   CC      drivers/edac/edac_mc.o
>   CC      drivers/edac/edac_device.o
>   CC      drivers/edac/edac_mc_sysfs.o
>   AR      drivers/char/tpm/built-in.a
>   CC      drivers/edac/edac_module.o
>   AR      arch/x86/kernel/built-in.a
>   AR      drivers/char/built-in.a
>   AR      arch/x86/built-in.a
>   CC      drivers/gpio/gpio-crystalcove.o
>   CC      drivers/gpu/drm/drm_mipi_dsi.o
>   CC      drivers/firmware/efi/efi.o
>   CC      drivers/acpi/pmic/intel_pmic_crc.o
>   AR      kernel/power/built-in.a
>   CC      drivers/gpio/gpio-tps68470.o
>   CC      drivers/gpio/gpio-wcove.o
>   AR      drivers/dma/built-in.a
>   AR      drivers/acpi/apei/built-in.a
>   AR      drivers/devfreq/built-in.a
>   CC      drivers/acpi/pmic/intel_pmic_xpower.o
>   CC      kernel/module.o
>   CC      drivers/edac/edac_device_sysfs.o
>   CC      drivers/edac/wq.o
>   CC      drivers/edac/edac_pci.o
>   AR      kernel/time/built-in.a
>   CC      drivers/edac/edac_pci_sysfs.o
>   CC      drivers/edac/ghes_edac.o
>   CC      drivers/i2c/i2c-core-base.o
>   CC      drivers/acpi/pmic/intel_pmic_bxtwc.o
>   CC      drivers/acpi/pmic/intel_pmic_chtwc.o
>   CC      drivers/acpi/pmic/intel_pmic_chtdc_ti.o
>   AR      drivers/gpio/built-in.a
>   CC      drivers/i2c/busses/i2c-designware-platdrv.o
>   AR      drivers/firmware/efi/built-in.a
>   AR      drivers/firmware/built-in.a
>   AR      drivers/acpi/built-in.a
>   CC      drivers/input/serio/i8042.o
>   AR      drivers/gpu/drm/built-in.a
>   AR      drivers/gpu/built-in.a
>   CC      drivers/iommu/amd_iommu.o
>   CC      drivers/mailbox/pcc.o
>   CC      drivers/input/mouse/elantech.o
>   CC      drivers/media/cec/cec-notifier.o
>   AR      drivers/edac/built-in.a
>   CC      drivers/mfd/tps68470.o
>   CC      drivers/mfd/mfd-core.o
>   CC      drivers/mfd/axp20x.o
>   AR      drivers/i2c/busses/built-in.a
>   CC      drivers/mfd/intel_soc_pmic_crc.o
>   CC      drivers/mfd/intel_soc_pmic_core.o
>   AR      drivers/mailbox/built-in.a
>   CC      drivers/mfd/intel_soc_pmic_bxtwc.o
>   CC      drivers/mfd/intel_soc_pmic_chtwc.o
>   AR      drivers/media/cec/built-in.a
>   AR      drivers/media/built-in.a
>   AR      drivers/input/serio/built-in.a
>   CC      drivers/net/phy/mdio_bus.o
>   CC      drivers/pci/probe.o
>   AR      drivers/i2c/built-in.a
>   AR      drivers/input/mouse/built-in.a
>   AR      drivers/input/built-in.a
>   AR      drivers/mfd/built-in.a
>   CC      drivers/pci/pci-driver.o
>   CC      drivers/pinctrl/intel/pinctrl-baytrail.o
>   CC      drivers/pinctrl/intel/pinctrl-cherryview.o
>   CC      drivers/platform/x86/intel_pmc_ipc.o
>   CC      drivers/platform/x86/intel_pmc_core.o
>   CC      drivers/platform/x86/intel_pmc_core_pltdrv.o
>   CC      drivers/platform/x86/pmc_atom.o
>   CC      drivers/net/phy/fixed_phy.o
>   CC      drivers/pwm/pwm-crc.o
>   CC      drivers/regulator/dummy.o
>   CC      drivers/regulator/fixed-helper.o
>   AR      kernel/trace/built-in.a
>   AR      drivers/pwm/built-in.a
>   AR      kernel/built-in.a
>   CC      drivers/rtc/rtc-cmos.o
>   AR      drivers/iommu/built-in.a
>   CC      drivers/spi/spi.o
>   AR      drivers/pinctrl/intel/built-in.a
>   AR      drivers/pinctrl/built-in.a
>   AR      drivers/platform/x86/built-in.a
>   AR      drivers/platform/built-in.a
>   CC      drivers/usb/common/common.o
>   AR      drivers/net/phy/built-in.a
>   AR      drivers/net/built-in.a
>   CC      drivers/tty/serdev/core.o
>   AR      drivers/regulator/built-in.a
>   CC      drivers/usb/core/hcd.o
>   CC      drivers/video/fbdev/vesafb.o
>   CC      drivers/video/fbdev/efifb.o
>   CC      drivers/usb/host/xhci-ext-caps.o
>   CC      drivers/tty/serial/8250/8250_core.o
>   CC      drivers/tty/serial/8250/8250_pnp.o
>   CC      drivers/scsi/hosts.o
>   CC      drivers/tty/serial/8250/8250_port.o
>   CC      drivers/tty/serial/8250/8250_dma.o
>   AR      drivers/usb/common/built-in.a
>   CC      drivers/tty/serial/8250/8250_dwlib.o
>   AR      drivers/rtc/built-in.a
>   AR      drivers/pci/built-in.a
>   CC      drivers/tty/serial/8250/8250_pci.o
>   CC      drivers/tty/serial/8250/8250_early.o
>   CC      drivers/tty/serial/8250/8250_dw.o
>   CC      drivers/tty/serial/8250/8250_mid.o
>   AR      drivers/tty/serdev/built-in.a
>   AR      drivers/usb/host/built-in.a
>   AR      drivers/scsi/built-in.a
>   AR      drivers/video/fbdev/built-in.a
>   AR      drivers/video/built-in.a
>   AR      drivers/usb/core/built-in.a
>   AR      drivers/usb/built-in.a
>   AR      drivers/tty/serial/8250/built-in.a
>   AR      drivers/tty/serial/built-in.a
>   AR      drivers/tty/built-in.a
>   AR      drivers/spi/built-in.a
>   AR      drivers/built-in.a
>   GEN     .version
>   CHK     include/generated/compile.h
>   UPD     include/generated/compile.h
>   CC      init/version.o
>   AR      init/built-in.a
>   LD      vmlinux.o
>   MODPOST vmlinux.o
>   MODINFO modules.builtin.modinfo
>   LD      .tmp_vmlinux1
>   KSYM    .tmp_kallsyms1.o
>   LD      .tmp_vmlinux2
>   KSYM    .tmp_kallsyms2.o
>   LD      vmlinux
>   SORTEX  vmlinux
>   SYSMAP  System.map
>   TEST    posttest
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 25 Christoph Hellwig 2020-03-07 14:16:26 UTC
On Fri, Mar 06, 2020 at 08:47:24PM -0600, Linus Torvalds wrote:
> Just bringing in more people.
> 
> Christoph, I'm not seeing where we initialize the new 'dma_mask' in all
> cases.
> 
> So it looks like what happens is:
> 
>  - platform_device_alloc() use kzalloc() to allocate the platform object.
> 
>  - it then calls setup_pdev_dma_masks, which has
> 
>         if (!pdev->dma_mask)
>                 pdev->dma_mask = DMA_BIT_MASK(32);
> 
>    so now it's set to a 32-bit mask.
> 
>  - nothing else ever seems to set it to anything else.
> 
> So the code looks a bit odd. What's going on? The logic of this all
> entirely escapes me.

In additition to modern dynamically allocate platform devices various
architectures (most notable arm and sh, but also some weirdo x86 code)
allocate platform_devices in .bss, and some of them initialize the
dma mask (or at least did when I added this code).

What platform doesn't boot here?  arm, x86, something else?  What board?
Comment 26 Artem S. Tashkinov 2020-03-07 14:25:07 UTC
(In reply to Christoph Hellwig from comment #25)
> 
> What platform doesn't boot here?  arm, x86, something else?  What board?

All the information is provided in the bug report. It's a bog standard x86-64 UEFI laptop with a Skylake CPU.
Comment 27 Hans de Goede 2020-03-07 14:27:30 UTC
Artem is talking about this Fedora bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1790115

Let me copy and paste some detailed hwinfo from that bugreport:

lspci:

00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL810xE PCI Express Fast Ethernet controller (rev 0a)
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 08)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-LP LPC Controller (rev 21)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)
00:1f.7 Non-Essential Instrumentation [1300]: Intel Corporation Device 9d26 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21)
00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 08)
00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01)
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
00:02.0 VGA compatible controller: Intel Corporation Skylake GT2 [HD Graphics 520] (rev 07)

CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz

product: HP Pavilion x360 Convertible (P1F10UA#ABA)
BIOS: F.28 (the latest released one)
Comment 28 Artem S. Tashkinov 2020-03-10 15:08:23 UTC
(In reply to Christoph Hellwig from comment #25)
> 
> What platform doesn't boot here?  arm, x86, something else?  What board?

Any updates on this one? All the requested information has been provided and I can give you SSH access to my laptop if necessary.
Comment 29 Linus Torvalds 2020-03-10 16:02:19 UTC
On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> Any updates on this one? All the requested information has been provided and
> I
> can give you SSH access to my laptop if necessary.

I'm inclined to just revert that commit.

It seems to still revert cleanly, and I don't see a single use of the
new "dma_mask" in "struct platform_device" apart from the
initialization done in that commit. The only use of that new dma_mask
field seems to literally be this:

        if (!pdev->dma_mask)
                pdev->dma_mask = DMA_BIT_MASK(32);
        if (!pdev->dev.dma_mask)
                pdev->dev.dma_mask = &pdev->dma_mask;

which is entirely nonsensical to begin with. It might make more sense
if the code did something more along the lines of:

        if (!pdev->dev.dma_mask) {
                pdev->dma_mask = DMA_BIT_MASK(32);
                pdev->dev.dma_mask = &pdev->dma_mask;
        }

and we'd make it clear that the "platform_device->dma_mask" thing is
purely a local allocation. It should be renamed to indicate that.

Btw, in platform_device_register_full(), we have this:

        if (pdevinfo->dma_mask) {
                /*
                 * This memory isn't freed when the device is put,
                 * I don't have a nice idea for that though.  Conceptually
                 * dma_mask in struct device should not be a pointer.
                 * See http://thread.gmane.org/gmane.linux.kernel.pci/9081
                 */
                pdev->dev.dma_mask =
                        kmalloc(sizeof(*pdev->dev.dma_mask), GFP_KERNEL);
                if (!pdev->dev.dma_mask)
                        goto err;

                kmemleak_ignore(pdev->dev.dma_mask);

which looks like more garbage. It _would_ have made sense to actually
use the new 'dma_mask' field here, and make this all be

        if (pdevinfo->dma_mask) {
                pdev->dma_mask = pdevinfo->dma_mask;
                pdev->dev.dma_mask = &pdev->dma_mask;
        }

and then we'd have a *sensible* use of that pdev field and we'd have
simplified the code and avoided that memory leak, but that's not how
the code worked.

I do have one question: the minimal config that works for you, and
it's just the Fedora one that does not, could you try to see what
modules get loaded with the Fedora one, that you might be lacking?

In fact, what happens if you do this:

 - build with the fedora config with the revert in place (so that you
get a working kernel)

 - boot into that kernel

 - then do a "make localmodconfig" to generate a minimal config based
on what modules you have loaded

 - see how that differs from your manually generated minimal config?

Anyway, I do think that that commit that you bisected down to is
nonsensical. It does too much, or too little, and using a name like
"dma_mask" is much too generic for that one special case and makes it
impossible to greap for it.

If it was called "platform_dma_mask", unconditionally initialized to
32 bits at alloc time, and then the 'pdevinfo->dma_mask' thing would
use it too, then it might be greppable, sensible, and useful. As it
is, the thing adds no value and clearly causes problems, which is why
I'd be inclined to revert it..

Christoph?

              Linus
Comment 30 Christoph Hellwig 2020-03-10 16:23:46 UTC
On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote:
> On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:
> >
> > Any updates on this one? All the requested information has been provided
> and I
> > can give you SSH access to my laptop if necessary.

Whoever sent that mail should reply to my last request..

> I'm inclined to just revert that commit.

Please don't.  This area is a complete mess (see more below), and we
need to make gradual steps to sort this crap out.  And this was one
important step.

> It seems to still revert cleanly, and I don't see a single use of the
> new "dma_mask" in "struct platform_device" apart from the
> initialization done in that commit. The only use of that new dma_mask
> field seems to literally be this:
> 
>         if (!pdev->dma_mask)
>                 pdev->dma_mask = DMA_BIT_MASK(32);
>         if (!pdev->dev.dma_mask)
>                 pdev->dev.dma_mask = &pdev->dma_mask;
> 
> which is entirely nonsensical to begin with. It might make more sense
> if the code did something more along the lines of:
> 
>         if (!pdev->dev.dma_mask) {
>                 pdev->dma_mask = DMA_BIT_MASK(32);
>                 pdev->dev.dma_mask = &pdev->dma_mask;
>         }
> 
> and we'd make it clear that the "platform_device->dma_mask" thing is
> purely a local allocation. It should be renamed to indicate that.

pdev->dev.dma_mask is a pointer that needs to point to real DMA mask,
which was a really really bad decision back when it was made, and
something I plan to fix soon.  So if we want a working dma mask we
need to make that point to something.  And since 5.4 we stop to
silently treat a NULL pointer as 32-bit DMA mask, so it needs to
be initialized for DMA to work.  We have a bunch of drivers initializing
pdev->dev.dma_mask, often to things like static variables, which we
need to cleanup next, and allowing the dma_mask in the platform_device
to be set directly should help with that.  That being said if your
suggested change fixes the reporters problem (which I'd be suprised)
I'm all for it for now.

> 
> Btw, in platform_device_register_full(), we have this:
> 
>         if (pdevinfo->dma_mask) {
>                 /*
>                  * This memory isn't freed when the device is put,
>                  * I don't have a nice idea for that though.  Conceptually
>                  * dma_mask in struct device should not be a pointer.
>                  * See http://thread.gmane.org/gmane.linux.kernel.pci/9081
>                  */
>                 pdev->dev.dma_mask =
>                         kmalloc(sizeof(*pdev->dev.dma_mask), GFP_KERNEL);
>                 if (!pdev->dev.dma_mask)
>                         goto err;
> 
>                 kmemleak_ignore(pdev->dev.dma_mask);
> 
> which looks like more garbage. It _would_ have made sense to actually
> use the new 'dma_mask' field here, and make this all be
> 
>         if (pdevinfo->dma_mask) {
>                 pdev->dma_mask = pdevinfo->dma_mask;
>                 pdev->dev.dma_mask = &pdev->dma_mask;
>         }
> 
> and then we'd have a *sensible* use of that pdev field and we'd have
> simplified the code and avoided that memory leak, but that's not how
> the code worked.
> 

Yes, but that whole platform_device_register_full mess predates this
changes and is just an awful hack by a tiny minority of the drivers using
platform devices.  Both the DMA mask code and platform devices are full
of mess like this, which slowly needs to be sorted out.

> If it was called "platform_dma_mask", unconditionally initialized to
> 32 bits at alloc time, and then the 'pdevinfo->dma_mask' thing would
> use it too, then it might be greppable, sensible, and useful. As it
> is, the thing adds no value and clearly causes problems, which is why
> I'd be inclined to revert it..

Reverting it will break DMA for most platform device users.
Comment 31 Artem S. Tashkinov 2020-03-10 16:30:10 UTC
(In reply to Christoph Hellwig from comment #30)
> 
> Whoever sent that mail should reply to my last request..

I don't know whether you're serious or not but the kernel bug report has all the data and  Hans de Goede emailed it to you again three days ago just in case. It's worth remembering it's the kernel bugzilla - not just a mailing list - you might want to check it once in a while.

Anyways, here it is for the third time:

lspci:

00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL810xE PCI Express Fast Ethernet controller (rev 0a)
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 08)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-LP LPC Controller (rev 21)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)
00:1f.7 Non-Essential Instrumentation [1300]: Intel Corporation Device 9d26 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21)
00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 08)
00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01)
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
00:02.0 VGA compatible controller: Intel Corporation Skylake GT2 [HD Graphics 520] (rev 07)

CPU: Intel Core i5-6200U
Product: HP Pavilion x360 Convertible (P1F10UA#ABA)
BIOS: F.28 (the latest released one)
Comment 32 Linus Torvalds 2020-03-10 16:56:54 UTC
Created attachment 287855 [details]
attachment-16112-0.html

On Tue, Mar 10, 2020, 09:23 Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote:
> > On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> > >
> > > Any updates on this one? All the requested information has been
> provided and I
> > > can give you SSH access to my laptop if necessary.
>
> Whoever sent that mail should reply to my last request..
>

They did. Days ago. It's all there in the bugzilla.

> I'm inclined to just revert that commit.
>
> Please don't.  This area is a complete mess (see more below), and we
> need to make gradual steps to sort this crap out.  And this was one
> important step.
>

Christoph, it's broken. "An important step" is completely irrelevant if it
causes regressions

It's that simple. It needs to get reverted or fixed, and you can't just
ignore the regression and continue to ask for information that was already
there before you asked the first time.

If you're not talking this regression seriously, then reverting is
happening today.

      Linus

>
Comment 33 Christoph Hellwig 2020-03-10 17:05:39 UTC
On Tue, Mar 10, 2020 at 09:56:38AM -0700, Linus Torvalds wrote:
> On Tue, Mar 10, 2020, 09:23 Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote:
> > > On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org>
> > wrote:
> > > >
> > > > Any updates on this one? All the requested information has been
> > provided and I
> > > > can give you SSH access to my laptop if necessary.
> >
> > Whoever sent that mail should reply to my last request..
> >
> 
> They did. Days ago. It's all there in the bugzilla.

I never got it.  I've also not seen any report on any list.  I can
try to dig into the freaking bugzilla, but that is not how bug reports
normally work.

> > I'm inclined to just revert that commit.
> >
> > Please don't.  This area is a complete mess (see more below), and we
> > need to make gradual steps to sort this crap out.  And this was one
> > important step.
> >
> 
> Christoph, it's broken. "An important step" is completely irrelevant if it
> causes regressions
> 
> It's that simple. It needs to get reverted or fixed, and you can't just
> ignore the regression and continue to ask for information that was already
> there before you asked the first time.
> 
> If you're not talking this regression seriously, then reverting is
> happening today.

I _am_ taking this seriously.  But if I sent a mail on saturday and the
first answer that actually comes back to me is you complainin on Tuesday
there isn't much I can to.  And as I said just reverting this commit
will break things left right and center as lots of incremental changes
depend on it. So I think we need to track this down and fix it for real,
and a submitter that actually answer to email would really help with
that.

> 
>       Linus
> 
> >
---end quoted text---
Comment 34 Artem S. Tashkinov 2020-03-10 17:23:04 UTC
(In reply to Christoph Hellwig from comment #33)
> On Tue, Mar 10, 2020 at 09:56:38AM -0700, Linus Torvalds wrote:
> > On Tue, Mar 10, 2020, 09:23 Christoph Hellwig <hch@lst.de> wrote:
> > 
> > > On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote:
> > > > On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org>
> > > wrote:
> > > > >
> > > > > Any updates on this one? All the requested information has been
> > > provided and I
> > > > > can give you SSH access to my laptop if necessary.
> > >
> > > Whoever sent that mail should reply to my last request..
> > >
> > 
> > They did. Days ago. It's all there in the bugzilla.
> 
> I never got it.  I've also not seen any report on any list.  I can
> try to dig into the freaking bugzilla, but that is not how bug reports
> normally work.

I'm deeply sorry but Hans de Goede sent my hardware config to you at 2020-03-07 14:27:30 UTC right after you asked for it. Anyways, let's not dig deeper into this - makes no sense whatsoever. Again my platform is x86-64.

> 
> > > I'm inclined to just revert that commit.
> > >
> > > Please don't.  This area is a complete mess (see more below), and we
> > > need to make gradual steps to sort this crap out.  And this was one
> > > important step.
> > >
> > 
> > Christoph, it's broken. "An important step" is completely irrelevant if it
> > causes regressions
> > 
> > It's that simple. It needs to get reverted or fixed, and you can't just
> > ignore the regression and continue to ask for information that was already
> > there before you asked the first time.
> > 
> > If you're not talking this regression seriously, then reverting is
> > happening today.
> 
> I _am_ taking this seriously.  But if I sent a mail on saturday and the
> first answer that actually comes back to me is you complainin on Tuesday
> there isn't much I can to.  And as I said just reverting this commit
> will break things left right and center as lots of incremental changes
> depend on it. So I think we need to track this down and fix it for real,
> and a submitter that actually answer to email would really help with
> that.
> 

I'm on it and I'm ready to test any patches if you have them. I guess the patch itself doesn't really break things but some drivers which depend on it do and these drivers get enabled very early on boot since I'm unable to get any output from the kernel at all even with

efi=debug earlyprintk=efi,keep

options.
Comment 35 Linus Torvalds 2020-03-10 17:45:01 UTC
Created attachment 287857 [details]
attachment-24927-0.html

On Tue, Mar 10, 2020, 10:23 <bugzilla-daemon@bugzilla.kernel.org> wrote:

> >
> > I _am_ taking this seriously.  But if I sent a mail on saturday and the
> > first answer that actually comes back to me is you complainin on Tuesday
> > there isn't much I can to.


You could just have read the email you got on Saturday.

Seriously - it had all the information you asked for, and more. You just
didn't bother to look.

Then people followed up today, because nothing was happening. You had asked
for stuff that was right there in the bugzilla.

Do you really wonder that I then think you're not talking this seriously?

     Linus
Comment 36 Arvind Sankar 2020-03-10 19:02:01 UTC
(In reply to Artem S. Tashkinov from comment #34)

> 
> I'm on it and I'm ready to test any patches if you have them. I guess the
> patch itself doesn't really break things but some drivers which depend on it
> do and these drivers get enabled very early on boot since I'm unable to get
> any output from the kernel at all even with
> 
> efi=debug earlyprintk=efi,keep
> 
> options.

Not sure if this will help producing some output, but note that earlyprintk=efi was removed in kernel v5.1. The equivalent to earlyprintk=efi,keep is now "earlycon=efifb keep_bootcon".
Comment 37 Linus Torvalds 2020-03-10 20:15:12 UTC
Created attachment 287859 [details]
clean up platform device dma_mask handling

I don't think this should make any real difference, but at least I personally find it easier to see what the dma_mask logic is, and there's more of a point to the new platform-device dma mask field.

This basically does three small changes:

 (1) rename the field from "dma_mask" to "platform_dma_mask". 

 (2) use the new field instead of the dynamic allocation for the platform_device_register_full() use case

 (3) every initialization of the field is matched with "use this field by making the device dma_mask pointer point to it".

Honestly, (1) is entirely syntactic, and comes from the field not having been greppable at all.

(2) shouldn't matter, but avoids a pointless allocation of a single word when we have this field.

And (3) shouldn't matter either, but makes the logic much more obvious: instead of randomly initializing the field even when it doesn't get used, we now clearly link initialization and use of the field for the actual dma_mask pointer.
Comment 38 Linus Torvalds 2020-03-10 20:41:56 UTC
On Tue, Mar 10, 2020 at 10:05 AM Christoph Hellwig <hch@lst.de> wrote:
>
> > They did. Days ago. It's all there in the bugzilla.
>
> I never got it.

The email I added you to the cc on had a very clear link to the
bugzilla, which had all the information. So all you needed to do was
to look at entry that was pointed at.

And no, you don't seem to get the actual emails that bugzilla sends
out, despite bugzilla showing you as being part of the cc list.

That is probably because you need to do that "add to the cc" yourself,
so that people can't use bugzilla as a way to spam random people.

Anyway, I've uploaded a patch for Artem to try to the bugzilla. I
seriously doubt it makes any difference at all. But I don't see how
the dma_mask change would make any difference in the first place, so
at least clarifying the code and the logic of when that dma_mask field
is initialized shouldn't hurt.

I do note that before that commit cdfee5623290 ("driver core:
initialize a default DMA mask for platform device"), only powerpc and
m68k had that "arch_setup_pdev_archdata()" function that does that
"default 32-bit dma mask".

On x86, we have mainly ACPI, which will do that

        if (acpi_dma_supported(adev))
                pdevinfo.dma_mask = DMA_BIT_MASK(32);
        else
                pdevinfo.dma_mask = 0;

        pdev = platform_device_register_full(&pdevinfo);

which basically should be a no-op - 0 means that
platform_device_register_full() doesn't do anything and we use the
default value, and DMA_BIT_MASK(32) _is_ that default value.

Did x86 perhaps used to have some "no dma_mask means that it's the
full 64 bits?" Because that's one of the effects of that commit that
Artem bisected things to: it will now always have a dma_mask pointer,
and it will default to that 32-bit value if it didn't have one before.

            Linus
Comment 39 Linus Torvalds 2020-03-10 20:46:14 UTC
Artem,

On Tue, Mar 10, 2020 at 9:01 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I do have one question: the minimal config that works for you, and
> it's just the Fedora one that does not, could you try to see what
> modules get loaded with the Fedora one, that you might be lacking?

In fact, maybe the easiest thing to do would be to use
"config-bisect.pl" to figure out what config option it is that breaks
things for you.

See

    tools/testing/ktest/config-bisect.pl

which has a "this is how to use it" description in the comments at the top...

            Linus
Comment 40 Arvind Sankar 2020-03-10 21:15:24 UTC
(In reply to Linus Torvalds from comment #38)
> On Tue, Mar 10, 2020 at 10:05 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > > They did. Days ago. It's all there in the bugzilla.
> >
> > I never got it.
> 
> The email I added you to the cc on had a very clear link to the
> bugzilla, which had all the information. So all you needed to do was
> to look at entry that was pointed at.
> 
> And no, you don't seem to get the actual emails that bugzilla sends
> out, despite bugzilla showing you as being part of the cc list.
> 
> That is probably because you need to do that "add to the cc" yourself,
> so that people can't use bugzilla as a way to spam random people.
> 
> Anyway, I've uploaded a patch for Artem to try to the bugzilla. I
> seriously doubt it makes any difference at all. But I don't see how
> the dma_mask change would make any difference in the first place, so
> at least clarifying the code and the logic of when that dma_mask field
> is initialized shouldn't hurt.
> 
> I do note that before that commit cdfee5623290 ("driver core:
> initialize a default DMA mask for platform device"), only powerpc and
> m68k had that "arch_setup_pdev_archdata()" function that does that
> "default 32-bit dma mask".
> 
> On x86, we have mainly ACPI, which will do that
> 
>         if (acpi_dma_supported(adev))
>                 pdevinfo.dma_mask = DMA_BIT_MASK(32);
>         else
>                 pdevinfo.dma_mask = 0;
> 
>         pdev = platform_device_register_full(&pdevinfo);
> 
> which basically should be a no-op - 0 means that
> platform_device_register_full() doesn't do anything and we use the
> default value, and DMA_BIT_MASK(32) _is_ that default value.
> 
> Did x86 perhaps used to have some "no dma_mask means that it's the
> full 64 bits?" Because that's one of the effects of that commit that
> Artem bisected things to: it will now always have a dma_mask pointer,
> and it will default to that 32-bit value if it didn't have one before.
> 
>             Linus

I think on x86, acpi_dma_supported(adev) is equivalent to adev != NULL, so the problem shouldn't come from here?
Comment 41 Artem S. Tashkinov 2020-03-10 21:18:22 UTC
(In reply to Linus Torvalds from comment #39)
> 
> In fact, maybe the easiest thing to do would be to use
> "config-bisect.pl" to figure out what config option it is that breaks
> things for you.
> 
> See
> 
>     tools/testing/ktest/config-bisect.pl
> 
> which has a "this is how to use it" description in the comments at the top...
> 

This will take some time, please. I now have two patches to test and then I will try to figure out which config option makes the kernel unbootable.
Comment 42 Linus Torvalds 2020-03-10 21:30:50 UTC
(In reply to Arvind Sankar from comment #40)
> 
> I think on x86, acpi_dma_supported(adev) is equivalent to adev != NULL, so
> the problem shouldn't come from here?

I doubt it's an ACPI device anyway, since I don't see why one config would work for Artem (his small one), but not the F31 config.

Of course, it's possible that the F31 config ends up having some debug option that triggers early during boot and then that kills the boot.

It's also possible that Artem's small config doesn't have some driver enabled. It looks like a very common hw setup, though. The only thing that isn't in a ton of other machines is that 

  Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01)

thing, which is handled by that MISC_RTSX_PCI driver. Maybe Artem doesn't normally use it, and his working config doesn't have any of that support in it (including the MFD_CORE etc that it selects).

The MFD files _are_ some of the files that get re-built with that commit reverted. So if there is some MFD thing going on..

Hmm. The MFD code does do things like this:

        pdev = platform_device_alloc(cell->name, platform_id);
        if (!pdev)
                goto fail_alloc;
    ...
        pdev->dev.dma_mask = parent->dma_mask;


so it really does play games with that dev.dma_mask pointer.


Artem? Does your minimal config that worked when you built it yourself (as opposed to the Fedora build) perhaps not have that MFD code enabled?
Comment 43 Linus Torvalds 2020-03-10 21:41:22 UTC
(In reply to Artem S. Tashkinov from comment #41)
> This will take some time, please. I now have two patches to test and then I
> will try to figure out which config option makes the kernel unbootable.

I'm starting to suspect it's some interaction with that MFD dma_mask parent assignment.

In comment #18 you say

 "The issue is, with my custom very simplistic .config the kernel boots just fine. It's the options which Fedora enables causes it not to boot. And their config includes pretty much everything on Earth."

and I'm now starting to think that your custom very simplistic .config that works just avoids the whole issue with MFD.

Can you attach your minimal config that worked for you?

Is my theory perhaps believable, or complete garbage?

If I'm right, then if you enable the MISC_RTSX_PCI driver in your minimal config, then it will perhaps show the problem too...

Beause rtsx_pcr.c does do that mfd_add_devices(&pcidev->dev) -> mfd_add_device(), which takes the DMA mask of the parent (PCI) device, and uses it for the newly allocated platform device.

I'm not seeing why that wouldn't work, but that's the one piece of hardware you have that I can see that isn't _hugely_ common, I think.
Comment 44 Christoph Hellwig 2020-03-11 06:13:26 UTC
On Tue, Mar 10, 2020 at 01:41:34PM -0700, Linus Torvalds wrote:
> Did x86 perhaps used to have some "no dma_mask means that it's the
> full 64 bits?" Because that's one of the effects of that commit that
> Artem bisected things to: it will now always have a dma_mask pointer,
> and it will default to that 32-bit value if it didn't have one before.

So looking at the exact history - x86 already uses dma-direct in 5.4,
which failed DMA map requests without a dma mask, it was just arm that
still allowed that and relied on it.  Same for intel-iommu.
Comment 45 Artem S. Tashkinov 2020-03-11 07:35:50 UTC
(In reply to Linus Torvalds from comment #43)
> 
> and I'm now starting to think that your custom very simplistic .config that
> works just avoids the whole issue with MFD.
> 
> Can you attach your minimal config that worked for you?
> 
> Is my theory perhaps believable, or complete garbage?
> 
> If I'm right, then if you enable the MISC_RTSX_PCI driver in your minimal
> config, then it will perhaps show the problem too...
> 
> Beause rtsx_pcr.c does do that mfd_add_devices(&pcidev->dev) ->
> mfd_add_device(), which takes the DMA mask of the parent (PCI) device, and
> uses it for the newly allocated platform device.
> 
> I'm not seeing why that wouldn't work, but that's the one piece of hardware
> you have that I can see that isn't _hugely_ common, I think.

The config can be downloaded from here: https://bugzilla.redhat.com/attachment.cgi?id=1654871

$ grep -i MFD .config  | grep -v "not set"
CONFIG_MFD_CORE=m
CONFIG_MEMFD_CREATE=y
Comment 46 Artem S. Tashkinov 2020-03-11 08:29:52 UTC
Created attachment 287863 [details]
kernel panic on boot, kernel-5.5.8-200.fc31

Here's something totally unexpected.

The kernel (any kernel for that matter) produces output with, hooray, these options (should have tried them earlier):

efi=debug earlycon=efifb keep_bootcon

I've made a photo of the screen.
Comment 47 Artem S. Tashkinov 2020-03-11 09:06:11 UTC
The patches by Linus and Christoph haven't changed anything - the kernel continues to die on boot.

Kernel 5.4.24 _without_ any patches produces a much longer boot trace which unfortunately doesn't fit on one screen, so I had to make a video of it.

I can try to write it all down if anyone's interested.

RIP: 0010:kmem_cache_alloc_trace
...

? acpi_ds_create_walk_state
acpi_ds_create_walk_state
acpi_ds_call_control_method
acpi_ds_parse_aml
acpi_ps_execute_method
acpi_ns_evaluate
acpi_ut_evaluate_object
Comment 48 Artem S. Tashkinov 2020-03-11 16:16:58 UTC
Created attachment 287875 [details]
fix dma_mask handling in platform_device_register_full.patch

Fixed by:

[PATCH] device core: fix dma_mask handling in platform_device_register_full

https://lkml.org/lkml/2020/3/11/717

Note You need to log in before you can comment on or make changes to this bug.