Bug 206175
Summary: | Fedora >= 5.4 kernels instantly freeze on boot without producing any display output | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Artem S. Tashkinov (aros) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | alexdeucher, hch, jwrdegoede, matt, nivedita, postix, rjw, torvalds |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
URL: | https://bugzilla.redhat.com/show_bug.cgi?id=1790115 | ||
Kernel Version: | >= 5.4.0-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
grub `set debug=all` + kernel debug options
[kernel 5.3] dmesg | grep -i dma Fedora 31 kernel config attachment-16112-0.html attachment-24927-0.html clean up platform device dma_mask handling kernel panic on boot, kernel-5.5.8-200.fc31 fix dma_mask handling in platform_device_register_full.patch |
Description
Artem S. Tashkinov
2020-01-12 13:51:40 UTC
Linus, I need help, please! Even `efi=debug earlyprintk=efi,keep` produces no output at all. I've no idea what to do except bisecting but I don't want to go through this. Besides, it's a distro kernel and I will be debugging my own kernel. Created attachment 286775 [details]
grub `set debug=all` + kernel debug options
The last four lines of GRUB2 debug log are: script/lexer.c:321: token 0 text [] loader/efi/linux.c:82: kernel_addr: 0x1000000 handover_offset: 0x190 params: 0x7af0c000 loader/efi/linux.c:85: handover_func() = 0x1000390 _ (not blinking) At this point the system is dead. There really is nothing at all to go by. I see no better idea than trying to bisect things. Annoying, but that's exactly what bisection is great at - bugs where you can't even begin to guess what the problem is. If you can't see it with a normal self-built kernel, try to build with the Fedora config file. The fact that it prints nothing at all makes me suspect it's some random but very core thing. But it's presumably very specific to your machine - some particular e820 memory layout issue or something. I'm not seeing anything particularly odd in your devices. One thing to do is to perhaps remove the "rhgb quiet" that Fedora puts at the end of the boot line. You can do that just from the grub shell line. (In reply to Linus Torvalds from comment #5) > The fact that it prints nothing at all makes me suspect it's some random but > very core thing. But it's presumably very specific to your machine - some > particular e820 memory layout issue or something. I'm not seeing anything > particularly odd in your devices. > > One thing to do is to perhaps remove the "rhgb quiet" that Fedora puts at > the end of the boot line. You can do that just from the grub shell line. I've tried booting with no arguments at all (not even initrd) - and with just debugging flags turned on (efi=debug earlyprintk=efi,keep), i.e. vmlinuz-5.4.8-300.fc31.x86_64 efi=debug earlyprintk=efi,keep - the result is always the same: an instant freeze and no messages on the screen. I will try with various Fedora kernels first starting with 5.4-rc1. I really don't want to bisect yet. I have the same problem on an Acer Aspire 5 laptop (model A515-43-R19L). The laptop isn't completely as originally sold, because I installed a second 4GB ram stick and a second hard drive. Debian testing was working fine until the kernel was upgraded from the 5.3 to the 5.4 version series. Tonight, I installed Fedora 31 that comes with a 5.3 kernel. Fedora was working until I ran updates and the kernel was upgraded to 5.4.8. In every case, the 5.4 kernel results in a hard freeze early in the boot process. The 5.3 kernel works fine, and downgrading back to that kernel works. In Debian, I tried compiling my own 5.4.11 kernel using Debian's 5.3 kernel config as a starting point. I ran "make oldconfig" and selected all the defaults. The self-compiled kernel had the same issue. My lspci output: 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root Complex 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] 00:01.6 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US) 00:01.7 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US) 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus A 00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus B 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 5 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 6 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 7 02:00.0 Non-Volatile memory controller: SK hynix Device 1327 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) 04:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32) 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c4) 05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Raven/Raven2/Fenghuang HDMI/DP Audio Controller 05:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor 05:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven2 USB 3.1 05:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor 05:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) HD Audio Controller 06:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61) (In reply to Matt Yates from comment #7) > I have the same problem on an Acer Aspire 5 laptop (model A515-43-R19L). > The laptop isn't completely as originally sold, because I installed a second > 4GB ram stick and a second hard drive. > > Debian testing was working fine until the kernel was upgraded from the 5.3 > to the 5.4 version series. Tonight, I installed Fedora 31 that comes with a > 5.3 kernel. Fedora was working until I ran updates and the kernel was > upgraded to 5.4.8. > > In every case, the 5.4 kernel results in a hard freeze early in the boot > process. The 5.3 kernel works fine, and downgrading back to that kernel > works. > > In Debian, I tried compiling my own 5.4.11 kernel using Debian's 5.3 kernel > config as a starting point. I ran "make oldconfig" and selected all the > defaults. The self-compiled kernel had the same issue. > > My lspci output: > That's weird and puzzling - our laptops are completely different aside from sharing the same architecture. What's your BIOS vendor (can be checked using sudo lshw)? Mine is American Megatrends Inc. Also, could you check your BIOS for a TPM module? Could you try disabling it and see if it helps? My BIOS vendor is "Insyde Corp.". There is a TPM module. When I disabled it, it caused my EFI boot entry to disappear, so I couldn't test it. However, I think we may have two separate problems. I switched back from Fedora to Debian Testing, and the Debian installer upgraded the kernel from 5.3 to 5.4 series prior to the first boot. The 5.4 kernel booted up on first boot. I could see boot messages scrolling, but the screen went to a black while trying to load lightdm because I did not have the "firmware-amd-graphics" package installed required for graphics. After installing the amd graphics package, the 5.4 kernel freezes as before (right at the start of the boot process). The 5.3 kernel boots as normal, and graphics work. The "firmware-amd-graphics" package (version 20190717-2) was the only thing I changed, so I guess the problem must be some sort of conflict with the amd graphics firmware and the 5.4 kernel. I have a similar picasso based raven laptop and it's working fine with F31 and the latest 5.4.8 kernel. I think the firmware is a red herring since the driver won't load without it. If you don't have it installed the driver won't be loaded. Sounds more like some regression between 5.3 and 5.4 for your laptop. It probably makes sense to open a new bug for the amdgpu issue to avoid mixing issues in this report. (In reply to Matt Yates from comment #9) > > The "firmware-amd-graphics" package (version 20190717-2) was the only thing > I changed, so I guess the problem must be some sort of conflict with the amd > graphics firmware and the 5.4 kernel. This is definitely a whole different issue not related to this one. You could open a separate bug report. Rafael J. Wysocki, I see some pretty heavy changes merged with kernel after 5.4 release: https://github.com/torvalds/linux/commit/782b59711e1561ee0da06bc478ca5e8249aa8d09 Could they cause this issue? Are there any extra boot flags that I could try to understand what's going on in my case? kernel-5.4.0-0.rc0.git1.1.fc32 boots kernel-5.4.0-0.rc0.git7.1.fc32 fails to boot (hard freeze) And all the kernels in between give Forbidden You don't have permission to access this resource. Which means some changes between git1 and git7 rendered my system unbootable. Looks like I have to bisect :-( __________________________________________________________ Again, Rafael, could you respond please? This is a note for myself: https://koji.fedoraproject.org/koji/buildinfo?buildID=1379422 https://kojipkgs.fedoraproject.org/vol/fedora_koji_archive03/packages/kernel/5.4.0/ https://kojipkgs.fedoraproject.org/vol/fedora_koji_archive05/packages/kernel/5.4.0/ Still trying to see what I can do without bisecting. From https://bugzilla.redhat.com/show_bug.cgi?id=1790115#c44 The bad commit is between these two snapshots: * Thu Sep 19 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git3.1 - Linux v5.3-7639-gb41dae061bbd * Wed Sep 18 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git2.1 - Linux v5.3-3839-g35f7a9526615 Maybe kernel developers have an idea what it could be? Again, the kernel freezes VERY early during boot - as nothing is printed on the screen even with all possible debug flags. Someone surely knows what it is and I don't want to spam LKML with that. (In reply to Artem S. Tashkinov from comment #15) > > The bad commit is between these two snapshots: > > * Thu Sep 19 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git3.1 > - Linux v5.3-7639-gb41dae061bbd > > * Wed Sep 18 2019 Jeremy Cline <jcline@redhat.com> - 5.4.0-0.rc0.git2.1 > - Linux v5.3-3839-g35f7a9526615 That's still 3800 commits... Looking at my merge log in that range, we have Pull xfs updates from Darrick Wong Pull swap access updates from Darrick Wong Pull overlayfs fixes from Miklos Szeredi Pull btrfs updates from David Sterba Pull AFS updates from David Howells Pull fs-verity support from Eric Biggers Pull fscrypt updates from Eric Biggers Pull file locking updates from Jeff Layton Pull vfs mount API infrastructure updates from Al Viro Pull d_path fix from Al Viro Pull vfs namei updates from Al Viro Pull networking updates from David Miller Pull crypto updates from Herbert Xu Pull char/misc driver updates from Greg KH Pull staging and IIO driver updates from Greg KH Pull tty/serial driver updates from Greg KH Pull USB updates from Greg KH Pull driver core updates from Greg Kroah-Hartman Pull KVM updates from Paolo Bonzini Pull KVM fix from Paolo Bonzini and none of them look at all obvious. There's no low-level x86 boot code, for example. And the filesystem stuff shouldn't trigger early in boot. Sure, there's some low-level driver updates that might cause it. tty in particular. But I don't see why it would impact you and nobody else. I _really_ think you should bisect. Trust me, it's faster than whatever else you're doing. (In reply to Linus Torvalds from comment #17) > I _really_ think you should bisect. Trust me, it's faster than whatever else > you're doing. Thank you Linus! I'm thinking about it. The issue is, with my custom very simplistic .config the kernel boots just fine. It's the options which Fedora enables causes it not to boot. And their config includes pretty much everything on Earth. To add insult to injury Fedora's kernel is built with some patches on top of vanilla and those patches could interact with the kernel code in strange ways and then it's not so obvious what should I bisect exactly. Should I apply Fedora's patches after each bisect step before actually building the kernel? Damn. 14 reboots. # git bisect bad cdfee5623290bc893f595636b44fa28e8207c5b3 is the first bad commit commit cdfee5623290bc893f595636b44fa28e8207c5b3 Author: Christoph Hellwig <hch@lst.de> Date: Fri Aug 16 08:24:35 2019 +0200 driver core: initialize a default DMA mask for platform device We still treat devices without a DMA mask as defaulting to 32-bits for both mask, but a few releases ago we've started warning about such cases, as they require special cases to work around this sloppyness. Add a dma_mask field to struct platform_device so that we can initialize the dma_mask pointer in struct device and initialize both masks to 32-bits by default, replacing similar functionality in m68k and powerpc. The arch_setup_pdev_archdata hooks is now unused and removed. Note that the code looks a little odd with the various conditionals because we have to support platform_device structures that are statically allocated. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20190816062435.881-7-hch@lst.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> arch/m68k/kernel/dma.c | 9 --------- arch/powerpc/kernel/setup-common.c | 6 ------ arch/sh/boards/mach-ap325rxa/setup.c | 1 - arch/sh/boards/mach-ecovec24/setup.c | 2 -- arch/sh/boards/mach-kfr2r09/setup.c | 1 - arch/sh/boards/mach-migor/setup.c | 1 - arch/sh/boards/mach-se/7724/setup.c | 2 -- drivers/base/platform.c | 37 ++++++++++++++++-------------------- include/linux/platform_device.h | 2 +- 9 files changed, 17 insertions(+), 44 deletions(-) # git bisect log git bisect start # good: [35f7a95266153b1cf0caca3aa9661cb721864527] Merge tag 'devprop-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect good 35f7a95266153b1cf0caca3aa9661cb721864527 # bad: [b41dae061bbd722b9d7fa828f35d22035b218e18] Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux git bisect bad b41dae061bbd722b9d7fa828f35d22035b218e18 # good: [2f2fa16e23816bded4b97117faf6e97a95ba9056] Merge branch 'devlink-unknown' git bisect good 2f2fa16e23816bded4b97117faf6e97a95ba9056 # bad: [e6874fc29410fabfdbc8c12b467f41a16cbcfd2b] Merge tag 'staging-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging git bisect bad e6874fc29410fabfdbc8c12b467f41a16cbcfd2b # good: [56a583d264b97e34d3c6009f50f113450e63a1de] Staging: exfat: Avoid use of strcpy git bisect good 56a583d264b97e34d3c6009f50f113450e63a1de # bad: [fb9617edf6c0e1b86a6595cd92dd3f84595221d9] Merge tag 'usb-ci-v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-next git bisect bad fb9617edf6c0e1b86a6595cd92dd3f84595221d9 # bad: [ecd55e367f3d706788632e176ec6b94e1a72a07c] usb: chipidea: msm: Use device-managed registration API git bisect bad ecd55e367f3d706788632e176ec6b94e1a72a07c # good: [3e2cb866b2b1497a2246c8d222cd672694ac9f15] USB: phy: mv-usb: convert platform driver to use dev_groups git bisect good 3e2cb866b2b1497a2246c8d222cd672694ac9f15 # good: [eceddc4071e39807f080913656252e4526227647] usb: typec: fusb302: Remove unused properties git bisect good eceddc4071e39807f080913656252e4526227647 # good: [a599e48662b4b505bef45d4831061f9d50703e17] usb: usb-skeleton: make comment block in line with coding style git bisect good a599e48662b4b505bef45d4831061f9d50703e17 # good: [bd5defaee872da9b81e3c72045eb6794445cd2e6] dma-mapping: remove is_device_dma_capable git bisect good bd5defaee872da9b81e3c72045eb6794445cd2e6 # bad: [58fb8beda201f00b139e8993cbda937cd9a4e603] dt-binding: usb: ci-hdrc-usb2: add imx7ulp compatible git bisect bad 58fb8beda201f00b139e8993cbda937cd9a4e603 # bad: [cdfee5623290bc893f595636b44fa28e8207c5b3] driver core: initialize a default DMA mask for platform device git bisect bad cdfee5623290bc893f595636b44fa28e8207c5b3 # first bad commit: [cdfee5623290bc893f595636b44fa28e8207c5b3] driver core: initialize a default DMA mask for platform device To further confirm it: 5.4 vanilla with Fedora config FAILs to boot. 5.4 vanilla with Fedora config *without* 0001-driver-core-initialize-a-default-DMA-mask-for-platform.patch (commit cdfee5623290bc893f595636b44fa28e8207c5b3) boots successfully. Now I'm totally confused as this patch looks relatively innocuous, so I'm leaving further judgement to people who can understand what's going on. This patch might affect multiple other built-in kernel drivers which causes a boot failure. This is what gets recompiled after reverting this patch - looks like a lot of stuff: UPD include/config/kernel.release DESCEND objtool UPD include/generated/utsrelease.h CALL scripts/atomic/check-atomics.sh CALL scripts/checksyscalls.sh CHK include/generated/compile.h CC init/version.o AR init/built-in.a CC kernel/sys.o CC drivers/ata/libata-core.o CC arch/x86/kernel/rtc.o CC drivers/acpi/glue.o CC drivers/base/core.o CC drivers/base/platform.o CC arch/x86/kernel/cpu/microcode/core.o CC drivers/char/ipmi/ipmi_dmi.o CC drivers/char/ipmi/ipmi_plat_data.o CC drivers/clk/clk-fixed-factor.o CC drivers/acpi/dock.o CC drivers/char/tpm/tpm-chip.o CC drivers/char/tpm/tpm-dev-common.o AR drivers/char/ipmi/built-in.a CC drivers/clk/clk-fixed-rate.o CC lib/genalloc.o AR arch/x86/kernel/cpu/microcode/built-in.a AR arch/x86/kernel/cpu/built-in.a CC arch/x86/kernel/early-quirks.o CC drivers/acpi/acpi_lpss.o CC drivers/acpi/acpi_apd.o CC drivers/char/tpm/tpm-dev.o CC drivers/char/tpm/tpm-interface.o CC drivers/char/tpm/tpm1-cmd.o CC drivers/char/tpm/tpm2-cmd.o CC drivers/clk/clk-gpio.o CC drivers/char/tpm/tpmrm-dev.o CC drivers/char/tpm/tpm2-space.o CC drivers/base/firmware_loader/main.o CC drivers/acpi/acpi_platform.o CC drivers/char/tpm/tpm-sysfs.o CC arch/x86/kernel/early_printk.o CC drivers/clk/x86/clk-pmc-atom.o CC drivers/char/tpm/eventlog/common.o CC drivers/char/tpm/eventlog/tpm1.o AR lib/built-in.a CC drivers/char/tpm/eventlog/tpm2.o CC drivers/char/tpm/tpm_ppi.o CC drivers/base/power/domain.o CC kernel/dma/mapping.o CC drivers/char/tpm/eventlog/acpi.o CC drivers/clk/x86/clk-st.o CC drivers/clk/x86/clk-lpt.o AR drivers/base/firmware_loader/built-in.a CC drivers/acpi/acpi_watchdog.o CC drivers/char/tpm/eventlog/efi.o CC drivers/char/tpm/tpm_tis_core.o CC drivers/char/tpm/tpm_tis.o CC drivers/char/tpm/tpm_crb.o CC arch/x86/kernel/pmem.o CC arch/x86/kernel/pcspeaker.o AR kernel/dma/built-in.a CC kernel/power/qos.o AR drivers/clk/x86/built-in.a CC drivers/acpi/apei/hest.o AR drivers/clk/built-in.a CC drivers/devfreq/devfreq.o AR drivers/ata/built-in.a CC drivers/acpi/apei/ghes.o CC drivers/acpi/ac.o CC drivers/dma/dmaengine.o CC arch/x86/kernel/sysfb.o CC kernel/time/alarmtimer.o CC drivers/acpi/fan.o AR drivers/base/power/built-in.a AR drivers/base/built-in.a CC kernel/trace/trace.o CC drivers/edac/edac_mc.o CC drivers/edac/edac_device.o CC drivers/edac/edac_mc_sysfs.o AR drivers/char/tpm/built-in.a CC drivers/edac/edac_module.o AR arch/x86/kernel/built-in.a AR drivers/char/built-in.a AR arch/x86/built-in.a CC drivers/gpio/gpio-crystalcove.o CC drivers/gpu/drm/drm_mipi_dsi.o CC drivers/firmware/efi/efi.o CC drivers/acpi/pmic/intel_pmic_crc.o AR kernel/power/built-in.a CC drivers/gpio/gpio-tps68470.o CC drivers/gpio/gpio-wcove.o AR drivers/dma/built-in.a AR drivers/acpi/apei/built-in.a AR drivers/devfreq/built-in.a CC drivers/acpi/pmic/intel_pmic_xpower.o CC kernel/module.o CC drivers/edac/edac_device_sysfs.o CC drivers/edac/wq.o CC drivers/edac/edac_pci.o AR kernel/time/built-in.a CC drivers/edac/edac_pci_sysfs.o CC drivers/edac/ghes_edac.o CC drivers/i2c/i2c-core-base.o CC drivers/acpi/pmic/intel_pmic_bxtwc.o CC drivers/acpi/pmic/intel_pmic_chtwc.o CC drivers/acpi/pmic/intel_pmic_chtdc_ti.o AR drivers/gpio/built-in.a CC drivers/i2c/busses/i2c-designware-platdrv.o AR drivers/firmware/efi/built-in.a AR drivers/firmware/built-in.a AR drivers/acpi/built-in.a CC drivers/input/serio/i8042.o AR drivers/gpu/drm/built-in.a AR drivers/gpu/built-in.a CC drivers/iommu/amd_iommu.o CC drivers/mailbox/pcc.o CC drivers/input/mouse/elantech.o CC drivers/media/cec/cec-notifier.o AR drivers/edac/built-in.a CC drivers/mfd/tps68470.o CC drivers/mfd/mfd-core.o CC drivers/mfd/axp20x.o AR drivers/i2c/busses/built-in.a CC drivers/mfd/intel_soc_pmic_crc.o CC drivers/mfd/intel_soc_pmic_core.o AR drivers/mailbox/built-in.a CC drivers/mfd/intel_soc_pmic_bxtwc.o CC drivers/mfd/intel_soc_pmic_chtwc.o AR drivers/media/cec/built-in.a AR drivers/media/built-in.a AR drivers/input/serio/built-in.a CC drivers/net/phy/mdio_bus.o CC drivers/pci/probe.o AR drivers/i2c/built-in.a AR drivers/input/mouse/built-in.a AR drivers/input/built-in.a AR drivers/mfd/built-in.a CC drivers/pci/pci-driver.o CC drivers/pinctrl/intel/pinctrl-baytrail.o CC drivers/pinctrl/intel/pinctrl-cherryview.o CC drivers/platform/x86/intel_pmc_ipc.o CC drivers/platform/x86/intel_pmc_core.o CC drivers/platform/x86/intel_pmc_core_pltdrv.o CC drivers/platform/x86/pmc_atom.o CC drivers/net/phy/fixed_phy.o CC drivers/pwm/pwm-crc.o CC drivers/regulator/dummy.o CC drivers/regulator/fixed-helper.o AR kernel/trace/built-in.a AR drivers/pwm/built-in.a AR kernel/built-in.a CC drivers/rtc/rtc-cmos.o AR drivers/iommu/built-in.a CC drivers/spi/spi.o AR drivers/pinctrl/intel/built-in.a AR drivers/pinctrl/built-in.a AR drivers/platform/x86/built-in.a AR drivers/platform/built-in.a CC drivers/usb/common/common.o AR drivers/net/phy/built-in.a AR drivers/net/built-in.a CC drivers/tty/serdev/core.o AR drivers/regulator/built-in.a CC drivers/usb/core/hcd.o CC drivers/video/fbdev/vesafb.o CC drivers/video/fbdev/efifb.o CC drivers/usb/host/xhci-ext-caps.o CC drivers/tty/serial/8250/8250_core.o CC drivers/tty/serial/8250/8250_pnp.o CC drivers/scsi/hosts.o CC drivers/tty/serial/8250/8250_port.o CC drivers/tty/serial/8250/8250_dma.o AR drivers/usb/common/built-in.a CC drivers/tty/serial/8250/8250_dwlib.o AR drivers/rtc/built-in.a AR drivers/pci/built-in.a CC drivers/tty/serial/8250/8250_pci.o CC drivers/tty/serial/8250/8250_early.o CC drivers/tty/serial/8250/8250_dw.o CC drivers/tty/serial/8250/8250_mid.o AR drivers/tty/serdev/built-in.a AR drivers/usb/host/built-in.a AR drivers/scsi/built-in.a AR drivers/video/fbdev/built-in.a AR drivers/video/built-in.a AR drivers/usb/core/built-in.a AR drivers/usb/built-in.a AR drivers/tty/serial/8250/built-in.a AR drivers/tty/serial/built-in.a AR drivers/tty/built-in.a AR drivers/spi/built-in.a AR drivers/built-in.a GEN .version CHK include/generated/compile.h UPD include/generated/compile.h CC init/version.o AR init/built-in.a LD vmlinux.o MODPOST vmlinux.o MODINFO modules.builtin.modinfo LD .tmp_vmlinux1 KSYM .tmp_kallsyms1.o LD .tmp_vmlinux2 KSYM .tmp_kallsyms2.o LD vmlinux SORTEX vmlinux SYSMAP System.map TEST posttest Created attachment 287811 [details]
[kernel 5.3] dmesg | grep -i dma
Created attachment 287813 [details]
Fedora 31 kernel config
As of kernel 5.6-rc4-git the bug is still present. Just bringing in more people. Christoph, I'm not seeing where we initialize the new 'dma_mask' in all cases. So it looks like what happens is: - platform_device_alloc() use kzalloc() to allocate the platform object. - it then calls setup_pdev_dma_masks, which has if (!pdev->dma_mask) pdev->dma_mask = DMA_BIT_MASK(32); so now it's set to a 32-bit mask. - nothing else ever seems to set it to anything else. So the code looks a bit odd. What's going on? The logic of this all entirely escapes me. That said, I don't see why it would break anything either. Linus On Fri, Mar 6, 2020 at 8:31 PM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=206175 > > --- Comment #20 from Artem S. Tashkinov (aros@gmx.com) --- > To further confirm it: > > 5.4 vanilla with Fedora config FAILs to boot. > > 5.4 vanilla with Fedora config *without* > 0001-driver-core-initialize-a-default-DMA-mask-for-platform.patch (commit > cdfee5623290bc893f595636b44fa28e8207c5b3) boots successfully. > > Now I'm totally confused as this patch looks relatively innocuous, so I'm > leaving further judgement to people who can understand what's going on. This > patch might affect multiple other built-in kernel drivers which causes a boot > failure. > > This is what gets recompiled after reverting this patch - looks like a lot of > stuff: > > UPD include/config/kernel.release > DESCEND objtool > UPD include/generated/utsrelease.h > CALL scripts/atomic/check-atomics.sh > CALL scripts/checksyscalls.sh > CHK include/generated/compile.h > CC init/version.o > AR init/built-in.a > CC kernel/sys.o > CC drivers/ata/libata-core.o > CC arch/x86/kernel/rtc.o > CC drivers/acpi/glue.o > CC drivers/base/core.o > CC drivers/base/platform.o > CC arch/x86/kernel/cpu/microcode/core.o > CC drivers/char/ipmi/ipmi_dmi.o > CC drivers/char/ipmi/ipmi_plat_data.o > CC drivers/clk/clk-fixed-factor.o > CC drivers/acpi/dock.o > CC drivers/char/tpm/tpm-chip.o > CC drivers/char/tpm/tpm-dev-common.o > AR drivers/char/ipmi/built-in.a > CC drivers/clk/clk-fixed-rate.o > CC lib/genalloc.o > AR arch/x86/kernel/cpu/microcode/built-in.a > AR arch/x86/kernel/cpu/built-in.a > CC arch/x86/kernel/early-quirks.o > CC drivers/acpi/acpi_lpss.o > CC drivers/acpi/acpi_apd.o > CC drivers/char/tpm/tpm-dev.o > CC drivers/char/tpm/tpm-interface.o > CC drivers/char/tpm/tpm1-cmd.o > CC drivers/char/tpm/tpm2-cmd.o > CC drivers/clk/clk-gpio.o > CC drivers/char/tpm/tpmrm-dev.o > CC drivers/char/tpm/tpm2-space.o > CC drivers/base/firmware_loader/main.o > CC drivers/acpi/acpi_platform.o > CC drivers/char/tpm/tpm-sysfs.o > CC arch/x86/kernel/early_printk.o > CC drivers/clk/x86/clk-pmc-atom.o > CC drivers/char/tpm/eventlog/common.o > CC drivers/char/tpm/eventlog/tpm1.o > AR lib/built-in.a > CC drivers/char/tpm/eventlog/tpm2.o > CC drivers/char/tpm/tpm_ppi.o > CC drivers/base/power/domain.o > CC kernel/dma/mapping.o > CC drivers/char/tpm/eventlog/acpi.o > CC drivers/clk/x86/clk-st.o > CC drivers/clk/x86/clk-lpt.o > AR drivers/base/firmware_loader/built-in.a > CC drivers/acpi/acpi_watchdog.o > CC drivers/char/tpm/eventlog/efi.o > CC drivers/char/tpm/tpm_tis_core.o > CC drivers/char/tpm/tpm_tis.o > CC drivers/char/tpm/tpm_crb.o > CC arch/x86/kernel/pmem.o > CC arch/x86/kernel/pcspeaker.o > AR kernel/dma/built-in.a > CC kernel/power/qos.o > AR drivers/clk/x86/built-in.a > CC drivers/acpi/apei/hest.o > AR drivers/clk/built-in.a > CC drivers/devfreq/devfreq.o > AR drivers/ata/built-in.a > CC drivers/acpi/apei/ghes.o > CC drivers/acpi/ac.o > CC drivers/dma/dmaengine.o > CC arch/x86/kernel/sysfb.o > CC kernel/time/alarmtimer.o > CC drivers/acpi/fan.o > AR drivers/base/power/built-in.a > AR drivers/base/built-in.a > CC kernel/trace/trace.o > CC drivers/edac/edac_mc.o > CC drivers/edac/edac_device.o > CC drivers/edac/edac_mc_sysfs.o > AR drivers/char/tpm/built-in.a > CC drivers/edac/edac_module.o > AR arch/x86/kernel/built-in.a > AR drivers/char/built-in.a > AR arch/x86/built-in.a > CC drivers/gpio/gpio-crystalcove.o > CC drivers/gpu/drm/drm_mipi_dsi.o > CC drivers/firmware/efi/efi.o > CC drivers/acpi/pmic/intel_pmic_crc.o > AR kernel/power/built-in.a > CC drivers/gpio/gpio-tps68470.o > CC drivers/gpio/gpio-wcove.o > AR drivers/dma/built-in.a > AR drivers/acpi/apei/built-in.a > AR drivers/devfreq/built-in.a > CC drivers/acpi/pmic/intel_pmic_xpower.o > CC kernel/module.o > CC drivers/edac/edac_device_sysfs.o > CC drivers/edac/wq.o > CC drivers/edac/edac_pci.o > AR kernel/time/built-in.a > CC drivers/edac/edac_pci_sysfs.o > CC drivers/edac/ghes_edac.o > CC drivers/i2c/i2c-core-base.o > CC drivers/acpi/pmic/intel_pmic_bxtwc.o > CC drivers/acpi/pmic/intel_pmic_chtwc.o > CC drivers/acpi/pmic/intel_pmic_chtdc_ti.o > AR drivers/gpio/built-in.a > CC drivers/i2c/busses/i2c-designware-platdrv.o > AR drivers/firmware/efi/built-in.a > AR drivers/firmware/built-in.a > AR drivers/acpi/built-in.a > CC drivers/input/serio/i8042.o > AR drivers/gpu/drm/built-in.a > AR drivers/gpu/built-in.a > CC drivers/iommu/amd_iommu.o > CC drivers/mailbox/pcc.o > CC drivers/input/mouse/elantech.o > CC drivers/media/cec/cec-notifier.o > AR drivers/edac/built-in.a > CC drivers/mfd/tps68470.o > CC drivers/mfd/mfd-core.o > CC drivers/mfd/axp20x.o > AR drivers/i2c/busses/built-in.a > CC drivers/mfd/intel_soc_pmic_crc.o > CC drivers/mfd/intel_soc_pmic_core.o > AR drivers/mailbox/built-in.a > CC drivers/mfd/intel_soc_pmic_bxtwc.o > CC drivers/mfd/intel_soc_pmic_chtwc.o > AR drivers/media/cec/built-in.a > AR drivers/media/built-in.a > AR drivers/input/serio/built-in.a > CC drivers/net/phy/mdio_bus.o > CC drivers/pci/probe.o > AR drivers/i2c/built-in.a > AR drivers/input/mouse/built-in.a > AR drivers/input/built-in.a > AR drivers/mfd/built-in.a > CC drivers/pci/pci-driver.o > CC drivers/pinctrl/intel/pinctrl-baytrail.o > CC drivers/pinctrl/intel/pinctrl-cherryview.o > CC drivers/platform/x86/intel_pmc_ipc.o > CC drivers/platform/x86/intel_pmc_core.o > CC drivers/platform/x86/intel_pmc_core_pltdrv.o > CC drivers/platform/x86/pmc_atom.o > CC drivers/net/phy/fixed_phy.o > CC drivers/pwm/pwm-crc.o > CC drivers/regulator/dummy.o > CC drivers/regulator/fixed-helper.o > AR kernel/trace/built-in.a > AR drivers/pwm/built-in.a > AR kernel/built-in.a > CC drivers/rtc/rtc-cmos.o > AR drivers/iommu/built-in.a > CC drivers/spi/spi.o > AR drivers/pinctrl/intel/built-in.a > AR drivers/pinctrl/built-in.a > AR drivers/platform/x86/built-in.a > AR drivers/platform/built-in.a > CC drivers/usb/common/common.o > AR drivers/net/phy/built-in.a > AR drivers/net/built-in.a > CC drivers/tty/serdev/core.o > AR drivers/regulator/built-in.a > CC drivers/usb/core/hcd.o > CC drivers/video/fbdev/vesafb.o > CC drivers/video/fbdev/efifb.o > CC drivers/usb/host/xhci-ext-caps.o > CC drivers/tty/serial/8250/8250_core.o > CC drivers/tty/serial/8250/8250_pnp.o > CC drivers/scsi/hosts.o > CC drivers/tty/serial/8250/8250_port.o > CC drivers/tty/serial/8250/8250_dma.o > AR drivers/usb/common/built-in.a > CC drivers/tty/serial/8250/8250_dwlib.o > AR drivers/rtc/built-in.a > AR drivers/pci/built-in.a > CC drivers/tty/serial/8250/8250_pci.o > CC drivers/tty/serial/8250/8250_early.o > CC drivers/tty/serial/8250/8250_dw.o > CC drivers/tty/serial/8250/8250_mid.o > AR drivers/tty/serdev/built-in.a > AR drivers/usb/host/built-in.a > AR drivers/scsi/built-in.a > AR drivers/video/fbdev/built-in.a > AR drivers/video/built-in.a > AR drivers/usb/core/built-in.a > AR drivers/usb/built-in.a > AR drivers/tty/serial/8250/built-in.a > AR drivers/tty/serial/built-in.a > AR drivers/tty/built-in.a > AR drivers/spi/built-in.a > AR drivers/built-in.a > GEN .version > CHK include/generated/compile.h > UPD include/generated/compile.h > CC init/version.o > AR init/built-in.a > LD vmlinux.o > MODPOST vmlinux.o > MODINFO modules.builtin.modinfo > LD .tmp_vmlinux1 > KSYM .tmp_kallsyms1.o > LD .tmp_vmlinux2 > KSYM .tmp_kallsyms2.o > LD vmlinux > SORTEX vmlinux > SYSMAP System.map > TEST posttest > > -- > You are receiving this mail because: > You are on the CC list for the bug. On Fri, Mar 06, 2020 at 08:47:24PM -0600, Linus Torvalds wrote:
> Just bringing in more people.
>
> Christoph, I'm not seeing where we initialize the new 'dma_mask' in all
> cases.
>
> So it looks like what happens is:
>
> - platform_device_alloc() use kzalloc() to allocate the platform object.
>
> - it then calls setup_pdev_dma_masks, which has
>
> if (!pdev->dma_mask)
> pdev->dma_mask = DMA_BIT_MASK(32);
>
> so now it's set to a 32-bit mask.
>
> - nothing else ever seems to set it to anything else.
>
> So the code looks a bit odd. What's going on? The logic of this all
> entirely escapes me.
In additition to modern dynamically allocate platform devices various
architectures (most notable arm and sh, but also some weirdo x86 code)
allocate platform_devices in .bss, and some of them initialize the
dma mask (or at least did when I added this code).
What platform doesn't boot here? arm, x86, something else? What board?
(In reply to Christoph Hellwig from comment #25) > > What platform doesn't boot here? arm, x86, something else? What board? All the information is provided in the bug report. It's a bog standard x86-64 UEFI laptop with a Skylake CPU. Artem is talking about this Fedora bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1790115 Let me copy and paste some detailed hwinfo from that bugreport: lspci: 00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21) 00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL810xE PCI Express Fast Ethernet controller (rev 0a) 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 08) 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-LP LPC Controller (rev 21) 00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21) 02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81) 00:1f.7 Non-Essential Instrumentation [1300]: Intel Corporation Device 9d26 (rev 21) 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1) 00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) 00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1) 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1) 00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21) 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21) 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21) 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21) 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 08) 00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21) 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01) 00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21) 00:02.0 VGA compatible controller: Intel Corporation Skylake GT2 [HD Graphics 520] (rev 07) CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz product: HP Pavilion x360 Convertible (P1F10UA#ABA) BIOS: F.28 (the latest released one) (In reply to Christoph Hellwig from comment #25) > > What platform doesn't boot here? arm, x86, something else? What board? Any updates on this one? All the requested information has been provided and I can give you SSH access to my laptop if necessary. On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > Any updates on this one? All the requested information has been provided and > I > can give you SSH access to my laptop if necessary. I'm inclined to just revert that commit. It seems to still revert cleanly, and I don't see a single use of the new "dma_mask" in "struct platform_device" apart from the initialization done in that commit. The only use of that new dma_mask field seems to literally be this: if (!pdev->dma_mask) pdev->dma_mask = DMA_BIT_MASK(32); if (!pdev->dev.dma_mask) pdev->dev.dma_mask = &pdev->dma_mask; which is entirely nonsensical to begin with. It might make more sense if the code did something more along the lines of: if (!pdev->dev.dma_mask) { pdev->dma_mask = DMA_BIT_MASK(32); pdev->dev.dma_mask = &pdev->dma_mask; } and we'd make it clear that the "platform_device->dma_mask" thing is purely a local allocation. It should be renamed to indicate that. Btw, in platform_device_register_full(), we have this: if (pdevinfo->dma_mask) { /* * This memory isn't freed when the device is put, * I don't have a nice idea for that though. Conceptually * dma_mask in struct device should not be a pointer. * See http://thread.gmane.org/gmane.linux.kernel.pci/9081 */ pdev->dev.dma_mask = kmalloc(sizeof(*pdev->dev.dma_mask), GFP_KERNEL); if (!pdev->dev.dma_mask) goto err; kmemleak_ignore(pdev->dev.dma_mask); which looks like more garbage. It _would_ have made sense to actually use the new 'dma_mask' field here, and make this all be if (pdevinfo->dma_mask) { pdev->dma_mask = pdevinfo->dma_mask; pdev->dev.dma_mask = &pdev->dma_mask; } and then we'd have a *sensible* use of that pdev field and we'd have simplified the code and avoided that memory leak, but that's not how the code worked. I do have one question: the minimal config that works for you, and it's just the Fedora one that does not, could you try to see what modules get loaded with the Fedora one, that you might be lacking? In fact, what happens if you do this: - build with the fedora config with the revert in place (so that you get a working kernel) - boot into that kernel - then do a "make localmodconfig" to generate a minimal config based on what modules you have loaded - see how that differs from your manually generated minimal config? Anyway, I do think that that commit that you bisected down to is nonsensical. It does too much, or too little, and using a name like "dma_mask" is much too generic for that one special case and makes it impossible to greap for it. If it was called "platform_dma_mask", unconditionally initialized to 32 bits at alloc time, and then the 'pdevinfo->dma_mask' thing would use it too, then it might be greppable, sensible, and useful. As it is, the thing adds no value and clearly causes problems, which is why I'd be inclined to revert it.. Christoph? Linus On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote: > On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > > > Any updates on this one? All the requested information has been provided > and I > > can give you SSH access to my laptop if necessary. Whoever sent that mail should reply to my last request.. > I'm inclined to just revert that commit. Please don't. This area is a complete mess (see more below), and we need to make gradual steps to sort this crap out. And this was one important step. > It seems to still revert cleanly, and I don't see a single use of the > new "dma_mask" in "struct platform_device" apart from the > initialization done in that commit. The only use of that new dma_mask > field seems to literally be this: > > if (!pdev->dma_mask) > pdev->dma_mask = DMA_BIT_MASK(32); > if (!pdev->dev.dma_mask) > pdev->dev.dma_mask = &pdev->dma_mask; > > which is entirely nonsensical to begin with. It might make more sense > if the code did something more along the lines of: > > if (!pdev->dev.dma_mask) { > pdev->dma_mask = DMA_BIT_MASK(32); > pdev->dev.dma_mask = &pdev->dma_mask; > } > > and we'd make it clear that the "platform_device->dma_mask" thing is > purely a local allocation. It should be renamed to indicate that. pdev->dev.dma_mask is a pointer that needs to point to real DMA mask, which was a really really bad decision back when it was made, and something I plan to fix soon. So if we want a working dma mask we need to make that point to something. And since 5.4 we stop to silently treat a NULL pointer as 32-bit DMA mask, so it needs to be initialized for DMA to work. We have a bunch of drivers initializing pdev->dev.dma_mask, often to things like static variables, which we need to cleanup next, and allowing the dma_mask in the platform_device to be set directly should help with that. That being said if your suggested change fixes the reporters problem (which I'd be suprised) I'm all for it for now. > > Btw, in platform_device_register_full(), we have this: > > if (pdevinfo->dma_mask) { > /* > * This memory isn't freed when the device is put, > * I don't have a nice idea for that though. Conceptually > * dma_mask in struct device should not be a pointer. > * See http://thread.gmane.org/gmane.linux.kernel.pci/9081 > */ > pdev->dev.dma_mask = > kmalloc(sizeof(*pdev->dev.dma_mask), GFP_KERNEL); > if (!pdev->dev.dma_mask) > goto err; > > kmemleak_ignore(pdev->dev.dma_mask); > > which looks like more garbage. It _would_ have made sense to actually > use the new 'dma_mask' field here, and make this all be > > if (pdevinfo->dma_mask) { > pdev->dma_mask = pdevinfo->dma_mask; > pdev->dev.dma_mask = &pdev->dma_mask; > } > > and then we'd have a *sensible* use of that pdev field and we'd have > simplified the code and avoided that memory leak, but that's not how > the code worked. > Yes, but that whole platform_device_register_full mess predates this changes and is just an awful hack by a tiny minority of the drivers using platform devices. Both the DMA mask code and platform devices are full of mess like this, which slowly needs to be sorted out. > If it was called "platform_dma_mask", unconditionally initialized to > 32 bits at alloc time, and then the 'pdevinfo->dma_mask' thing would > use it too, then it might be greppable, sensible, and useful. As it > is, the thing adds no value and clearly causes problems, which is why > I'd be inclined to revert it.. Reverting it will break DMA for most platform device users. (In reply to Christoph Hellwig from comment #30) > > Whoever sent that mail should reply to my last request.. I don't know whether you're serious or not but the kernel bug report has all the data and Hans de Goede emailed it to you again three days ago just in case. It's worth remembering it's the kernel bugzilla - not just a mailing list - you might want to check it once in a while. Anyways, here it is for the third time: lspci: 00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21) 00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL810xE PCI Express Fast Ethernet controller (rev 0a) 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 08) 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-LP LPC Controller (rev 21) 00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21) 02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81) 00:1f.7 Non-Essential Instrumentation [1300]: Intel Corporation Device 9d26 (rev 21) 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1) 00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) 00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1) 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1) 00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21) 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21) 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21) 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21) 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 08) 00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21) 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01) 00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21) 00:02.0 VGA compatible controller: Intel Corporation Skylake GT2 [HD Graphics 520] (rev 07) CPU: Intel Core i5-6200U Product: HP Pavilion x360 Convertible (P1F10UA#ABA) BIOS: F.28 (the latest released one) Created attachment 287855 [details] attachment-16112-0.html On Tue, Mar 10, 2020, 09:23 Christoph Hellwig <hch@lst.de> wrote: > On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote: > > On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org> > wrote: > > > > > > Any updates on this one? All the requested information has been > provided and I > > > can give you SSH access to my laptop if necessary. > > Whoever sent that mail should reply to my last request.. > They did. Days ago. It's all there in the bugzilla. > I'm inclined to just revert that commit. > > Please don't. This area is a complete mess (see more below), and we > need to make gradual steps to sort this crap out. And this was one > important step. > Christoph, it's broken. "An important step" is completely irrelevant if it causes regressions It's that simple. It needs to get reverted or fixed, and you can't just ignore the regression and continue to ask for information that was already there before you asked the first time. If you're not talking this regression seriously, then reverting is happening today. Linus > On Tue, Mar 10, 2020 at 09:56:38AM -0700, Linus Torvalds wrote: > On Tue, Mar 10, 2020, 09:23 Christoph Hellwig <hch@lst.de> wrote: > > > On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote: > > > On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org> > > wrote: > > > > > > > > Any updates on this one? All the requested information has been > > provided and I > > > > can give you SSH access to my laptop if necessary. > > > > Whoever sent that mail should reply to my last request.. > > > > They did. Days ago. It's all there in the bugzilla. I never got it. I've also not seen any report on any list. I can try to dig into the freaking bugzilla, but that is not how bug reports normally work. > > I'm inclined to just revert that commit. > > > > Please don't. This area is a complete mess (see more below), and we > > need to make gradual steps to sort this crap out. And this was one > > important step. > > > > Christoph, it's broken. "An important step" is completely irrelevant if it > causes regressions > > It's that simple. It needs to get reverted or fixed, and you can't just > ignore the regression and continue to ask for information that was already > there before you asked the first time. > > If you're not talking this regression seriously, then reverting is > happening today. I _am_ taking this seriously. But if I sent a mail on saturday and the first answer that actually comes back to me is you complainin on Tuesday there isn't much I can to. And as I said just reverting this commit will break things left right and center as lots of incremental changes depend on it. So I think we need to track this down and fix it for real, and a submitter that actually answer to email would really help with that. > > Linus > > > ---end quoted text--- (In reply to Christoph Hellwig from comment #33) > On Tue, Mar 10, 2020 at 09:56:38AM -0700, Linus Torvalds wrote: > > On Tue, Mar 10, 2020, 09:23 Christoph Hellwig <hch@lst.de> wrote: > > > > > On Tue, Mar 10, 2020 at 09:01:57AM -0700, Linus Torvalds wrote: > > > > On Tue, Mar 10, 2020 at 8:08 AM <bugzilla-daemon@bugzilla.kernel.org> > > > wrote: > > > > > > > > > > Any updates on this one? All the requested information has been > > > provided and I > > > > > can give you SSH access to my laptop if necessary. > > > > > > Whoever sent that mail should reply to my last request.. > > > > > > > They did. Days ago. It's all there in the bugzilla. > > I never got it. I've also not seen any report on any list. I can > try to dig into the freaking bugzilla, but that is not how bug reports > normally work. I'm deeply sorry but Hans de Goede sent my hardware config to you at 2020-03-07 14:27:30 UTC right after you asked for it. Anyways, let's not dig deeper into this - makes no sense whatsoever. Again my platform is x86-64. > > > > I'm inclined to just revert that commit. > > > > > > Please don't. This area is a complete mess (see more below), and we > > > need to make gradual steps to sort this crap out. And this was one > > > important step. > > > > > > > Christoph, it's broken. "An important step" is completely irrelevant if it > > causes regressions > > > > It's that simple. It needs to get reverted or fixed, and you can't just > > ignore the regression and continue to ask for information that was already > > there before you asked the first time. > > > > If you're not talking this regression seriously, then reverting is > > happening today. > > I _am_ taking this seriously. But if I sent a mail on saturday and the > first answer that actually comes back to me is you complainin on Tuesday > there isn't much I can to. And as I said just reverting this commit > will break things left right and center as lots of incremental changes > depend on it. So I think we need to track this down and fix it for real, > and a submitter that actually answer to email would really help with > that. > I'm on it and I'm ready to test any patches if you have them. I guess the patch itself doesn't really break things but some drivers which depend on it do and these drivers get enabled very early on boot since I'm unable to get any output from the kernel at all even with efi=debug earlyprintk=efi,keep options. Created attachment 287857 [details] attachment-24927-0.html On Tue, Mar 10, 2020, 10:23 <bugzilla-daemon@bugzilla.kernel.org> wrote: > > > > I _am_ taking this seriously. But if I sent a mail on saturday and the > > first answer that actually comes back to me is you complainin on Tuesday > > there isn't much I can to. You could just have read the email you got on Saturday. Seriously - it had all the information you asked for, and more. You just didn't bother to look. Then people followed up today, because nothing was happening. You had asked for stuff that was right there in the bugzilla. Do you really wonder that I then think you're not talking this seriously? Linus (In reply to Artem S. Tashkinov from comment #34) > > I'm on it and I'm ready to test any patches if you have them. I guess the > patch itself doesn't really break things but some drivers which depend on it > do and these drivers get enabled very early on boot since I'm unable to get > any output from the kernel at all even with > > efi=debug earlyprintk=efi,keep > > options. Not sure if this will help producing some output, but note that earlyprintk=efi was removed in kernel v5.1. The equivalent to earlyprintk=efi,keep is now "earlycon=efifb keep_bootcon". Created attachment 287859 [details]
clean up platform device dma_mask handling
I don't think this should make any real difference, but at least I personally find it easier to see what the dma_mask logic is, and there's more of a point to the new platform-device dma mask field.
This basically does three small changes:
(1) rename the field from "dma_mask" to "platform_dma_mask".
(2) use the new field instead of the dynamic allocation for the platform_device_register_full() use case
(3) every initialization of the field is matched with "use this field by making the device dma_mask pointer point to it".
Honestly, (1) is entirely syntactic, and comes from the field not having been greppable at all.
(2) shouldn't matter, but avoids a pointless allocation of a single word when we have this field.
And (3) shouldn't matter either, but makes the logic much more obvious: instead of randomly initializing the field even when it doesn't get used, we now clearly link initialization and use of the field for the actual dma_mask pointer.
On Tue, Mar 10, 2020 at 10:05 AM Christoph Hellwig <hch@lst.de> wrote: > > > They did. Days ago. It's all there in the bugzilla. > > I never got it. The email I added you to the cc on had a very clear link to the bugzilla, which had all the information. So all you needed to do was to look at entry that was pointed at. And no, you don't seem to get the actual emails that bugzilla sends out, despite bugzilla showing you as being part of the cc list. That is probably because you need to do that "add to the cc" yourself, so that people can't use bugzilla as a way to spam random people. Anyway, I've uploaded a patch for Artem to try to the bugzilla. I seriously doubt it makes any difference at all. But I don't see how the dma_mask change would make any difference in the first place, so at least clarifying the code and the logic of when that dma_mask field is initialized shouldn't hurt. I do note that before that commit cdfee5623290 ("driver core: initialize a default DMA mask for platform device"), only powerpc and m68k had that "arch_setup_pdev_archdata()" function that does that "default 32-bit dma mask". On x86, we have mainly ACPI, which will do that if (acpi_dma_supported(adev)) pdevinfo.dma_mask = DMA_BIT_MASK(32); else pdevinfo.dma_mask = 0; pdev = platform_device_register_full(&pdevinfo); which basically should be a no-op - 0 means that platform_device_register_full() doesn't do anything and we use the default value, and DMA_BIT_MASK(32) _is_ that default value. Did x86 perhaps used to have some "no dma_mask means that it's the full 64 bits?" Because that's one of the effects of that commit that Artem bisected things to: it will now always have a dma_mask pointer, and it will default to that 32-bit value if it didn't have one before. Linus Artem, On Tue, Mar 10, 2020 at 9:01 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I do have one question: the minimal config that works for you, and > it's just the Fedora one that does not, could you try to see what > modules get loaded with the Fedora one, that you might be lacking? In fact, maybe the easiest thing to do would be to use "config-bisect.pl" to figure out what config option it is that breaks things for you. See tools/testing/ktest/config-bisect.pl which has a "this is how to use it" description in the comments at the top... Linus (In reply to Linus Torvalds from comment #38) > On Tue, Mar 10, 2020 at 10:05 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > They did. Days ago. It's all there in the bugzilla. > > > > I never got it. > > The email I added you to the cc on had a very clear link to the > bugzilla, which had all the information. So all you needed to do was > to look at entry that was pointed at. > > And no, you don't seem to get the actual emails that bugzilla sends > out, despite bugzilla showing you as being part of the cc list. > > That is probably because you need to do that "add to the cc" yourself, > so that people can't use bugzilla as a way to spam random people. > > Anyway, I've uploaded a patch for Artem to try to the bugzilla. I > seriously doubt it makes any difference at all. But I don't see how > the dma_mask change would make any difference in the first place, so > at least clarifying the code and the logic of when that dma_mask field > is initialized shouldn't hurt. > > I do note that before that commit cdfee5623290 ("driver core: > initialize a default DMA mask for platform device"), only powerpc and > m68k had that "arch_setup_pdev_archdata()" function that does that > "default 32-bit dma mask". > > On x86, we have mainly ACPI, which will do that > > if (acpi_dma_supported(adev)) > pdevinfo.dma_mask = DMA_BIT_MASK(32); > else > pdevinfo.dma_mask = 0; > > pdev = platform_device_register_full(&pdevinfo); > > which basically should be a no-op - 0 means that > platform_device_register_full() doesn't do anything and we use the > default value, and DMA_BIT_MASK(32) _is_ that default value. > > Did x86 perhaps used to have some "no dma_mask means that it's the > full 64 bits?" Because that's one of the effects of that commit that > Artem bisected things to: it will now always have a dma_mask pointer, > and it will default to that 32-bit value if it didn't have one before. > > Linus I think on x86, acpi_dma_supported(adev) is equivalent to adev != NULL, so the problem shouldn't come from here? (In reply to Linus Torvalds from comment #39) > > In fact, maybe the easiest thing to do would be to use > "config-bisect.pl" to figure out what config option it is that breaks > things for you. > > See > > tools/testing/ktest/config-bisect.pl > > which has a "this is how to use it" description in the comments at the top... > This will take some time, please. I now have two patches to test and then I will try to figure out which config option makes the kernel unbootable. (In reply to Arvind Sankar from comment #40) > > I think on x86, acpi_dma_supported(adev) is equivalent to adev != NULL, so > the problem shouldn't come from here? I doubt it's an ACPI device anyway, since I don't see why one config would work for Artem (his small one), but not the F31 config. Of course, it's possible that the F31 config ends up having some debug option that triggers early during boot and then that kills the boot. It's also possible that Artem's small config doesn't have some driver enabled. It looks like a very common hw setup, though. The only thing that isn't in a ton of other machines is that Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01) thing, which is handled by that MISC_RTSX_PCI driver. Maybe Artem doesn't normally use it, and his working config doesn't have any of that support in it (including the MFD_CORE etc that it selects). The MFD files _are_ some of the files that get re-built with that commit reverted. So if there is some MFD thing going on.. Hmm. The MFD code does do things like this: pdev = platform_device_alloc(cell->name, platform_id); if (!pdev) goto fail_alloc; ... pdev->dev.dma_mask = parent->dma_mask; so it really does play games with that dev.dma_mask pointer. Artem? Does your minimal config that worked when you built it yourself (as opposed to the Fedora build) perhaps not have that MFD code enabled? (In reply to Artem S. Tashkinov from comment #41) > This will take some time, please. I now have two patches to test and then I > will try to figure out which config option makes the kernel unbootable. I'm starting to suspect it's some interaction with that MFD dma_mask parent assignment. In comment #18 you say "The issue is, with my custom very simplistic .config the kernel boots just fine. It's the options which Fedora enables causes it not to boot. And their config includes pretty much everything on Earth." and I'm now starting to think that your custom very simplistic .config that works just avoids the whole issue with MFD. Can you attach your minimal config that worked for you? Is my theory perhaps believable, or complete garbage? If I'm right, then if you enable the MISC_RTSX_PCI driver in your minimal config, then it will perhaps show the problem too... Beause rtsx_pcr.c does do that mfd_add_devices(&pcidev->dev) -> mfd_add_device(), which takes the DMA mask of the parent (PCI) device, and uses it for the newly allocated platform device. I'm not seeing why that wouldn't work, but that's the one piece of hardware you have that I can see that isn't _hugely_ common, I think. On Tue, Mar 10, 2020 at 01:41:34PM -0700, Linus Torvalds wrote:
> Did x86 perhaps used to have some "no dma_mask means that it's the
> full 64 bits?" Because that's one of the effects of that commit that
> Artem bisected things to: it will now always have a dma_mask pointer,
> and it will default to that 32-bit value if it didn't have one before.
So looking at the exact history - x86 already uses dma-direct in 5.4,
which failed DMA map requests without a dma mask, it was just arm that
still allowed that and relied on it. Same for intel-iommu.
(In reply to Linus Torvalds from comment #43) > > and I'm now starting to think that your custom very simplistic .config that > works just avoids the whole issue with MFD. > > Can you attach your minimal config that worked for you? > > Is my theory perhaps believable, or complete garbage? > > If I'm right, then if you enable the MISC_RTSX_PCI driver in your minimal > config, then it will perhaps show the problem too... > > Beause rtsx_pcr.c does do that mfd_add_devices(&pcidev->dev) -> > mfd_add_device(), which takes the DMA mask of the parent (PCI) device, and > uses it for the newly allocated platform device. > > I'm not seeing why that wouldn't work, but that's the one piece of hardware > you have that I can see that isn't _hugely_ common, I think. The config can be downloaded from here: https://bugzilla.redhat.com/attachment.cgi?id=1654871 $ grep -i MFD .config | grep -v "not set" CONFIG_MFD_CORE=m CONFIG_MEMFD_CREATE=y Created attachment 287863 [details]
kernel panic on boot, kernel-5.5.8-200.fc31
Here's something totally unexpected.
The kernel (any kernel for that matter) produces output with, hooray, these options (should have tried them earlier):
efi=debug earlycon=efifb keep_bootcon
I've made a photo of the screen.
The patches by Linus and Christoph haven't changed anything - the kernel continues to die on boot. Kernel 5.4.24 _without_ any patches produces a much longer boot trace which unfortunately doesn't fit on one screen, so I had to make a video of it. I can try to write it all down if anyone's interested. RIP: 0010:kmem_cache_alloc_trace ... ? acpi_ds_create_walk_state acpi_ds_create_walk_state acpi_ds_call_control_method acpi_ds_parse_aml acpi_ps_execute_method acpi_ns_evaluate acpi_ut_evaluate_object Created attachment 287875 [details] fix dma_mask handling in platform_device_register_full.patch Fixed by: [PATCH] device core: fix dma_mask handling in platform_device_register_full https://lkml.org/lkml/2020/3/11/717 |