Bug 194521
Description
Vaclav Ovsik
2017-02-08 20:32:58 UTC
Created attachment 254621 [details]
nomodeset - no crash
Created attachment 254631 [details]
iommu=off - no crash
Created attachment 254641 [details]
lspci -vvv
Created attachment 254651 [details]
/proc/cpuinfo
Created attachment 254661 [details]
crash logged using netconsole
Do you have the amdgpu firmware installed? When you create bugs against amdgpu driver, use the latest kernel and mesa code: https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.11-wip https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers Problems might be fixed in the latest code. Latest polaris firmware: https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/ How to create a custom kernel, see: https://bugzilla.kernel.org/show_bug.cgi?id=193651 Doubt the problem is in the amdgpu driver. What about bug in the amd_iommu? I think this because I tried to switch off external GPU using acpi_call module. The following command was successful: echo '\_SB_.PCI0.VGA.PX02' > /proc/acpi/call while running kernel with no KMS (no amdgpu). Fan really went silent after this, but kernel crashed in several seconds in similar way like with amdgpu and active iommu. The filesystem is after every crash corrupted. I'm afraid that storage controller goes through iommu too and crash causes some random writes to disk :(. But may be I am wrong and this ACPI call is illegal in reality and amdgpu does something wrong regarding iommu to. Nevertheless amdgpu works with iommu=off fine. Maybe the problem is with some buggy BIOS/firmware from vendor. I will try a newer kernel. I tried kernel 4.10.0-rc6-amd64 from Debian experimental archive and the result is very similar. To minimize harm on filesystem I booted into emergency mode with read-only file-system and tried to switch off GPU using ACPI call. There is some warning during call, but something happened :) ACPI Warning: \_SB.PCI0.VGA.PX02: Insufficient arguments - Caller passed 0, method requires 1 (20160930/nsarguments-256) I did this twice - one time with and one time without iommu=off. I'm attaching netconsole logs... Is this a proof the problem is in the amd_iommu.c? Created attachment 254695 [details]
logged using netconsole - 4.10.0rc6 with iommu=off, gpu turned off using acpi_call
Created attachment 254697 [details]
logged using netconsole - 4.10.0rc6, gpu turned off using acpi_call -> crash
Stock kernels have very little amdgpu driver code, see kernel.org and click diff. You have very new amd gpu so Use the command: git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux The kernel configuration file of Debian Official kernel are available in /boot, named after the kernel release. Copy the .config file to the linux directory. Connect all your devices and run the command: make localmodconfig. You can use the command make defconfig too for creating initial .config file. Use the command: make xconfig and check that you have enabled: Reroute Broken IRQ, Virtualization KVM and 300Hz CPU timer, I also disabled Swap, Kernel Debug, CPU Freq scaling , Cpu handling in Acpi, Used Bios to control CPU and devices. In the drivers->graphics->amdgpu enable cik support for a gcn 1.1 gpu and si support for a gcn 1.0 gpu. Create debian kernel package: export CONCURRENCY_LEVEL=4 fakeroot make-kpkg --initrd kernel_image Install the kernel package with Gdebi. To make a custom kernel to boot, add a line to /etc/initramfs-tools/modules: unix And run: sudo update-initramfs Reboot. Created attachment 254739 [details]
drm-next-4.11-wip: boot into emergency mode - crash after modprobe amdgpu
Comment on attachment 254739 [details]
drm-next-4.11-wip: boot into emergency mode - crash after modprobe amdgpu
You have Carrizo and Topaz gpus. Can you disable the other from bios? The linux driver does not support amd dual graphics to speed up fps. In the kernel configuration you can try to disable iommu and vgaswitcheroo. From the kernel command line you can blacklist pci devices.
The BIOS is really primitive, there is nearly nothing regarding HW that can be changed :-/. I can continue to use iommu=off, it seems to be fine. Thanks Same issue persists on HP 15-ba028ur using latest mainline kernel. (4.12-rc7) iommu=off and amd_iommu=fullflush make boot possible. Hi, I still encounter a same issue concerned to ext4 fs corruption using linux kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand). Laptop configs are A6-9220 radeon r4 5 compute cores 2c+3g, 4G RAM, 200GB Intel SSD (1st laptop) or 500GB Toshiba HDD (2nd laptop) I'm using OS ALT linux distribution (www.altlinux.org) Boot and installation of the system is performed flawlessly using LiveCD if LAN cable is NOT attached. dmesg shows plenty of "AMD-Vi: Completion-Wait loop timed out" errors. Connecting LAN cable during LiveCD boot results graphical target boot failure or kernel panic. After first reboot the system won't boot anyway and ext4 filesystem corruption occurs. As investigtion revealed that switching IOMMU off (amd_iommu=off and/or iommu=soft) solves the issue. "amd_iommu=fullflush" doesn't work for me. I've discovered several patches concerning solution of (amd_)iommu issues in linux-kernel mailing list archive, but they are either applied to kernels mentioned above already or their application doesn't solve the issue for me. Above mentioned patch (https://patchwork.freedesktop.org/patch/157327/) is not applicable to mentioned kernel versions any more. Therefore my question is: am I missing some patch that already solved the issue or should I provide more specific bug report? Created attachment 281945 [details]
HP laptop dmesg output with plenty of errors while LAN cable present
Created attachment 281947 [details]
HP laptop dmesg output while IOMMU turned on, but no LAN cable
Created attachment 281949 [details]
HP laptop HW config via dmidecode
(In reply to Nikolai from comment #17) > using linux > kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand). I'm sorry for typo. Should be read as 4.19.16... 4.20.17 Created attachment 282035 [details]
lspci -v for HP laptop 17-ak041ur
Created attachment 282037 [details]
lspci for HP laptop 17-ak041ur
(In reply to Nikolai from comment #17) > Hi, > I still encounter a same issue concerned to ext4 fs corruption using linux > kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand). > > Laptop configs are A6-9220 radeon r4 5 compute cores 2c+3g, 4G RAM, 200GB > Intel SSD (1st laptop) or 500GB Toshiba HDD (2nd laptop) This patch works for me: https://lkml.org/lkml/2019/4/8/331 One also can use a 'pci=noats' as a temporary countermeasure. Thanks to Joerg Roedel <jroedel@suse.de> who guided me to a solution. Does booting with amdgpu.runpm=0 on the kernel command line help? (In reply to Alex Deucher from comment #25) > Does booting with amdgpu.runpm=0 on the kernel command line help? Yes it does. System is able to boot and no filesystem corruption occurs either. So which solution is preferable in such case then? Created attachment 282309 [details]
dmesg for HP laptop 17-ak041ur with amdgpu.runpm=0 kernel parameter
|