Bug 194521

Summary: AMD-Vi: Completion-Wait loop timed out, IO_PAGE_FAULT and crash
Product: Other Reporter: Vaclav Ovsik (vaclav.ovsik)
Component: OtherAssignee: other_other
Status: NEW ---    
Severity: normal CC: alexdeucher, fin4478, mrromanze, nickel
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.9.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: crash logged using netconsole
nomodeset - no crash
iommu=off - no crash
lspci -vvv
/proc/cpuinfo
crash logged using netconsole
logged using netconsole - 4.10.0rc6 with iommu=off, gpu turned off using acpi_call
logged using netconsole - 4.10.0rc6, gpu turned off using acpi_call -> crash
drm-next-4.11-wip: boot into emergency mode - crash after modprobe amdgpu
HP laptop dmesg output with plenty of errors while LAN cable present
HP laptop dmesg output while IOMMU turned on, but no LAN cable
HP laptop HW config via dmidecode
lspci -v for HP laptop 17-ak041ur
lspci for HP laptop 17-ak041ur
dmesg for HP laptop 17-ak041ur with amdgpu.runpm=0 kernel parameter

Description Vaclav Ovsik 2017-02-08 20:32:58 UTC
Created attachment 254611 [details]
crash logged using netconsole

I bought my daughter a notebook HP 15-ba062nc
(http://support.hp.com/us-en/product/HP-15-ba000-Notebook-PC-series/10862317/model/11792430).
Installed is Debian Stretch/Sid with kernel 4.9.6.

Successful boot without crash is possible with
    - disabled amdgpu (e.g. old nomodeset)
    - or disabled iommu (iommu=off)
otherwise the kernel crashes and the file-system is corrupted.

iommu=off is much better way now, because the notebook runs in energy
efficient manner - the fan is quiet or stopped.

Attached are kernel messages using netconsle.
Comment 1 Vaclav Ovsik 2017-02-08 20:35:58 UTC
Created attachment 254621 [details]
nomodeset - no crash
Comment 2 Vaclav Ovsik 2017-02-08 20:37:02 UTC
Created attachment 254631 [details]
iommu=off - no crash
Comment 3 Vaclav Ovsik 2017-02-08 20:48:42 UTC
Created attachment 254641 [details]
lspci -vvv
Comment 4 Vaclav Ovsik 2017-02-08 20:52:17 UTC
Created attachment 254651 [details]
/proc/cpuinfo
Comment 5 Vaclav Ovsik 2017-02-08 21:09:53 UTC
Created attachment 254661 [details]
crash logged using netconsole
Comment 6 fin4478 2017-02-09 12:07:28 UTC
Do you have the amdgpu firmware installed?

When you create bugs against amdgpu driver, use the latest kernel and mesa code:
https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.11-wip
https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers

Problems might be fixed in the latest code.

Latest polaris firmware:
https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/


How to create a custom kernel, see:
https://bugzilla.kernel.org/show_bug.cgi?id=193651
Comment 7 Vaclav Ovsik 2017-02-09 23:34:15 UTC
Doubt the problem is in the amdgpu driver. What about bug in the amd_iommu?
I think this because I tried to switch off external GPU using acpi_call module.
The following command was successful:

    echo '\_SB_.PCI0.VGA.PX02' > /proc/acpi/call

while running kernel with no KMS (no amdgpu). Fan really went silent
after this, but kernel crashed in several seconds in similar way like
with amdgpu and active iommu.

The filesystem is after every crash corrupted. I'm afraid that storage
controller goes through iommu too and crash causes some random writes to
disk :(. But may be I am wrong and this ACPI call is illegal in reality
and amdgpu does something wrong regarding iommu to. Nevertheless amdgpu works
with iommu=off fine. Maybe the problem is with some buggy BIOS/firmware
from vendor.

I will try a newer kernel.
Comment 8 Vaclav Ovsik 2017-02-10 18:36:11 UTC
I tried kernel 4.10.0-rc6-amd64 from Debian experimental archive and the result is very similar. To minimize harm on filesystem I booted into emergency mode with read-only file-system and tried to switch off GPU using ACPI call. There is some warning during call, but something happened :)

ACPI Warning: \_SB.PCI0.VGA.PX02: Insufficient arguments - Caller passed 0, method requires 1 (20160930/nsarguments-256)

I did this twice - one time with and one time without iommu=off.

I'm attaching netconsole logs...

Is this a proof the problem is in the amd_iommu.c?
Comment 9 Vaclav Ovsik 2017-02-10 18:37:59 UTC
Created attachment 254695 [details]
logged using netconsole - 4.10.0rc6 with  iommu=off, gpu turned off using acpi_call
Comment 10 Vaclav Ovsik 2017-02-10 18:39:04 UTC
Created attachment 254697 [details]
logged using netconsole - 4.10.0rc6, gpu turned off using acpi_call  -> crash
Comment 11 fin4478 2017-02-13 03:25:42 UTC
Stock kernels have very little amdgpu driver code, see kernel.org and click diff. You have very new amd gpu so Use the command: 
git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux

The kernel configuration file of Debian Official kernel are available in /boot, named after the kernel release. Copy the .config file to the linux directory. Connect all your devices and run the command: make localmodconfig. You can use the command make defconfig too for creating initial .config file. 

Use the command: make xconfig and check that you have enabled: Reroute Broken IRQ, Virtualization KVM and 300Hz CPU timer, I also disabled Swap, Kernel Debug, CPU Freq scaling , Cpu handling in Acpi, Used Bios to control CPU and devices. In the drivers->graphics->amdgpu enable cik support for a gcn 1.1 gpu and si support for a gcn 1.0 gpu.

Create debian kernel package:
export CONCURRENCY_LEVEL=4
fakeroot make-kpkg --initrd kernel_image

Install the kernel package with Gdebi. To make a custom kernel to boot, add a line to /etc/initramfs-tools/modules:
unix
And run: sudo update-initramfs
Reboot.
Comment 12 Vaclav Ovsik 2017-02-13 20:32:12 UTC
Created attachment 254739 [details]
drm-next-4.11-wip:  boot into emergency mode - crash after modprobe amdgpu
Comment 13 fin4478 2017-02-14 07:56:09 UTC
Comment on attachment 254739 [details]
drm-next-4.11-wip:  boot into emergency mode - crash after modprobe amdgpu

You have Carrizo and Topaz gpus. Can you disable the other from bios? The linux driver does not support amd dual graphics to speed up fps. In the kernel configuration you can try to disable iommu and vgaswitcheroo. From the kernel command line you can blacklist pci devices.
Comment 14 Vaclav Ovsik 2017-02-15 20:41:34 UTC
The BIOS is really primitive, there is nearly nothing regarding HW that can be changed :-/. I can continue to use iommu=off, it seems to be fine.
Thanks
Comment 15 mrromanze 2017-06-30 15:59:08 UTC
Same issue persists on HP 15-ba028ur using latest mainline kernel. (4.12-rc7)

iommu=off  
and
amd_iommu=fullflush
make boot possible.
Comment 16 mrromanze 2017-06-30 16:38:48 UTC
Patch: https://patchwork.freedesktop.org/patch/157327/
Comment 17 Nikolai 2019-03-21 08:59:30 UTC
Hi,
  I still encounter a same issue concerned to ext4 fs corruption using linux kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand).

  Laptop configs are A6-9220 radeon r4 5 compute cores 2c+3g, 4G RAM, 200GB Intel SSD (1st laptop) or 500GB Toshiba HDD (2nd laptop)
I'm using OS ALT linux distribution (www.altlinux.org)
  Boot and installation of the system is performed flawlessly using LiveCD if LAN cable is NOT attached. dmesg shows plenty of "AMD-Vi: Completion-Wait loop timed out" errors.
  Connecting LAN cable during LiveCD boot results graphical target boot failure or kernel panic. 
  After first reboot the system won't boot anyway and ext4 filesystem corruption occurs.

 As investigtion revealed that switching IOMMU off (amd_iommu=off and/or iommu=soft) solves the issue. "amd_iommu=fullflush" doesn't work for me.

I've discovered several patches concerning solution of (amd_)iommu issues in linux-kernel mailing list archive, but they are either applied to kernels mentioned above already or their application doesn't solve the issue for me.
Above mentioned patch (https://patchwork.freedesktop.org/patch/157327/) is not applicable to mentioned kernel versions any more.
  Therefore my question is: am I missing some patch that already solved the issue or should I provide more specific bug report?
Comment 18 Nikolai 2019-03-21 13:51:35 UTC
Created attachment 281945 [details]
HP laptop dmesg output with plenty of errors while LAN cable present
Comment 19 Nikolai 2019-03-21 13:53:07 UTC
Created attachment 281947 [details]
HP laptop dmesg output while IOMMU turned on, but no LAN cable
Comment 20 Nikolai 2019-03-21 13:53:40 UTC
Created attachment 281949 [details]
HP laptop HW config via dmidecode
Comment 21 Nikolai 2019-03-26 13:08:31 UTC
(In reply to Nikolai from comment #17)

>    using linux
> kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand).

I'm sorry for typo. Should be read as 4.19.16... 4.20.17
Comment 22 Nikolai 2019-03-26 14:23:13 UTC
Created attachment 282035 [details]
lspci -v for HP laptop 17-ak041ur
Comment 23 Nikolai 2019-03-26 14:24:06 UTC
Created attachment 282037 [details]
lspci for HP laptop 17-ak041ur
Comment 24 Nikolai 2019-04-08 12:09:28 UTC
(In reply to Nikolai from comment #17)
> Hi,
>   I still encounter a same issue concerned to ext4 fs corruption using linux
> kernels 4.19.16... 4.20.27 on HP laptop 17-ak041ur (2 pcs on hand).
> 
>   Laptop configs are A6-9220 radeon r4 5 compute cores 2c+3g, 4G RAM, 200GB
> Intel SSD (1st laptop) or 500GB Toshiba HDD (2nd laptop)

This patch works for me:

https://lkml.org/lkml/2019/4/8/331

One also can use a 'pci=noats' as a temporary countermeasure.

Thanks to Joerg Roedel <jroedel@suse.de> who guided me to a solution.
Comment 25 Alex Deucher 2019-04-10 15:50:08 UTC
Does booting with amdgpu.runpm=0 on the kernel command line help?
Comment 27 Nikolai 2019-04-11 09:41:37 UTC
(In reply to Alex Deucher from comment #25)
> Does booting with amdgpu.runpm=0 on the kernel command line help?

Yes it does. System is able to boot and no filesystem corruption occurs either.

So which solution is preferable in such case then?
Comment 28 Nikolai 2019-04-11 09:43:57 UTC
Created attachment 282309 [details]
dmesg for HP laptop 17-ak041ur with amdgpu.runpm=0 kernel parameter