Created attachment 294707 [details] dmesh showing the issue Most things are working despite issue report. I can't resume from sleep (i get a black screen) After some investigation i found that the issue appeared after this commit about amd/iommu in kernel: https://gitlab.com/linux-kernel/stable/-/commit/05a0542b456e135f362ba83a17ccff73bac0b92f What i don't know is if this commit is causing the issue and need to be reverted or only if it means that this driver needs a rewrite.
Detailed description of the issue. You can find the issue in attached file. The issue is happening at each boot. Everything is working despite the kernel panic except the fact that i can't resume from suspend. Issue started after kernel 5.7.XX Laptop: lenovo Yoga 530-14ARR (81H9) with amd ryzen 2500U and Vega 8 integrated with bios up-to-date. Equipped with MP2 I2C controller and driver i2c-amd-mp2 Deactivating i2c-amd-mp2 stop the issue but as expected touchpad and touchscreen stop working. Deactivating iommu also stop the issue but it may cause other issue on the laptop. So the issue seems to be linked to iommu and i2c-amd-mp2 driver. As i said above, i did some investigation and found out that the issue was linked to this commit: https://lore.kernel.org/linux-iommu/20200527115313.7426-10-joro@8bytes.org/ The issue doesn't happen when i build 5.8 kernel without this commit.
Hi, and sorry for not getting back to you on Github. What happens is that there should be "normal" IO page faults reported by the AMD iommu driver for Lenovo Yoga 530-14ARR users. dmesg from 5.7 kernels and older contain something like: [] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:00.0 domain=0x0000 address=0xffdefa80 flags=0x0050] [] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:00.0 domain=0x0000 address=0xffdefaa0 flags=0x0050] (7 more lines) [] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:00.0 domain=0x0000 address=0xffde9000 flags=0x0050] These page faults are expected, they've been there since the beginning for the Yoga 530-14ARR and aren't due to the I2C layer but to the Wacom touchscreen itself (more likely) or the wacom driver handling it (much less likely, IIRC), one HID report during initialization specifically. Since the errors seemed harmless, I didn't push the investigation further back then. But now since 5.8, amd_iommu_report_page_fault segfaults before reporting the IO page faults. From a quick look it seems that dev_iommu_get() wasn't called yet to initialize dev->iommu for the touchscreen device? Is that possible? Anyway it's clearly an iommu driver bug. As a temporary workaround for those who don't want to compile a kernel, if you mostly need the touchpad and can live without the touchscreen, blacklisting the wacom module makes suspend/resume work again.
Hi Elie, Thx for your answer it will help us a lot ;) i tried to investigate and solve it my self, but it seems that i'm not good enouth. So what do you recommand ? I can build kernel but how will it help ? Except by using older kernel (like 5.4 LTS kernel) Blacklisting the wacom module seems to be the best thing (and i still can activate it and deactivate it if needed) Anyway, if needed i'm also ok to help as much as i can
(In reply to remy.lampure from comment #3) > So what do you recommand ? > > I can build kernel but how will it help ? Except by using older kernel (like > 5.4 LTS kernel) Adding this null check to amd_iommu_report_page_fault() should prevent the issue : - if (pdev) + if (pdev && pdev->dev->iommu) dev_data = dev_iommu_priv_get(&pdev->dev); I don't know the iommu code enough to tell if this is a proper fix though.
Wacom driver maintainer here :) I don't have access to one of these devices so I unfortunately can't help really with debugging this issue or the IO_PAGE_FAULT messages from earlier kernels. I am, however, interested in understanding what might be going on at a bit more technical level. We've recently been getting some suspend/resume bug reports that I can't explain but which sound somewhat similar to this issue... @Elie: do you have a guess about "why" the IO_PAGE_FAULT messages would be generated and if they're related to this segfault at all? You said you expect them to be from the touchscreen itself; could you go into a little more detail? Is there a firmware issue that Lenovo or Wacom should be looking at? I can definitely bug the appropriate people if we know enough about what's wrong :)
Hi, The IO_PAGE_FAULT is triggered by a 72 bytes long HID feature report sent to the Wacom device. When I filtered this report in the I2C layer : @@ -193,11 +193,31 @@ static int i2c_amd_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs, int num) for (i = 0; i < num; i++) { pmsg = &msgs[i]; + + if (pmsg->len == 72) { + dev_err(&i2c_dev->pdev->dev, + "Buggy xfer # (size = %hu, flags = %hu, addr = 0x%08x)\n", + pmsg->len, pmsg->flags, pmsg->addr); + print_hex_dump(KERN_ERR, "Buffer:\n", DUMP_PREFIX_NONE, 16, 1, pmsg->buf, + pmsg->len, false); + dump_stack(); + + goto skipXfer; + } + err = i2c_amd_xfer_msg(i2c_dev, pmsg); if (err) break; } +skipXfer: amd_mp2_pm_runtime_put(i2c_dev->common.mp2_dev); return err ? err : num; } There was no more error, and at least the finger input still works. Styluses are also supported by the touchscreen, may they be the ones affected by blocking this report? (I don't own one so cannot check) Here's the backtrace and the contents of the report: [] i2c_amd_mp2 AMDI0011:00: Buggy xfer # (size = 72, flags = 0, addr = 0x0000000a) [] Buffer: 04 00 3b 03 05 00 42 00 0b 00 00 00 00 00 00 00 [] Buffer: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [] Buffer: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [] Buffer: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [] Buffer: 00 00 00 00 00 00 00 00 [] CPU: 5 PID: 65 Comm: kworker/5:1 Tainted: G OE 5.9.16-050916-generic #202012211331 [] Hardware name: LENOVO 81H9/LNVNB161216, BIOS 8MCN50WW 12/25/2018 [] Workqueue: events wacom_init_work [wacom] [] Call Trace: [] show_stack+0x52/0x58 [] dump_stack+0x70/0x8b [] i2c_amd_xfer.cold+0x52/0x59 [i2c_amd_mp2_plat] [] __i2c_transfer.part.0+0x6b/0x2b0 [] __i2c_transfer+0x24/0x80 [] i2c_transfer+0x52/0x110 [] __i2c_hid_command+0x102/0x2d0 [i2c_hid] [] ? __i2c_transfer.part.0+0x6b/0x2b0 [] ? i2c_adapter_unlock_bus+0x12/0x20 [] i2c_hid_set_or_send_report+0x11c/0x1d0 [i2c_hid] [] i2c_hid_output_raw_report.isra.0+0xac/0xe0 [i2c_hid] [] i2c_hid_raw_request+0x5e/0x190 [i2c_hid] [] ? hid_alloc_report_buf+0x24/0x30 [hid] [] _wacom_query_tablet_data.isra.0+0x137/0x2d0 [wacom] [] wacom_init_work+0x1a/0x30 [wacom] [] process_one_work+0x1e8/0x3b0 [] worker_thread+0x50/0x370 [] kthread+0x12f/0x150 [] ? process_one_work+0x3b0/0x3b0 [] ? __kthread_bind_mask+0x70/0x70 [] ret_from_fork+0x22/0x30 (repeated 5 times because there are 5 attempts by wacom_set_device_mode()) This is why I suspected that it was the touchscreen itself which was in some way accessing off-limit memory, but I don't really know more about the "how" or "why" than that.
BTW, in the DSDT here's the entry for the touchscreen: Device (I2CA) { Name (_HID, "AMDI0011") // _HID: Hardware ID Name (_UID, Zero) // _UID: Unique ID Name (_ADR, Zero) // _ADR: Address Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings { IRQ (Edge, ActiveHigh, Exclusive, ) {10} Memory32Fixed (ReadWrite, 0xFEDC2000, // Address Base 0x00001000, // Address Length ) }) Name (_DEP, Package (0x01) // _DEP: Dependencies { ^PCI0.GP17.MP2C }) Method (_STA, 0, NotSerialized) // _STA: Status { Return (0x0F) } Method (RSET, 0, NotSerialized) { SRAD (0x05, 0xC8) } } I could be looking at the wrong entirely, but is 0xFEDC2000 the memory region the touchscreen should be accessing instead?
Nevermind, disregard my previous message, this isn't the entry for the touchscreen but for one of the I2C adapters.
Thanks. The buffer that you dumped doesn't look like what I'd expect `_wacom_query_tablet_data` to create, but I'd have to poke around more with an I2C sensor. Either way, I think I see what you mean about the device itself causing the fault. The `flags=0x0050` in the `IO_PAGE_FAULT` message seems to indicate that the access came from the device rather than the host. It may be from the device acknowledging the transfer? Either that, or the call to `wacom_get_report` might be the true trigger, with the short-circuit you wrote just causing it to be skipped (due to `wacom_set_report` returning error codes).
I sent a fix for this yesterday. Can you please try it? Fix: https://lore.kernel.org/r/20210202145419.29143-1-joro@8bytes.org It is already in the IOMMU tree at https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/ next If you want to try it from there.
(In reply to Joerg Roedel from comment #10) > I sent a fix for this yesterday. Can you please try it? > > Fix: https://lore.kernel.org/r/20210202145419.29143-1-joro@8bytes.org > > It is already in the IOMMU tree at > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/ next > > If you want to try it from there. Is there a way to try it without building the entire kernel ? If no, will this patch be available in next RC7 kernel ? (if yes i will wait until fedora team build this)
(In reply to remy.lampure from comment #11) > Is there a way to try it without building the entire kernel ? > > If no, will this patch be available in next RC7 kernel ? (if yes i will wait > until fedora team build this) You have to build the entire kernel. But I will also push it for -rc7, will update here when the patch is upstream.
The fix is merged into v5.11-rc7.
(In reply to Joerg Roedel from comment #13) > The fix is merged into v5.11-rc7. I installed v5.11-rc7. The issue seems to be gone. My touchpad didn't work the first time i booted in new kernel but it worked the next boots. Resume from sleep is now ok. I also get issue with screen rotation sensors not always working but it's not related. The iommu part of this issue is solved for me. Thx PS: I will the dmesg output with this new kernel
Created attachment 295157 [details] Dmesg with fix in 5.11-rc7 kernel
(In reply to remy.lampure from comment #14) > The iommu part of this issue is solved for me. Thx Thanks for the testing and reporting back. This issue can be closed now.
Issue closed because a fix was added to kernel 5.11