Bug 211241

Summary: BUG: kernel NULL pointer dereference for i2c_amd_mp2 driver an iommu
Product: Drivers Reporter: remy.lampure
Component: IOMMUAssignee: drivers_iommu
Status: RESOLVED CODE_FIX    
Severity: normal CC: joro, jroedel, killertofu, nehal-bakulchandra.shah, remy.lampure, shyam-sundar.s-k, syniurge
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: Kernel 5.8 and above Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesh showing the issue
Dmesg with fix in 5.11-rc7 kernel

Description remy.lampure 2021-01-17 16:14:04 UTC
Created attachment 294707 [details]
dmesh showing the issue

Most things are working despite issue report.
I can't resume from sleep (i get a black screen)

After some investigation i found that the issue appeared after this commit about amd/iommu in kernel: https://gitlab.com/linux-kernel/stable/-/commit/05a0542b456e135f362ba83a17ccff73bac0b92f
What i don't know is if this commit is causing the issue and need to be reverted or only if it means that this driver needs a rewrite.
Comment 1 remy.lampure 2021-01-17 16:41:55 UTC
Detailed description of the issue.

You can find the issue in attached file.
The issue is happening at each boot. Everything is working despite the kernel panic except the fact that i can't resume from suspend.

Issue started after kernel 5.7.XX

Laptop: lenovo Yoga 530-14ARR (81H9) with amd ryzen 2500U and Vega 8 integrated with bios up-to-date. Equipped with MP2 I2C controller and driver i2c-amd-mp2

Deactivating i2c-amd-mp2 stop the issue but as expected touchpad and touchscreen stop working.
Deactivating iommu also stop the issue but it may cause other issue on the laptop.

So the issue seems to be linked to iommu and i2c-amd-mp2 driver.

As i said above, i did some investigation and found out that the issue was linked to this commit: https://lore.kernel.org/linux-iommu/20200527115313.7426-10-joro@8bytes.org/

The issue doesn't happen when i build 5.8 kernel without this commit.
Comment 2 Elie Morisse 2021-01-22 20:10:14 UTC
Hi, and sorry for not getting back to you on Github.

What happens is that there should be "normal" IO page faults reported by the AMD iommu driver for Lenovo Yoga 530-14ARR users. dmesg from 5.7 kernels and older contain something like:

[] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:00.0 domain=0x0000 address=0xffdefa80 flags=0x0050]
[] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:00.0 domain=0x0000 address=0xffdefaa0 flags=0x0050]
(7 more lines)
[] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:00.0 domain=0x0000 address=0xffde9000 flags=0x0050]

These page faults are expected, they've been there since the beginning for the Yoga 530-14ARR and aren't due to the I2C layer but to the Wacom touchscreen itself (more likely) or the wacom driver handling it (much less likely, IIRC), one HID report during initialization specifically. Since the errors seemed harmless, I didn't push the investigation further back then.

But now since 5.8, amd_iommu_report_page_fault segfaults before reporting the IO page faults. From a quick look it seems that dev_iommu_get() wasn't called yet to initialize dev->iommu for the touchscreen device? Is that possible?

Anyway it's clearly an iommu driver bug.

As a temporary workaround for those who don't want to compile a kernel, if you mostly need the touchpad and can live without the touchscreen, blacklisting the wacom module makes suspend/resume work again.
Comment 3 remy.lampure 2021-01-22 20:19:18 UTC
Hi Elie,
Thx for your answer it will help us a lot ;) i tried to investigate and solve it my self, but it seems that i'm not good enouth.
So what do you recommand ?

I can build kernel but how will it help ? Except by using older kernel (like 5.4 LTS kernel)

Blacklisting the wacom module seems to be the best thing (and i still can activate it and deactivate it if needed)

Anyway, if needed i'm also ok to help as much as i can
Comment 4 Elie Morisse 2021-01-24 16:08:09 UTC
(In reply to remy.lampure from comment #3)
> So what do you recommand ?
> 
> I can build kernel but how will it help ? Except by using older kernel (like
> 5.4 LTS kernel)

Adding this null check to amd_iommu_report_page_fault() should prevent the issue :

-	if (pdev)
+	if (pdev && pdev->dev->iommu)
		dev_data = dev_iommu_priv_get(&pdev->dev);

I don't know the iommu code enough to tell if this is a proper fix though.
Comment 5 Jason Gerecke 2021-01-26 16:03:08 UTC
Wacom driver maintainer here :) I don't have access to one of these devices so I unfortunately can't help really with debugging this issue or the IO_PAGE_FAULT messages from earlier kernels. I am, however, interested in understanding what might be going on at a bit more technical level. We've recently been getting some suspend/resume bug reports that I can't explain but which sound somewhat similar to this issue...

@Elie: do you have a guess about "why" the IO_PAGE_FAULT messages would be generated and if they're related to this segfault at all? You said you expect them to be from the touchscreen itself; could you go into a little more detail? Is there a firmware issue that Lenovo or Wacom should be looking at? I can definitely bug the appropriate people if we know enough about what's wrong :)
Comment 6 Elie Morisse 2021-01-30 15:14:46 UTC
Hi,

The IO_PAGE_FAULT is triggered by a 72 bytes long HID feature report sent to the Wacom device. When I filtered this report in the I2C layer :

@@ -193,11 +193,31 @@ static int i2c_amd_xfer(struct i2c_adapter
*adap, struct i2c_msg *msgs, int num)

        for (i = 0; i < num; i++) {
                pmsg = &msgs[i];
+
+    if (pmsg->len == 72) {
+        dev_err(&i2c_dev->pdev->dev,
+                       "Buggy xfer # (size = %hu, flags = %hu, addr =
0x%08x)\n",
+                       pmsg->len, pmsg->flags, pmsg->addr);
+        print_hex_dump(KERN_ERR, "Buffer:\n", DUMP_PREFIX_NONE, 16,
1, pmsg->buf,
+                      pmsg->len, false);
+        dump_stack();
+
+        goto skipXfer;
+    }
+
                err = i2c_amd_xfer_msg(i2c_dev, pmsg);
                if (err)
                        break;
        }

+skipXfer:
	amd_mp2_pm_runtime_put(i2c_dev->common.mp2_dev);
	return err ? err : num;
}

There was no more error, and at least the finger input still works. Styluses are also supported by the touchscreen, may they be the ones affected by blocking this report? (I don't own one so cannot check)

Here's the backtrace and the contents of the report:

[] i2c_amd_mp2 AMDI0011:00: Buggy xfer # (size = 72, flags = 0, addr = 0x0000000a)
[] Buffer: 04 00 3b 03 05 00 42 00 0b 00 00 00 00 00 00 00
[] Buffer: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[] Buffer: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[] Buffer: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[] Buffer: 00 00 00 00 00 00 00 00
[] CPU: 5 PID: 65 Comm: kworker/5:1 Tainted: G           OE     5.9.16-050916-generic #202012211331
[] Hardware name: LENOVO 81H9/LNVNB161216, BIOS 8MCN50WW 12/25/2018
[] Workqueue: events wacom_init_work [wacom]
[] Call Trace:
[]  show_stack+0x52/0x58
[]  dump_stack+0x70/0x8b
[]  i2c_amd_xfer.cold+0x52/0x59 [i2c_amd_mp2_plat]
[]  __i2c_transfer.part.0+0x6b/0x2b0
[]  __i2c_transfer+0x24/0x80
[]  i2c_transfer+0x52/0x110
[]  __i2c_hid_command+0x102/0x2d0 [i2c_hid]
[]  ? __i2c_transfer.part.0+0x6b/0x2b0
[]  ? i2c_adapter_unlock_bus+0x12/0x20
[]  i2c_hid_set_or_send_report+0x11c/0x1d0 [i2c_hid]
[]  i2c_hid_output_raw_report.isra.0+0xac/0xe0 [i2c_hid]
[]  i2c_hid_raw_request+0x5e/0x190 [i2c_hid]
[]  ? hid_alloc_report_buf+0x24/0x30 [hid]
[]  _wacom_query_tablet_data.isra.0+0x137/0x2d0 [wacom]
[]  wacom_init_work+0x1a/0x30 [wacom]
[]  process_one_work+0x1e8/0x3b0
[]  worker_thread+0x50/0x370
[]  kthread+0x12f/0x150
[]  ? process_one_work+0x3b0/0x3b0
[]  ? __kthread_bind_mask+0x70/0x70
[]  ret_from_fork+0x22/0x30

(repeated 5 times because there are 5 attempts by wacom_set_device_mode())

This is why I suspected that it was the touchscreen itself which was in some way accessing off-limit memory, but I don't really know more about the "how" or "why" than that.
Comment 7 Elie Morisse 2021-01-30 15:23:36 UTC
BTW, in the DSDT here's the entry for the touchscreen:

    Device (I2CA)
    {
        Name (_HID, "AMDI0011")  // _HID: Hardware ID
        Name (_UID, Zero)  // _UID: Unique ID
        Name (_ADR, Zero)  // _ADR: Address
        Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
        {
            IRQ (Edge, ActiveHigh, Exclusive, )
                {10}
            Memory32Fixed (ReadWrite,
                0xFEDC2000,         // Address Base
                0x00001000,         // Address Length
                )
        })
        Name (_DEP, Package (0x01)  // _DEP: Dependencies
        {
            ^PCI0.GP17.MP2C
        })
        Method (_STA, 0, NotSerialized)  // _STA: Status
        {
            Return (0x0F)
        }

        Method (RSET, 0, NotSerialized)
        {
            SRAD (0x05, 0xC8)
        }
    }

I could be looking at the wrong entirely, but is 0xFEDC2000 the memory region the touchscreen should be accessing instead?
Comment 8 Elie Morisse 2021-01-30 15:26:20 UTC
Nevermind, disregard my previous message, this isn't the entry for the touchscreen but for one of the I2C adapters.
Comment 9 Jason Gerecke 2021-02-02 21:54:12 UTC
Thanks. The buffer that you dumped doesn't look like what I'd expect `_wacom_query_tablet_data` to create, but I'd have to poke around more with an I2C sensor. Either way, I think I see what you mean about the device itself causing the fault. The `flags=0x0050` in the `IO_PAGE_FAULT` message seems to indicate that the access came from the device rather than the host. It may be from the device acknowledging the transfer? Either that, or the call to `wacom_get_report` might be the true trigger, with the short-circuit you wrote just causing it to be skipped (due to `wacom_set_report` returning error codes).
Comment 10 Joerg Roedel 2021-02-03 10:18:22 UTC
I sent a fix for this yesterday. Can you please try it?

Fix: https://lore.kernel.org/r/20210202145419.29143-1-joro@8bytes.org

It is already in the IOMMU tree at

  https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/ next

If you want to try it from there.
Comment 11 remy.lampure 2021-02-03 15:01:22 UTC
(In reply to Joerg Roedel from comment #10)
> I sent a fix for this yesterday. Can you please try it?
> 
> Fix: https://lore.kernel.org/r/20210202145419.29143-1-joro@8bytes.org
> 
> It is already in the IOMMU tree at
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/ next
> 
> If you want to try it from there.

Is there a way to try it without building the entire kernel ?

If no, will this patch be available in next RC7 kernel ? (if yes i will wait until fedora team build this)
Comment 12 Joerg Roedel 2021-02-03 15:04:39 UTC
(In reply to remy.lampure from comment #11)

> Is there a way to try it without building the entire kernel ?
> 
> If no, will this patch be available in next RC7 kernel ? (if yes i will wait
> until fedora team build this)

You have to build the entire kernel. But I will also push it for -rc7, will update here when the patch is upstream.
Comment 13 Joerg Roedel 2021-02-08 09:04:34 UTC
The fix is merged into v5.11-rc7.
Comment 14 remy.lampure 2021-02-09 15:37:18 UTC
(In reply to Joerg Roedel from comment #13)
> The fix is merged into v5.11-rc7.

I installed v5.11-rc7.

The issue seems to be gone.
My touchpad didn't work the first time i booted in new kernel but it worked the next boots.

Resume from sleep is now ok.

I also get issue with screen rotation sensors not always working but it's not related.

The iommu part of this issue is solved for me. Thx

PS: I will the dmesg output with this new kernel
Comment 15 remy.lampure 2021-02-09 15:38:18 UTC
Created attachment 295157 [details]
Dmesg with fix in 5.11-rc7 kernel
Comment 16 Joerg Roedel 2021-02-09 15:55:41 UTC
(In reply to remy.lampure from comment #14)

> The iommu part of this issue is solved for me. Thx

Thanks for the testing and reporting back. This issue can be closed now.
Comment 17 remy.lampure 2021-02-14 14:02:06 UTC
Issue closed because a fix was added to kernel 5.11